How to Encode Categorical Data

TASK: To Experiment and Implement Different Types of Encoding to deal with Categorical Data

Shelvi Garg
6 min readJun 11, 2021
Image Ref: Unsplash

In this blog we will explore and implement:

One hot Encoding using:

  • Python’s category_encoding library
  • Sklearn Preprocessing
  • Python’s get_dummies

Binary Encoding

Frequency Encoding

Label Encoding

Ordinal Encoding

What is Categorical data

Categorical data is a type of data that is used to group information with similar characteristics while Numerical data is a type of data that expresses information in the form of numbers.

Example: Gender

Why do we need Encoding?

  • Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values.
  • Many algorithm’s performances even vary based upon how the Categorical variables are encoded.

Categorical variables can be divided into two categories:

  • Nominal (No particular order)
  • Ordinal (some ordered)

We will also refer to a cheat sheet that shows when to use which type of encoding.

Method: 1 Using Python’s category encoder library

category_encoders is an amazing python library that provides 15 different encoding schemes.

Here, is the list of 15 types of encoding :

  • One Hot Encoding
  • Label Encoding
  • Ordinal Encoding
  • Helmert Encoding
  • Binary Encoding
  • Frequency Encoding
  • Mean Encoding
  • Weight of Evidence Encoding
  • Probability Ratio Encoding
  • Hashing Encoding
  • Backward Difference Encoding
  • Leave One Out Encoding
  • James-Stein Encoding
  • M-estimator Encoding
  • Thermometer Encoder

Importing Libraries

import pandas as pd 
import sklearn

pip install category_encoders

import category_encoders as ce

Creating Dataframe

data = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female', 'Female'],
'class' : ['A','B','C','D','A'],
'city' : ['Delhi','Gurugram','Delhi','Delhi','Gurugram'] })
data.head()
png

Implementing One-Hot Encoding through category_encoder

In this method, each category is mapped to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features.

Create an object of the One Hot Encoder

ce_OHE = ce.OneHotEncoder(cols=['gender','city']) 

ce_OHE
OneHotEncoder(cols=['gender', 'city'])data1 = ce_OHE.fit_transform(data)
data1.head()
png

Binary Encoding

Binary encoding converts a category into binary digits. Each binary digit creates one feature column.

Image Ref
ce_be = ce.BinaryEncoder(cols=['class']);

# transform the data
data_binary = ce_be.fit_transform(data["class"]);

Print Data

print(data["class"])
data_binary
0 A
1 B
2 C
3 D
4 A
Name: class, dtype: object
png

Similarly, there are other 14 types of encoding provided by this library.

Method 2: USING PYTHON’S GET DUMMIES

pd.get_dummies(data,columns=["gender","city"])
png

Assigning Prefix if we want to. Though it takes the default prefix too!

pd.get_dummies(data,prefix=["gen","city"],columns=["gender","city"])
png

METHOD 3: USING SKLEARN

sklearn also has 15 different types of inbuilt encoders, which can be accessed from sklearn.preprocessing.

SKLEARN ONE HOT ENCODING

lets first Get a list of categorical variables from our data

s = (data.dtypes == 'object')
cols = list(s[s].index)

Importing :

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)

Applying on gender column

data_gender = pd.DataFrame(ohe.fit_transform(data[["gender"]]))

data_gender
png

Applying on City Column

data_city = pd.DataFrame(ohe.fit_transform(data[["city"]]))

data_city
png

Applying on the class column

data_class = pd.DataFrame(ohe.fit_transform(data[["class"]]))

data_class
png

This is because the class column has 4 unique values

Applying on the list of categorical variables:

data_cols = pd.DataFrame(ohe.fit_transform(data[cols]))

data_cols
png

here the first 2 columns represent gender, the next 4 columns represent class, and the remaining 2 of the city.

SKLEARN Label Encoding

In label encoding, each category is assigned a value from 1 through N where N is the number of categories for the feature. There is no relation or order between these assignments.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

Label encoder takes no arguments

le_class = le.fit_transform(data[["class"]]);le_classarray([0, 1, 2, 3, 0])

Comparing with one-hot encoding

data_class
png

ORDINAL ENCODING

Ordinal encoding’s encoded variables retain the ordinal(ordered) nature of the variable. It looks almost similar to Label Encoding. The only difference being Label coding doesn't consider whether a variable is ordinal or not, it will anyways assign a sequence of integers

Example: Ordinal encoding will assign values as Very Good(1)<Good(2)<Bad(3)<Worse(4)

First, we need to assign the original order of the variable through a dictionary.

temp = {'temperature' :['very cold', 'cold', 'warm', 'hot', 'very hot']}

df=pd.DataFrame(temp,columns=["temperature"])

temp_dict = {
'very cold': 1,
'cold': 2,
'warm': 3,
'hot': 4,
"very hot": 5
}
temp_dict{'very cold': 1, 'cold': 2, 'warm': 3, 'hot': 4, 'very hot': 5}temp{'temperature': ['very cold', 'cold', 'warm', 'hot', 'very hot']}df
png

Then we can map each row for the variable as per the dictionary.

df["temp_ordinal"] = df.temperature.map(temp_dict)df
png

Frequency Encoding

The category is assigned as per the frequency of value in its total lot

data_freq = pd.DataFrame({'class' : ['A','B','C','D','A',"B","E","E","D","C","C","C","E","A","A"]})

Grouping by class column

fe = data_freq.groupby("class").size()
fe
class
A 4
B 2
C 4
D 2
E 3
dtype: int64

Dividing by length

len(data_freq)15fe_ = fe/len(data_freq)

Mapping and Rounding off

data_freq["data_fe"] = data_freq["class"].map(fe_).round(2)data_freq
png

We saw 5 types of encoding schemes. Similarly, there are 10 other types of encoding :

  • Helmert Encoding
  • Mean Encoding
  • Weight of Evidence Encoding
  • Probability Ratio Encoding
  • Hashing Encoding
  • Backward Difference Encoding
  • Leave One Out Encoding
  • James-Stein Encoding
  • M-estimator Encoding
  • Thermometer Encoder

Which One is Best then?

There is no single method that works best for every problem or dataset. I personally think that the get_dummies method has an advantage in its ability to implement very easily.

Read about all the 15 types of encoding in detail here:

If you want to read about all the 15 types of encoding here is a very good article to refer to: https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

I am also attaching a cheat sheet on when to use what type of encoding.

Image Ref

………….If you like this blog, don’t forget to leave a few hearty claps :)

Connect me on LinkedIn

References:

  1. https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
  2. https://pypi.org/project/category-encoders/
  3. https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

--

--