How to Encode Categorical Data

TASK: To Experiment and Implement Different Types of Encoding to deal with Categorical Data

Shelvi Garg

6 min readJun 11, 2021

In this blog we will explore and implement:

One hot Encoding using:

Python’s category_encoding library
Sklearn Preprocessing
Python’s get_dummies

Binary Encoding

Frequency Encoding

Label Encoding

Ordinal Encoding

What is Categorical data

Categorical data is a type of data that is used to group information with similar characteristics while Numerical data is a type of data that expresses information in the form of numbers.

Example: Gender

Why do we need Encoding?

Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values.
Many algorithm’s performances even vary based upon how the Categorical variables are encoded.

Categorical variables can be divided into two categories:

Nominal (No particular order)
Ordinal (some ordered)

We will also refer to a cheat sheet that shows when to use which type of encoding.

Method: 1 Using Python’s category encoder library

category_encoders is an amazing python library that provides 15 different encoding schemes.

Here, is the list of 15 types of encoding :

One Hot Encoding
Label Encoding
Ordinal Encoding
Helmert Encoding
Binary Encoding
Frequency Encoding
Mean Encoding
Weight of Evidence Encoding
Probability Ratio Encoding
Hashing Encoding
Backward Difference Encoding
Leave One Out Encoding
James-Stein Encoding
M-estimator Encoding
Thermometer Encoder

Importing Libraries

import pandas as pd 
import sklearn

pip install category_encoders

import category_encoders as ce

Creating Dataframe

data = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female', 'Female'],
                      'class' : ['A','B','C','D','A'],
                      'city' : ['Delhi','Gurugram','Delhi','Delhi','Gurugram'] })data.head()

Implementing One-Hot Encoding through category_encoder

In this method, each category is mapped to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features.

Create an object of the One Hot Encoder

ce_OHE = ce.OneHotEncoder(cols=['gender','city']) 

ce_OHEOneHotEncoder(cols=['gender', 'city'])data1 = ce_OHE.fit_transform(data)
data1.head()

Binary Encoding

Binary encoding converts a category into binary digits. Each binary digit creates one feature column.

ce_be = ce.BinaryEncoder(cols=['class']);

# transform the data 
data_binary = ce_be.fit_transform(data["class"]);

Print Data

print(data["class"])
data_binary0    A
1    B
2    C
3    D
4    A
Name: class, dtype: object

Similarly, there are other 14 types of encoding provided by this library.

Method 2: USING PYTHON’S GET DUMMIES

pd.get_dummies(data,columns=["gender","city"])

Assigning Prefix if we want to. Though it takes the default prefix too!

pd.get_dummies(data,prefix=["gen","city"],columns=["gender","city"])

METHOD 3: USING SKLEARN

sklearn also has 15 different types of inbuilt encoders, which can be accessed from sklearn.preprocessing.

SKLEARN ONE HOT ENCODING

lets first Get a list of categorical variables from our data

s = (data.dtypes == 'object')
cols = list(s[s].index)

Importing :

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)

Applying on gender column

data_gender = pd.DataFrame(ohe.fit_transform(data[["gender"]]))

data_gender

Applying on City Column

data_city = pd.DataFrame(ohe.fit_transform(data[["city"]]))

data_city

Applying on the class column

data_class = pd.DataFrame(ohe.fit_transform(data[["class"]]))

data_class

This is because the class column has 4 unique values

Applying on the list of categorical variables:

data_cols = pd.DataFrame(ohe.fit_transform(data[cols]))

data_cols

here the first 2 columns represent gender, the next 4 columns represent class, and the remaining 2 of the city.

SKLEARN Label Encoding

In label encoding, each category is assigned a value from 1 through N where N is the number of categories for the feature. There is no relation or order between these assignments.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

Label encoder takes no arguments

le_class = le.fit_transform(data[["class"]]);le_classarray([0, 1, 2, 3, 0])

Comparing with one-hot encoding

data_class

ORDINAL ENCODING

Ordinal encoding’s encoded variables retain the ordinal(ordered) nature of the variable. It looks almost similar to Label Encoding. The only difference being Label coding doesn't consider whether a variable is ordinal or not, it will anyways assign a sequence of integers

Example: Ordinal encoding will assign values as Very Good(1)<Good(2)<Bad(3)<Worse(4)

First, we need to assign the original order of the variable through a dictionary.

temp = {'temperature' :['very cold', 'cold', 'warm', 'hot', 'very hot']}

df=pd.DataFrame(temp,columns=["temperature"])

temp_dict = {
    'very cold': 1,
    'cold': 2,
    'warm': 3,
    'hot': 4,
    "very hot": 5
}temp_dict{'very cold': 1, 'cold': 2, 'warm': 3, 'hot': 4, 'very hot': 5}temp{'temperature': ['very cold', 'cold', 'warm', 'hot', 'very hot']}df

Then we can map each row for the variable as per the dictionary.

df["temp_ordinal"] = df.temperature.map(temp_dict)df

Frequency Encoding

The category is assigned as per the frequency of value in its total lot

data_freq = pd.DataFrame({'class' : ['A','B','C','D','A',"B","E","E","D","C","C","C","E","A","A"]})

Grouping by class column

fe = data_freq.groupby("class").size()
feclass
A    4
B    2
C    4
D    2
E    3
dtype: int64

Dividing by length

len(data_freq)15fe_ = fe/len(data_freq)

Mapping and Rounding off

data_freq["data_fe"] = data_freq["class"].map(fe_).round(2)data_freq

We saw 5 types of encoding schemes. Similarly, there are 10 other types of encoding :

Helmert Encoding
Mean Encoding
Weight of Evidence Encoding
Probability Ratio Encoding
Hashing Encoding
Backward Difference Encoding
Leave One Out Encoding
James-Stein Encoding
M-estimator Encoding
Thermometer Encoder

Which One is Best then?

There is no single method that works best for every problem or dataset. I personally think that the get_dummies method has an advantage in its ability to implement very easily.

Read about all the 15 types of encoding in detail here:

If you want to read about all the 15 types of encoding here is a very good article to refer to: https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

I am also attaching a cheat sheet on when to use what type of encoding.

………….If you like this blog, don’t forget to leave a few hearty claps :)

Connect me on LinkedIn

References:

How to Encode Categorical Data

TASK: To Experiment and Implement Different Types of Encoding to deal with Categorical Data

In this blog we will explore and implement:

What is Categorical data

Why do we need Encoding?

Categorical variables can be divided into two categories:

Method: 1 Using Python’s category encoder library

Here, is the list of 15 types of encoding :

Importing Libraries

Creating Dataframe

Implementing One-Hot Encoding through category_encoder

Binary Encoding

Method 2: USING PYTHON’S GET DUMMIES

METHOD 3: USING SKLEARN

SKLEARN ONE HOT ENCODING

lets first Get a list of categorical variables from our data

Applying on gender column

Applying on City Column

Applying on the class column

Applying on the list of categorical variables:

SKLEARN Label Encoding

Comparing with one-hot encoding

ORDINAL ENCODING

Example: Ordinal encoding will assign values as Very Good(1)<Good(2)<Bad(3)<Worse(4)

Frequency Encoding

Which One is Best then?

Read about all the 15 types of encoding in detail here:

Connect me on LinkedIn

Written by Shelvi Garg