How to Encode Categorical Data
TASK: To Experiment and Implement Different Types of Encoding to deal with Categorical Data
In this blog we will explore and implement:
One hot Encoding using:
- Python’s category_encoding library
- Sklearn Preprocessing
- Python’s get_dummies
Binary Encoding
Frequency Encoding
Label Encoding
Ordinal Encoding
What is Categorical data
Categorical data is a type of data that is used to group information with similar characteristics while Numerical data is a type of data that expresses information in the form of numbers.
Example: Gender
Why do we need Encoding?
- Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values.
- Many algorithm’s performances even vary based upon how the Categorical variables are encoded.
Categorical variables can be divided into two categories:
- Nominal (No particular order)
- Ordinal (some ordered)
We will also refer to a cheat sheet that shows when to use which type of encoding.
Method: 1 Using Python’s category encoder library
category_encoders is an amazing python library that provides 15 different encoding schemes.
Here, is the list of 15 types of encoding :
- One Hot Encoding
- Label Encoding
- Ordinal Encoding
- Helmert Encoding
- Binary Encoding
- Frequency Encoding
- Mean Encoding
- Weight of Evidence Encoding
- Probability Ratio Encoding
- Hashing Encoding
- Backward Difference Encoding
- Leave One Out Encoding
- James-Stein Encoding
- M-estimator Encoding
- Thermometer Encoder
Importing Libraries
import pandas as pd
import sklearn
pip install category_encoders
import category_encoders as ce
Creating Dataframe
data = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female', 'Female'],
'class' : ['A','B','C','D','A'],
'city' : ['Delhi','Gurugram','Delhi','Delhi','Gurugram'] })data.head()
Implementing One-Hot Encoding through category_encoder
In this method, each category is mapped to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features.
Create an object of the One Hot Encoder
ce_OHE = ce.OneHotEncoder(cols=['gender','city'])
ce_OHEOneHotEncoder(cols=['gender', 'city'])data1 = ce_OHE.fit_transform(data)
data1.head()
Binary Encoding
Binary encoding converts a category into binary digits. Each binary digit creates one feature column.
ce_be = ce.BinaryEncoder(cols=['class']);
# transform the data
data_binary = ce_be.fit_transform(data["class"]);
Print Data
print(data["class"])
data_binary0 A
1 B
2 C
3 D
4 A
Name: class, dtype: object
Similarly, there are other 14 types of encoding provided by this library.
Method 2: USING PYTHON’S GET DUMMIES
pd.get_dummies(data,columns=["gender","city"])
Assigning Prefix if we want to. Though it takes the default prefix too!
pd.get_dummies(data,prefix=["gen","city"],columns=["gender","city"])
METHOD 3: USING SKLEARN
sklearn also has 15 different types of inbuilt encoders, which can be accessed from sklearn.preprocessing.
SKLEARN ONE HOT ENCODING
lets first Get a list of categorical variables from our data
s = (data.dtypes == 'object')
cols = list(s[s].index)
Importing :
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)
Applying on gender column
data_gender = pd.DataFrame(ohe.fit_transform(data[["gender"]]))
data_gender
Applying on City Column
data_city = pd.DataFrame(ohe.fit_transform(data[["city"]]))
data_city
Applying on the class column
data_class = pd.DataFrame(ohe.fit_transform(data[["class"]]))
data_class
This is because the class column has 4 unique values
Applying on the list of categorical variables:
data_cols = pd.DataFrame(ohe.fit_transform(data[cols]))
data_cols
here the first 2 columns represent gender, the next 4 columns represent class, and the remaining 2 of the city.
SKLEARN Label Encoding
In label encoding, each category is assigned a value from 1 through N where N is the number of categories for the feature. There is no relation or order between these assignments.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Label encoder takes no arguments
le_class = le.fit_transform(data[["class"]]);le_classarray([0, 1, 2, 3, 0])
Comparing with one-hot encoding
data_class
ORDINAL ENCODING
Ordinal encoding’s encoded variables retain the ordinal(ordered) nature of the variable. It looks almost similar to Label Encoding. The only difference being Label coding doesn't consider whether a variable is ordinal or not, it will anyways assign a sequence of integers
Example: Ordinal encoding will assign values as Very Good(1)<Good(2)<Bad(3)<Worse(4)
First, we need to assign the original order of the variable through a dictionary.
temp = {'temperature' :['very cold', 'cold', 'warm', 'hot', 'very hot']}
df=pd.DataFrame(temp,columns=["temperature"])
temp_dict = {
'very cold': 1,
'cold': 2,
'warm': 3,
'hot': 4,
"very hot": 5
}temp_dict{'very cold': 1, 'cold': 2, 'warm': 3, 'hot': 4, 'very hot': 5}temp{'temperature': ['very cold', 'cold', 'warm', 'hot', 'very hot']}df
Then we can map each row for the variable as per the dictionary.
df["temp_ordinal"] = df.temperature.map(temp_dict)df
Frequency Encoding
The category is assigned as per the frequency of value in its total lot
data_freq = pd.DataFrame({'class' : ['A','B','C','D','A',"B","E","E","D","C","C","C","E","A","A"]})
Grouping by class column
fe = data_freq.groupby("class").size()
feclass
A 4
B 2
C 4
D 2
E 3
dtype: int64
Dividing by length
len(data_freq)15fe_ = fe/len(data_freq)
Mapping and Rounding off
data_freq["data_fe"] = data_freq["class"].map(fe_).round(2)data_freq
We saw 5 types of encoding schemes. Similarly, there are 10 other types of encoding :
- Helmert Encoding
- Mean Encoding
- Weight of Evidence Encoding
- Probability Ratio Encoding
- Hashing Encoding
- Backward Difference Encoding
- Leave One Out Encoding
- James-Stein Encoding
- M-estimator Encoding
- Thermometer Encoder
Which One is Best then?
There is no single method that works best for every problem or dataset. I personally think that the get_dummies method has an advantage in its ability to implement very easily.
Read about all the 15 types of encoding in detail here:
If you want to read about all the 15 types of encoding here is a very good article to refer to: https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
I am also attaching a cheat sheet on when to use what type of encoding.
………….If you like this blog, don’t forget to leave a few hearty claps :)
Connect me on LinkedIn
References: