Dropping Constant Features using VarianceThreshold: Feature Selection -1

Most Straight forward Guide for removing constant/quasi-constant predictors with Python VarianceThreshold

Shelvi Garg
Nerd For Tech

--

In this guide, you will read exactly what you need to remove constant features while doing feature selection.

Image Ref

Constant Features show similar/single values in all the observations in the dataset. These features provide no information that allows ML models to predict the target.

C,D columns here are constant Features

High Variance in predictors: Good Indication

Low Variance predictors: Not good for model

We can drop constant features using Sklearn’s Variance Threshold.

Refer Official Document: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

You can find the complete code file and dataset here on my GitHub: https://github.com/shelvi31/Feature-Selection

Variance Threshold:

Variance Threshold is a feature selector that removes all the low variance features from the dataset that are of no great use in modeling.

It looks only at the features (x), not the desired outputs (y), and can thus be used for unsupervised learning.

Default Value of Threshold is 0

  • If Variance Threshold = 0 (Remove Constant Features )
  • If Variance Threshold > 0 (Remove Quasi-Constant Features )

Python Implementation:

import pandas as pd
import numpy as np
# Loading data from train.csv file
train_df = pd.read_csv("train data credit card.csv")
train_df.head(5)
png
train_df.shape(245725, 11)

Shortening the huge dataset

train = train_df.loc[1:40000,:]
train.shape
(40000, 11)
png
#Filling Null values if anytrain = train.fillna("None")

Dropping ID Column, defining target

train1 = train.drop(["ID","Is_Lead"],axis=1)
y = train["Is_Lead"]

As Variance Threshold can work only upon numerical data. We need to first convert the data types of another non-integer/non-float columns. For this, we will use an Ordinal Encoder.

Image Ref: Medium

To see no. of unique values in each column:

train1.nunique(axis=0)Gender                     2
Age 62
Region_Code 35
Occupation 4
Channel_Code 4
Vintage 66
Credit_Product 3
Avg_Account_Balance 35278
Is_Active 2
dtype: int64

Using Ordinal Encoder: Required Before Thresholding

In ordinal encoding, each unique category value is assigned an integer value. For example, “red” is 1, “green” is 2, and “blue” is 3. This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

# import ordinal encoder from sklearn
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()

# Transform the data
train1[["Gender","Region_Code","Occupation","Channel_Code","Credit_Product","Is_Active"]] = ord_enc.fit_transform(train1[["Gender","Region_Code","Occupation","Channel_Code","Credit_Product","Is_Active"]])

MAIN CODE:

Defining and Fitting Threshold

For quasi-constant features, that have the same value for a very large subset, using a threshold of 0.01 would mean dropping the column where 99% of the values are similar.

from sklearn.feature_selection import VarianceThreshold

var_thr = VarianceThreshold(threshold = 0.25) #Removing both constant and quasi-constant
var_thr.fit(train1)

var_thr.get_support()
array([False, True, True, True, True, True, True, True, False])

OUTPUT:

  • True: High Variance
  • False: Low Variance

Picking Up the low Variance Columns:

As per my above code, I am dropping columns that are 75% or more similar (you can keep any value you prefer)

concol = [column for column in train1.columns 
if column not in train1.columns[var_thr.get_support()]]

for features in concol:
print(features)
Gender
Is_Active

Dropping Low Variance Columns:

train1.drop(concol,axis=1)
png
train1.columnsIndex(['Gender', 'Age', 'Region_Code', 'Occupation', 'Channel_Code', 'Vintage',
'Credit_Product', 'Avg_Account_Balance', 'Is_Active'],
dtype='object')

This is how we can see which are the columns that have high variance and thus contribute to better models. Don’t forget to convert the columns dtype to integer or flow before applying a threshold.

Once you identify your low variance columns, you can always reverse the encoding and continue your journey with original data. Also, don't forget to drop the same columns from test data before predicting results! :)

Cheers!

Shelvi ❤

References:

--

--