It’s well known that many machine learning models can’t process categorical features natively. While there are some exceptions, it’s usually up to the practitioner to decide on a numeric representation of each categorical feature. There are many ways to accomplish this, but one strategy seldom recommended is label encoding.
Label encoding replaces each categorical value with an arbitrary number. For instance, if we have a feature containing letters of the alphabet, label encoding might assign the letter “A” a value of 0, the letter “B” a value of 1, and continue this pattern until “Z” which is assigned 25. After this process, technically speaking, any algorithm should be able to handle the encoded feature.
But what’s the problem with this? Shouldn’t sophisticated machine learning models be able to handle this type of encoding? Why do libraries like Catboost and other encoding strategies exist to deal with high cardinality categorical features?
This article will explore two examples demonstrating why label encoding can be problematic for machine learning models. These examples will help us appreciate why there are so many alternatives to label encoding, and it will deepen our understanding of the relationship between data complexity and model performance.
One of the best ways to gain intuition for a machine learning concept is to understand how it works in a low dimensional space and try to extrapolate the result to higher dimensions. This mental extrapolation doesn’t always align with reality, but for our purposes, all we need is a single feature to see why we need better categorical encoding strategies.
A Feature With 25 Categories
Let’s start by looking at a basic toy dataset with one feature and a continuous target. Here are the dependencies we need:
import numpy as np
import polars as pl
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split