Cornellius Yudha Wijaya
2024-06-05 08:00:12
www.kdnuggets.com
Image by Editor | Midjourney & Canva
Let’s learn how to use Scikit-learn’s imputer for handling missing data.
Preparation
Ensure you have the Numpy, Pandas and Scikit-Learn installed in your environment. If not, you can install them via pip using the following code:
pip install numpy pandas scikit-learn
Then, we can import the packages into your environment:
import numpy as np
import pandas as pd
import sklearn
from sklearn.experimental import enable_iterative_imputer
Handle Missing Data with Imputer
A scikit-Learn imputer is a class used to replace missing data with certain values. It can streamline your data preprocessing process. We will explore several strategies for handling the missing data.
Let’s create a data example for our example:
sample_data = {'First': [1, 2, 3, 4, 5, 6, 7, np.nan,9], 'Second': [np.nan, 2, 3, 4, 5, 6, np.nan, 8,9]}
df = pd.DataFrame(sample_data)
print(df)
First Second
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 6.0 6.0
6 7.0 NaN
7 NaN 8.0
8 9.0 9.0
You can fill the columns’ missing values with the Scikit-Learn Simple Imputer using the respective column’s mean.
First Second
0 1.00 5.29
1 2.00 2.00
2 3.00 3.00
3 4.00 4.00
4 5.00 5.00
5 6.00 6.00
6 7.00 5.29
7 4.62 8.00
8 9.00 9.00
For note, we round the result into 2 decimal places.
It’s also possible to impute the missing data with Median using Simple Imputer.
imputer = sklearn.SimpleImputer(strategy='median')
df_imputed = round(pd.DataFrame(imputer.fit_transform(df), columns=df.columns),2)
print(df_imputed)
First Second
0 1.0 5.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 6.0 6.0
6 7.0 5.0
7 4.5 8.0
8 9.0 9.0
The mean and median imputer approach is simple, but it can distort the data distribution and create bias in a data relationship.
There are also possible to use a K-NN imputer to fill in the missing data using the nearest neighbour approach.
knn_imputer = sklearn.KNNImputer(n_neighbors=2)
knn_imputed_data = knn_imputer.fit_transform(df)
knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)
print(knn_imputed_df)
First Second
0 1.0 2.5
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 6.0 6.0
6 7.0 5.5
7 7.5 8.0
8 9.0 9.0
The KNN imputer would use the mean or median of the neighbour’s values from the k nearest neighbours.
Lastly, there is the Iterative Impute methodology, which is based on modelling each feature with missing values as a function of other features. As this article states, it’s an experimental feature, so we need to enable it initially.
iterative_imputer = IterativeImputer(max_iter=10, random_state=0)
iterative_imputed_data = iterative_imputer.fit_transform(df)
iterative_imputed_df = round(pd.DataFrame(iterative_imputed_data, columns=df.columns),2)
print(iterative_imputed_df)
First Second
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 6.0 6.0
6 7.0 7.0
7 8.0 8.0
8 9.0 9.0
If you can properly use the imputer, it could help make your data science project better.
Additional Resouces
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.