Cornellius Yudha Wijaya
2024-09-03 08:00:05
www.kdnuggets.com
Let’s try to learn about categorical data in Pandas.
Preparation
Our Top 3 Course Recommendations
1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.
2. Google Data Analytics Professional Certificate – Up your data analytics game
3. Google IT Support Professional Certificate – Support your organization in IT
Before we start, we need the Pandas and Numpy packages installed. You can install them using the following code:
With the packages installed, let’s jump into the main part of the article.
Manage Categorical Data in Pandas
Categorical data is a Pandas data type representing particular (fixed) numbers of class or distinct values. It’s different from the string or object data type in Pandas, especially in the way Pandas store the data.
Categorical data is more memory-efficient as the values in categorical data are only stored once. In contrast, object data types store each value as a separate string, which requires much more memory.
Let’s try out the categorical data with an example. Below is how we can initiate the categorical data with Pandas.
import pandas as pd
df = pd.DataFrame({
'fruits': pd.Categorical(['apple', 'kiwi', 'watermelon', 'kiwi', 'apple', 'kiwi']),
'size': pd.Categorical(['small', 'large', 'large', 'small', 'large', 'small'])
})
df.info()
Output:
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fruits 6 non-null category
1 size 6 non-null category
dtypes: category(2)
memory usage: 396.0 bytes
You can see the data type for column fruits, and the size is a category instead of an object, as we usually get.
We can try to compare the memory usage for the categorical and object data types with the following code:
import numpy as np
n = 100000
df_object = pd.DataFrame({
'fruit': np.random.choice(['apple', 'banana', 'orange'], size=n)
})
print('Memory usage with object type:')
print(df_object['fruit'].memory_usage(deep=True))
df_category = pd.DataFrame({
'fruit': pd.Categorical(np.random.choice(['apple', 'banana', 'orange'], size=n))
})
print('Memory usage with categorical type:')
print(df_category['fruit'].memory_usage(deep=True))
Output:
Memory usage with object type:
6267209
Memory usage with categorical type:
100424
You can see that the object type consumes way more memory than the categorical data type, especially with more samples.
Next, we will examine the unique method that categorical data types can use. For example, you can get the categories:
df['fruits'].cat.categories
Output:
Index(['apple', 'kiwi', 'watermelon'], dtype="object")
Also, we can rename the categories:
df['fruits'] = df['fruits'].cat.rename_categories(['fruit_apple', 'fruit_banana', 'fruit_orange'])
print(df['fruits'].cat.categories)
Output:
Index(['fruit_apple', 'fruit_banana', 'fruit_orange'], dtype="object")
The categorical data type can also introduce ordinal values, and we can compare categories.
df['size'] = pd.Categorical(df['size'], categories=['small', 'medium', 'large'], ordered=True)
df['size']
Output:
0 True
1 False
2 False
3 True
4 False
5 True
Name: size, dtype: bool
Mastering the categorical data type would give you an edge in the data analysis.
Additional Resources
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.