2024-11-05 04:30:00
marcusolsson.dev
Imagine you’re a data analyst at a global company who’s been asked to provide employee statistics for a survey on remote working and distributed teams. You’ve extracted the relevant employee data, but sharing it as-is could violate privacy laws. How can you anonymize this data while ensuring it’s still useful? In this article, you’ll learn about k-anonymity and l-diversity—two valuable techniques in privacy engineering to help you reduce the privacy risk in datasets.
Before you continue reading
Data anonymization is a complex topic that’s difficult to accomplish in practice. This article aims to give you a basic understanding of two commonly used data anonymization techniques. Using these techniques, however, does not guarantee compliance with regulations such as GDPR.
First, let’s look at the data you’ve extracted from your internal HR system:
full_name | country | tenure_years | department | |
---|---|---|---|---|
John Smith | jsmith@… | USA | 5 | Sales |
Maria Garcia | mgarcia@… | USA | 3 | Marketing |
Yuki Tanaka | ytanaka@… | Japan | 7 | Engineering |
Hans Mueller | hmueller@… | Germany | 2 | Finance |
Sarah Johnson | sjohnson@… | UK | 5 | HR |
Pierre Dubois | pdubois@… | UK | 3 | Sales |
Li Wei | lwei@… | China | 7 | Engineering |
Anna Kowalski | akowalski@… | USA | 2 | Marketing |
Eva Schmidt | eschmidt@… | Germany | 5 | Finance |
Priya Patel | ppatel@… | UK | 3 | HR |
The data identifies individual employees by name and email, so sharing it with a third party may violate privacy laws.
A first attempt at anonymization
Fortunately, the survey partner doesn’t need this level of specificity. Let’s start by removing all fields that directly identify individual employees, such as full_name
and email
.
Next, we’ll attempt to further de-identify the individuals by aggregating individual rows into groups.
country | tenure_years | departments | count |
---|---|---|---|
USA | 0-5 | Sales, Marketing | 3 |
Japan | 6-10 | Engineering | 1 |
Germany | 0-5 | Finance | 2 |
UK | 0-5 | HR, Sales | 3 |
China | 6-10 | Engineering | 1 |
There, we’ve removed the names and emails, but is new dataset truly anonymous? Assume that an attacker knows that Yuki Tanaka lives in Japan. With this dataset, they could infer that Yuki works in Engineering and roughly how long he’s been with the company. Even though a dataset doesn’t directly identify an individual, some may still be identifiable.
Quasi-identifiers
Attributes that don’t uniquely identify an individual on their own but may, when combined with other attributes, are called quasi-identifiers. In our example, attributes like country, tenure, and department are quasi-identifiers. They don’t directly identify a person, but when combined, they might narrow down the possibilities enough to uniquely identify someone.
It’s clear that our first attempt isn’t enough to reasonably protect the privacy of the employees. Next, we’ll see how we can further reduce the privacy risk with k-anonymity.
K-anonymity
K-anonymity is a data anonymization technique ensuring that for each combination of quasi-identifying attributes (such as country and tenure), there are at least k rows that share those exact values.
The k is a number we choose. For example, if k=2, the data is considered to be 2-anonymous. A higher k value provides more privacy.
Example with k=2
By setting k=2, we’re saying that each combination of quasi-identifiers must appear at least twice in the dataset.
Here’s how our data looks after applying k=2 anonymity:
country | tenure_years | department | count |
---|---|---|---|
USA | 0-5 | Sales, Marketing | 3 |
UK | 0-5 | HR, Sales | 3 |
Germany | 0-5 | Finance | 2 |
Other | 6-10 | Engineering | 2 |
Notice how Japan and China are now grouped as “Other”. Since both of them only have one employee, we had to combine them for all combinations to have a count of 2 or more.
Example with k=5
If a higher k means improved privacy, why don’t we just set it to a really big number? Let’s see what happens when we increase k to 5—each combination of quasi-identifiers appearing at least five times:
continent | tenure_years | department | count |
---|---|---|---|
Europe | 0-5 | HR, Sales, Finance | 5 |
Other | 0-10 | Engineering, Sales, Marketing | 5 |
It’s now much harder to identify individual employees in the dataset. But as a result, we had to remove so much information that it lost its usefulness. Any employee outside of Europe is grouped under “Other”, and the range of tenure becomes so big that it’s no longer useful.
Lower k values increase specificity, while higher k values offer better privacy protection. The best value for k depends on your dataset and how sensitive your data is.
L-diversity
Let’s assume we choose k=2 so that all combinations occur at least twice. Unfortunately, it still fails to protect employees in Germany, Japan, and China.
Since all the employees in Asia work in Engineering, we’ve done little to protect Yuki’s privacy. You may also have realized that it doesn’t actually matter how many employees are in the Other group if all of them are in Engineering. The problem isn’t the size of the group, but the lack of diversity in the departments
column.
This is where l-diversity comes in. L-diversity builds on k-anonymity to provide even more privacy by ensuring a given level of diversity in sensitive attributes within each group.
L-diversity ensures that there’s sufficient diversity within each combination of quasi-identifiers. For example, l=5 means that within each group, there are at least 5 well-represented values for the sensitive attribute (in this case, department).
What is a well-represented value?
Note that l-diversity doesn’t necessarily mean l unique values in each group. For example, if the UK had 9 employees in Sales and only one in HR, an external actor could guess with 90% confidence that an employee works in Sales.
In our example, one way to achieve l=2 is to move all employees in Germany into the Other group. Note that, as a result, the dataset is now 3-anonymous.
country | tenure_years | department | count |
---|---|---|---|
USA | 0-5 | Sales, Marketing | 3 |
UK | 0-5 | HR, Sales | 3 |
Other | 0-10 | Finance, Engineering | 4 |
Now, each group has at least two different department values, satisfying l-diversity with l=2. Even if someone knew Yuki was based in Japan, it’s no longer trivial to deduce that he’s in Engineering. Unfortunately—just like for k=5—the tenure range (0-10) is likely too wide to be useful.
As you can see, setting the values or k and l involves balancing data utility against privacy protection. It also illustrates an important point: using k-anonymity and l-diversity doesn’t guarantee the absence of privacy risk.
Limitations and considerations
K-anonymity and l-diversity allow us to communicate privacy risk in a dataset. However, as we saw throughout the article, these techniques have limitations:
- Homogeneity attacks: An attacker can still infer sensitive information if all sensitive values within a k-anonymous group are the same.
- Background knowledge attacks: Additional information might allow attackers to narrow down possibilities.
- Skewness attacks: Even with l-diversity, if one value is much more frequent, high-probability inferences are possible.
- Similarity attacks: If the sensitive values in a group are semantically similar, it may still allow harmful inferences.
- Data utility trade-off: As we increase privacy protections, we often lose some of the data’s usefulness or specificity.
In practice, it may be impossible to completely eliminate the risk of re-identification without rendering the data useless in the process. The best way to remove privacy risk is to avoid collecting or sharing the information in the first place.
Learn more
To learn more, I recommend the
Data Privacy Handbook by Utrect University. In addition to k-anonymity and l-diversity, it also covers t-closeness—another technique to further reduce privacy risk—along with videos on each technique.
Also, if you’re using Google Cloud Platform, check out their
Sensitive Data Protection which lets you compute
k-anonymity and
l-diversity on your datasets.
Conclusion
As data professionals, we need to balance privacy risks and data utility when sharing sensitive data. K-anonymity and l-diversity are two data anonymization techniques that can help you reason and make conscious decisions about the privacy risks for a dataset.
Unfortunately, it may be close to impossible to guarantee that an attacker won’t be able to re-identify individuals in a dataset. Data anonymization techniques should only be used as one part of a larger, comprehensive privacy program.
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.