So You Want to Work With Patron Data… De-identification Basics

Welcome to this week’s Tip of the Hat! This week’s post is a “back to basics” about de-identification and patron data. Why? After reading a recent article published in the Code4Lib Journal where patron data was not de-identified before combining it with external data sets, now’s a good time as any to remind library workers about de-identification. [1]

De-identification Definitions

Before we talk about de-identification, we must talk about anonymization and the differences between the two:

  • De-identification is when you remove the connection between the data and any identifiable individual in the real world. Sometimes de-identified datasets have a unique identifier replacing personally identifiable information (PII) to data points, which is then called pseudonymization.

De-identification provides a way for some to work with data to track individual trends with a reduced risk of re-identification and other privacy risks. Why “for some” and “reduced”? We’ll get into the whys of the issues with de-identification later in this post.

De-identification Method Basics

PII comes in two forms: data about a person and data about a person’s activities that can be linked back to the person. The methods and level of work needed to sufficiently de-identify patron data depend on the type of PII in the data set. The methods commonly used to de-identify PII include truncation, obfuscation, and aggregation.

  • Obfuscation moves the reference point of the data up a few levels of granularity. An example is using a birth year or age instead of the person’s full birth date.
  • Truncation strips the raw data to a small subsection general enough that it cannot be easily connected to an identifiable person. A real-world example of truncation is HIPAA’s guidance on physical address de-identification, truncating the address to the first three digits of the zip code.
  • Aggregation further groups individual data points creating a more generalized data set. Going back to the obfuscation example, individual ages can be aggregated into age ranges.

There are more methods to de-identify data, some of which can get quite complex, such as differential privacy. The three methods mentioned above, nonetheless, are some of the more accessible de-identification methods available to libraries.

Before You De-identify…

Remember in the first section that we mentioned that de-identification only works for some data sets and only reduces privacy risk? There are two main reasons why this is:

  1. De-identification does not protect outliers in data or for small population data sets. There are equations (more) and properties that can help you determine if your dataset cannot be re-identified, but for most libraries, de-identification is not possible due to the type or size of the data set they wish to deidentify.
  2. De-identified data can still be re-identified through the use of external data sets, particularly if the data in the de-identified dataset was not properly de-identified. An evergreen example is the AOL data set that retained identifying data in the search queries, even though AOL scrubbed identifying data about the searcher.

It is possible to have a de-identified data set of patron data, but the process is not fool-proof. De-identification requires multiple sample de-identification processes and analysis in determining the risk of how easy it is to reconnect the data to an individual.

Overall, de-identification is a tool to help protect patron privacy, but it should not be the only privacy tool used in the patron data lifecycle. The most effective privacy tools and methods in the patron data lifecycle are the questions you ask at the beginning of the lifecycle:

  • Why are you collecting this data?
  • Does this reason tie to a demonstrated business need?
  • Are there other ways you can achieve the business need without collecting high-risk patron PII?

If you want to learn more about de-identification and privacy risks, check out the resources below:

[1] The article contains additional privacy and security concerns that we will not cover in this post, including technical, administrative, and ethical concerns.

Hat Tip: Latanya Sweeney, Ph.D.

Welcome to this week’s Tip of the Hat!

Many of you might be preparing the last public displays for Black History Month or setting up the first set of Women’s History Month displays. If you need to add one more person to feature in either or both displays, or if you wish to know more important black women in STEM, you’re in luck! Today’s newsletter is a quick introduction to one of the major players in the data privacy field, Latanya Sweeney, Ph.D.

Latanya Sweeney is a Professor of Government and Technology in Residence at Harvard University and the founding director of Harvard’s Data Privacy Lab. She is also the first African American woman to receive a Computer Science Ph.D. from MIT. Sweeney made many major contributions to the technology field, but the most well-known contribution for privacy professionals is Sweeney’s work on k-anonymity. Her work on the re-identification of individuals through data has prompted a shift in many in the privacy field in reassessing the concept of anonymization. For example, in a study published in 2000, Sweeney found that 87% of the US population can be identified based on zip code, gender, and date of birth. Health data is also an area in which Sweeney has shown again and again how easy it can be to re-identify data that used certain anonymization methods.

Other parts of Professor Sweeney’s work delves into how data can be used to discriminate, including her work on the discrimination found in online ad delivery. The projects page for the Data Privacy Lab and the various tools on the home page shows the vast array of research areas under the guidance of Sweeney’s direction of the Lab.

Did we also mention that she was also the Chief Technologist at the FTC in 2014?

Some recent talks and panels include:

We leave you with an excerpt from a 2007 interview from Scientific American where many can appreciate Sweeney’s approach to privacy:

[Walter] Why is privacy versus security becoming such a problem? Why should we even care?

[Sweeney](Laughs) Well, one issue is we need privacy. I don’t mean political issues. We literally can’t live in a society without it. Even in nature animals have to have some kind of secrecy to operate…. There’s a primal need for secrecy so we can achieve our goals.

Privacy also allows an individual the opportunity to grow and make mistakes and really develop in a way you can’t do in the absence of privacy, where there’s no forgiving and everyone knows what everyone else is doing… With today’s technology, though, you basically get a record from birth to grave and there’s no forgiveness. And so as a result we need technology that will preserve our privacy.