Getting “On The Same Page” – Personal Data and Libraries

We cover a lot of ground on the Tip of The Hat! There’s so much to explore with data privacy and security that sometimes it’s easy to get lost in the details and lose track of the fundamentals. We’re also in a field where it’s improbable that everyone shares the same background or knowledge about a specific topic, which contributes to some of the misunderstandings and confusion in discussions around data privacy and security.

We talk about the importance of setting expectations and shared understandings in our work at LDH, such as defining essential terms and concepts with vendors so that everyone is clear on what’s being said in contract negotiations. This week’s post is our attempt to extend this philosophy to the blog with the start of the On The Same Page series. The series will aim to define the terms that form the basis of library data privacy and security. This week we start with a term often used, but its definition is hard to pin down – personal data.

What is Personal Data?

Short answer – it’s complicated.

One of the reasons defining personal data is complicated is the legal world. Sometimes data privacy regulations use different terms, such as personally identifiable (or identifying) information (PII) or personal information (PI). If that wasn’t enough, these regulations have different definitions for the same concept. You can get a sense of how this confusion can play out after reading a comparison of the different terms and definitions of personal data for the EU’s General Data Protection Regulation (GDPR) with various US state data privacy laws such as the California Consumer Privacy Act (CCPA). There are some similarities between the legal definitions, but just enough difference (or vagueness) to make defining personal data a bit more complicated than expected.

We also can’t leave the definition as “data about an individual person” because the definition doesn’t fully capture what counts as personal data. The National Institute of Standards and Technology (NIST)’s definition of PII captures some of this complexity in the two main parts of their definition: “any information that can be used to distinguish or trace an individual‘s identity” and  “any other information that is linked or linkable to an individual.” This definition of personal data is not very helpful to the layperson who has a pile of electronic resource use data in front of them is personal data. Can this data identify an individual patron? There are no names attached to it, so 🤷🏻‍♀️?

Despite the differences between these definitions, there are some common threads in which we can get a sense of what personal data is. We’ll break these threads into three categories:

  • Direct identifiers – data that directly identifies a person, such as a person’s name, government or organization identification number, and IP address.
  • Indirect identifiers – data that can identify a person with a great degree of confidence when combined with other indirect identifiers. This includes demographic, socioeconomic, and location data. A classic example of identifying people by combining indirect identifiers comes from Dr. Latanya Sweeney’s work identifying individuals using the date of birth, zip code, and gender.
  • Behavioral data – data that describes a person’s behaviors, activities, or habits. When collected over a length of time, behavioral data can identify a person when combined with direct identifiers, or if the behavioral data itself contains direct or indirect identifiers (put a pin in this for later!).

In short, personal data is much more than a person’s name or ID number. Personal data is data about a person, be it a direct identifier or data that can reasonably be linked back to a person. The second part of that definition is crucial to libraries working with patron data.

Libraries, Patrons, and Personal Data

When working with patron data, libraries work with all three types of personal data. The following is just a tiny sample of the kinds of patron data that call under each category of personal data:

Direct identifiers

  • Name
  • Physical and email addresses
  • Patron record and barcode numbers
  • User account login and password
  • Device information (operating system, browser, device identification number, and other information that makes up a digital fingerprint)

Indirect identifiers

  • Demographic information such as age, gender identity, and race/ethnicity
  • Declared major or minor (or grade level in K-12 schools)
  • Disability status
  • Patron type (e.g., resident or non-resident; student, faculty, or staff; specific patron statuses based on specific services, library card types, or market segments)
  • Geographical information (e.g., region, neighborhood, home branch)

Behavioral data

  • Borrowing history
  • Search history
  • Reference question logs
  • Library website analytics capturing website activity
  • Electronic content access logs

Some of you might already know why we put a pin in the behavioral data earlier – search and question logs have a plethora of direct and indirect identifiers. For example, a reference chat log history for a typical day can contain direct identifiers such as patron account login information and addresses and subject matter that serve as indirect identifiers of the patron in question. IP addresses and device information from website analytics and system logs (such as proxy server logs) can also potentially identify a patron.

It’s almost impossible for patrons to use the library without the library collecting some form of personal data. The shift from print to electronic resources and services significantly increased the library’s ability to collect behavioral data that can identify patrons on its own. Even if the patron goes to the physical library just to pull a book from the shelf and read at the library, the security cameras in the building might record the person’s face (direct identifier) and the book that they pulled from the stacks (behavioral data) at a specific branch location and time of day (more behavioral data with a dash of an indirect identifier). Leaving the definition of personal data at “data about a person” does not capture the reality of how the evolution of services, resources, and technology in libraries has changed the type and amount of patron data generated by library use by patrons.

Constant Changes in What Counts as Personal Data

It’s tempting to settle on the general definition of personal data with the three categories and call it a day. However, the rapid pace of change in legal regulations and technologies means that the nature of personal data will change. What might be considered non-personal data today (such as highly aggregated data based on the definitions of several data privacy regulations) might be personal data in the near future when someone discovers how to connect that data to an individual using newer technologies, algorithms, or improved re-identification methods. It also might be that more categories of personal data are waiting to be defined or refined. We weren’t kidding when we said that defining personal data is complicated!

Nevertheless, what we have today is a good working definition that we can use when talking about patron data privacy and security: Personal data is data about a person, be it a direct identifier or data that can reasonably be linked back to a person. While it’s easier to think about personal data when we limit ourselves to someone’s name or barcode number, we must remember that personal data takes on many, and sometimes deceptive, forms – particularly when it comes to the behavioral data generated by patron use of the library.