Getting “On The Same Page” – Personal Data and Libraries

An adult white woman stands in front a wall. Binary code and a programming script are being projected on both the woman and the wall. — Photo by ThisisEngineering RAEng on Unsplash

We cover a lot of ground on the Tip of The Hat! There’s so much to explore with data privacy and security that sometimes it’s easy to get lost in the details and lose track of the fundamentals. We’re also in a field where it’s improbable that everyone shares the same background or knowledge about a specific topic, which contributes to some of the misunderstandings and confusion in discussions around data privacy and security.

We talk about the importance of setting expectations and shared understandings in our work at LDH, such as defining essential terms and concepts with vendors so that everyone is clear on what’s being said in contract negotiations. This week’s post is our attempt to extend this philosophy to the blog with the start of the On The Same Page series. The series will aim to define the terms that form the basis of library data privacy and security. This week we start with a term often used, but its definition is hard to pin down – personal data.

What is Personal Data?

Short answer – it’s complicated.

One of the reasons defining personal data is complicated is the legal world. Sometimes data privacy regulations use different terms, such as personally identifiable (or identifying) information (PII) or personal information (PI). If that wasn’t enough, these regulations have different definitions for the same concept. You can get a sense of how this confusion can play out after reading a comparison of the different terms and definitions of personal data for the EU’s General Data Protection Regulation (GDPR) with various US state data privacy laws such as the California Consumer Privacy Act (CCPA). There are some similarities between the legal definitions, but just enough difference (or vagueness) to make defining personal data a bit more complicated than expected.

We also can’t leave the definition as “data about an individual person” because the definition doesn’t fully capture what counts as personal data. The National Institute of Standards and Technology (NIST)’s definition of PII captures some of this complexity in the two main parts of their definition: “any information that can be used to distinguish or trace an individual‘s identity” and “any other information that is linked or linkable to an individual.” This definition of personal data is not very helpful to the layperson who has a pile of electronic resource use data in front of them is personal data. Can this data identify an individual patron? There are no names attached to it, so 🤷🏻‍♀️?

Despite the differences between these definitions, there are some common threads in which we can get a sense of what personal data is. We’ll break these threads into three categories:

Direct identifiers – data that directly identifies a person, such as a person’s name, government or organization identification number, and IP address.
Indirect identifiers – data that can identify a person with a great degree of confidence when combined with other indirect identifiers. This includes demographic, socioeconomic, and location data. A classic example of identifying people by combining indirect identifiers comes from Dr. Latanya Sweeney’s work identifying individuals using the date of birth, zip code, and gender.
Behavioral data – data that describes a person’s behaviors, activities, or habits. When collected over a length of time, behavioral data can identify a person when combined with direct identifiers, or if the behavioral data itself contains direct or indirect identifiers (put a pin in this for later!).

In short, personal data is much more than a person’s name or ID number. Personal data is data about a person, be it a direct identifier or data that can reasonably be linked back to a person. The second part of that definition is crucial to libraries working with patron data.

Libraries, Patrons, and Personal Data

When working with patron data, libraries work with all three types of personal data. The following is just a tiny sample of the kinds of patron data that call under each category of personal data:

Direct identifiers

Name
Physical and email addresses
Patron record and barcode numbers
User account login and password
Device information (operating system, browser, device identification number, and other information that makes up a digital fingerprint)

Indirect identifiers

Demographic information such as age, gender identity, and race/ethnicity
Declared major or minor (or grade level in K-12 schools)
Disability status
Patron type (e.g., resident or non-resident; student, faculty, or staff; specific patron statuses based on specific services, library card types, or market segments)
Geographical information (e.g., region, neighborhood, home branch)

Behavioral data

Borrowing history
Search history
Reference question logs
Library website analytics capturing website activity
Electronic content access logs

Some of you might already know why we put a pin in the behavioral data earlier – search and question logs have a plethora of direct and indirect identifiers. For example, a reference chat log history for a typical day can contain direct identifiers such as patron account login information and addresses and subject matter that serve as indirect identifiers of the patron in question. IP addresses and device information from website analytics and system logs (such as proxy server logs) can also potentially identify a patron.

It’s almost impossible for patrons to use the library without the library collecting some form of personal data. The shift from print to electronic resources and services significantly increased the library’s ability to collect behavioral data that can identify patrons on its own. Even if the patron goes to the physical library just to pull a book from the shelf and read at the library, the security cameras in the building might record the person’s face (direct identifier) and the book that they pulled from the stacks (behavioral data) at a specific branch location and time of day (more behavioral data with a dash of an indirect identifier). Leaving the definition of personal data at “data about a person” does not capture the reality of how the evolution of services, resources, and technology in libraries has changed the type and amount of patron data generated by library use by patrons.

Constant Changes in What Counts as Personal Data

It’s tempting to settle on the general definition of personal data with the three categories and call it a day. However, the rapid pace of change in legal regulations and technologies means that the nature of personal data will change. What might be considered non-personal data today (such as highly aggregated data based on the definitions of several data privacy regulations) might be personal data in the near future when someone discovers how to connect that data to an individual using newer technologies, algorithms, or improved re-identification methods. It also might be that more categories of personal data are waiting to be defined or refined. We weren’t kidding when we said that defining personal data is complicated!

Nevertheless, what we have today is a good working definition that we can use when talking about patron data privacy and security: Personal data is data about a person, be it a direct identifier or data that can reasonably be linked back to a person. While it’s easier to think about personal data when we limit ourselves to someone’s name or barcode number, we must remember that personal data takes on many, and sometimes deceptive, forms – particularly when it comes to the behavioral data generated by patron use of the library.

March 22, 2021March 22, 2021

#DataSpringCleaning 2021 – Email and Patron Data

A white and brown short-haired dog places their right front paw on top of a open laptop keyboard. The laptop screen shows a blurred Gmail inbox window. — Image source: https://www.flickr.com/photos/karenbaijens/16241866468/ (CC BY 2.0)

Welcome to the first week of Spring in the Northern Hemisphere! This month marks one year of working from home for some library workers and the hybrid remote/onsite work limbo for others. In both cases, this anniversary also marks a year’s worth of patron data collected and stored all over the place due to the abrupt switch to remote work and virtual services. It’s safe to say that many disaster or business continuity plans didn’t plan for a pandemic, and the resulting scramble to virtual or reduced physical services/work created new or exacerbated existing data privacy gaps. Last year’s #DataSpringCleaning focused on setting up the home office to address a common privacy problem – the over-retention of patron data. Check out the post and the companion workshop materials about protecting patron privacy while working from home if you haven’t already done so.

This year’s #DataSpringCleaning project is ambitious as it is daunting. This year is the Sisyphean project of data cleanup projects – no matter how many times we try and fail, we keep coming back to this one project in hopes of finally completing it. Let us go back once more into the breach, friends. It’s time to scrub our work email.

Email as Major Privacy Risk to Patron Privacy

While many library workers are aware that their emails can contain patron data, they might not be aware of how much patron data is stored in their accounts. Personally identifiable information, or PII, includes data about a patron as well as data of a patron’s activity. The former can be easy to identify and easy to email without much thought about the privacy risk of doing so:

Name
Physical and email addresses
Birthdate or age
Patron record number
Username and password

A patron’s activities, on the other hand, can be harder to identify once you factor in the types of emails a library worker can receive or send in any given day:

Help desk ticket threads
Reference form or chat tickets or transcripts
Direct email from patrons
System or application reports or alerts
Vendor service desk tickets or reports

This list is just a small selection of the types of emails that can contain data around a patron’s activities such as:

Reference questions
Search and circulation histories
IP addresses
Electronic resource authentication and access history
Library computer and wifi logs and activity

And that’s just the start of how much patron data is in staff emails!

The ease of storing and sharing data through email makes it difficult to control data sharing and retention once the data hits the email system. The risk to patron privacy compounds once the email containing patron data leaves the library’s email system and into a third-party email account, be it a vendor or even a personal email account. Another risk for many libraries is that staff emails are subject to public disclosure requests. Several state and local regulations protect patron record data from disclosure, but in some cases, this protection might not extend to patron data in staff email. If your library’s emails can be publicly requested, don’t assume that you’ll get a chance to redact patron data before the emails are released to the public.

Starting the Long Journey of Protecting Patron Privacy in Staff Email

Scrubbing patron data from library email is a Sisyphean task. You can tell patrons not to email PII only to have patrons send over their logins for the financial website they can’t log into on a public computer. You can tell staff not to store patron data in work email, only to have staff use email as their primary knowledgebase for reference chat questions and answers. However, you have more control over how staff uses library email than you do patrons – this is where we start our scrubbing journey.

We’ll break this journey into two parts: the short and long term. The following are some actions workers and organizations can take in mitigating patron privacy risk in library emails:

Short term (individual) actions

First, get familiar with your email system’s filter and search capabilities! These will make the deletion process less painful.
Find and delete system-generated emails that contain patron data. These can be found through searching by a shared email address or subject line.
Search for emails with attachments and delete attachments if they contain patron data
Before deleting the email, migrate patron data that absolutely must be retained for a demonstrated operational need from email to a secured storage area designated by work (if one is available)
Create email rules to automatically delete incoming system-generated emails containing patron data
Learn how to use the ticketing system or other help desk or information desk systems as the primary mode of communication with other library staff about tickets and other

Long term (organizational) actions

Create policies and procedures around restricting the use of staff email to transmit or store certain types of patron data based on data classification level and/or privacy risk
Create secured data/file transfer options for sharing patron data, particularly between staff and authorized third parties
Set up applications and systems to not include patron data in system-generated reports and emails
Set up retention policies in email systems to automatically delete email based on organizational retention schedules or retention schedules set by legal regulation
Create procedures or processes to use the ticketing system or other help desk or information desk systems as the primary mode of communication between staff as well as between staff and patrons
Create secured storage outside of staff email for patron data that absolutely must be retained for a demonstrated operational need, and create retention schedules for the data retained in storage

The short-term actions can take a while with manual reviewing of attachments and individual emails. But, with the magic of search and filter options, you can quickly eliminate a good portion of privacy risks by deleting the archive of system-generated emails. The long-term actions require a team effort in the organization, from administration drafting policies to IT creating automatic retention policies and secured storage and transmission options.

None of us want to spend more time dealing with email than we have to, and trying to keep up with the current email inbox count is near impossible as it is. Nonetheless, we need to keep in mind that work email can put patron privacy at risk, and we must address that risk as part of our library duties. It’s a #DataSpringCleaning project that never ends, but as long as we have email, there will always be the need to clean our inboxes to protect patron privacy.

May 26, 2020July 29, 2020

Just Published! Library Data Risk Assessment Guide

Welcome to this week’s Tip of the Hat!

To build or to outsource?

Building an application or creating a process in a library takes time and resources. A major benefit of keeping it local, though, is that libraries have the greatest control over the data collected, stored, and processed by that application or system. Conversely, a major drawback of keeping it local is the sheer number of moving parts to keep track of in the building process. Some libraries have the technical know-how to build their own applications or have the resources to keep a process in house. Keeping track of privacy risks is another matter. Risk assessment and management must be addressed in any system or process that touches patron data, so how can libraries with limited privacy risk assessment or management experience make sure that their local systems and processes mitigate patron privacy risks?

Libraries have a new resource to help with privacy risk management! The Digital Library Federation’s Privacy and Ethics in Technology Working Group (formerly known as the Technologies of Surveillance Working Group) published “A Practical Guide to Performing a Library User Data Risk Assessment in Library-Built Systems“. This 28-page guide provides best practices and practical strategies in conducting a data risk assessment, including:

Classifications of library user data and privacy risk
A table of common risk areas, including probability, severity, and mitigation strategies
Practical steps to mitigate data privacy risks in the library, ranging from policy to data minimization
A template for readers to conduct their own user data inventory and risk assessment

This guide joins the other valuable resources produced by the DLF Privacy and Ethics in Technology Working Group:

The group also plans to publish a set of guidelines around vendor privacy in the coming months, so be sure to bookmark https://wiki.diglib.org/Privacy_and_Ethics_in_Technology and check back for any updates!

July 15, 2019July 29, 2020

CRMS 101

Welcome to this week’s Tip of the Hat! Today we have a brief overview of an acronym that is becoming a popular tool in libraries – the customer relationship management system [CRMS] – and how this new player in the library field affects patron privacy. While some folks know about CRMS, there might be others that are not exactly sure what they are, and what they have to do with libraries. Below is a “101”- type guide to help folks get up to speed on the ongoing conversation.

What is a CRMS?

A customer relationship management system [CRMS] manages an organization’s interactions with customers with the goal to grow and maintain customer relationships with the organization. CRMS products have been used in other fields outside of librarianship for decades, mostly in commercial businesses, but the increased importance in data analysis and improving customer experiences has led for wider adoption of CRMS products in other fields, including libraries.

What is a CRMS used for?

Many organizations use CRMS products to track various communications with customers (email, social media, phone, etc.) as well as data about a customer’s interests, demographics, and other data that can be used for data analysis. This analysis is then used to improve and customize the user experience (targeted marketing, personal recommendations, and invitations, etc.) as well as making business decisions surrounding products, services, and organization-customer relations. This analysis can also be used to create user profiles or for market segmentation research.

What are some examples of CRMS?

There are many proprietary and open source options, though Salesforce is one of the most recognized CRM companies in the overall field. In the library world, several library vendors sell standalone CRMS products, such as OrangeBoy’s Savannah. Other library vendors have started offering products that integrate the CRMS into the Integrated Library System [ILS]. OCLC’s WISE is one such example of this integration, while other library vendors plan to release their versions in the near future.

What data is collected in a CRMS?

A CRMS is capable of collecting a large quantity of very detailed data about a customer. Types of patron data that can be collected with a library CRMS includes (but not limited to):

Demographic information
Circulation information like total checkouts, types of materials checked out, and physical location of checkouts
Public computer reservation information
Electronic resource usage
Program attendance

In addition to library supplied data, other data sets from external sources can be imported into the CRMS ranging from US Census data to open data sets from cities and other organizations that could include other demographic information by geographical area (such as zip code) or by other indicators.

How is patron privacy impacted by CRMS?

The amount of information that can be collected by a CRMS is akin to the type of information collected by commercial companies who sell services and products. By creating a user profile, the company can use that information to personalize that customer’s experience and interactions with the company, with the ultimate goal of creating and maintaining return customers. Traditionally libraries do collect and store some of the same information that CRMS products collect; however, it is usually not stored in one central database. Creating a profile of a patron’s use of the library leaves both the library and the patron at high risk for harm on both a personal and organizational level. This user profile is subject to unauthorized access by library staff, data breaches and leaks, or intentional misuse by staff or by the vendor that is hosting the system. This user profile can also be subject to a judicial subpoena, which puts patrons who are part of vulnerable populations at higher risk for personal harm if the information is collected and stored in the CRMS.

Further reading on the conflict between the CRMS, data collection, and library privacy:

Big Brother is Watching You: The ethical role of libraries and big data by Erin Berman
The Challenge of Balancing Customer Service with Privacy by Sarah Houghton

What can we do to mitigate privacy risks if we use a CRMS?

If your library chooses to use a CRMS:

Limit the type and amount of patron data collected by the system. For data that is collected and stored in the CRMS, consider de-identification methods, such as aggregation, obfuscation, and truncation
Perform risk assessments to gauge the level of potential harm connected by collecting and storing certain types of patron information as well as matching up patron information with imported data sets from external sources
Negotiate at the contract signing or renewal stage with the vendor regarding privacy and security policies and standards around the collection, storage, access, and deletion/retention of patron data, as well as who is responsible for what in case there is a data breach
Perform regular privacy and security audits for both the library and the vendor

We hope that you find this guide useful! Please feel free to forward or pass along the guide in your organizations if you are having conversations about CRMS adoption or implementation. LDH can also help you through the decision, negotiation, or implementation processes – contact us to learn more!

April 22, 2019July 29, 2020

[REDACTED] – Redacting PII From Digital Collections

Welcome to this week’s Tip of the Hat! The Executive Assistant is back, and you know what that means…

A sitting black cat looking up at the camera, meowing loudly.
We’re back in business, newsletter-wise!

This week’s topic comes from a recent post to the Code4Lib mailing list. A library is planning to scan a batch of archival documents to PDF format, and are looking for ways to automate the process of identifying personally identifiable information [PII] in the documents and redacting said PII. The person mentioned that the documents might contain Social Security Numbers or credit card numbers.

Many libraries and archives have resources – digital and physical – that contain some form of PII in the source. While physical resources can be restricted to specific physical locations (unless someone copies the source via copier, pencil and paper, camera, etc.), digital resources that are available through a digital repository can increase the risk of privacy harm if that digital resource contains unredacted PII.

When libraries and archives are incorporating personal collections, research data sets, or other resources that may contain PII, here are some considerations to keep in mind to help through the process of mitigating the risk of data breaches and other privacy harms:

Who is included or mentioned in the resource – Some archival collections contain PII surrounding the individual who donated their materials. When dealing with institutional/educational records or research data sets, however, you might be dealing with different types of PII regulations and policies depending on who is included in the resource and what type of PII is present.

What PII is in the resource – When most folks think about PII, they think about information about a person: name, Social Security number, financial information, addresses, and so on. What tends to be overlooked is PII that is information about an activity surrounding a person that could identify that person. Think library checkout histories, web search histories, and purchase history. You will need to decide what types of PII needs redacting, but keep both facets of PII in mind when deciding.

What is the redaction workflow – This gets into the question from the mailing list. The workflow of redacting PII depends on several factors, including what PII needs to be redacted, the number of resources needing to be redacted, and what format the resource is in. Integrating redaction into a digitization or intake workflow reduces the time spent retroactively redacting PII by staff. Here I’d like to offer a word of caution – while automating workflows for efficiency can be positive, sub-optimizing a part of a workflow can lead to a less efficient overall workflow as well as have negative effects on work quality or resources.

What tools and resources are available – While looking at the overall workflow for redacting PII, the available resources and knowledge available to you as an organization to build and maintain a redaction workflow will greatly shape said workflow, or even the ability to redact PII in a systematic manner. There are many commercial tools that automate data classification and redaction workflows, and there are options to “roll your own” identification and redaction tool using various programming languages and regular expressions. If you work at a library or archive that is part of a bigger institution, there might be tools or resources already available through central IT or through departments that oversee compliance or information security and privacy. Don’t be afraid to reach out to these folks!

If you’re wondering where to begin or what other organizations approach redaction, here are a few resources, here are some resources to start with:

“Data Redaction: What it is, why you need it, and how to start” (OPIN Systems)
Duke University’s Medical Center Archives post about archival restriction of information in a medical archive
“Ethics in Archives: How Special Collections Protects Your Privacy” by Jessica L. Serrao
Indiana University’s guide on redacting data in multiple formats
“Major Legal Challenges Facing Oral History in the Digital Age” by John Neuenschwander