Hello, Cherry Blossoms

Let’s take a break to appreciate the cherry blossoms across town.

A closeup on a group of cherry blossoms in bloom and flower buds on a tree branch in front of a blurred church steeple background.
Image source: https://www.flickr.com/photos/40441865@N08/16528632440/ (CC BY 2.0)
A cherry tree in full bloom in a secluded park in early spring. The blossoms appear to cascade from the tree on it various branches.
Image source: https://www.flickr.com/photos/kaoru_o/13596683015/ (CC BY ND 2.0)
A row of blooming cherry trees in front of red bricked academic buildings on the University of Washington Seattle campus.
Image source: https://www.flickr.com/photos/brianholsclaw/4447935281/ (CC BY ND 2.0)

[Bonus – If you’re curious about what makes a cherry tree a cherry tree, the University of Washington created an animated illustration describing the anatomy of a cherry tree.]

Take some time to appreciate the flower blossoms wherever you are – we’ll be back next week with the latest library privacy news and updates.

In The Meantime…

Do you have a library privacy question for us? Email us at newsletter@ldhconsultingservices.com with your question or idea and we’ll feature it in a future newsletter. We also welcome guest writers for the newsletter. If you have an idea for a guest post, let us know for a chance to be featured on the blog. We look forward to your questions and ideas!

#DataSpringCleaning 2022 – Glitter, Data, and You

Happy belated Spring Equinox to our fellow Northern Hemisphere dwellers! It doesn’t exactly feel like spring for many folks, but soon enough, there will be leaves on the trees, flowers in the gardens, and pollen in the air. So, so much pollen. Pollen that makes you sneeze even if you haven’t ventured outside in days and have all the windows and doors closed. Pollen that coats your car to the point where you can’t see out of the windshield. Pollen clouds. Pollen is everywhere. It’s like nature’s version of glitter.

The analogy of pollen-as-glitter doesn’t quite match up one-to-one. For example, limiting the amount of glitter we come into contact with is easier than limiting the amount of pollen unless you take drastic measures (like moving to another part of the world to avoid certain types of pollen). However, we have a more accurate analogy to form – data as glitter. Here are some ways data is like glitter from our tweet in 2020:

Hot take – Data is not the new oil. Data is the new glitter:

– Lures humans in with its shininess
– Very easy to accumulate
– Found in places you least likely expect to find it
– Almost impossible to get rid of
– Everyone insists on using it w/o thinking through the consequences

We all had a glitter phase – all glitter, all the time. For some of us, though, we are the ones who are left cleaning up after someone somewhere in the building used any glitter. The nature of glitter – the attractiveness of the shininess, the ease of getting a hold of glitter, the lightweight and aerodynamic nature of individual glitter specks – is sure to be a recipe of disaster if there are no guidelines in place in using it. Parents and educators might already know a few of these guidelines: laying down plastic or paper over the workspace for easy cleanup, not leaving glitter containers open when not in use, and washing hands when finished working with glitter. For such tiny specks of plastic, it takes a lot of effort to ensure that the glitter doesn’t get everywhere and on everyone.

Data is like glitter. If there are no guidelines or measures to control the use and flow of data, you will have multiple versions of the same data set in various places. In previous #DataSpringCleaning posts, we talk about electronic and physical data retention and deletion, but that only addresses some of the privacy risks we face when working with data. For those unfortunate enough to have to clean up after a glitter explosion, it’s nearly impossible to get all the glitter if control measures were not put in place. The same is true with data – left unrestricted, data will get everywhere, making it almost impossible to delete. It also makes it practically impossible to control who has access, what is shared, and even when it’s appropriate to work with patron data.

For this year’s #DataSpringCleaning, we’re taking a proactive approach to avoid cleaning up explosion after explosion of glitter-like data. What are some ways you can limit the spread of patron data in your library or organization? The data lifecycle is a great place to start:

  • What data do you absolutely need to collect to do what you need to do?
  • Where should you keep the data?
  • Who should have access to the data?
  • How should the data be shared, if at all?
  • How do you clean up after the data is no longer used or needed?

Another place to start is to get into the habit of asking if you truly need to use patron data in the first place. Some of the worst glitter cleanups come from times when glitter use was absolutely unnecessary – for example before you use that glitter bath bomb, do you really need to have glitter all over yourself and your bathtub and your bathroom and your pets who enter the bathroom and your carpet and your furniture and your clothes and everyone who comes into contact with you or the other glittered surfaces? The answer is almost always “no.”

Stopping to ask yourself if patron data is needed in the first place to do the thing that you need to do is one of the best ways to avoid putting patron privacy at risk at your library. Thinking about data in terms of glitter can help you get into the habit of being more judicious about when to use patron data and how it should be used to limit unmitigated messes that will take considerable amounts of time to clean up. Data is glitter – plan accordingly!

Say What You Mean, or When Not to Use Certain Technical Terms

The phase "Choose your words" are spelled out using wooden Scramble letter tiles on a white table. The word "your" is vertically spelled using the "o" and "r" in the horizontal words "choose" and "words", respectively.
Photo by Brett Jordan on Unsplash

Welcome to the first week of Daylight Savings Time for most of our readers in the US! Now that we are short one hour of sleep, it’s the best time to start with a thought experiment. The following is an excerpt from a recent library technology conference poster proposal:

“Patrons who visit an academic library with their smart devices (i.e., cell phones, laptops, tablets) connected to the campus Wi-Fi services would have their geolocation data, user ID, and time stamp stored in the Wi-Fi service provider’s system. The big data harvested provides a clear view of patron demographic information, including majors, classes being taken, along with other data… the use of Artificial Intelligence has helped the library to predict user behavior and thus be able to more closely tailor facilities, collections, and instruction to enhance student success.” (emphasis added)

Now for the question – what would you use instead of “Artificial Intelligence” in that excerpt? Take a moment to write down whatever comes to mind.

As we started exploring in our “On The Same Page” series, words are complicated. Sometimes they don’t let onto the complexity of the concept they represent, such as personal data. Other terms are prone to obscure, misdirect, or otherwise conceal the real-world consequences of the ideas and actions represented by those terms. Phrases like “artificial intelligence” and “machine learning” find widespread use in our lives without much thought into what they mean and the implications behind those terms in our understanding of technology. What can we use instead of these terms, though?

An excellent place to start is to say what you mean. The Center on Privacy & Technology at Georgetown recently announced that they will no longer use terms like “artificial intelligence” and “machine learning.” Instead, they will use the following guidelines to say what they mean:

1. Be as specific as possible about what the technology in question is and how it works.

2. Identify any obstacles to our own understanding of a technology that result from failures of corporate or government transparency.

3. Name the corporations responsible for creating and spreading the technological product.

4. Attribute agency to the human actors building and using the technology, never to the technology itself.

One example provided by the article takes the phrase “face recognition uses artificial intelligence” and replaces it with “tech companies use massive data sets to train algorithms to match images of human faces.” The latter phrase is specific as to what is all involved, including the human involvement behind the technologies referenced in the former term. The latter phrase also doesn’t conceal the process of facial recognition – it takes data from real human faces, and lots of it, to get an algorithm to determine a match of a face with an image correctly. But wait – where do the faces come from? What decisions are being made about which faces to feed into the algorithm? Do the people whose faces are being used to train this algorithm know that their faces are being used in training? What are the ultimate goals of the tech companies in creating this type of technology? Who are these tech companies in the first place?

Being specific about the technology, how it works, and the humans behind the technology better positions the readers in asking questions about the real-world impact of these technologies. It also attempts to make more apparent to the readers the potential harms that can come from these technologies, such as the potential of lack of consent from the people whose faces are being used for training and the potential bias in the data set itself based on who is included. Spelling out the specifics breaks us from using technical terms that we and our audience might not fully understand or be aware of the potential privacy risks and harms inherent in these technologies.

Let’s revisit the excerpt from the beginning of the post. With the Privacy Center’s guidelines in mind, what would you say instead of “Artificial Intelligence” in the last sentence?

(Bonus – Are there other technical terms in the excerpt that need to be spelled out? If so, what should be said instead of those terms?)

We invite you to share your answers with us! You can use the following form to share your answers with us. We are not collecting personal data such as IP address, name, or address for submissions. We’ll return to the exercise and share the responses in a future post, so stay tuned!

Getting “On The Same Page” – Personal Data and Libraries

We cover a lot of ground on the Tip of The Hat! There’s so much to explore with data privacy and security that sometimes it’s easy to get lost in the details and lose track of the fundamentals. We’re also in a field where it’s improbable that everyone shares the same background or knowledge about a specific topic, which contributes to some of the misunderstandings and confusion in discussions around data privacy and security.

We talk about the importance of setting expectations and shared understandings in our work at LDH, such as defining essential terms and concepts with vendors so that everyone is clear on what’s being said in contract negotiations. This week’s post is our attempt to extend this philosophy to the blog with the start of the On The Same Page series. The series will aim to define the terms that form the basis of library data privacy and security. This week we start with a term often used, but its definition is hard to pin down – personal data.

What is Personal Data?

Short answer – it’s complicated.

One of the reasons defining personal data is complicated is the legal world. Sometimes data privacy regulations use different terms, such as personally identifiable (or identifying) information (PII) or personal information (PI). If that wasn’t enough, these regulations have different definitions for the same concept. You can get a sense of how this confusion can play out after reading a comparison of the different terms and definitions of personal data for the EU’s General Data Protection Regulation (GDPR) with various US state data privacy laws such as the California Consumer Privacy Act (CCPA). There are some similarities between the legal definitions, but just enough difference (or vagueness) to make defining personal data a bit more complicated than expected.

We also can’t leave the definition as “data about an individual person” because the definition doesn’t fully capture what counts as personal data. The National Institute of Standards and Technology (NIST)’s definition of PII captures some of this complexity in the two main parts of their definition: “any information that can be used to distinguish or trace an individual‘s identity” and  “any other information that is linked or linkable to an individual.” This definition of personal data is not very helpful to the layperson who has a pile of electronic resource use data in front of them is personal data. Can this data identify an individual patron? There are no names attached to it, so 🤷🏻‍♀️?

Despite the differences between these definitions, there are some common threads in which we can get a sense of what personal data is. We’ll break these threads into three categories:

  • Direct identifiers – data that directly identifies a person, such as a person’s name, government or organization identification number, and IP address.
  • Indirect identifiers – data that can identify a person with a great degree of confidence when combined with other indirect identifiers. This includes demographic, socioeconomic, and location data. A classic example of identifying people by combining indirect identifiers comes from Dr. Latanya Sweeney’s work identifying individuals using the date of birth, zip code, and gender.
  • Behavioral data – data that describes a person’s behaviors, activities, or habits. When collected over a length of time, behavioral data can identify a person when combined with direct identifiers, or if the behavioral data itself contains direct or indirect identifiers (put a pin in this for later!).

In short, personal data is much more than a person’s name or ID number. Personal data is data about a person, be it a direct identifier or data that can reasonably be linked back to a person. The second part of that definition is crucial to libraries working with patron data.

Libraries, Patrons, and Personal Data

When working with patron data, libraries work with all three types of personal data. The following is just a tiny sample of the kinds of patron data that call under each category of personal data:

Direct identifiers

  • Name
  • Physical and email addresses
  • Patron record and barcode numbers
  • User account login and password
  • Device information (operating system, browser, device identification number, and other information that makes up a digital fingerprint)

Indirect identifiers

  • Demographic information such as age, gender identity, and race/ethnicity
  • Declared major or minor (or grade level in K-12 schools)
  • Disability status
  • Patron type (e.g., resident or non-resident; student, faculty, or staff; specific patron statuses based on specific services, library card types, or market segments)
  • Geographical information (e.g., region, neighborhood, home branch)

Behavioral data

  • Borrowing history
  • Search history
  • Reference question logs
  • Library website analytics capturing website activity
  • Electronic content access logs

Some of you might already know why we put a pin in the behavioral data earlier – search and question logs have a plethora of direct and indirect identifiers. For example, a reference chat log history for a typical day can contain direct identifiers such as patron account login information and addresses and subject matter that serve as indirect identifiers of the patron in question. IP addresses and device information from website analytics and system logs (such as proxy server logs) can also potentially identify a patron.

It’s almost impossible for patrons to use the library without the library collecting some form of personal data. The shift from print to electronic resources and services significantly increased the library’s ability to collect behavioral data that can identify patrons on its own. Even if the patron goes to the physical library just to pull a book from the shelf and read at the library, the security cameras in the building might record the person’s face (direct identifier) and the book that they pulled from the stacks (behavioral data) at a specific branch location and time of day (more behavioral data with a dash of an indirect identifier). Leaving the definition of personal data at “data about a person” does not capture the reality of how the evolution of services, resources, and technology in libraries has changed the type and amount of patron data generated by library use by patrons.

Constant Changes in What Counts as Personal Data

It’s tempting to settle on the general definition of personal data with the three categories and call it a day. However, the rapid pace of change in legal regulations and technologies means that the nature of personal data will change. What might be considered non-personal data today (such as highly aggregated data based on the definitions of several data privacy regulations) might be personal data in the near future when someone discovers how to connect that data to an individual using newer technologies, algorithms, or improved re-identification methods. It also might be that more categories of personal data are waiting to be defined or refined. We weren’t kidding when we said that defining personal data is complicated!

Nevertheless, what we have today is a good working definition that we can use when talking about patron data privacy and security: Personal data is data about a person, be it a direct identifier or data that can reasonably be linked back to a person. While it’s easier to think about personal data when we limit ourselves to someone’s name or barcode number, we must remember that personal data takes on many, and sometimes deceptive, forms – particularly when it comes to the behavioral data generated by patron use of the library.