Shedding Light on Endangered Data

April 19, 2017

This is a cross-posting from an April 17, 2017 post on Heritage Bytes, the Open Context blog:

For the week of April 17-21, we’re joining a large community-wide effort to raise greater awareness of “endangered data”. In light of all of the other crises in the world, highlighting endangered data may seem silly. After all, given the daily news onslaught of increasing authoritarianism, kleptocracy, war, bigotry, poverty and environmental problems, the fate of abstract electronic databases seems low on the priority list.

However, we argue that safeguarding data represents a need to safeguard our civil liberties, civil society, future environment, and broader understanding of our world. This last point is key. Data are often integral to how we try to understand the world.

As authoritarianism takes hold, data become increasingly politicized and precarious. Authoritarians attempt to dictate what is and is not true. Truth must conform to the needs of vested interests or ideologies or it will be suppressed. The current administration’s assault on climate science represents a stunning assault on an “Inconvenient Truth” (so aptly named by Al Gore). Beyond climate science, researchers create data key to understanding social, historical, and governance issues. Like climate science, better understanding in these other domains can threaten powerful and entrenched interests, which is why authoritarians may seek to suppress or corrupt data documenting such topics.

Unfortunately, we don’t really understand the full scope and magnitude of what data may be under threat. We also don’t have a good understanding of what threats may be more immediate and where to prioritize our “data rescue” efforts. But here are some (incomplete) thoughts about what threatens data:

  • Outright Suppression: Some datasets may be suppressed and destroyed overtly. This is a digital equivalent of burning books or even whole libraries.
  • Lack of Funding: Creating, maintaining, curating and preserving data all require effort, often by dedicated professionals and institutions. Cutting off funding to these professionals and organizations will quickly endanger data.
  • Lack of Time: People need time to dedicate their attention to work on data. Badly structured rewards, incentive systems, and other bureaucratic pressures in academic research, force many researchers to neglect data. Researchers need intellectual freedom to devote their time toward data, where the rewards are still uneven and uncertain.
  • Lack of Access: Hiding data away from wider scrutiny makes it easier to delete, alter or corrupt. It also makes it easier to make spurious claims (and harder to refute them).
  • Collection Biases: Political and ideological agendas shape how we collect data and what data we collect. We’ve already seen Republican attempts to cease collecting data about housing discrimination, no doubt with a motivation to make the problem “disappear”.
  • Analytical Biases: Data need analysis to be interpreted and used. People apply different models and analytic methods that may (or may not) explicitly or implicitly bias understanding of data.
  • Filter Biases: The past several months have provided a hard education on the problem of “fake news” (propaganda) in the contemporary news media. Even if we manage to preserve some integrity in our data and analyses, we face the steep challenge of communicating our understandings in an overtly hostile and ideologically-charged media environment.

In arguing for the importance of data, we’re not suggesting that data are wholly objective or empirical. Data are never complete, perfect, or objective. As brilliantly discussed by Cathy O’Neal, data reflect our incomplete and often biased views of the world. Because data, like other forms of knowledge, are imperfect, they need to be a part of open conversations and debates in civil society. If we do a better job at making data more open to critique and evaluation from people with a wider variety of perspectives, we can improve both the data themselves and our understandings derived from them.

Over the past several months, we have taken part in “data rescue” events organized across the nation. There is a strong focus on climate data, but our participation involved endangered data from National Park Service websites. Working with Max Ogden and colleagues at the California Digital Library, we safeguarded more than a terabyte of data from a National Park Service database, as well as some 20,000 web pages, especially those that bring US national parks to underrepresented communities (African American, Asian American, Native American, LGBTQ).

As we move forward with Endangered Data Week, we will post more about the needs to protect public data, some of the importance of public data for a healthy civil society, and some of our broader collaborations to make public data better protected and understood.

Posted in: Events, News