Our last blog post about Open Context discussed how we were rebuilding many aspects of how we organized the data we publish. This post gets into some of the details.
For background, Open Context is a bit different from many research data repositories like the Archaeology Data Service or tDAR. Open Context’s core purpose is to publish (we’ll explain what we mean by “publish” below) archaeological data. We’re not primarily in the business of preserving those data. Data archiving is (obviously) a key requirement, but in the case of Open Context, we rely on other partner institutions to preserve data. Those partners include the California Digital Library, Zenodo, and the Internet Archive.
That’s not to say we play no role in preservation. We try to align our publishing work to align with the needs of data preservation and archiving. This means data cleanup, metadata documentation, use of shared standards, and use of open and non-proprietary file formats. Efforts in each of these areas makes data preservation in repositories more meaningful.
What does “data publishing” mean?
We use the term “publishing” because it helps communicate some core needs in archaeological data sharing. “Raw Data” are rarely easy to understand and use. Data typically need description, cleanup, and presentation for others to understand. It takes people dedicating their time, expertise, and effort, together with supporting software and databases to help meaningfully communicate data. The term publishing encapsulates these requirements.
As noted in our last blog post about Open Context’s new phase of technical development, we’re making some fundamental changes to the databases and software that support our data publishing.
Publishing Challenges
Most websites like Open Context use software and databases to manage and deliver content to your browser. This blog itself uses WorldPress (open source content management software) and MySQL (an open source database) to share news and updates. Different kinds of software and database technologies will be suitable for different kinds of online applications.
Open Context requires specialized software and database organization to meet some of the particular needs of data publishing in archaeology. There are a number of concerns that make archaeology uniquely challenging:
- Archaeologists lack many common conventions in recording or organizing their data.
- Archaeologists work across many disciplines that may each have lots of diversity in approaches to recording and organizing data.
- Some archaeological data is very structured (typically in tables of rows and columns), but other documentation is loosely structured (narrative field notes), and archaeologists make lots of use of a wide variety of media types (images, video, GIS, 3D).
- Archaeological data creation takes place at different times, with different people, often with little coordination. This makes it difficult to merge related datasets together.
- Archaeologists juggle lots of concerns, and sometimes don’t make a priority of good data management practices.
- Archaeological data is not ethically neutral and needs proper contextualization, security and sensitivity protections, and appropriate governance. While these needs may not seem overtly “technical”, they absolutely do also shape software and data modeling requirements.
The diversity and complexity of archaeological data makes data publishing a big challenge. Another aspect of the challenge is that archaeologists often lack the time, training, and resources to create data that they or anyone else can actually use! Currently, the Open Context editorial team invests a lot of time and effort in the cleanup and reorganization of datasets.
Dealing with these challenges requires learning, experience, and wider engagement. That’s why our organization is investing so heavily in “data literacy”. We need to gain much more experience and understanding about what’s required to actually use archaeological data. If we can share that understanding more broadly, then archaeologists can make data that’s easier to use. That understanding will also help us improve our approaches to publishing data. Our current work in “software refactoring” (cleanup, organization improvements, maintenance) helps set the stage for making such improvements.