(From XKCD, Creative Commons Attribution Non-Commercial License: https://xkcd.com/974/)
The above XKCD comic elegantly captures so much of what’s hard about working on the software infrastructure behind Open Context. Basically, the complexity, diversity, and messiness of archaeological data makes software support hard. In other application areas, “passing the salt” can be really straightforward, since those domains may limit condiments to salt and pepper only. But in archaeology, our “data condiments” include all the world’s various spice racks and weirdly sized and shaped jars, squeeze-bottles, and single serve packets you that can imagine.
In short, archaeologists have eclectic tastes! Open Context passes these various data condiments with a publishing workflow. To do so, it ingests data from many sources that all have different relationships, different descriptive attributes, and different controlled vocabularies. However, it’s very hard to build software that can handle the vast diversity of ways that archaeologists model and describe their data.
Databases in Archaeology
In order to make a searchable and browsable website that enables users to query and navigate through all this diversity, one needs a database organization (a “schema”) and associated software that’s organized with a great deal of abstraction and generalization.
Most people who work with data are familiar with tables as an organizing principle. Typically, a data table has rows of instances described by values in columns of attributes. For example, a table of bones may be described like this:
|Bone ID||Context ID||Taxon||Element||Comment|
|1||1||Sheep||Tibia||Broken in excavation?|
|2||1||Dog||Mandible||Has canine canines|
|3||3||Sheep||Humerus||Not really funny|
In this table, the column headings “Bone ID”, “Context ID”, “Taxon”, “Element”, and “Comment” all provide different kinds of descriptive attributes for the three bones in the table. Moreover, a “relational database” allows one to relate multiple data tables together. For example, another data table describing contexts can look something like this:
|Context ID||Type||Excavated By|
A relational database would make it possible to “join” information from the Contexts table with the Bones table across their shared “Context ID” field. This lets you know, for example, that Bone ID 1 came from a topsoil context (Context 1) excavated by Jane.
Databases Get Complicated Quickly
Data tables like these seem very straightforward! But, in practice, though, things can get hard.
It turns out that organizing data in tables like these (either as single tables or multiple tables in a relational database) works fine if there’s a single and stable (rarely changing) idea about how data should be organized. If you want to change the attributes you’re recording in a table, you need to change the structure of the table by adding or removing columns. Similarly, if you want to record something that doesn’t fit into a “Bone” or a “Context” data table, you need to create a whole new table that describes something different, like “Coins” or “Glass Artifacts”. Typically, each new table will have its own set of attribute columns. Because archaeologists study a huge variety of “stuff”, an archaeological database can end up with dozens of different data tables, each designed to describe some type of finds, contexts, features, etc.
So what does this have to do with Open Context?
Open Context and Diverse Data
It turns out that archaeologists love organizing their data in tables in many, many, many different ways. In fact, it is very rare that two different archaeologists would share a common way of their organizing data. It’s pretty much a free-for-all. This diversity represents a huge and ongoing technical challenge for Open Context.
This recognition drives a great deal of our technology planning and strategies. Since it’s the end of 2022, we thought we’d share some plans for 2023 and beyond. This work includes:
- Major Open Context Revision: First, we are now very close to completing a major revision of Open Context, now in testing at https://staging.opencontext.org. This major update includes several “backend” improvements to improve speed, scalability, and stricter enforcement of certain data quality expectations. The user interface also has more data visualization features and more streamlined search and data export features.
- Modularizing Open Context: But the main improvements are behind the scenes. We use smaller, more clear, and more streamlined code to run Open Context. These improvements make Open Context easier to maintain and adapt. More fundamentally, we’ve made Open Context’s internal structure more modular. Modularity gives flexibility by making it easier to change specific parts of Open Context without breaking other parts that we still want to keep.
- Incorporating Arches: Modularity will allow us to better benefit from the investments and talents made by other projects. Starting in 2023, we will begin to replace the database application behind Open Context with the Arches Project. Arches has very similar (to Open Context) data organization strategies to accommodate the diversity of cultural heritage data. However, Arches has much more investment, and has a much wider user community. It also benefits from more open source developer talent and expertise. It makes sense for us to use Arches rather than waste time and money on very similar kinds of parallel development for use only with Open Context.
Next Steps with Arches
We’re still trying to figure out exactly how to incorporate Arches as the core database application for Open Context. Our current thinking is to maintain Open Context’s revised user interface for viewing, browsing, searching and visualizing records as separate components with Arches mainly running “behind the scenes” as the data source.
While we have lots of problems to solve, we’re looking forward to teaming with the wider and very vibrant Arches community. So the next steps in Open Context’s technical evolution will involve much deeper and richer kinds of collaboration.