Open Context has evolved to increasingly emphasize a model of “data sharing as publication.” Outcomes of our user experience study showed professional researchers had data quality and branding concerns, and needs for standards-alignment and rich documentation. Indeed, data sharing advocates agree that data sharing necessitates more than “dumps” of raw and undocumented data on the Web. Publication models can align professional and career interests with the research interests of the larger community.
Demonstrating the research benefits of data publishing
In September 2012, the AAI received a 2-year Digital Humanities Implementation grant from the National Endowment for the Humanities. The project Applying Linked Open Data: Refining a Model of Data Sharing as Publication aims to demonstrate how publication processes can help improve the discoverability, reuse, and longevity of primary scholarly materials. The AAI will collaborate with archaeologists working in the Mediterranean region to further develop workflows to publish archaeological datasets as Linked Open Data. Though demonstrated with a theme of ancient trade and exchange, the project’s tools and workflows are applicable in any field needing better data dissemination. A major project outcome is a generalized model for publishing well-documented and reusable scholarly data. The success of this model lies in its outward orientation. Rather than working toward monolithic centralization, this approach enables researchers to participate in a growing and widely distributed humanities information ecosystem. An all-day workshop involving all members of the working groups participating in this project took place during the Society for American Archaeology conference in Honolulu in April 2013.
Documenting datasets and improving data quality both require expert guidance and effort. Effective data dissemination must involve some formal mechanisms of professional publication rather than informal sharing of user-generated content. Editorial oversight, coupled with clear and trustworthy citation practices, can make data dissemination a recognized and professionally valued form of publication. Editorial boards can perform important signaling roles in academia by elevating the prestige of data sharing. To this end, in the spring of 2011, we established an Open Context Editorial Board comprised of domain experts representing several specializations in archaeology. The Editorial Board assisted with the creation of the Open Context’s Editorial Policies & Author Guidelines for data publication.
Data publishing workflows
Scholars are familiar with editorial workflows that transform manuscripts into completed publications. Researchers submit text files to journal editors, who then circulate manuscripts for review. When a paper is accepted, a researcher works with a journal editor through multiple revisions before the manuscript is ready for publication. Email, versioning, and edit-tracking help coordinate the work. Similarly, appropriate workflows and technology can facilitate data publishing. Fortunately, open source tools for collaborative software development are highly refined and can be readily adapted for use with datasets. These tools support key requirements of robust version control and collaborative bug tracking. With support from the Alfred P. Sloan Foundation, we developed the Data Refine System for evaluating, annotating, and linking datasets. This system integrates two open source components: (1) Google Refine (for data editing and robust versioning); and (2) Mantis, a popular PHP-based issue-tracker. Rather than simply being “shared,” datasets processed through the Data Refine System are first reviewed, edited, and annotated for formalized publication and archiving. Contributing researchers and editors collaborate to identify, track, and resolve issues, clean data, and create documentation. This gives datasets more value as intelligible and high-quality scholarly outputs.
Challenges in working with data
Datasets have several important qualities that differ from manuscripts. Datasets can often be quite large and full of complex interrelationships between various tables and multimedia files (images, videos, GIS, etc.). This is especially the case in archaeology where projects can involve large teams, including specialists who create their own datasets. Archaeological documentation is also highly multimedia and can generate tens of thousands of images and other media files (3D scans, GIS, remote sensing, etc.) that need association with other documentation. Our experience shows that it is common to see complex dependencies between various parts of an archaeological project. For example, the datasets of different specialists (zooarchaeology, pottery or lithic analysts) at an excavation typically need to be related through reference to archaeological contexts. Integrating, cleaning, and adequately documenting such large and complex datasets requires a great deal of effort and experience with data.
While datasets are often larger and more complex than manuscripts, the most important difference lies in the uses of data versus text. The primary purpose of text is to be read by a human reader. In contrast, data usually needs heavy machine-processing before it can be meaningfully rendered to people. To be used, data needs software mediation, and as such, data become an aspect of software. The complex interdependencies between various parts of a large dataset and the software needed to use and interpret data require different quality control processes than seen in manuscript publishing.
Published in the International Journal of Digital Curation (April 2014)
This paper won the “best paper” prize at the International Digital Curation Conference in San Francisco (February 2014).
Published in the PLOS ONE (June 2014)
This paper presents the collaborative research results from 23 co-authors who analyzed primary archaeozoological data for over 200,000 faunal specimens excavated from seventeen sites in Turkey.