Editor’s note: This month, Elsevier Connect is exploring “how science can build a sustainable future.” In this article, CLOCKSS Executive Director Craig Van Dyck and LOCKSS and Web Archiving Program Manager Nicholas Taylor write about the challenges of preserving digital scholarly content for generations to come. Elsevier works with the CLOCKSS Archive to store all book and journal content on ScienceDirect.
Recent experience shows us that the internet is far from fool-proof. Yet scholarly communications rely on the Internet as a foundational cornerstone of infrastructure.
From manuscript submission and peer review systems, to content production, publishing platforms, discovery tools, subscription and customer information, archiving solutions, email, blogs, Twitter, product information, conference announcements and registration – the Internet is the basis for the scholarly publishing value chain.
Just within the past months we have seen several examples of large-scale Internet mishaps, for example:
- The WannaCry ransomware attack in May 2017
- The British Airways IT failure, also in May
- The outage of the Amazon Web Service S3 in March
- The use of disinformation, data thefts, leaks, and social media "trolls" to influence elections
- The removal of climate change information from the US Government website
- Large-scale content piracy
In response, organizations have invested in stronger security infrastructure to defend against “bad actors,” adding greater system redundancy, and in some cases taking legal action. But even these steps cannot ensure against the loss of vital information or the failure of key systems.
These are just conspicuous recent examples highlighting the precariousness of digital information. How can we ensure the longevity of crucial digital information, such as scholarly publications, for the long term, given anticipated risks such as system failure, organizational failure, financial unsustainability, bad actors, and hardware and software obsolescence?
A growing international community of practice focused on digital preservation has come into its own in the last two decades. National libraries, research universities, and libraries and archives of all shapes and sizes steward digitized and, increasingly, born-digital content of varied types and origin. Their efforts are often not limited to collections they own or have received as donations, or their own records; there is interest also in the preservation of common cultural heritage. When a library invests in the digitization of important physical artefacts, the library needs assurance that the digital files are safely preserved for the long term.
A prominent example of this is the web-wide crawls of the Internet Archive. Since 1996, the Internet Archive has collected broad swaths of the public web and made them accessible. Hundreds of institutions use subscription web archiving services from the Internet Archive to curate their own topical collections on specific subjects, contributing to a communal effort to preserve the web.
These and most other web archiving efforts are largely focused on public, freely-available web resources. The techniques for collecting and preserving web-accessible digital information are also applicable to other use cases. One such use case is the preservation of scholarly publications, now for some time disseminated through the web.
How vulnerable is the scholarly literature?
From a reader’s perspective, it is very frustrating if they are unable to locate a web resource that they have accessed previously. When a publisher makes articles and books available online, the reader expects that the “scholarly record” will be reliably maintained and will continue to be available.
However, there are cases where online scholarly content disappears. If a publisher goes out of business, or decides to cease hosting a journal or book, or has a sustained catastrophic failure of their platform, then end-users may not be able to find the content that they have previously accessed. This is where a preservation service will step in to ensure that the content remains available online.
What is the CLOCKSS Archive?
Strong infrastructure exists for the long-term preservation of scholarly content. For example, the CLOCKSS Archive is a leading provider of long-term preservation services for scholarly publishers and librarians. Funded by a mix of publisher fees and library contributions, the system holds the scholarly record (“the minutes of science”) in trust on behalf of the entire community.
CLOCKSS (short for Controlled LOCKSS) uses the award-winning open source software LOCKSS (Lots of Copies Keep Stuff Safe). CLOCKSS is a “dark archive” (end-users do not access the content except in exceptional cases) that uses the unique polling mechanism of the LOCKSS software. LOCKSS solves one of the problems of long-term preservation, i.e., how can you be sure that the data is valid, in a dark archive?
CLOCKSS maintains 12 copies of all of its millions of journal articles and books, and the 12 nodes continuously cross-check their content against one another. If one node reports a variance to what the other nodes report, then that one node is “out-polled” and its variant content is overwritten with correct content from another node.
The LOCKSS software runs very slowly over the internet, with the content safely preserved in all 12 of the globally distributed CLOCKSS server nodes, which are located at top-tier libraries.
The primary purpose of the long-term preservation of digital content is to ensure that end-users will continue to have access to the valuable resources that they rely upon, even if that content disappears from the web. In that case, CLOCKSS will “trigger” the content for access to end-users.
CLOCKSS has signed agreements with its participating publishers, which give CLOCKSS permission to make the triggered content available. CLOCKSS has over 20,000 journal titles and over 6,000 books in its archive. In its 12 years of operation, CLOCKSS has triggered access to 32 journals.
In addition, CLOCKSS is a preservation partner for CHORUS, which ensures that publicly funded articles that are openly available are accessible to the public. CLOCKSS will provide public access to the archived publications should a publisher fail to provide public access.
There are several preservation services that focus on scholarly publications. The Keepers Registry tracks the holdings of 13 preservation systems. A few of these services aim to cover all of scholarly publishing (or at least all journals), while others have more specific focuses.
It is a strength of the current long-term preservation environment that there are multiple players with diverse technology solutions, organizational structures and business models.
In general, it seems that the scholarly literature is well cared for. It is a sign of the health of the community that there is a diverse set of solutions the community itself has implemented. It is encouraging that the scholarly community is “taking care of its business.
Elsevier's digital archive
To preserve scholarly research, we work in partnership with various organizations as well as maintaining our own digital archive. Read more.