IIPC General Assembly and Web Archiving Conference
Event Details:
Speaker(s):
Thib Guicherd-Callin
Nicholas Taylor
Thib Guicherd-Callin and Nicholas Taylor will participate in the 2018 IIPC General Assembly and Web Archiving Conference. We look forward to connecting with any LOCKSS partners who may also be attending!
Thib will present Sifting needles out of (well-formed) haystacks: using LOCKSS plugins for web archive metadata extraction, as part of the session, "Leveraging Archived Content".
Here is the abstract:
As the volume of web archives have grown and web archiving has matured from a supplementary to an increasingly essential mechanism for collection development, there has been growing attention to the challenge of curating that content at scale. National libraries engaged in national domain-scale harvesting are envisioning workflows to identify and meaningfully process the online successors to the offline documents they historically curated. There is elsewhere increasing interest in the application of artificial intelligence to making sense of digital collections, including archived web materials. Where automated or semi-automated technologies are not yet adequate, crowd-sourcing also remains a strategy for scaling curation of granular objects within web archives.
The LOCKSS Program has developed significant expertise and tooling for identifying and parsing metadata for web-accessible objects in the domain of scholarly works: electronic journals and books, both subscription and open access, as well as government information. This is enabled by means of a highly flexible plugin architecture in the LOCKSS software, which augments the traditional crawl configuration options of an archival harvester like Heritrix with additional functionality focused on content discovery, content filtering, logical normalization, and metadata extraction. A LOCKSS plugin specified for a given publishing platform or content management system encodes the rules for how a harvester can parse its content, allowing for extraction of bibliographic metadata from, e.g., HTML, PDF, RIS, XML, and other formats.
While the LOCKSS software has always fundamentally been a web preservation system, it has largely evolved in parallel with the tools and approaches of the larger web archiving community. However, a major re-architecture effort is currently underway that will bring the two into much closer alignment. The LOCKSS software is incorporating many of the technologies used by the web archiving community as it is re-implemented as a set of modular web services. The increased participation of LOCKSS in the broader community should bolster sustainability, but a more promising possibility is for cross-pollination of technical capabilities, with metadata extraction a component of probable interest to many web archiving initiatives.
This presentation will detail the capabilities of the LOCKSS plugin architecture, with examples of how it has been applied for LOCKSS use cases, how it will work as a standalone web service, and discussion with the audience of where and how such capabilities might be applied for broader web archiving use cases.
Nicholas will participate in the panel Collaborative, selective, contemporary: lessons and outcomes from new web archiving forays focused on China and Japan.
Here is the abstract:
A reasonable assessment of web archiving efforts focused on China and Japan suggests that the level of collecting is not commensurate with the prominence of Chinese and Japanese web content broadly. Mandarin Chinese is the second-most common language of world internet users; Japanese is the seventh. The distribution of the languages of websites is dominated by English, with other languages in the long-tail, but Mandarin Chinese and Japanese are both in the top nine languages, representing 2% and 5.1% of websites respectively.
Meanwhile, the number of web archiving efforts focused on China and Japan is comparatively modest. The community-maintained list of web archiving initiatives highlights only three (out of 85) efforts focused on China or Japan. A search for “china” or “chinese” on the Archive-It portal yielded 56 collections (out of 4,846); a search for “japan” or “japanese” yielded 43 – 1% or less of Archive-It collections for both.
Recognizing the opportunity for more selective archiving of Chinese and Japanese web content, the Stanford East Asia Library has over the last several years led efforts to curate two major new collections, documenting Chinese civil society and contemporary Japanese affairs, respectively. This panel will discuss the particular motivations and impact of these collecting efforts, as well as address the following questions of more general interest to web archive curation practice:
- How can collaboration with researchers inform web content collecting efforts?
- What role do content creators themselves play in facilitating web content collecting efforts?
- How can coordination with and consideration of other institutions’ web content collecting efforts inform local collecting?
- What challenges — in terms of communications, funding, metadata, policy, quality assurance, staffing, workflow — are entailed in undertaking a new web archiving initiative and how can they be addressed?
- How is web content collecting continuous or discontinuous with the kinds of collecting that libraries have traditionally engaged in?
- Apart from these questions of curatorial concern, this panel will also detail technical aspects of the two projects, including quality assurance observations and how Stanford Libraries has managed the collections through a hybrid infrastructure consisting of Archive-It, the Stanford Digital Repository, a local OpenWayback instance, and Blacklight-based discovery and exhibits platforms.