Scaling LOCKSS
Scale is a predictable preoccupation as memory institutions collectively contemplate the growing amount and size of candidate content for preservation. To ensure that limited resources have maximum impact, we want our digital preservation systems to scale well in terms of capacity and marginal cost, or efficiently maintain their effectiveness as capacity is increased. Folks considering LOCKSS wonder how a system conspicuously touting liberal replication as a core feature could possibly be scalable.
The LOCKSS system is demonstrably scalable in a couple of key ways:
- It scales well horizontally; at least one LOCKSS network has a massive number of participants. Each node adds to the resilience of the overall network and therefore to the quality of digital preservation available to all participants.
- It scales well vertically; the system remains performant for a 100+ terabyte network with terabyte-scale Archival Units.
Where there is greater doubt about LOCKSS and scalability is the number of copies.
In defense of "lots of copies", LOCKSS insists on more copies because it is (much) more effective. You could run LOCKSS with three copies (in keeping with ostensible best practice) but if one of the copies becomes unavailable even temporarily, you’d have no way to arbitrate between two remaining copies with disagreeing fixity, unless your checksum store were perfectly secure and trustworthy (which it almost certainly couldn’t be). Three copies is surely less expensive and therefore more scalable than four copies, but it’s a lot less fit for the purpose of digital preservation. It matters also that the lots of copies be as independent as possible and hosted within diverse organizational infrastructures, which logically aligns with digital preservation most effectively being a community- rather than solely institutionally-based commitment.
There is a tendency to suppose that storage costs, by virtue of their being more easily measured, represent the largest category in the lifetime costs of digital preservation. A summary of relevant research suggests that storage typically accounts for less than half of the total cost. At low storage volumes, the proportional cost of storage may be even lower — note the ratio of storage to non-storage costs for the Digital Preservation Network (PDF), for example. While use cases have since expanded and the web has become more dynamic, the LOCKSS software was built to make ingest of the content that was its original animating use case — web-based scholarly publications — as easy as possible, recognizing that ingest was typically a more significant contributor to the lifetime costs of digital preservation than storage.
Meanwhile, infrastructures that we would instinctively characterize as "scaling well" — the big commercial cloud service providers — are a questionable fit for digital preservation. They are more vulnerable to single points of failure (e.g., privileged insider attacks, more strongly-incentivized external attacks, more tightly-coupled systems, less diverse system components); their assurances of data integrity are opaque; their service models disfavor data access, shifting digital preservation as "persistent access" to "contingent access"; and the continuity and perpetuity of their hosting of data depends on flawlessly uninterrupted ongoing payment.
With those caveats noted, we acknowledge that slowing Kryder rates and larger data volumes present unavoidable trade-offs between preserving less content at a higher replication factor or more content at a lower replication factor. The obvious question for us is: how to provide LOCKSS-based digital preservation without lots of copies of the content?
We have some ideas of how we could approach this. On the foundation of the re-architected LOCKSS software, we’re starting by prototyping what we’re calling the LOCKSS fixity service (PPTX) — a highly-replicated LOCKSS network that preserves checksums and can therefore provide high-confidence fixity assertions for large volumes of data stored at lower replication factors. The service has been designed and we’re in early stages of development. We look forward to sharing more details as we progress.