![]() |
| July 2000 | Get BSD | New to BSD? | Search BSD | Submit News | FAQ | Contact Us | Join Us |
|
Preserving journals
Maintaining integrity
Model of distrust
David S. H. Rosenthal of Sun Microsystems presented work conducted with Vicky Reich of Stanford University Laboratories, entitled "Permanent Web Publishing." The talk was presented at the USENIX Annual Technical Conference in San Diego on 22 June 2000, and the paper appears in the conference proceedings.
Rosenthal began by noting that research librarians have a duty to make journal articles available to interested parties, but they also have a duty to ensure that the information in these articles cannot be suppressed. Their method for satisfying these objectives is to scatter lots of copies in many locations, so that it is easy to find some copies (to consult) but hard to find all copies (to destroy). They also provide copies of materials to other libraries that need them.
As scholarly journals move toward electronic publishing, it remains important that the articles be difficult to suppress. The authors propose that this is possible by caching lots of copies at many sites on the Internet, never removing articles from these caches, comparing the copies at different sites to detect and repair damage, and making it hard to find all of the copies.
In the authors' proposal, there are a large number of archives keeping copies of various journal articles. When a new issue of a journal comes out, each library's archive site retrieves the new material from the publisher's web site and stores it in the archive. (The library is a subscriber to the journal, so the access is authorized.) At random intervals, an archive may conduct a poll of other sites, in order to verify the integrity of its collection. It does this by multicasting to other archives an invitation to participate, which consists the following components:
Archives which receive this invitation must decide whether to participate. Participation requires the computation of a hash of the contents, and is therefore expensive. Consequently, an archive will decline to participate in some polls and participate in others. If the archive chooses to participate, it will, at some point before the duration is up, choose a random verifier, and prepend the verifier and the challenge to the collection and compute a hash. It multicasts the challenge, verifier, remaining duration, and the hash to all other parties. The challenge and the verifier prevent replays, forcing the hash to be recomputed if the archive chooses to participate. Each participant tallies the votes by recomputing the hash using the verifier of the received votes. (This makes participation even more expensive.)
Naturally, there is a mechanism for conducting polls on various subsets of the collection, to locate the cause of disagreeing poll results.
Rosenthal stressed that in this approach, there is no central authority, no absolute right or wrong, but only consensus. Each archive site acts purely to maintain the integrity of its own collection. The process contains a lot of randomness and is slow, which are unusual design goals for software, but in this case create an environment in which somebody who wishes to alter the collection will probably be discovered before much damage can occur.
Rosenthal contrasted this design with the usual security model of having secrets, certificates, and credentials. He believes that over the timescales of interest (a century or so), secrets cannot be maintained. In an effort to provoke the crowd, he described the different approaches as "authoritarian vs. libertarian systems."
During the questions session, one member of the audience expressed concern that there is no absolute right or wrong in this system, and there is some possibility that an incorrect version of an article could propagate throughout the system. I think this is an invalid criticism. In Rosenthal's system, propagating incorrect data requires that identical, incorrect data be present in a large number of archives. This will not happen by random corruption or human error, because that will produce different problems at different archives. It would instead indicate a concerted attack on the system. If there were instead a single authoritative archive, it would be much more vulnerable to the attack than the distributed consensus-based system.
The implementation that the authors have developed is called LOCKSS, "Lots Of Copies Keep Stuff Safe," and various documents relating to it are available from http://lockss.stanford.edu/.