Wednesday, 15 November 2017

Thoughts on Fixity Checking in Digital Preservation Systems

I would like to query the rationale for actually doing periodic fixity checking in isolation. This has bugged me for a bit so I am going to unload.

As far as I can see, the main reasons would be undetected corruption on storage and tampering that doesn’t hijack the chain of custody.

All storage media now have built-in error detection and correction using Reed-Solomon, Hamming or something similar which is generally capable of dealing with small multi-bit errors. In modern environments, this gives unrecoverable read error rates of at worst around 1 in 10^14 and generally several orders of magnitude better – which is around one in 12TB total read. Write errors are less frequent – they do occur but can be detected by device firmware and retried elsewhere on the medium. These are absolute worst case figures and result in *detectable* failure long before we even get to computing fixity. The chance of bit flips occurring in such a pattern as to defeat error correction coding is several orders of magnitude less – it is similar to bit flips resulting in an unchanged MD5 hash. Interestingly, in most cases the mere act of reading data allows devices to detect and correct future errors as the storage medium becomes marginal so there is value in doing that.

Consequently, however, undetected corruption is most likely when data moves from the error corrected environment of the medium to less robust environments. At an interconnect level protocols such as SCSI, SATA, Ethernet and FC are all error corrected as is the PCI-E bus itself. The most likely failure points are likely to be a curator’s PC or software. How many curators work on true workstation grade systems with error corrected RAM and error corrected CPU caches? How well tested are your hashing implementations (MD5 had a bug not so long ago)? How about all the scripts that tie everything together? How about every tool in your preservation toolchain? How many of these fail properly when an unrecoverable media error is encountered?

If we consider malicious activity then, again, we have to ask whether it is easier to attack the storage (which may require targeting several geographically dispersed and reasonably secure targets) or the curation workflow, which is localised, generally in a less secure location than a machine room, and can legitimise changes. A robust digital signature environment is the way to deal with this – and fixity hashes *can* be used to make this more efficient (sign the hash rather than the whole object).  

Locally computed hashes can be very useful as a bandwidth efficient way of comparing multiple copies of an object (rsync has done this for ages) to ensure that they are in sync.

So there are reasons to compute hashes, when needed, but fixity is not necessarily a compelling reason given the way modern systems are engineered.

In practice, these checks do detect failures but they are almost exclusively transmission errors as a result of uncontrolled (and unauditable) activities - often by sysadmins or third party suppliers not well versed in digital preservation. In these cases, the wrong data is actually written to storage so there is no fixity to lose. Periodic "fixity" checking can catch these cases, but ideally you want to have visibility of these processes and checking immediately after completion. If the errors are in automated processes, waiting for a periodic check to come round may allow significant damage to occur.

Originally posted to the PASIG mailing list, with updates as a result of discussion with Kyle Rimkus (University of Illinois at Urbana-Champaign).

...also now posted on the DPC blog.

No comments: