Thursday, 16 November 2017
Retconning this Blog
I've got a load of bits and pieces laying around the internet that I've posted over the years. I'm going to collect them all here with their original dates. I would like to claim it's for preservation but frankly, it just that I can't keep track of the stuff.
Wednesday, 15 November 2017
Thoughts on Fixity Checking in Digital Preservation Systems
I would like to query the rationale for actually doing periodic fixity checking in isolation. This has bugged me for a bit so I am going to unload.
As far as I can see, the main reasons would be undetected corruption on storage and tampering that doesn’t hijack the chain of custody.
All storage media now have built-in error detection and correction using Reed-Solomon, Hamming or something similar which is generally capable of dealing with small multi-bit errors. In modern environments, this gives unrecoverable read error rates of at worst around 1 in 10^14 and generally several orders of magnitude better – which is around one in 12TB total read. Write errors are less frequent – they do occur but can be detected by device firmware and retried elsewhere on the medium. These are absolute worst case figures and result in *detectable* failure long before we even get to computing fixity. The chance of bit flips occurring in such a pattern as to defeat error correction coding is several orders of magnitude less – it is similar to bit flips resulting in an unchanged MD5 hash. Interestingly, in most cases the mere act of reading data allows devices to detect and correct future errors as the storage medium becomes marginal so there is value in doing that.
Consequently, however, undetected corruption is most likely when data moves from the error corrected environment of the medium to less robust environments. At an interconnect level protocols such as SCSI, SATA, Ethernet and FC are all error corrected as is the PCI-E bus itself. The most likely failure points are likely to be a curator’s PC or software. How many curators work on true workstation grade systems with error corrected RAM and error corrected CPU caches? How well tested are your hashing implementations (MD5 had a bug not so long ago)? How about all the scripts that tie everything together? How about every tool in your preservation toolchain? How many of these fail properly when an unrecoverable media error is encountered?
If we consider malicious activity then, again, we have to ask whether it is easier to attack the storage (which may require targeting several geographically dispersed and reasonably secure targets) or the curation workflow, which is localised, generally in a less secure location than a machine room, and can legitimise changes. A robust digital signature environment is the way to deal with this – and fixity hashes *can* be used to make this more efficient (sign the hash rather than the whole object).
Locally computed hashes can be very useful as a bandwidth efficient way of comparing multiple copies of an object (rsync has done this for ages) to ensure that they are in sync.
So there are reasons to compute hashes, when needed, but fixity is not necessarily a compelling reason given the way modern systems are engineered.
In practice, these checks do detect failures but they are almost exclusively transmission errors as a result of uncontrolled (and unauditable) activities - often by sysadmins or third party suppliers not well versed in digital preservation. In these cases, the wrong data is actually written to storage so there is no fixity to lose. Periodic "fixity" checking can catch these cases, but ideally you want to have visibility of these processes and checking immediately after completion. If the errors are in automated processes, waiting for a periodic check to come round may allow significant damage to occur.
Originally posted to the PASIG mailing list, with updates as a result of discussion with Kyle Rimkus (University of Illinois at Urbana-Champaign).
...also now posted on the DPC blog.
As far as I can see, the main reasons would be undetected corruption on storage and tampering that doesn’t hijack the chain of custody.
All storage media now have built-in error detection and correction using Reed-Solomon, Hamming or something similar which is generally capable of dealing with small multi-bit errors. In modern environments, this gives unrecoverable read error rates of at worst around 1 in 10^14 and generally several orders of magnitude better – which is around one in 12TB total read. Write errors are less frequent – they do occur but can be detected by device firmware and retried elsewhere on the medium. These are absolute worst case figures and result in *detectable* failure long before we even get to computing fixity. The chance of bit flips occurring in such a pattern as to defeat error correction coding is several orders of magnitude less – it is similar to bit flips resulting in an unchanged MD5 hash. Interestingly, in most cases the mere act of reading data allows devices to detect and correct future errors as the storage medium becomes marginal so there is value in doing that.
Consequently, however, undetected corruption is most likely when data moves from the error corrected environment of the medium to less robust environments. At an interconnect level protocols such as SCSI, SATA, Ethernet and FC are all error corrected as is the PCI-E bus itself. The most likely failure points are likely to be a curator’s PC or software. How many curators work on true workstation grade systems with error corrected RAM and error corrected CPU caches? How well tested are your hashing implementations (MD5 had a bug not so long ago)? How about all the scripts that tie everything together? How about every tool in your preservation toolchain? How many of these fail properly when an unrecoverable media error is encountered?
If we consider malicious activity then, again, we have to ask whether it is easier to attack the storage (which may require targeting several geographically dispersed and reasonably secure targets) or the curation workflow, which is localised, generally in a less secure location than a machine room, and can legitimise changes. A robust digital signature environment is the way to deal with this – and fixity hashes *can* be used to make this more efficient (sign the hash rather than the whole object).
Locally computed hashes can be very useful as a bandwidth efficient way of comparing multiple copies of an object (rsync has done this for ages) to ensure that they are in sync.
So there are reasons to compute hashes, when needed, but fixity is not necessarily a compelling reason given the way modern systems are engineered.
In practice, these checks do detect failures but they are almost exclusively transmission errors as a result of uncontrolled (and unauditable) activities - often by sysadmins or third party suppliers not well versed in digital preservation. In these cases, the wrong data is actually written to storage so there is no fixity to lose. Periodic "fixity" checking can catch these cases, but ideally you want to have visibility of these processes and checking immediately after completion. If the errors are in automated processes, waiting for a periodic check to come round may allow significant damage to occur.
Originally posted to the PASIG mailing list, with updates as a result of discussion with Kyle Rimkus (University of Illinois at Urbana-Champaign).
...also now posted on the DPC blog.
Monday, 6 November 2017
ORCID Token Revocation
At the last Cultivating ORCIDs Meeting in Birmingham in June 2017, I ran a working group looking at different approaches to
implementing ORCID IDs. One of the outcomes was the identification of a common issue when it
came to ORCID implementations and third party suppliers, namely, that
institutional users needed to explicitly grant access to third party
suppliers in addition to their own institution. This behaviour has a
number of undesirable side effects:
At the meeting, Will Simpson of ORCID presented a very useful non-technical overview of how authentication and ORCID/OAuth tokens worked in terms of managing access permissions. Discussion then moved on to the main topic of how ORCID permissions might be delegated to third party providers and, in particular, how to handle the termination of third party arrangements. During these discussions, Will indicated that support for the optional OAuth functionality for token revocation was being considered by ORCID. OAuth is the technology/standard that ORCID uses for authorisation/access control. At the moment, tokens are granted by default for 20 years, or 1 hour for effectively single, short term, use. Naturally, neither of these match the typical duration of a scholar’s relationship with an institution. Minimising the number of active tokens would be good from both a security and “data hygiene” standpoint, so the ability for an institution to relinquish their token when a scholar leaves would be useful in its own right. Scholars can revoke their tokens manually when they leave but it is unrealistic to rely on them to remember to do so.
At the moment, it is possible to work around this situation by making creative use of the OAuth token refresh facility. This functionality is important since it is what will allow an institution to grant tokens to a third party on behalf of an individual researcher (which will be explored in the next posting), but, in this context, it does provide a slightly unorthodox method for effectively relinquishing a token. Intended for use when an existing token nears expiry, a replacement token may be requested with a new expiry date which then invalidates the previous token. However, this can *actually* be done at any time and a 20-year token *can* be replaced by a 1 hour token which can simply be allowed to expire, resulting in no active tokens.
This is a concatenation of two articles posted on the UK ORCID Consortium blog.
- Communicating this to users can be difficult since they are not always aware of these third parties
- Getting consistent takeup across multiple systems can be difficult (user loses interest) which makes downstream integration more awkward than necessary
- Institution has little visibility of these third party interactions – which can cause problems when suppliers are dropped or other issues arise
- The only way currently round this is to let a supplier use the institutional key – which however then grants them *ALL* the rights can access that the institution has
At the meeting, Will Simpson of ORCID presented a very useful non-technical overview of how authentication and ORCID/OAuth tokens worked in terms of managing access permissions. Discussion then moved on to the main topic of how ORCID permissions might be delegated to third party providers and, in particular, how to handle the termination of third party arrangements. During these discussions, Will indicated that support for the optional OAuth functionality for token revocation was being considered by ORCID. OAuth is the technology/standard that ORCID uses for authorisation/access control. At the moment, tokens are granted by default for 20 years, or 1 hour for effectively single, short term, use. Naturally, neither of these match the typical duration of a scholar’s relationship with an institution. Minimising the number of active tokens would be good from both a security and “data hygiene” standpoint, so the ability for an institution to relinquish their token when a scholar leaves would be useful in its own right. Scholars can revoke their tokens manually when they leave but it is unrealistic to rely on them to remember to do so.
At the moment, it is possible to work around this situation by making creative use of the OAuth token refresh facility. This functionality is important since it is what will allow an institution to grant tokens to a third party on behalf of an individual researcher (which will be explored in the next posting), but, in this context, it does provide a slightly unorthodox method for effectively relinquishing a token. Intended for use when an existing token nears expiry, a replacement token may be requested with a new expiry date which then invalidates the previous token. However, this can *actually* be done at any time and a 20-year token *can* be replaced by a 1 hour token which can simply be allowed to expire, resulting in no active tokens.
This is a concatenation of two articles posted on the UK ORCID Consortium blog.
Monday, 25 September 2017
Data2paper Poster for 10th RDA Plenary
Data2paper is a cloud-based application to automate the process of compiling and submitting a data paper to a journal without the researcher having to leave the research space or wrestle directly with the journal’s submission system.
Serving the wider research community, data2paper works with academic institutions, publishers, data repositories, funding agencies and organizations interested in research and scholarly communication. Data2paper aims at progressing data papers, enabling data re-use and giving researchers credit for their data.
The development of the original idea was funded as part of the Jisc Data Spring initiative.
SWORDV3 Poster for 10th RDA Plenary
SWORD (Simple Web-service Offering Repository Deposit) is a lightweight protocol for depositing and updating content from one location to another. The SWORD vision is ‘lowering the barriers to deposit‘, principally for depositing content into repositories, but potentially for depositing into any system which wants to receive content from remote sources.
The SWORDV3 project has been funded by Jisc with two key aims for the next generation of the SWORD Protocol:
- To bring SWORD up-to-date with the developments in the repository sphere in the last 5 years, with alignment to new protocols and new use-cases such as data publishing and complex objects.
- To establish community and governance mechanisms for the standard and supporting code libraries to ensure ongoing maintenance and evolution. This will include a technical validation process to allow third party libraries to be hosted under the SWORD brand.
- Dom Fripp – Jisc (Funder) – Project Manager, SWORD user through the Jisc Research Data Shared Service
- Richard Jones – Cottage Labs Ltd – Technical Lead, SWORD contributor and implementer
- Neil Jefferies – Bodleian Libraries, University of Oxford – Community Lead, SWORD user through Data2paper project (and internally within the Bodleian Library)
Friday, 1 September 2017
GPD Pocket
I've had a clearout - the Omnibooks have gone (except the 300), the Lifebook U810 has gone, the Zaurus SL-C1000 has gone. In their place I have a GPD Pocket which I helped crowdfund on Indiegogo...not that they needed it, they exceeded their target by something like 1500%.
I ordered the Windows version since I wanted something that pretty much worked out of the box since I expected the Ubuntu port would be a bit rough. This turned out to be case since GPD hadn't really done Linux before and Intel's docs for the Cherry Trail platform are woeful. However, the community has now stepped up and there are several flavours which now perform quite acceptably. I'll probably get round to trying some in due course.
It's a lovely little device which you can read about elsewhere online (seriously, just Google it) but I can say I haven't really had any of the problems that others have reported. It's one of the first run of the devices and I haven't even done the "standard" mods like flashing the unlocked BIOS or replacing the thermal compound on the heatsink. Keyboard is not a problem - it's huge compared to the U810 or the Zaurus - and I never liked trackpads anyway so the trackpoint is just fine.
However, on to the the main topic of this posting. Since the Pocket is running Windows 10, that is definitely going to need some work before I consider it usable. It's interesting how much better the Linux "out-of-the-box experience" generally is (when all the drivers work)...which is a bit of a turnaround. Then again, the following process is all about making Windows as Linux-like as possible!
I ordered the Windows version since I wanted something that pretty much worked out of the box since I expected the Ubuntu port would be a bit rough. This turned out to be case since GPD hadn't really done Linux before and Intel's docs for the Cherry Trail platform are woeful. However, the community has now stepped up and there are several flavours which now perform quite acceptably. I'll probably get round to trying some in due course.
It's a lovely little device which you can read about elsewhere online (seriously, just Google it) but I can say I haven't really had any of the problems that others have reported. It's one of the first run of the devices and I haven't even done the "standard" mods like flashing the unlocked BIOS or replacing the thermal compound on the heatsink. Keyboard is not a problem - it's huge compared to the U810 or the Zaurus - and I never liked trackpads anyway so the trackpoint is just fine.
However, on to the the main topic of this posting. Since the Pocket is running Windows 10, that is definitely going to need some work before I consider it usable. It's interesting how much better the Linux "out-of-the-box experience" generally is (when all the drivers work)...which is a bit of a turnaround. Then again, the following process is all about making Windows as Linux-like as possible!
- Wait an age while Windows updates and reboots, updates and reboots while attempting to index the contents of the SSD. So turn off the indexing service to speed things up...and probably never turn it on again since it never seems to be any use to me.
- Install NTLite and prune out the cruft - like all the apps, the App Store, Cortana etc. that take up space and, on occasion, CPU and thus battery power. After that, the Windows 10 tile menus start to look a bit empty...
- Install Classic Shell to get some nice sensible menus. I favour Classic style menus.
- Install Spybot Anti-Beacon to shut off most of Microsoft's snooping. Default settings seem to work best - delve too deeply into the optional settings and you can break updates which is probably worse from a security point of view than the meagre trickle of data that remains.
- Install WinXCorners to make the half-baked multiple desktop implementation in Windows actually usable. I set it so the mouse in the top right corner shows all desktops for task switching - which is how I have it in Linux.
- Install ThrottleStop to better manage the CPU power draw - I have the following config:
- CPU Voltage ID reduced to 0.6650. I'm not sure that the number displayed is right but it does drop power consumption noticeably. Any lower and I get occasional blue screens.
- Turbo Power Limits (TPL Button) configured with:
- Package power long: 3W (this is the long term average power limit)
- Package power short: 5W (this allows short bursts above the average up to the turbo time limit)
- Turbo time limit: 10s
- Low power profile that disables turbo (limits the CPU to 1.6Ghz)
- Default power profile that allows full turbo to 2.56GHz
- I did some tests with Geekbench 3 to test the various power levels. At 5W, all the cores will turbo up to 2.56 GHz, at 4W only two cores can spin up, and at 3W only one core will boost and even then only to around 2.4GHz. At the 2W notional "scenario design power" it appears that only one core can even reach 1.6Ghz, which is basically unusable, so Intel have been rather creative with the figures.
- Also, note that this does not factor in GPU power consumption since I don't really game much on the GPD. If you do then you'll need to add a watt or two to the package limits.
- With these settings I will get typically 8-10 hours of use during the day - web browsing, document editing, emailing and Skyping - which suits me fine.
- Install the rest of the Open Source stack that I use
- NextCloud I always sync everything (phone/laptop etc.) to my own servers. Nothing that I can't afford to lose goes into anyone's cloud!
- LibreOffice of course
- I do a fair bit of image management from my camera (that's another post!) so RawTherapee and Digikam although I am now looking at DarkTable since they have a Windows version now.
- Firefox and Thunderbird - Firefox Quantum is really looking rather good!
- Password Gorilla
- And a couple of proprietary but free goodies
- Microsoft Research's Image Composite Editor which is a really very good panorama stitcher that runs a treat on the Pocket. Whatever you think of MS as a whole, the guys at MS Research do some really good stuff.
- Foxit Reader the least annoying PDF viewer that I have encountered
Monday, 19 June 2017
Rethinking Digital Preservation (for Research Libraries)
Presentation given at the Digital Preservation for the Arts Social Sciences and Humanties conference. I will expand a lot of the slides in future articles in that I discuss the difference between the preservation of digital knowledge as opposed to digital objects. Knowledge is dynamic, evolving and context dependent and so requires a rather different approach that is less focussed on static fixity and more process oriented.
Subscribe to:
Posts (Atom)