Saturday, 7 July 2012

Representing knowledge – metadata, data and linked data

Reposted Op-Ed from Wikipedia Signpost

This piece examines a key question that new Wikimedia projects such as Wikidata are concerned with: how to properly represent knowledge digitally at the most basic level. There is a real danger that an inflexible, proscriptive approach to data will severely limit the scope, capabilities and ultimate utility of the resulting service.

At one level, the textual representation of information and knowledge in books and online can be viewed as simply another serialisation and packaging format for information and knowledge, optimised for human rather than machine consumption. Within the Wikipedia community – Wikidata and elsewhere – there is a perceived utility in using more structured, machine-friendly formats to enable better information sharing and computer-assisted analysis and research. However, there remains a lot of debate about the best approach, to which I will contribute the views I have developed over nearly a decade of research and development projects at the Bodleian Library[1] and before that, through my involvement with knowledge management in the commercial domain.

My first point is that metadata and data are really different aspects of a continuum. In the majority of cases, data acquires much of its meaning only in connection with its context, which is largely contained within so-called metadata. This is especially true for numerical data streams, but holds even for data in the form of text and images: when and where a text was written are often critical elements in understanding the meaning.[2] Data and metadata should be considered not as distinct entities but as complementary facets of a greater whole.

Secondly, there will be no single unifying metadata "standard" (or even a few such standards), so deal with it! For example, biosharing.org lists just under 200 metadata standards for experimental biosciences alone. The notion of a single standard that led to the development of MARC, and latterly RDA, in the library sphere is simply not applicable to the way in which metadata is now used within the field of academic enquiry. This means that any solution to handling digital objects must have a mechanism for handling a multiplicity of standards, and ideally within an individual object – for example, bibliographic, rights and preservation metadata may quite reasonably be encoded using different standards.[3] The corollary of this is that if we have such a mechanism there is no need to abandon existing standards prematurely. This avoidance of over-proscribing and premature decision-making will be familiar to Agile developers. Consequently, Wikidata developers would be ill-advised to aim for a rigid, unitary metadata model – even at a basic level, representing knowledge is too complex and variable for such an approach.

So how do we balance this proliferation of standards with the desire for sharing and interoperability? We can find several key areas in which a consensus view is emerging, not through explicit standard-setting activities but through experience and necessity. This gives us a good indication that these are sensible points on which to base longer-term interoperability.
  1. An emergent data/object model. Besides the bibliographic entities, such as digitised texts, images and data, a number of key types of "context-object" recur when we start to try to build more complex systems for handling digital information. This can be seen in such diverse areas as the specifications for TEI, Freebase, CERIF and schema.org. The most important of these elements are people, places, vocabularies/ontologies and the notion of time dependency. Indeed, for many projects in the humanities, these objects actually form the basis for expressing ideas and framing discourse using the conventional bibliographic objects to provide an evidentiary base.
  2. Aggregations as a key organising tool for this expanded universe of digital objects. In many cases, these aggregations are also objects in their own right, representing content collections, organisations, geopolitical entities and even projects – each potentially with a history and other attributes. An essential characteristic of aggregations is that they need not be hierarchical, but rather a graph capable of capturing the more unstructured, web-like way people have of organising themselves and their knowledge.[4]
  3. Agreement on essential common properties. For each object type there is usually a general consensus on a minimal set of properties that are sufficient to both uniquely identify an object and provide enough information to a human reader that the object is the one that they are interested in. Often, the latter is actually a less strict requirement as a person can use circumstantial evidence such as the context in which an object occurs for disambiguation. While it is desirable to try to capture contextual information systematically, we have to accept that this is frequently not done. Sources for this common baseline could include Dublin Core (or dcterms to be explicit) records, DataCite records, gazetteers, and name authority lists, for example.
These common properties are obviously very amenable to storage and manipulation in a relational database. Indeed, for large-scale data ingestion with the following clean-up, de-duplication and merging of records/objects, this is likely to be the best tool for the job. However, once this task has been completed and we delve into the more varied elements of the objects, the advantages of a purely relational database approach are less clear-cut.

Instead, we can treat each object as an independent, web-addressable entity – which in practice is desirable in its own right as a mode of publication and dissemination. In particular, we can use search engines to index across heterogeneous fields – Apache Solr excels at faceting and grouping, while ElasticSearch can index arbitrary XML without schemas (i.e. all of the varied domain-specific metadata). These tools give users ways into the material that are much easier to use and more intuitive.

The objects alone are only a part of the picture – the relationships between objects are critical to the structure of the overall collection. In fact, in many cases (especially in the humanities) a significant proportion of research activity actually involves discovering, analysing and documenting such relationships. The Semantic Web or, more precisely, the ideas behind the Resource Description Framework (RDF) and linked data, provide a mechanism for expressing these relationships in a way that is structured, through the use of defined vocabularies, but also flexible and extensible, through the ability to use multiple vocabularies. While theoretically it is possible to express all metadata in RDF, this is not practical for performance[5] and usability[6] reasons, and is unnecessary.

This model of linked data, combining a mix of standardised fields and less-structured textual content, should not be entirely unfamiliar to people used to working with Semantic MediaWiki, sharing their metadata on Wikidata, or using data boxes in Wikipedia! However, when applying this model to practical research projects it emerges that a critical element is still lacking. Although we can describe relationships between objects using RDF, we are limited to making assertions of the form [subject][predicate/relationship][object] (the RDF "triple"). In practice, relatively few statements of this form can be considered universally and absolutely true. For example: a person may live at a particular address but only for a certain period of time; the copyright on a book may last for 50 years, but only in a particular country. Essentially, what is needed is a mechanism to define the circumstances under which a relationship can be considered valid. A number of possible mechanisms could do this – replacing RDF triples with "quads" that include a context object; annotation of relationships using OAC.

These examples are really just special cases of a more general requirement that is of great interest to scholars. This is the ability to qualify a relationship or assertion to capture an element of provenance. Specifically, we need to know who made an assertion, when, on the basis of what evidence, and under which circumstances it holds. This may be manifested in several ways:
  • Differences of scholarly opinion – it should be possible for there to be contradictory assertions in the data relating to an object, provided we can supply the evidence for each point of view.
  • Quality of the evidence – information can be incomplete, or just unclear if we are dealing with digitised materials. In this case we want to capture the assumptions under which an assertion is made.
  • Proximity of evidence – we may have an undated document but if we know the biography of the author we can place some limits on probable dates. This evidence is not intrinsic to the object but can be derived from its context.
  • Omissions – collections are usually incomplete for various reasons. It is important to distinguish the absence of material as a result of inactivity or specific omission from subsequent failures in collection building.
These qualifications become especially important when we try to use computational tools such as analytics and visualisation. Indeed, projects such as Mapping the Republic of Letters (Stanford University) are expending significant effort to find ways of representing uncertainty and omission in visualisations.

I believe there needs to be a subtle change in the mindset when creating reference resources for scholarly purposes (and, arguably, more generally). Rather than always aiming for objective statements of truth we need to realise that a large amount of knowledge is derived via inference from a limited and imperfect evidence base, especially in the humanities. Thus we should aim to accurately represent the state of knowledge about a topic, including omissions, uncertainty and differences of opinion.


Notes

  1. In particular, Cultures of Knowledge.
  2. Usefully, most books come with a reasonable amount of metadata (author, publisher, date, version etc.) encapsulated in the format, but this is represents somewhat of an anomaly. Before the advent of the book and, more recently, in online materials, metadata tends to be scarcer.
  3. However, I concede that it is not unreasonable to expect that things are generally encoded in XML with a defined schema.
  4. Our own experience of trying to model the organisational structure of the University of Oxford (notionally hierarchical) convinced us that this was essential.
  5. RDF databases (triple stores) currently scale to the order of billions of triples – this limit can be reached quite easily when you consider that the information in a MARC record for a book in a library may have well over 100 fields.
  6. RDF is a very verbose format. Existing domain-specific XML formats can be much easier to read and manipulate.

Tuesday, 3 July 2012

VMWare ESX Failover/HA

When you take a VMWare host down into maintenance mode to drop in a few more CPU's and RAM and then bring it back up, it appears that vMotion and Failover drop off the network ports they were enabled on. Annoying, since if you're not careful you end up with a cluster of standalone hosts manifestly not doing cluster-y things.

VMWare Tools

When VMWare Tools just doesn't seem to be installing no matter how many time you run vmware-config-tools.pl remember the --clobber-kernel-modules switch. Sometime it decides it won't overwrite the old modules for no apparent reason.

Wednesday, 27 June 2012

Shoehorning DualliesPart 1A - BOM

As people have asked - here is the Bill of Materials for the build in my previous post:

Case: CoolerMaster Elite 360
Motherboard; Asus K8N-DL
Processors: 2x AMD Opteron OSA275FAA6CB
Coolers: AMD Stock quad-heatpipe
RAM: 6x Micron 2GB PC3200 DDR ECC Reg
PSU: Meridien Xclio GreatPower 600W
Fans: 2x Akasa 120mm Blue (Top outlet and side inlet),
Fans: 3x 80mm Arctic Cooling TC (bottom inlet and rear outlet)
GPU: Asus Passive Radeon HD 4350
SATA backplane: StarTech SATSASBAY425 (Actually it's an older model)
Fan bus: 5-way ModMyToys 4-Pin Distribution PCB from KustomPC's
Card-Reader: Generic Acorp 3.5-bay reader

Sunday, 24 June 2012

Shoehorning Duallies Part 1 - The Elite 360

I like compact systems, which is somewhat incompatible with my penchant for multiprocessor systems. Recently, I have got hold of a couple of Cooler-Master Elite 360 cases which are about as small ((W) 148 x (H) 360 x (D) 439 mm) as you can get and still fit a regular ATX PSU and motherboard. The K8N-DL here does, however, take liberties with the ATX specification (being slightly L-shaped  and 10.5 inches rather than 9.6 inches deep). Fortunately, the Elite 360 drive supports are placed to exactly fit this "L"!
While you can technically fit two 5.25-inch devices into this case, it makes things a *lot* neater if you only use one and route/stash cables in the other. Here I have a nice little 4x2.5-inch SATA backplane for SSD's/HDD's and a USB card reader that gives me another front port too. The top 3.5-inch hard drive bay is also used as a cable stash and just behind it there is a 5-way ModMyToys 4-Pin Distribution PCB from KustomPC's which runs the top, rear and bottom fans.

Since I now use my VE200 for most installs, the lack of an optical drive isn't really a problem - most of my stuff is downloaded anyway.

Also note that the PSU has to go in last as it sits over the motherboard (and you have to remove the graphics card, the bottom CPU cooler and any tall NB coolers to get it in! A modular PSU is a must!

So what goes into the HDD backplane? For booting, rather than use SSD's, I use a couple of CF/SATA adapters.
I have a couple of the Addonics ones in the link and a couple of cheap Chinese ones from eBay and, frankly, they are identical. I have some Transcend 4GB 266x and Lexar 8GB 200x cards which are as cheap as chips compared to SSD's - rather slower too but Linux still boots quite snappily with no seeking to worry about.

For main storage I have a mix but the Samsung M7 laptop drives are working well for me in RAID1 pairs.

P.S. I'm aware I've mixed units. Being a Physicist by training, I work in SI but IT is often still standardised in (US) Imperial - it makes sense to use native units. Who uses a 133.4mm drive?

Friday, 15 June 2012

VMWare Workstation 8 and OpenSUSE 12.1 revisited

I'm not entirely happy with the configuration above - although everything installs and basically works there are enough little bits of misbehaviour that I am not entirely happy.
  • The fglrx drivers for the Radeon cards in my machines seems to generate odd screen corruptions, for example, when a menu pops up. These are overwritten fairly quickly by the proper content but there is a visual discontinuity.
  • Workstation 8 is not supported on OpenSUSE 12.1 not does it officially support it as a guest although both work more or less. This is irritating since 12.1 has been out for a bit.
  • Automatic input capture in VM's utterly fails for me until Workstation 8.04 (released yesterday as I write!) so I had to use Ctrl-G to do manual capture.
  • VM's seem to be able to bog down the entire machine when saturating only one core (and it's not an IO bottleneck). This is not something I have encountered previously. This is really noticeable when installing a VM where the host pretty much grinds to a halt and screen updates become glacial as files are uncompressed and copied. 
So, basically, I am going back to OpenSUSE 11.4 since that does seem to work properly.

Thursday, 14 June 2012

Adventures with openSUSE 12.1 and a K8N-DL

Decided to upgrade on of my K8N-DL based Opteron boxes to OpenSUSE 12.1 from 11.3. This machine boots from a CF card and then mounts a pair of HDD's in RAID one on /srv and hosts a collection of VM's. The old config used VirtualBox and fitted everything on a 4GB CF card but I wanted to try out VMWare Workstation 8 on this one which is a fair bit bigger (since it effectively packages an instance of VMWare Server and a vSphere client now) so I needed to up this to 8GB.

All well and good - I slotted in the new card and fired up the OpenSUSE CD for installation, telling it to format the CF and then mount the existing RAID on /srv. One little oddity, I thought, is that it labelled the array as md127 rather than md1. So, the installation completes and I reboot...to be confronted by the emergency login in text mode. Not good, so take a look at dmesg and, sure enough, it can't mount /dev/md127 because it doesn't exist. A look in /proc/partitions tells me that now I have device called md1 so I edit /etc/fstab to mount this and reboot - all appears well.

A little down the line, after some patching/updating, I am installing VMWare when I notice that only one drive light is on. Strange. So I run up mdadm which informs me that I now have two arrays, md1 and md127 with one drive in each. Urgh! I delete md127 and add back the, now freed, drive to md1 and after a bit of resync, all seems sane.

However, VMware Workstation was refusing to start, filling the logs with Hostd: error: N7Vmacore15SystemExceptionE messages and streams of hex that looked not entirely unlike a kernel panic...except it wasn't taking the system down. A bit of googling tells me I needed to enable IPv6 to fix this. This works fine but WTF?

Wednesday, 14 March 2012

Too Many Files on the Zalman VE-200

It appears that the VE-200 has a limit on 32 files in an NTFS directory for browsing ISO images. Otherwise "Too Many Files" flashes up on the little LCD display. Not a major problem since you're allowed subdirectories but just something to bear in mind. Probably best to keep filenames and directory paths shortish too as a general principle.

Windows XP and WebDAV

I am playing with ownCloud on a laptop (the U810 to be precise) as a nice way having  cloud-y storage when I'm at home that I can pick up and take along  when I go travelling to non-networked locations. The way of getting stuff onto it is primarily WebDAV, and although I am mostly Linux-based, there are XP boxes (and the odd Mac) in my life too.

However, WebDAV on Windows seems to have a troubled existence - there is much material online on its brokeness at various levels. Microsoft seems to had had several goes at a client, each with its own bugs and idiosyncrasies. Herewith is what worked for me with the XP SP3 built-in client (the WebClient service).

  1. Create a Network Place but ensure that you specify port 80 explicitly - or else the place creation fails as, I think, Windows tries to mount it as a CIFS. It's not like the "http:" is a clue!
  2. Edit the Registry and create HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\WebClient\Parameters\UseBasicAuth as a DWORD and set it to 2 (forces BASIC authentication). Why is authentication in Windows always borked in some way?
  3. Forget about https: connections
The lack of https: might be an issue except that I'm primarily interested in getting stuff off the Windows boxes. Once that's done I can probably push the odd new file up through the Web UI and not bother with WebDAV on XP much more.