Cultural Heritage Institutions and Archives have the potential to set the agenda for major applications of Linked Data

Linked Data

Cultural Heritage institutions like libraries, museums and archives are faced with many different challenges in preparing Linked Data, as this essay will survey. For example there is rich variety between the different institutions, in terms of what they collect and what function they serve, even what language they speak, which can lead to different preferences in metadata standards. These and other reasons promote heterogeneity within the Linked Data, prompting a breakdown in interoperability between the different Cultural Heritage Institutions, and especially in aggregators like Europeana. Added to this, valuable and sophisticated metadata is often sacrificed to make the data more compatible with other datasets. But by examining how institutions reckon with these problems, key solutions are highlighted of importance to many different applications in linked data. Examining how these challenges are eventually resolved sets the agenda for Linked Data research in the future.

It is of huge benefit to society that anyone with an internet connection can access most of the world’s museums, libraries and archives. In fact, Tim Berners-Lee felt on encountering the first museum exhibit online that it encapsulated some of the Web’s most basic principles of free and open access to information: “This use of the Web to bring distant people to great resources, and the navigational idiom used to make the virtual museum, both caught on and inspired many excellent Web sites” (Berners-Lee, p.64).[i]

In a definition of Cultural Heritage, Eero Hyvonen shows how the intention behind Linked Data and Cultural Heritage are similarly aligned, in that they seek to protect data for posterity: “Cultural Heritage (CH) refers to the legacy of physical objects, environment, traditions, and knowledge of a society that are inherited from the past, maintained and developed further in the present, and preserved (conserved) for the benefit of future generations” (Hyvonen, p.1).[ii] The consequences of putting a Cultural Heritage collection online is that it is no longer hemmed-in by physical constraints such as space or time, many more people can view it at once, and unlike the physical world, can then immediately access connected objects or collections many leagues away.

Berners-Lee writes of hyperlinks opening up research, and the same applies to Cultural Heritage institutions: “On the web however, research ideas in hypertext links can be followed up in seconds, rather than weeks of making phone calls and waiting for deliveries in the mail” (Berners-Lee, p.41). The speed of research was accelerated by the Web, but before Linked Open Data, performing searches could still be arduous. Howard Besser gives an impression of the constant back-tracking once necessary for searching catalogues and collections, having to use different search engines that required different syntax and protocols, without much interoperability or interconnection: “each of these repositories might have required a different syntax and different set of viewers. Once the user had searched several different repositories, he or she still could not examine all retrieved objects together. There was no way of merging sets” (Besser, pp. 561, 562).[iii]

The natural extension of this is pooled resources of collections and databases between separate institutions. One such example of this so-called “information portal” is Europeana: “A flagship application here is Europeana, based on millions of collection objects originating from memory organizations all over Europe” (Hyvonen, p.2). Europeana aggregates the data of over 2,000 institutions across Europe, allowing users to explore these institutions’ collections. In order for these collections to be linked they must follow the same metadata standard, problematic when you stop to consider how many different institutions and possibilities for unique metadata there are. And this compliancy usually comes at the cost of the quality to the metadata.

How Europeana and other Cultural Heritage linked-data aggregators harmonize these conflicting datasets – at least for their own use – is by consuming the wide-spread records of metadata, and applying their own structure to the data, which in Europeana’s case is the EDM (Europeana Data Model), publishing it on the Europeana servers: “[EDM] should facilitate Europeana’s transition from a closed data repository to an open information space that integrates with the Web architecture and the Linked Data principles for identifying and exposing resources on the Web” (Haselhofer et al, p.97).[iv] But while this method may create consistency between national institutions, it promotes inconsistent metadata within the institution itself: “This approach ensures a level of consistency and interoperability between the datasets from different institutions [but] creates the problem of a disconnect between the Cultural Heritage institute original metadata and the Linked Data version” (Boer et al, p.2).[v]

When converting Cultural Heritage metadata into the Linked Data model used by Europeana, the metadata is forced to comply with the target meta-model, which has a simplifying or cheapening effect on the valuable metadata: “the original metadata is forced into a target meta-model such as Dublin Core [which] means that specific metadata values are copied to the new model, deleting other values. While this produces ’clean’ data in the new meta-model, the original complexity and detail of the data may be lost” (ibid, p.3). The presiding structure of Cultural Heritage data is graph-like to reflect its semantic richness, needing to connect many different objects and datatypes, where other datasets adopt a flat shape. As a consequence these datasets require more complex XML, making it more difficult to convert to RDF for use in large aggregators: “For datasets that are more or less ‘flat’ record structures, the produced RDF is usually of sufficient quality. However, Cultural Heritage datasets usually are inherently graph-like and will consist of more complex XML” (ibid).

Such variety in metadata appears to be unique to Cultural Heritage metadata: “A special characteristic of cultural collection databases is that they contain semantically rich information. Collection items have a history and are related in many ways to our environment, to the society, and to other collection items” (Hyvonen, Makela et al, p.1).[vi] Part of the reason Cultural Heritage metadata is so rich and varied is as abovementioned because it incorporates so many different elements; a look at the type of data collected in these portals ranges from photographs, artworks, audio or video footage, biography, historical artefacts etc: “Managing publication of such richness and variety of content on the Web… poses challenges where traditional publication approaches need to be re-thought” (Hyvonen, p.viii).

The example given by Hyvonen to illustrate this point about the richness in Cultural Heritage data uses a chair. Appearing simple at first, the chair may be connected to a style, a period, a famous designer, be linked to someone historically significant, etc, and expand outwards into many different collections: “Other collection items, locations, time periods, designers, companies etc. can be related to the chair through their properties and implicitly constitute a complicated semantic network of associations. This semantic network is not limited to a single collection but spans over other related collections in other museums” (ibid, pp.1, 2).

It is also important to note the differences among the separate Cultural Heritage institutions, like libraries, museums and archives. Primarily where they differ is in what they collect, and arising from this are different standards and recommended datasets: “Archives are the third major type of memory organizations in addition to museums and libraries. Archives, libraries, and museums differ from each other in 1) what they remember and 2) who they serve” (ibid, p.41). Archives contain greater numbers of legal and historical documents, and more primary-source documents that pertain to historical events. In contrast to archives, the items collected in a library are rarely unique, multiple copies of the objects usually existing somewhere, because each copy is of equal value to the user. Museums then are different from libraries and archives, but they can also be vastly different among themselves in terms of what they house, for example art, animals, machinery, technology etc. Usually the objects are unique but not necessarily to the same degree as archives. So while it is possible that the same object could be housed in each of these institutions, it is logical to think that the accompanying metadata be different, having been made to serve very different ends.

Between separate branches of the same type of institution there can also be significant changes in their metadata. Although there are preferred standards for each institution, such as Dublin Core for libraries, how metadata is added to and personalized makes interoperability between individual institutions difficult: “The data often includes attributes that are unique to a particular museum, and the data is often inconsistent and noisy because it has been maintained over a long period of time by many individuals” (Szekely et al, p.2).[vii]

The idiosyncratic way that metadata is written promotes even greater variety within aggregated collections, making it less interoperable and so less linked: “a fundamental problem area in dealing with CH data is to make the content mutually interoperable, so that it can be searched, linked, and presented in a harmonized way across the boundaries of the datasets and data silos” (Hyvonen, p.5). One of the solutions to make datasets interoperable within a larger database is to find a way to homogenise the data: “Web languages, standards, and ontologies make it possible to make heterogeneous museum collections of different kind mutually interoperable. This enables, e.g., the creation of large inter-museum exhibitions” (Hyvonen, Makela et al, p. 2).

Another example of heterogeneity in metadata can be seen with different languages. This especially applies to international institutions like Europeana which aggregates data from European cultural heritage institutions in their individual languages. So when indexing data about an object or event, what language should be used? English may be the dominant language of the internet, but then this requires translating all the metadata.

Hyvonen gives the following example about the event “Battle of Albert” in France during WW1. When indexing data the event is given in both the French “Bataille d’Albert 1914” and the Finnish “Albertin taistelu 1914”. The solution he puts forward is to use a neutral Universal Resource Identifier: “[The neutral] URI is actually used in DBpedia… and although it is based on English, it is language neutral in the sense the URI indeed has different labels “Bataille d’Albert (1914)” in French and “Battle of Albert (1914)” in English attached to it” (Hyvonen, p.26). While the French example alone may be the more neutral, it is “also more difficult to use from a human perspective” (ibid), so English is preferred.

Another example of heterogeneous data can be found in the following case study. Databases recording stolen art, kept by international police forces like Interpol and Scotland Yard, may keep detailed data descriptions of both missing and found items. But through different data the two might never be aligned: “A museum reporting a theft may describe an object as “a Song dynasty Ying Ging lotus vase”, whereas a police officer reporting a recovered item may simply enter a “12.5 inch high pale green vase with floral designs” (Antoniou,  p.199).[viii]  Human involvement – in this case art experts – is still required for its deduction and inference.

But as Heflin writes, different metadata is to be expected. The way of addressing this and making the metadata more homogeneous is to find a way of converting the metadata to the more accepted standard: “Different ontologies may model the same concepts in different ways. The language should provide primitives for relating different representations, thus allowing data to be converted to different ontologies and enabling a “web of ontologies” (Heflin).[ix]

A further challenge to Cultural Heritage Linked Data is presented by homonyms. Why homonyms disrupt databases is because the ontologies are so restrictive, and may not be able to account for more than one version of something. One way to avoid the headache brought about by homonymous terms is to make fewer restrictions in the database: “The problem of homonymous terms occurs when there are homonyms within the range of ontologies used for annotating the ontological feature at hand” (Hyvonen, Makela et al, p.8). One solution is to include with every RDF card the potential choices then present this to the human editor so that they can “remove the false interpretations on the RDF card manually” (ibid). Using Finnish as an example in Hyvonen’s work with the MuseumFinland, this problem did not occur very often, not only with language but because the homonyms usually occurred between terms in different domain ontologies. “However, the problem still remains in some cases and is likely to be more severe in languages like English having more homonyms” (ibid).

What should be done with objects such as people that share the same name? This is where ontologies come into play, ontologies linking data to their exact references in the real world. Using ontologies, Hyvonen et al put forward a solution to objects that share names: “two persons who happen to have the same name should be disambiguated by different URIs, and a person whose name can be written in many ways, should be identified by a single URI to which the alternative terms refer” (Hyvonen, Makela et al, p.20).

Hyvonen examines the complications that follow from a high-frequency object like the “Mona Lisa”, an original artwork but for which many different versions exist across various media, and puts forward a simple solution to the homonym problem: “the painting ‘Mona Lisa’ is in the Louvre, but there are numerous copies, photographs, texts, drawings, and statues depicting it in various European museums. Different views of the same object can be represented using a special proxy mechanism” (Hyvonen, pp.43, 44).

Finally on the subject of names, confusion occurs in databases when the opposite of the above happens, where two different names are provided for the same individual. This is more likely to happen with languages like Russian that use patronymics, or any language where the name differs substantially from the English ie most languages. Szekely et al working in the Smithsonian Museum were made go to great lengths in order to address heterogeneous names of persons: “We matched people using their names, including variants, and their birth dates and death dates. The task is challenging because people’s names are recorded in many different ways, multiple people can have the same name, and birth dates and death dates are often missing or incorrect” (Szekely et al, p.10).

The above is not an example of homonyms, but named synonyms. Synonyms fall under heterogeneity, when different named data is given about the same person or object. Why this is such a problem is that machines are astoundingly literal, where a human in the same situation might be able to exercise inference, and say this thing must also be that. Hyvonen underlines the difficulty names present in a Cultural Heritage Institution database:

The problem of synonyms is particularly difficult when dealing with names. Firstly, names can be written using different syntactic conventions… Second, the names may be written using different transliteration systems… Third, a person may have different names during his/her life (due to marriage, or example), use artistic pseudo-names or…are given nicknames by others (Hyvonen, p.109).

How synonyms and homonyms in a search can be avoided is through semantic disambiguation. This way can refer to the context in which the word appears eg the proximity of the word “bear” to “cubs” or “grizzly” will be able to determine that this “bear” is an animal, whereas “bear” near weight or emotion could mean the verb “bear” and not the noun. This semantic disambiguation can be done manually at the indexing stage, by using keywords from an appropriate ontology, or as above through human deduction based on the word’s context.

All these causes increase heterogeneity and lower interoperability between the datasets in Cultural Heritage Institutions. One method for overcoming the heterogeneity in metadata is for different data controllers to refer to shared URIs when linking their data: “Using URIs has many obvious benefits for linking data. If different content providers index their data using shared URIs, then the distributed data can be linked together automatically in order to enrich it” (Hyvonen, p.28). Even then there may be different terms within the databases that the ontologies and datasets are linked to. So in DBPedia, the example Hyvonen gave of the Battle of Albert may be given differently than in the Imperial War Museum database: “Using a single global ontology for URIs would be an optimal solution, but the reality is that different repositories and communities will continue using different identifiers for the same things” (ibid).

A further way of tackling this heterogeneity is to use ontology alignments between the collections: “[Ontology alignments] define what URIs refer to the same concept in different RDF stores, or overlap in meaning when the concepts do not fully correspond to each other” (ibid). The simplest way of tackling interoperability between datasets may still be to adopt the same schema, but this is not always a viable option, particularly when alternative schema have become established in databases or by data controllers: “interoperability problems can be tackled effectively by using a single schema. However, different schemas are needed and used for different kinds of objects in portal applications dealing with cross-domain contents” (Hyvonen, p.42).

An example of a program to promote interoperability was designed by Europeana and called ESE (Europeana Semantic Elements). The ESE achieves data harmonization by transforming separate institutions’ data into the same schema. EDM (Europeana Data Model) is also used here to link data. “EDM is a Semantic Web-based framework for representing cross-domain collection metadata in museums, libraries, and archives. The model facilitates richer content descriptions than ESE, and data linking based on shared resources” (Hyvonen, p.43).

Another example of a program designed to offset the shortcomings of linked open data is the CIDOC Conceptual Reference Model (CRM). Oldman et al point out that this is an ontology specifically designed for the multi-disciplinary Cultural Heritage institution data: “The CRM came about as through realisation that Cultural Heritage institutions represented such a wide variety of different knowledge that attempting to model or integrate this within established meta-models… would be unsustainable and semantically limiting” (Oldman et al, p.13).[x] As can be seen, the steps to solve the limitations of Cultural Heritage institution data has led to many important developments in linked open data.

By charting the various difficulties that Cultural Heritage Institutions have encountered, this essay has shown how this field has set the agenda for Linked Data, in terms of solutions and breakthroughs to these specific problems. The advantages to overcoming these obstacles to well-formed Linked Data will benefit not just Cultural Heritage data but other areas of data as well. It is impossible to say where Cultural Heritage will go to next, but Hyvonen speculates it will increase in size, breadth and ambition: “It is easy to envision that the development is leading toward larger semantic CH portals, since larger and larger linked datasets with better and better quality are being published. Such datasets are crossing geographical, cultural, and linguistic barriers of content providers in different countries” (Hyvonen, p.121).

[i] Berners-Lee, Tim, Weaving the Web

[ii] Hyvonen, Eero, Publishing and Using Cultural Heritage Linked Data on the Semantic Web

[iii] Besser, Howard, “The Past, Present, and Future of Digital Libraries”

[iv] Haslhofer, Bernard, et al, “data.europeana.eu: The Europeana Linked Open Data Pilot”

[v] Boer, V et al, “Supporting Linked Data Production for Cultural Heritage Institutes: The Amsterdam Museum Case Study”

[vi] Hyvonen, Makela et al, “MuseumFinland: Finnish Museums on the Semantic Web”

[vii] Szekely, Pedro et al,  “Connecting the Smithsonian American Art Museum to the Linked Data Cloud”

[viii] Antoniou, Grigoris et al, A Semantic Web Primer

[ix] Heflin, J. “OWL Web Ontology Language Use Cases and Requirements”

[x] Oldman, Dominic, Doerr, Martin and Gradmann, Stefan, “ZEN and the Art of Linked Data: New Strategies for a Semantic Web of Humanist Knowledge”

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s