More and more data is going up on the Internet using linked data in a variety of formats such as microformats, RDFa, and RDF/XML.
Linked data represents entities as consistent URLs and includes links to other databases of the linked data. In a sense, it's the computer equivalent of human-readable web pages. Often, these formats are used for open data, such as the data published by some governments, like in the UK and elsewhere.
Linked data adds a lot of flexibility and power, but it also introduces more complexity. Often, to work effectively with linked data, we need to start a triple store of some kind. In this recipe and the next three, we'll use Sesame (http://rdf4j.org/) and the kr
Clojure library (https://github.com/drlivingston/kr).
First, we need to make sure that the dependencies are listed in our Leiningen project.clj
file:
(defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [edu.ucdenver.ccp/kr-sesame-core "1.4.17"] [org.clojure/tools.logging "0.3.0"] [org.slf4j/slf4j-simple "1.7.7"]])
We'll execute these packages to have these loaded into our script or REPL:
(use 'incanter.core 'edu.ucdenver.ccp.kr.kb 'edu.ucdenver.ccp.kr.rdf 'edu.ucdenver.ccp.kr.sparql 'edu.ucdenver.ccp.kr.sesame.kb 'clojure.set) (import [java.io File])
For this example, we'll get data from the Telegraphis Linked Data assets. We'll pull down the database of currencies at http://telegraphis.net/data/currencies/currencies.ttl. Just to be safe, I've downloaded that file and saved it as data/currencies.ttl
, and we'll access it from there.
We'll store the data, at least temporarily, in a Sesame data store (http://notes.3kbo.com/sesame) that allows us to easily store and query linked data.
The longest part of this process will be to define the data. The libraries we're using do all of the heavy lifting, as shown in the steps given below:
tstore
:(defn kb-memstore "This creates a Sesame triple store in memory." [] (kb :sesame-mem)) (defn init-kb [kb-store] (register-namespaces kb-store '(("geographis" "http://telegraphis.net/ontology/geography/geography#") ("code" "http://telegraphis.net/ontology/measurement/code#") ("money" "http://telegraphis.net/ontology/money/money#") ("owl" "http://www.w3.org/2002/07/owl#") ("rdf" "http://www.w3.org/1999/02/22-rdf-syntax-ns#") ("xsd" "http://www.w3.org/2001/XMLSchema#") ("currency" "http://telegraphis.net/data/currencies/") ("dbpedia" "http://dbpedia.org/resource/") ("dbpedia-ont" "http://dbpedia.org/ontology/") ("dbpedia-prop" "http://dbpedia.org/property/") ("err" "http://ericrochester.com/")))) (def t-store (init-kb (kb-memstore)))
q
:(def q '((?/c rdf/type money/Currency) (?/c money/name ?/full_name) (?/c money/shortName ?/name) (?/c money/symbol ?/symbol) (?/c money/minorName ?/minor_name) (?/c money/minorExponent ?/minor_exp) (?/c money/isoAlpha ?/iso) (?/c money/currencyOf ?/country)))
header-keyword
and fix-headers
functions will do this:(defn header-keyword "This converts a query symbol to a keyword." [header-symbol] (keyword (.replace (name header-symbol) \_ -))) (defn fix-headers "This changes all of the keys in the map to make them valid header keywords." [coll] (into {} (map (fn [[k v]] [(header-keyword k) v]) coll)))
(defn load-data [krdf-file q] (load-rdf-file k rdf-file) (to-dataset (map fix-headers (query k q))))
user=> (sel d :rows (range 3) :cols [:full-name :name :iso :symbol]) | :full-name | :name | :iso | :symbol | |-----------------------------+---------+------+---------| | United Arab Emirates dirham | dirham | AED | إ.د | | Afghan afghani | afghani | AFN | ؋ | | Albanian lek | lek | ALL | L |
First, here's some background information. Resource Description Format (RDF) isn't an XML format, although it's often written using XML. (There are other formats as well, such as N3 and Turtle.) RDF sees the world as a set of statements. Each statement has at least three parts (a triple): a subject, predicate, and object. The subject and predicate must be URIs. (URIs are like URLs, only more general. For example, uri:7890
is a valid URI.) Objects can be a literal or a URI. The URIs form a graph. They are linked to each other and make statements about each other. This is where the linked in linked data comes from.
If you want more information about linked data, http://linkeddata.org/guides-and-tutorials has some good recommendations.
Now, about our recipe. From a high level, the process we used here is pretty simple, given as follows:
kb-memstore
and init-kb
)load-data
)q
and load-data
)rekey
and col-map
)load-data
)The newest thing here is the query format. kb
uses a nice SPARQL-like DSL to express the queries. In fact, it's so easy to use that we'll deal with it instead of working with raw RDF. The items starting with ?/
are variables which will be used as keys for the result maps. The other items look like rdf-namespace/value
. The namespace is taken from the registered namespaces defined in init-kb
. These are different from Clojure's namespaces, although they serve a similar function for your data: to partition and provide context.