For the last recipe, Reading RDF data, the embedded domain-specific language (EDSL) used for the query gets converted to SPARQL, which is the query language for many linked data systems. If you squint just right at the query, it looks kind of like a SPARQL WHERE
clause. For example, you can query DBPedia to get information about a city, such as its population, location, and other data. It's a simple query, but a query nevertheless.
This worked great when we had access to the raw data in our own triple store. However, if we need to access a remote SPARQL endpoint directly, it's more complicated.
For this recipe, we'll query DBPedia (http://dbpedia.org) for information on the United Arab Emirates currency, which is the Dirham. DBPedia extracts structured information from Wikipedia (the summary boxes) and republishes it as RDF. Just as Wikipedia is a useful first-stop for humans to get information about something, DBPedia is a good starting point for computer programs that want to gather data about a domain.
First, we need to make sure that the dependencies are listed in our Leiningen project.clj
file:
(defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [edu.ucdenver.ccp/kr-sesame-core "1.4.17"] [org.clojure/tools.logging "0.3.0"] [org.slf4j/slf4j-simple "1.7.7"]])
Then, load the Clojure and Java libraries we'll use:
(require '[clojure.java.io :as io] '[clojure.xml :as xml] '[clojure.pprint :as pp] '[clojure.zip :as zip]) (use 'incanter.core 'edu.ucdenver.ccp.kr.kb 'edu.ucdenver.ccp.kr.rdf 'edu.ucdenver.ccp.kr.sparql 'edu.ucdenver.ccp.kr.sesame.kb 'clojure.set) (import [java.io File] [java.net URL URLEncoder])
As we work through this, we'll define a series of functions. Finally, we'll create one function, load-data
, to orchestrate everything, and we'll finish by doing the following:
kb-memstore
and init-kb
functions from Reading RDF data. We define a function that takes a URI for a subject in the triple store and constructs a SPARQL query that returns at most 200 statements about this subject. The function then filters out any statements with non-English strings for objects, but it allows everything else:(defn make-query "This creates a query that returns all of the triples related to asubject URI. It filters out non-English strings." ([subject kb] (binding [*kb* kb *select-limit* 200] (sparql-select-query (list '(~subject ?/p ?/o) '(:or (:not (:isLiteral ?/o)) (!= (:datatype ?/o) rdf/langString) (= (:lang ?/o) ["en"])))))))
(defn make-query-uri "This constructs a URI for the query." ([base-uri query] (URL. (str base-uri "?format=" (URLEncoder/encode "text/xml") "&query=" (URLEncoder/encode query)))))
(defn result-seq "This takes the first result and returns a sequence of this node, plus all of the nodes to the right of it." ([first-result] (cons (zip/node first-result) (zip/rights first-result))))
result-to-kv
). It uses binding-str
to pull the results out of the XML. Then, accum-hash
pushes the key-value pairs into a map. Keys that occur more than once have their values accumulated in a vector:(defn binding-str "This takes a binding, pulls out the first tag's content, and concatenates it into a string." ([b] (apply str (:content (first (:content b)))))) (defn result-to-kv "This takes a result node and creates a key-value vector pair from it." ([r] (let [[p o] (:content r)] [(binding-str p) (binding-str o)]))) (defn accum-hash ([m [k v]] (if-let [current (m k)] (assoc m k (str current space v)) (assoc m k v))))
rekey
. This will convert the keys of a map based on another map:(defn rekey "This just flips the arguments for clojure.set/rename-keys to make it more convenient." ([k-map map] (rename-keys (select-keys map (keys k-map)) k-map)))
(defn query-sparql-results "This queries a SPARQL endpoint and returns a sequence of result nodes." ([sparql-uri subject kb] (->> kb ;; Build the URI query string. (make-query subject) (make-query-uri sparql-uri) ;; Get the results, parse the XML, ;; and return the zipper. io/input-stream xml/parse zip/xml-zip ;; Find the first child. zip/down zip/right zip/down ;; Convert all children into a sequence. result-seq)))
load-data
:(defn load-data "This loads the data about a currency for the given URI." [sparql-uri subject col-map] (->> ;; Initialize the triple store. (kb-memstore) init-kb ;; Get the results. (query-sparql-results sparql-uri subject) ;; Generate a mapping. (map result-to-kv) (reduce accum-hash {}) ;; Translate the keys in the map. (rekey col-map) ;; And create a dataset. to-dataset))
(def rdfs "http://www.w3.org/2000/01/rdf-schema#") (def dbpedia "http:///dbpedia.org/resource/") (def dbpedia-ont "http://dbpedia.org/ontology/") (def dbpedia-prop "http://dbpedia.org/property/") (def col-map {(str rdfs 'label) :name, (str dbpedia-prop 'usingCountries) :country (str dbpedia-prop 'peggedWith) :pegged-with (str dbpedia-prop 'symbol) :symbol (str dbpedia-prop 'usedBanknotes) :used-banknotes (str dbpedia-prop 'usedCoins) :used-coins (str dbpedia-prop 'inflationRate) :inflation})
load-data
with the DBPedia SPARQL endpoint, the resource we want information about (as a symbol), and the column map:user=> (def d (load-data "http://dbpedia.org/sparql" (symbol (str dbpedia dbpedia "United_Arab_Emirates_dirham")) col-map)) user=> (sel d :cols [:country :name :symbol]) | :country | :name | :symbol | |----------------------+-----------------------------+---------| | United Arab Emirates | United Arab Emirates dirham | إ.د |
The only part of this recipe that has to do with SPARQL, really, is the make-query
function. It uses the sparql-select-query
function to generate a SPARQL query string from the query pattern. This pattern has to be interpreted in the context of the triple store that has the namespaces defined. This context is set using the binding
command. We can see how this function works by calling it from the REPL by itself:
user=> (println (make-query (symbol (str dbpedia "/United_Arab_Emirates_dirham")) (init-kb (kb-memstore)))) PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT ?p ?o WHERE { <http://dbpedia.org/resource/United_Arab_Emirates_dirham> ?p ?o . FILTER ( ( ! isLiteral(?o) || ( datatype(?o) !=<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> ) || ( lang(?o) = "en" ) ) ) } LIMIT 200
The rest of the recipe is concerned with parsing the XML format of the results, and in many ways, it's similar to the last recipe.