One of the best things about Cascalog queries is that they can be composed together. Similar to composing functions, this can be a good way to build a complex process from smaller, easy-to-understand parts.
In this recipe, we'll parse the Virginia census data we first used in the Managing program complexity with STM recipe in Chapter 3, Managing Complexity with Concurrent Programming. You can download this data from http://www.ericrochester.com/clj-data-analysis/data/all_160_in_51.P35.csv. We'll also use a new census datafile that contains the race data. You can download it from http://www.ericrochester.com/clj-data-analysis/data/all_160_in_51.P3.csv.
Since we're reading CSV, we'll need to use the dependencies and imports from the Parsing CSV files with Cascalog recipe.
We'll also use the hfs-text-delim
function from that recipe and ->long
from the Aggregating data with Cascalog recipe.
Also, we'll need the data files from http://www.ericrochester.com/clj-data-analysis/data/all_160_in_51.P35.csv and http://www.ericrochester.com/clj-data-analysis/data/all_160_in_51.P3.csv. We'll put them into the data
directory, as follows:
(def families-file "data/all_160_in_51.P35.csv") (def race-file "data/all_160_in_51.P3.csv")
We'll read these datasets and convert some of the fields in each to integers. Then we'll join the two together and select only a few of the fields.
families
data file and converts the integer fields to numbers:(def family-data (<- [?GEOID ?SUMLEV ?STATE ?NAME ?POP100 ?HU100 ?P035001] ((hfs-text-delim families-file :has-header true) ?GEOID ?SUMLEV ?STATE _ _ _ _ _ ?NAME ?spop100 ?shu100 _ _ ?sp035001 _) (->long ?spop100 :> ?POP100) (->long ?shu100 :> ?HU100) (->long ?sp035001 :> ?P035001)))
race
data file:(def race-data (<- [?GEOID ?SUMLEV ?STATE ?NAME ?POP100 ?HU100 ?P003001 ?P003002 ?P003003 ?P003004 ?P003005 ?P003006 ?P003007 ?P003008] ((hfs-text-delim race-file :has-header true) ?GEOID ?SUMLEV ?STATE _ _ _ _ _ ?NAME ?spop100 ?shu100 _ _ ?sp003001 _ ?sp003002 _ ?sp003003 _ ?sp003004 _ ?sp003005 _ ?sp003006 _ ?sp003007 _ ?sp003008 _) (->long ?spop100 :> ?POP100) (->long ?shu100 :> ?HU100) (->long ?sp003001 :> ?P003001) (->long ?sp003002 :> ?P003002) (->long ?sp003003 :> ?P003003) (->long ?sp003004 :> ?P003004) (->long ?sp003005 :> ?P003005) (->long ?sp003006 :> ?P003006) (->long ?sp003007 :> ?P003007) (->long ?sp003008 :> ?P003008)))
?GEOID
field. It will also rename some of the fields:(def census-joined (<- [?name ?pop100 ?hu100 ?families ?white ?black ?indian ?asian ?hawaiian ?other ?multiple] (family-data ?geoid _ _ ?name ?pop100 ?hu100 ?families) (race-data ?geoid _ _ _ _ _ _ ?white ?black ?indian ?asian ?hawaiian ?other ?multiple)))
Now we can run this and send the results to the standard output:
user=> (?- (stdout) census-joined) … RESULTS ----------------------- Abingdon town 8191 4271 2056 7681 257 15 86 6 Accomac town 519 229 117 389 106 0 3 1 Adwolf CDP 1530 677 467 1481 17 1 4 0 Alberta town 298 163 77 177 112 4 0 0 Alexandria city 139966 72376 30978 85186 30491 589 8432 141 Altavista town 3450 1669 928 2415 891 5 20 0 Amherst town 2231 1032 550 1571 550 17 14 0 Annandale CDP 41008 14715 9790 20670 3533 212 10103 53 Appalachia town 1754 879 482 1675 52 4 2 0 Appomattox town 1733 849 441 1141 540 8 3 0 Aquia Harbour CDP 6727 2300 1914 5704 521 38 150 …