If you work with data for long enough, you'll eventually come across data that you can't find a library for, and you'll need to write your own parser. Some formats might be simple enough for regular expressions, but if you need to balance syntactic structures in the input, or handle anything too complicated, you're probably better off creating a custom parser. Sometimes, custom parsers can be slower than regular expressions for very large inputs, but sometimes they're still your best option.
Clojure, and most functional languages, are great for parsing, and many have parser-combinator libraries that make writing parsers extremely simple.
For this recipe, as an example of a data format that needs parsing, we'll work with some FASTA data (http://en.wikipedia.org/wiki/FASTA_format). FASTA is a file format that's used in bioinformatics to exchange nucleotide and peptide sequences. Of course, there are parsers already for this, but it's a simple, yet non-trivial format, which makes a good example case for this recipe.
The first line of FASTA data starts with a >
character followed by a unique identifier. This line often contains other information about the specimen described, the database it came from, and more. After this line comes one or more lines that list the sequence information. A more detailed explanation of the FASTA format is available at http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml. This is how an example FASTA record looks:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPY IGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIP FHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPY TIIGQMASILYFSIILAFLPIAGXIENY
We'll use the parse-ez library (https://github.com/protoflex/parse-ez) to build the parser.
We need to make sure that parse-ez
is listed in our Leiningen project.clj
file:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [parse-ez "0.3.6"]])
We also need to make it available to our script or REPL:
(use 'protoflex.parse)
To define a parser, we just define functions that parse the different parts of the input and then combine them to parse larger structures:
(defn <| [l r] (let [l-output (l)] (r) l-output))
(defn nl [] (chr-in #{ ewline eturn}))
>
character, followed by anything up to the end of the line:(defn defline [] (chr >) (<| #(read-to-re #"[ ]+") nl))
(defn acid-code [] (chr-in #{A B C D E F G H I K L M N P Q R S T U V W X Y - *})) (defn acid-code-line [] (<| #(multi+ acid-code) #(attempt nl)))
(defn fasta [] (ws?) (let [dl (defline) gls (apply str (flatten (multi+ acid-code-line)))] {:defline dl, :gene-seq gls})) (defn multi-fasta [] (<| #(multi+ fasta) ws?))
(defn parse-fasta [input] (parse multi-fasta input :eof false :auto-trim false))
Now, we can use this function to parse the example record, provided at the beginning of this recipe:
user=> parse-fasta test-data) {:defline "gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]", :gene-seq "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY"}
At the most abstract level, parsers are functions. They take a string as input and return a data structure. More complex, advanced parsers are built by combining simpler elements.
The <|
function is a good example of this. It does not parse anything by itself. However, this function makes it possible to combine two other parsers in a useful way: it parses both parts and throws away the results of the second.
The acid-code
function is an example of how to create a parser from a basic component. It matches any of the characters in the set.
acid-code-line
then combines the acid-code
parser. It has to match one or more acid-code
characters, optionally followed by a newline
. It uses the <|
combinator to throw away the newline
and return the sequence of acid-codes
.
This entire parser is built up in this way, by composing complex structures from simple parts. While this is a very basic parser, it's possible to create quite complex parsers in this way, leveraging the full power of Clojure while keeping the code readable and maintainable.