Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. String and Text Processing

Someone will call Something will fall And smash on the floor Without reading the text Know what comes next Seen it before And it’s painful Things must change We must rearrange them Or we’ll have to estrange them All that I’m saying The game’s not worth playing Over and over again
—Depeche Mode, "The Sun and the Rainfall"

5.0 Introduction

Users who come to Mathematica for its superior mathematical capabilities are pleasantly surprised to find strong abilities in programming areas outside of mathematics proper. This is certainly true in the area of textual and string processing. Mathematica’s rich library of functions for string and structured text manipulation rivals Java, Perl, or any other modern language you can tie a string around.

The sections in this introduction provide information on some of the basic tools of strings and string manipulation.

Characters and Character Encodings

Mathematica uses Unicode internally, but externally (e.g., when saving a notebook) it uses ASCII codes, encoding non-ASCII characters in a special form.

For example, lowercase Greek letters and other non-ASCII characters are encoded using backslash-bracketed character names ([name]).

In[1]:=  alpha = "α"
Out[1]=  α

The function ToString will translate strings using different encoding schemes.

The default character encoding used by Mathematica is stored in $CharacterEncoding, and the native character encoding of the underlying operating system Mathematica is running is stored in $SystemCharacterEncoding. All available encodings are stored in $CharacterEncodings.

    In[3]:=  $CharacterEncoding
    Out[3]=  UTF-8

    In[4]:=  $SystemCharacterEncoding
    Out[4]=  UTF-8

    In[5]:=  Partition[$CharacterEncodings, 4] // TableForm
Out[5]//TableForm=
             AdobeStandard      ASCII                       CP936                       CP949
             CP950              Custom                      EUC-JP                      EUC
             IBM-850            ISO10646-1                  ISO8859-10                  ISO8859-11
             ISO8859-13         ISO8859-14                  ISO8859-15                  ISO8859-16
             ISO8859-1          ISO8859-2                   ISO8859-3                   ISO8859-4
             ISO8859-5          ISO8859-6                   ISO8859-7                   ISO8859-8
             ISO8859-9          ISOLatin1                   ISOLatin2                   ISOLatin3
             ISOLatin4          ISOLatinCyrillic            Klingon                     koi8-r
             MacintoshArabic    MacintoshChineseSimplified  MacintoshChineseTraditional MacintoshCroatian
             MacintoshCyrillic  MacintoshGreek              MacintoshHebrew             MacintoshIcelandic
             MacintoshKorean    MacintoshNonCyrillicSlavic  MacintoshRomanian           MacintoshRoman
             MacintoshThai      MacintoshTurkish            MacintoshUkrainian          Math1
             Math2              Math3                       Math4                       Math5
             Mathematica1       Mathematica2                Mathematica3                Mathematica4
             Mathematica5       Mathematica6                Mathematica7                PrintableASCII
             ShiftJIS           Symbol                      Unicode                     UTF8
             WindowsANSI        WindowsBaltic               WindowsCyrillic             WindowsEastEurope
             WindowsGreek       WindowsThai                 WindowsTurkish              ZapfDingbats

Notice how UTF-8 needs two bytes to display alpha.

ToCharacterCode gives the numerical representation.

You can map from character codes back to characters using FromCharacterCode[].

In[8]:=  FromCharacterCode[{87, 88, 89, 90}]
Out[8]=  WXYZ

The mapping may not be one-to-one for certain encodings.

In[9]:=  FromCharacterCode[{206, 177}, "UTF8"]
Out[9]=  α

String and Regular Expressions

A great deal of Mathematica’s prowess in text processing comes from its rich support for pattern matching. There are two basic classes of string patterns: string expressions and regular expressions. Introduced in version 5.1, each has a similar expressive power. The advantage of StringExpression is that it is less cryptic because it uses more words than symbols to express patterns. The advantage of RegularExpression is that it is more standardized with other languages such as Perl, Ruby, Java, and so on. Non-Mathematica programmers, especially those with a background in Unix, are more likely to understand regular expressions, although these expressions are cryptic to the uninitiated. You should become familiar with both if you plan to do much string manipulation. If you program frequently in languages outside of Mathematica, master the regular expression syntax. If you work strictly in Mathematica, choose the one that most appeals to you. If you learn the string expression syntax, you will have a leg up on learning Mathematica’s more general pattern-matching syntax, which is used in many contexts outside text processing. You can also mix string expressions and regular expressions into compound patterns.

String expressions

StringExpressions are mostly written using the infix operator ~~, which is a syntactic shortcut for the StringExpression[] function. StringExpression uses Mathematica’s blanks notation (e.g., _, __, and ___) to represent wild cards. See Chapter 4 for more on blanks.

Match "xy" followed by any character.

In[10]:=  "xy" ~~ _;

In[11]:=  StringMatchQ["xyz" , "xy" ~~ _]
Out[11]=  True

In[12]:=  StringMatchQ["xyzz" , "xy" ~~ _]
Out[12]=  False

Match "xy" followed by one or more characters.

In[13]:=  "xy" ~~ __;

In[14]:=  StringMatchQ["xyzz" , "xy" ~~ __]
Out[14]=  True

In[15]:=  StringMatchQ["xy" , "xy" ~~ __]
Out[15]=  False

Match "xy" followed by zero or more characters.

In[16]:=  "xy" ~~ ___;

In[17]:=  StringMatchQ["xyz" , "xy" ~~ ___]
Out[17]=  True

In[18]:=  StringMatchQ["xy" , "xy" ~~ ___]
Out[18]=  True

Patterns can be associated with variables so that the matching portion can be referred to in a subsequent expression. For example, the following pattern will match if the string begins and ends with the same sequence of characters.

Table 5-1 shows some of the common raw ingredients for string expressions. If you have already read Chapter 4 on pattern matching, you can see that all the same constructs are available for strings. The full set of string expression primitives can be found in tutorial/WorkingWithStringPatterns.

Table 5-1. Common string patterns

Pattern	Description
""string"" "a literal string of characters" "_" "any single character" "__" "any substring of one or more characters" "___" "any substring of zero or more characters" "x_,x__,x___" "substrings given the name x" "x:pattern" "pattern given the name x" "pattern.." "pattern repeated one or more times" "pattern..." "pattern repeated zero or more times" "patt1\|patt2\| ..." "a pattern matching at least one of the patt-i" "patt/;cond" "a pattern for which cond evaluates to True" "pattern?test" "a pattern for which test yields True for each character" "Except[pattern]" "matches anything except pattern" "Whitespace" "a sequence of whitespace characters" "NumberString" "the characters of a number" "DatePattern[spec]" "the characters of a date" "charobj" "an object representing a character class (see below)"

Pattern

Description

""string""            "a literal string of characters"
"_"                     "any single character"
"__"                    "any substring of
                          one or more characters"
"___"                   "any substring of
                          zero or more characters"
"x_,x__,x___"           "substrings given the name x"
"x:pattern"             "pattern given the name x"
"pattern.."             "pattern repeated one or more times"
"pattern..."            "pattern repeated zero or more times"
"patt1|patt2| ..."      "a pattern matching
                          at least one of the patt-i"
"patt/;cond"            "a pattern for which
                          cond evaluates to True"
"pattern?test"          "a pattern for which test
                          yields True for each character"
"Except[pattern]"       "matches anything except pattern"
"Whitespace"            "a sequence of whitespace characters"
"NumberString"          "the characters of a number"
"DatePattern[spec]"     "the characters of a date"
"charobj"               "an object representing a
                          character class (see below)"

Table 5-2 shows some of the common raw ingredients for regular expressions. The full set of regular expression primitives can be found in tutorial/WorkingWithStringPatterns. Here c or c n , where n is a number, is a placeholder for an arbitrary character, and p n is a placeholder for an arbitrary regular expression.

Table 5-2. Common regular expressions

Regular expression	Description
"[c1c2c3]" "Matches any of the characters c1, c2, or c3. "[c1-c2]" For example,[AEIOUaeiou] matches vowels." "Matches characters c1 through c2. For example, [a-z] matches all lowercase letters." "[^c1c2c3]" "Matches any characters EXCEPT c1, c2, c3. For example,[^AEIOUaeiou] matches nonvowels." "c" "Zero or more occurrences of character c. Greedy version." "c+" "One or more occurrences of character c. Greedy version." "c?" "The character c or nothing (i.e., zero or one occurrences). Greedy version." "c?" "Lazy version of c*." "c+?" "Lazy version of c+." "c??" "Lazy version of c?." "p1\|p2\|...\|pN" "Matches p1 or p2 or ... pN." "p1p2...pN" "Matches p1, followed by p2, followed by ... pN." "^p1" "Matches p1 only at the start of the string." "p1$" "Matches p1 only at the end of the string." "^p1$" "Matches only if p1 matches the entire string." "\d" "Any digit 0-9" "\s" "Whitespace"

Regular expression

Description

"[c1c2c3]"       "Matches any of the characters c1, c2, or c3.
"[c1-c2]"          For example,[AEIOUaeiou] matches vowels."
                 "Matches characters c1 through c2. For example,
                   [a-z] matches all lowercase letters."
"[^c1c2c3]"      "Matches any characters EXCEPT c1, c2, c3. For
                   example,[^AEIOUaeiou] matches nonvowels."
"c*"             "Zero or more occurrences
                   of character c. Greedy version."
"c+"             "One or more occurrences
                   of character c. Greedy version."
"c?"             "The character c or nothing (i.e., zero
                   or one occurrences). Greedy version."
"c*?"            "Lazy version of c*."
"c+?"            "Lazy version of c+."
"c??"            "Lazy version of c?."
"p1|p2|...|pN"   "Matches p1 or p2 or ... pN."
"p1p2...pN"      "Matches p1, followed by p2, followed by ... pN."
"^p1"            "Matches p1 only at the start of the string."
"p1$"            "Matches p1 only at the end of the string."
"^p1$"           "Matches only if p1 matches the entire string."
"\d"            "Any digit 0-9"
"\s"            "Whitespace"

5.1 Comparing Strings

Problem

You want to compare strings but Less, LessEqual, Greater, and GreaterEqual do not work.

Solution

Use Order[e1, e2], which returns 1 if e1 is before e2, -1 if e1 is after e2, and 0 if they are equal.

In[23]:=  Order["rat", "rate"]
Out[23]=  1

In[24]:=  Order["rat", "cat"]
Out[24]=  -1

Discussion

Most users of Mathematica will not find themselves doing direct string comparison since functions like Sort and Ordering do the right thing. However, if you find yourself needing to use the more natural comparison operators with strings, you can do the following:

In[25]:=  Unprotect[Less, LessEqual, Greater, GreaterEqual];
          Less[s1_String, s2_String] := Order[s1, s2] > 0;
          LessEqual[s1_String, s2_String] := Order[s1, s2] > -1;
          Greater[s1_String, s2_String] := Order[s1, s2] < 0;
          GreaterEqual[s1_String, s2_String] := Order[s1, s2]  < 1;
          Protect[Less, LessEqual, Greater, GreaterEqual];

In[31]:=  "rat" < "cat"
Out[31]=  False

In[32]:=  "cat" < "rat"
Out[32]=  True

In[33]:=  "cat" <= "cat"
Out[33]=  True

5.2 Removing and Replacing Characters from Strings

Problem

You want to strip certain characters (e.g., whitespace) or characters at certain positions from a string. You may also want to replace these characters with other characters.

Solution

Using patterns

StringReplace[] is an extremely versatile function that solves most character-oriented stripping and replacing operations. It supports a very general set of string-substitution rules, including regular expressions and Mathematica-specific string patterns.

Strip all spaces.

Strip leading and trailing whitespace.

Normalize whitespace: strip leading, trailing, and multiple internal whitespace.

Literal string substitution.

Ignore case while matching.

Use Mathematica-specific patterns instead of regular expressions.

Using positions

Sometimes you know exactly where the characters are that you want to remove. In that case, StringDrop[] is a lot more efficient. StringDrop[] takes the string and a second argument, which can be an offset from the front, an offset from the end, specific positions, or a range of positions.

Consider:

In[41]:= myString = "abcdefghijklmnop" ;

Here you drop the first three characters.

In[42]:= StringDrop[myString, 3]
Out[42]= defghijklmnop

Alternatively, you drop the last three characters, like so.

In[43]:= StringDrop[myString, -3]
Out[43]= abcdefghijklm

Drop only the third character, like this.

In[44]:= StringDrop[myString, {3}]
Out[44]= abdefghijklmnop

And drop the third through fifth ("cde"), using a range list.

In[45]:= StringDrop[myString, {3, 5}]
Out[45]= abfghijklmnop

The step size in the range can even be greater than one by specifying it as the third element. Here you specify a step size of two to remove every other character. The -1 upper limit is a convenient way to specify the end of the string without having to know its length.

In[46]:= StringDrop[myString, {l, -1, 2}]
Out[46]= bdfhjlnp

You can also act on several strings at once.

In[47]:= otherString = "1234567890";

In[48]:= StringDrop[{myString, otherString}, {3, 5}]
Out[48]= {abfghijklmnop, 1267890}

The positional form for replacement is called StringReplacePart[], and it works using similar conventions for specifying positions. The difference is that you must always provide a contiguous range or a list of such ranges.

In[49]:= StringReplacePart[myString, "ZZZ", {3, 5}]
Out[49]= abZZZfghijklmnop

In[50]:= StringReplacePart[myString, "ZZZ", {{3, 5}, {10, 15}}]
Out[50]= abZZZfghiZZZp

Each range can also have its own replacement string.

In[51]:= StringReplacePart[myString, {"ZZZ", "WWW"}, {{3, 5}, {10, 15}}]
Out[51]= abZZZfghiWWWp

Discussion

As you can see from the given examples, StringReplace is quite versatile. However, the versatility is derived from Mathematica’s rich support for patterns (see 5.0 Introduction). Here are some typical text-processing problems that yield to the application of StringReplace[] and pattern matching.

Stripping comments

String expression version:

Regular expression version:

Changing delimiters

Delimited text (e.g., comma-delimited text) sounds simple at first, but many delimited formats allow a way to handle the delimiters as regular text by some quoting mechanism, as well as a way to escape quotes themselves. Furthermore, you must handle empty fields. If you want to replace a comma-delimited format with, say, a semicolon-delimited format, you must craft expressions that deal with all cases. Here, "" is used to escape a double quote. This example does not handle empty fields, but see Friedl’s Mastering Regular Expressions for guidance.

Removing XML markup

Simple XML manipulations, such as discarding markup, can be accomplished with StringReplace[].

In[56]:=  NotebookDirectory[]
Out[56]=  /Users/smangano/Documents/workspace/Mathematica Cookbook/mathematica/

In[57]:=  xml = Import[FileNameJoin[
             {NotebookDirectory[], "..", "data", "ch02", "data1.xml"}], "Text"]
Out[57]=  <?xml version="1.0" encoding="UTF-8"?>
          <!-- Some data to use as a test for Mathematica's XML import -->
          <?test Just for didactic purposes?>
          <data>
              <item>
                  <name>Leonardo</name>
                  <sex>male</sex>
                  <age>8</age>
                  <height>4.7</height>
              </item>
              <item>
                  <name>Salvatore</name>
                  <sex>male</sex>
                  <age>5</age>
                  <height>4.1</height>
              </item>
              <item>
                  <name>Alexis</name>
                  <sex>female</sex>
                  <age>6</age>
                  <height>4.4</height>
              </item>
          </data>
          <!-- Comment at end -->

Replacing with expression evaluation

By capturing matched substrings in variables, you can perform expression evaluation using ToExpression[] as you replace.

Here is another example using dates.

5.3 Extracting Characters and Substrings

Problem

You want to extract a substring by position or content from a string.

Solution

Using patterns

StringCases[] provides the pattern-driven means of extracting substrings. There are two major variations. In the first, you simply extract what the patterns literally match. The second variation uses rules to transform the matched substrings into other strings and return those instead.

You can extract specific words using regular expressions (here matches word boundaries).

The same can be done using string expressions.

In[64]:=  StringCases["The pig thought he was a dog and then chased the cat.",
           WordBoundary ~~ {"a", "the"}  ~~ WordBoundary, IgnoreCase -> True]
Out[64]=  {The, a, the}

The most common reason for using rules instead of patterns is to match a substring within a specific context but return the substring alone. Here we want to return substrings bracketed by one or more occurrences of the letter a. This example also illustrates that regular expressions and string expressions can be mixed.

Using positions

Sometimes you know exactly where the characters are that you want to remove. In that case, StringTake[] is a lot more efficient. StringTake[] takes the string and a second argument, which can be an offset from the front, an offset from the end, specific positions, or a range of positions.

Consider:

In[66]:= myString = "abcdefghijklmnop";

Here you take the first three characters.

In[67]:= StringTake[myString, 3]
Out[67]= abc

Alternatively, you take the last three characters, like so.

In[68]:= StringTake[myString, -3]
Out[68]= nop

Take only the third character, like this.

In[69]:= StringTake[myString, {3}]
Out[69]= c

And take the third through fifth ("cde") using a range list.

In[70]:= StringTake[myString, {3, 5}]
Out[70]= cde

The step size in the range can even be greater than one by specifying it as the third element. Here you specify a step size of two to take every other character. The -1 upper limit is a convenient way to specify the end of the string without having to know its length.

In[71]:= StringTake[myString, {1, -1, 2}]
Out[71]= aeegikmo

Conveniently, you can also act on several strings at once.

In[72]:= otherString = "1234567890";

In[73]:= StringTake[{myString, otherstring}, {3, 5}]
Out[73]= {cde, 345}

If you have read 5.2 Removing and Replacing Characters from Strings, you see that StringTake has very similar parameter variations as StringDrop[]. However, StringTake has an additional feature: it can take a list of position specifications and produce a list of the resulting extracts.

In[74]:= StringTake[myString, {{1}, {3}, {8, 10}}]
Out[74]= {a, c, hij}

This is useful for picking multiple segments from a string in one step. However, if you want a string rather than a list, simply wrap the expression in a StringJoin[].

In[75]:= StringJoin[StringTake[myString, {{1}, {3}, {8, lO}}]]
Out[75]= aehij

Discussion

In the Solution section we used RegularExpression["(?<=a)"] (lookbehind) and RegularExpression["(?=a)"] (look-ahead) because there is no stringexpression equivalent. However, there is an option for StringCases[] called Overlaps, which when set to True, causes the matcher to continue at the character that follows the first character of the last matched substring. In the following example, this allows a single a to act as both a start of pattern and end of pattern.

Without Overlaps→True, you would not get the "cbcbd" substring.

There is a third setting, Overlaps→All, which causes the matcher to repeat searches from the same position until no new matches are found. To see the effect of All, we need to consider a different example, one in which the bracketing character is not excluded from the match. A parenthesized expression is a good example.

Out[80]//TableForm=
              ((a-b)
              ((a-b) (c + d)
              ((a-b) (c + d) (e / (f + g)
              ((a-b) (c + d) (e / (f + g))
              ((a-b) (c + d) (e / (f + g)))
              (a-b)
              (a-b) (c + d)
              (a-b) (c + d) (e / (f + g)
              (a-b) (c + d) (e / (f + g))
              (a-b) (c + d) (e / (f + g)))
              (c + d)
              (c + d) (e / (f + g)
              (c + d) (e / (f + g))
              (c + d) (e / (f + g)))
              (e / (f + g)
              (e / (f + g))
              (e / (f + g)))
              (f + g)
              (f + g))
              (f + g)))

5.4 Duplicating a String

Problem

You need to synthesize a string from a fixed number of copies of a seed string.

Solution

Use StringJoin[] on the output of Table[].

   In[81]:=  stringDup[seed_, n_: 2] := StringJoin@Array[seed &, n]

   In[82]:=  stringDup["-", 10] // InputForm
Out[82]//InputForm=
             "----------"

   In[83]:=  stringDup["wiki "]
   Out[83]=  wiki wiki

Discussion

This is a simple recipe, and I include it because it’s something you expect to be bundled as a native function, but it’s not. For most practical applications, the solution is fine, but for very large n, a doubling approach will have better performance. Rather than doing the math to get the exact string size, we simply truncate the closest sized string obtained from pure doubling of the seed.

In[84]:=  stringDup2[seed_, n_] :=
           StringTake[Nest[# <> # &, seed, Ceiling[Log[2, n]]], n*StringLength[seed]]

In[85]:=  Mean[Table[Timing[stringDup["-", 100 000]][[1]], {10}]]
Out[85]=  0.0486878

In[86]:=  Mean[Table[Timing[stringDup2["-", 100 000]][[1]], {10}]]
Out[86]=  0.0031014

This solution may not be obvious, so let’s break it down. It should be clear that mapping the function #<>#& to a list containing a string will double that string (recall that <> is string concatenation).

In[87]:= # <> #& /@ {"_"}
Out[87]= {--}

It follows that doing this twice will quadruple it.

In[88]:= # <> #&/@ (# <> #& /@ {"_";})
Out[88]= {----)

Repeating this process m times will create a string of length 2^m. However, the input is the desired length n, not the number of doublings, so we know we need at least Ceiling[Log[2, n]] doublings; by using Nest with this number, we get exactly that. However, this overshoots the desired length in most cases, because we rarely expect n to be an exact power of 2. So we use Take to extract the correct length. The reason this is fast for large n is that it reduces a 0(n) operation in terms of Table to a 0(log n) operation using StringJoin.

You can bundle these versions together into one function that gives good performance across all sizes.

In[89]:=  Clear[stringDup];
          stringDup[seed_String, n_Integer /; n >=2^12] :=
           StringTake[Nest[# <> # &, seed, Ceiling[Log[2, n]]], n]
          stringDup[seed_String, n_Integer: 2] :=  StringJoin@[rray[seed &, n]

5.5 Matching and Searching Text

Problem

You want to determine if a string contains a pattern and at what positions.

Solution

Use StringMatchQ[string,pattern] to determine if a string matches a pattern.

In[92]:= StringMatchQ["1234", NumberString]
Out[92]= True

Here I show a match on multiple strings with a pattern that is predicated.

In[93]:=  StringMatchQ[{"1234", "1237"}, p: NumberString /; OddQ [FromDigits [p]]]
Out[93]=  {False, True}

Use StringFreeQ[string,pattern] to determine if a string does not match a pattern.

In[94]:=  StringFreeQ[{"1234", "abcde"}, p: NumberString]
Out[94]=  {False, True}

Use StringPosition[string,pattern] to find the integer offsets of matches. The default behavior is to search for all occurrences of the pattern (i.e., Overlaps → True).

In[95]:=  StringPosition["1234abcd54321", NumberString]
Out[95]=  {{l,4}, (2,4), (3,4), (4,4),
           {9, 13}, {10, 13}, {ll, 13}, {12, 13}, (13, 13)}

With Overlaps → False, you only get matches on substrings that don’t share characters with prior matches.

Discussion

StringMatchQ[] and StringFreeQ[] very often find application in restricting inputs to functions.

In[97]:=  classify [word_String /; StringMatchQ[word,{"I", "me", "we",
                "you", "they", "him", "her", "it"}]] := pronoun[word]
          classify[word_String /; StringMatchQ[word, {"and", "or", "nor",
                "after", "although", "as", "because", "before", "how", "if",
                 "once", "since", "than", "that", "though", "till", "until",
                 "when", "where", "whether", "while"}]] := conjunction[word]
          classify[word_String /; StringMatchQ[word, DatePattern[{"DayName"}]]]  :=
           dayofweek[word]
          classify[word_String /; StringMatchQ[word, DatePattern[{"MonthName"}]]]  :=
           month[word]
          (*...*)
          classify[word_String]  := other[word] ;

You can also use them as input to other functions, like Pick[] in the following grep implementation adapted from an example in Mathematica documentation. Recall that in the standard Unix grep, option -v instructs grep to return lines that don’t match the pattern. Here Transpose and Range are used to number each line so the result contains a list of pairs {line, match text}. This grep function was implemented in terms of StringFreeQ rather than StringMatchQ since the latter only succeeds if the entire string matches.

  In[102]:=  grep[file_, patt_, "-v"]  := grepImpl[file, patt, True ]
             grep[file_, patt_] := grepImpl[file, patt, False]
             grepImpl[file_, patt_, value_] := With[{data =Import[file, "Lines"]},
               Pick[Transpose[{Range[Length[data]], data=],
                StringFreeQ[data, RegularExpression[patt]], value ]]

  In[105]:=  grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"=], "bar"}] //
              TableForm
Out[105]//TableForm=
             1 bar
             4 foo bar
             5 foobar
             6 barfo

  In[106]:=  grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"=}], "bar$"]
  Out[106]=  {{1, bar}, {4, foo bar}, {5, foobar}}

  In[107]:=  grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"}], "bar", "-v"]
  Out[107]=  {{2, foo}, {3, baz}, {7, fo o}}

Both StringMatchQ[] and StringFreeQ[] support the IgnoreCase → True option. StringMatchQ also supports option SpellingCorrection → True, which allows the match to succeed even if a small number of characters are wrong. However, in many cases a small number can mean only 1, as the following example demonstrates, so I would not rely too heavily on this "feature."

The output of StringPosition[] can be used as the input to StringTake.

In[110]:=  With[{str ="1234abcd54321"},
            StringTake[str, StringPosition[str, NumberString]]]
Out[110]=  {1234, 234, 34, 4, 54321, 4321, 321, 21, 1}

If you want to use it with StringDrop[], you need to map StringDrop[] over the list returned by StringPosition[]. This will give you a list with each matching segment dropped. More than likely, you will want to set Overlaps → False in this case. Try Overlaps → True with the expression given below to see why it is undesirable.

5.6 Tokenizing Text

Problem

You want to break a string into tokens based on a character or pattern.

Solution

StringSplit[] provides a variety of options for tokenizing text. The default is simply to tokenize on whitespace.

In[112]:=  StringSplit["The quick brown fox
jumped over the lazy programmer"]
Out[112]=  {The, quick, brown, fox, jumped, over, the, lazy, programmer}

Other delimiters can be specified as literals or more general patterns. Here you specify comma delimiters with zero or more whitespace characters.

In[113]:=  StringSplit["2008/01/20, test1, 100.3, 77.8,33.77",
            ","~~WhitespaceCharacter ...]
Out[113]=  {2008/01/20, test1, 100.3, 77.8, 33.77}

If there are several delimiters, give each pattern in a list. Here you decide to parse the date along with the comma-delimited text.

In[114]:=  StringSplit["2008/01/20, test1, 100.3, 77.8,33.77",
              {"," ~~WhitespaceCharacter ..., "/"}]
Out[114]=  {2008, 01, 20, test1, 100.3, 77.8, 33.77}

Discussion

StringSplit supports rules as well as patterns, which leads to some interesting applications, such as a means of highlighting output. Here is an example that stylizes XML by rendering directives, comments, and tags in specific font styles and colors. (The colors will not be visible in a monochrome print, but you can try the code on your own to see the effect.)

5.7 Working with Natural Language Dictionaries

Problem

You want to do some simple linguistic processing driven by a reliable lexicon.

Solution

As of version 6, Mathematica comes bundled with many useful data sources. One of these sources is an integrated English language dictionary (dictionaries for other languages can be installed).

Look up words that begin with th and end with y.

In[116]:=  DictionaryLookup["th" ~~ ___ ~~"y"]
Out[116]=  {thankfully, thanklessly, theatricality, theatrically,
            thematically, theocracy, theologically, theology, theoretically,
            theory, theosophy, therapeutically, therapy, thereby, thermally,
            thermodynamically, thermostatically, they, thickly, thievery,
            thingummy, thingy, thinly, thirdly, thirstily, thirsty, thirty, thorny,
            thoroughly, thoughtfully, thoughtlessly, thready, threateningly,
            threepenny, threnody, thriftily, thrifty, thrillingly, throatily,
            throaty, throwaway, thruway, thuggery, thunderously, thundery, thy}

Look up words that end in ee.

In[117]:=  DictionaryLookup[___  ~~  "ee"]
Out[117]=  {absentee, addressee, agree, Aimee, Albee, amputee, apogee, appointee,
            Ashlee, attendee, Attlee, axletree, banshee, bee, bootee, bumblebee,
            bungee, carefree, Chattahoochee, Cherokee, chickadee, chimpanzee,
            coffee, committee, conferee, consignee, coulee, Cree, debauchee, decree,
            Dee, degree, deportee, Desiree, detainee, devotee, disagree, divorcee,
            draftee, Dundee, dungaree, Elysee, emcee, employee, enlistee, entree,
            epee, escapee, evacuee, fat-free, fee, fiancee, filigree, flee, foresee,
            franchisee, free, fricassee, Frisbee, fusee, Galilee, garnishee, gee, ghee,
            glee, goatee, grandee, grantee, guarantee, gumtree, honeybee, honoree,
            Humvee, inductee, internee, interviewee, invitee, jamboree, Jaycee,
            jubilee, kedgeree, Klee, knee, lee, Lee, legatee, Legree, lessee, levee,
            licensee, manatee, marquee, matinee, McGee, McKee, melee, Menominee,
            Milwaukee, mortgagee, Murrumbidgee, Muskogee, nee, negligee, nominee,
            Okeechobee, Okefenokee, oversee, parolee, Pawnee, payee, pedigree, pee,
            peewee, Pelee, perigee, pewee, pharisee, Pharisee, pongee, prithee,
            protegee, puree, puttee, quadtree, ranee, referee, refugee, Renee,
            repartee, retiree, returnee, Rhee, rupee, Sadducee, scree, see, settee,
            Shawnee, Sheree, shoetree, singletree, sirree, Slurpee, soiree, spree,
            squeegee, standee, subcommittee, subtree, suttee, Suwanee, Swanee,
            Tallahassee, tee, Tennessee, tepee, thee, three, toffee, toll-free, topee,
            toupee, towhee, townee, Toynbee, trainee, transferee, tree, trochee,
            Truckee, trustee, Tuskegee, twee, Tweedledee, Tyree, wannabee, wee, whee,
            whiffletree, whippletree, whoopee, Yahtzee, Yankee, yippee, Zebedee}

Discussion

There are a lot of neat applications for an integrated dictionary.

Crossword puzzles

Here is how you might cheat at a crossword puzzle. Say you have three letters of a six-letter word and the clue is "51 down: unkeyed."

In[118]:=  DictionaryLookup["a"  ~~ _ ~~ "o" ~~ _ ~~ _  ~~ "l"]
Out[118]=  {amoral, atonal, avowal}

Ah, atonal sounds right (pun intended)!

Anagrams

You can also help your second grader impress the teacher on that November worksheet for finding all the words you can make out of the letters in "Thanksgiving" (i.e., anagrams). Here we use a pattern containing all combinations of the letters in "thanksgiving" and an extra constraint function to ensure letters are contained by their availability (count). Strictly speaking, an anagram must use all the letters of the input, but I ignore that here.

Using Tally[] to count letter occurrences and doing a bit of set manipulation, we can generalize this for any word. The condition checking for the empty complement at the end is not strictly needed here because we will never match a word in the dictionary that has a letter that is not in the input word. However, it is needed to make the logic if isWordSubsetQ[] is correct as a general predicate.

You can test the generality against other words.

In[125]:=  anagrams["winter"]
Out[125]=  {en, er, in, inert, inter, ire, it, net, new, newt, nit, niter, re, rein, rent,
            rite, ten, tern, ti, tie, tier, tin, tine, tire, twin, twine, twiner, we,
            weir, wen, went, wet, win, wine, winter, wire, wit, wren, writ, write}

In[126]:=  anagrams["dog"]
Out[126]=  {do, dog, go, god}

Palindromes

Here is a neat little palindrome finder (courtesy of the Mathematica documentation).

In[127]:=  DictionaryLookup[x__/; x ===StringReverse[x]}
Out[127]=  {a, aha, aka, bib, bob, boob, bub, CFC, civic, dad, deed, deified,
            did, dud, DVD, eke, ere, eve, ewe, eye, gag, gig, huh, I, kayak,
            kook, level, ma'am, madam, mam, MGM, minim, mom, mum, nan, non, noon,
            nun, oho, pap, peep, pep, pip, poop, pop, pup, radar, redder, refer,
            repaper, reviver, rotor, sagas, sees, seres, sexes, shahs, sis,
            solos, SOS, stats, stets, tat, tenet, TNT, toot, tot, tut, wow, WWW}

Spell-checker

By using all the words in the dictionary with Nearest[], you can create a rudimentary spell-checker. For our first attempt, we’ll use Nearest's default distance function. We’ll return a list for which the first element is True or False depending on the word’s inclusion in the dictionary and the second element is a list of potential correct spellings.

In[128]:=  
             (*Returns a function that, when applied to <word> and an integer <n>, 
           returns a list containing the n words in the integrated dictionary considered
           to be closest to <word>*)

           nf1 = Nearest[DictionaryLookup[]];

           SpellCheck1[word_]  := Module[{corrections = nf1[word, 15]} ,
             If[ MemberQ[ corrections, word], {True, word}, {False, corrections}]]
In[130]:=  SpellCheck1["pickel"]
Out[130]=  {False, {nickel, picked, picker, picket, bicker, dicker, dickey,
             hickey, kicked, kicker, licked, Michel, mickey, Mickey, nicked}}

We see that the default distance function used for strings (EditDistance) does not make the greatest spell-checker: the obvious suggestion of pickle is not among the first 15 nearest words. You can experiment with other distance functions. Here is one that penalizes more heavily for mistakes in consonants than for mistakes in vowels.

Here we test on some commonly misspelled words (according to the Oxford Dictionaries website: http://bit.ly/KuIQ2 ).

In[135]:=  SpellCheck2["accomodate"]
Out[135]=  {False, {accommodate, accommodated, accommodates, accumulate, accelerate,
             accentuate, acclimate, accolade, accommodation, accordant}}

In[136]:=  SpellCheck2["alcahol"]
Out[136]=  {False, {alcohol, alcohols, alcoholic,
             achoo, ahchoo, algal, anchor, carol, lethal, local}}

In[137]:=  SpellCheck2["mispell"]
Out[137]=  {False, {misspell, Aspell, Ispell, miscall,
             respell, spell, dispel, dispels, misdeal, misplay}}

This returns useful results, but performance (speed) is poor.

In[138]:=  SpellCheck2["pickel"]  // Timing
Out[138]=  {2.22533, {False, {nickel, picked, picker,
              picket, pickle, packed, packer, packet, pecked, pick}}}

We can improve the speed using a divide-and-conquer approach: pick a large but manageable number (e.g., 100) of nearest words according to simple EditDistance, and then do a second pass on the smaller set with the EditDistance sans vowels. We define a distance function called ConsonantDistance[] for the second pass.

Good results and about 43 times faster!

Mathematica also provides WordData[], which returns information about properties of words, such as parts of speech and definitions.

In[142]:=  WordData["run"]
Out[142]=  {{run, Noun, Score}, {run, Noun, Travel}, {run, Noun, RegularTrip},
            {run, Noun, ShortTrip}, {run, Noun, FootballPlay}, {run, Noun, Endeavor},
            {run, Noun, Successiveness}, {run, Noun, Flow}, {run, Noun, Damage},
            {run, Noun, Footrace}, {run, Noun, Campaign}, {run, Noun, Streak},
            {run, Noun, Stream}, {run, Noun, IndefiniteQuantity},
            {run, Noun, Liberty}, {run, Noun, TimePeriod}, {run, Verb, Disintegrate},
            {run, Verb, SplitUp}, {run, Verb, Dissolve}, {run, Verb, Treat},
            {run, Verb, Change}, {run, Verb, Get}, {run, Verb, Vie}, {run, Verb, Race},
            {run, Verb, Catch}, {run, Verb, Draw}, {run, Verb, Operate},
            {run, Verb, Function}, {run, Verb, CarryThrough}, {run, Verb, Play},
            {run, Verb, Circularize}, {run, Verb, Trip}, {run, Verb, GoThrough},
            {run, Verb, Hurry}, {run, Verb, TravelRapidly}, {run, Verb, Sport},
            {run, Verb, Accompany}, {run, Verb, Sail}, {run, Verb, SpreadOut},
            {run, Verb, Flow}, {run, Verb, GoAway}, {run, Verb, Displace},
            {run, Verb, MoveFreely}, {run, Verb, Trade}, {run, Verb, Loose},
            {run, Verb, Direct}, {run, Verb, Succeed}, {run, Verb, Implement},
            {run, Verb, Occur}, {run, Verb, Continue},{run, Verb, Endure},
            {run, Verb, Extend}, {run, Verb, MakePass}, {run, Verb, Lean},
            {run, Verb, Incur}, {run, Verb, Go}, {run, Verb, Range}}

5.8 Importing XML

Problem

You want to import and manipulate XML data in Mathematica.

Solution

Use Import[] with format "XMLObject" to import XML and convert it to a special Mathematica expression form. Consider the following XML in file datal.xml (available for download at the book’s website).

<?xml version="1.0" encoding="UTF-8"?>
<!-- Some data to use as a test for Mathematica's XML import -->
<?test Just for didactic purposes?>
<data>
    <item>
        <name>Leonardo</name>
        <sex>male</sex>
        <age>8</age>
        <height>4.7</height>
    </item>
    <item>
        <name>Salvatore</name>
        <sex>male</sex>
        <age>5</age>
        <height>4.1</height>
    </item>
    <item>
        <name>Alexis</name>
        <sex>female</sex>
        <age>6</age>
        <height>4.4</height>
    </item>
</data>

Discussion

Mathematica imports XML into expression form. You can manipulate the expression just like you would any other Mathematica expression, but first you need to understand the structure, which is a bit unconventional. Mathematica uses two types of heads to encode XML. XMLObject[ "type" ] is used to represent everything that is not an element, including the entire document (type = "Document"), comments (type = "Comment}), CDATA sections (type = "CDATASection"), processing instructions (type = "Processinglnstruction"), declarations (type = "Declaration), and document types (type = "Doctype"). In the XML above, you see examples for document, declaration, comment, and processing instruction. XMLElement[tag,{attr1→val1,...},{data1,...}] is used to represent element data for both simple (text values) and complex element types (those with child elements). Don’t get tripped up by the XMLObject notation. The entire syntax XMLObject["type"] is the head of the expression, while the remainder is a sequence of one or more arguments that depends on the type.

  In[144]:=  Head[data] // InputForm
Out[144]//InputForm=
             XMLObject["Document"]

The document version consists of three arguments: a list containing the declaration and possibly other objects, the document content, and a list of any objects (such as comments) that might appear past the last XML element. A very crude way to access structure is through Part[] or, equivalently, [[n]].

Pattern matching is much more elegant and more resilient to changes in document structure. Here we extract male elements using Cases with a pattern and an infinite level specification. This is roughly equivalent to using XPath in native XML processing.

  In[150]:=  Cases[data, XMLElement[_, _, {_, XMLElement["sex", _, {"male"}], ___}],
               Infinity] // TableForm
Out[150]//TableForm=
             XMLElement[item, {}, {XMLElement[name, {}, {Leonardo}], XMLElement[sex,
              {}, {male}], XMLElement[age, {}, {8}], XMLElement[height, {}, {4.7}]}]
             XMLElement[item, {}, {XMLElement[name, {}, {Salvatore}], XMLElement[sex,
              {}, {male}], XMLElement[age, {}, {5}], XMLElement[height, {}, {4.1}]}]

Sometimes, the XMLObject and XMLElement notation can be a bit too heavy, and it is easier to work with simple nested lists. This can be done with Apply plus List, specifying all levels.

In[151]:=  list = Apply[List, data, {0, Infinity}]
Out[151]=  {{{{Version, 1.0}, {Encoding, UTF-8}},
             { Some data to use as a test for Mathematica's XML import },
             {test, Just for didactic purposes}}, {data, {},
             {{item, {}, {{name, {}, {Leonardo}}, {sex, {}, {male}}, {age, {}, {8}},
                {height, {}, {4.7}}}}, {item, {}, {{name, {}, {Salvatore}},
                {sex, {}, {male}}, {age, {}, {5}}, {height, {}, {4.1}}}},
              {item, {}, {{name, {}, {Alexis}}, {sex, {}, {female}},
                {age, {}, {6}}, {height, {}, {4.4}}}}}}, {{  Comment at end }}}

This can shorten the patterns needed to extract content.

In[152]:=  Cases[list, {___, {"sex", _, {"male"}}, ___}, Infinity]
Out[152]=  {{{name, {}, {Leonardo}}, {sex, {}, {male}},
             {age, {}, {8}], {height, {}, {4.7}}}, {{name, {}, {Salvatore}},
             {sex, {}, {male}}, {age, {}, {5}}, {height, {}, {4.1}}}}

Another useful transformation is to change all heads to the symbolic form of the element tag. Here we use //. (ReplaceRepeated) with rules that strip XMLObject and convert XMLElement expressions. I show the output in tree form to make it clear what this transformation does.

Note

When converting strings to symbols, you need to be cognizant of whether a symbol already exists and has a value. This bit me when I was preparing this recipe, because I failed to recognize that the top-level element tag name was "data," which, of course, turned out to be the name of the variable that I was transforming. Infinite recursion! The solution was to include the transformation from XMLElement["data", attrs_, content_] to XMLElement["items", attrs, content] as the first transformation.

5.9 Transforming XML Using Patterns and Rules

Problem

You want to transform imported XML into something more suitable to mathematical manipulation.

Solution

The format of imported XML is a bit heavy. You use pattern matching and ReplaceAll to transform it into something more digestible. Here we take our row-oriented XML data into a simple matrix.

This technique has two basic steps. First, you use Cases to extract the relevant elements. Second, you use a series of one or more transformations to massage the data into the form you want. In the first transformation, elements are taken to primitive values. Here you rely on the column position to determine when strings need conversion into numbers via ToExpression[]. The second transformation strips out the remaining XMLElement content. Until you have some experience with these types of transformations it is unlikely that you’ll whip them up off the top of your head. The final form of this transformation reflects the fact that I developed it in stages. Here are the successive refinements.

Choose the relevant elements.

In[156]:=  Cases[data , XMLElement["item", _, _], Infinity]
Out[156]=  {XMLElement[item, {},
             {XMLElement[name, {}, {Leonardo}], XMLElement[sex, {}, {male}],
              XMLElement[age, {}, {{8}], XMLElement[height, {}, {4.7}]}],
            XMLElement[item, {}, {XMLElement[name, {}, {Salvatore}],
              XMLElement[sex, {}, {male}], XMLElement[age, {}, {5}],
              XMLElement[height, {}, {4.1}]}], XMLElement[item, {},
             {XMLElement[name, {}, {[Alexis}], XMLElement[sex, {}, {female}],
              XMLElement[age, {}, {6}], XMLElement[height, {}, {4.4 }]}]}

Strip out the data-level XML structure.

Strip out the row-level XML structure, leaving the data in matrix form but all the primitive values as strings.

Finally, do the type conversion.

Discussion

There are always many ways to solve the same transformation problem. The tradeoffs involve brevity, clarity, generality, and performance. The solution has clarity, because it accomplishes the transformation in a step-by-step fashion. However, it is neither brief nor very general. The following transformation does the same thing but is more general. It will work on any two-level XML document because it does not match on specific element names (like "item"). It also does not hardcode which columns contain numeric data. However, it is a bit more cryptic because it does not mention XMLElement at all. Rather, it immediately converts the data to a list (using Apply with head List), and it uses [[n]] to pick out the relevant items.

I demonstrate the generality by processing an XML file with a different number of rows, columns, and data types.

XML-to-XML transformations

You may find that you need to transform XML for reasons other than using the data in Mathematica. Unless you already know a language specifically designed for this purpose (like XSLT), Mathematica is a good choice. Mathematica’s pattern-matching capabilities are well suited to many types of XML transformations. Consider the problem of converting elements to attributes.

It is a bit easier to see how this worked by converting back to XML text. The stripping of carriage returns () is only for formatting purposes.

A transformation from attributes to elements follows similar lines. The use of Join[] here is not strictly necessary, but it shows you how to handle cases in which you don’t want to lose preexisting child elements at the point where you are injecting attribute content.

5.10 Transforming XML Using Recursive Functions (à la XSLT)

Problem

The pure pattern-based approach of 5.9 Transforming XML Using Patterns and Rules is too awkward, cryptic, or complex for your particular transformation problem.

Solution

Consider using an approach inspired by Extensible Stylesheet Language Transforms (XSLT). XSLT is a language that is specifically designed to transform XML. There are some rough similarities between XSLT and a style of Mathematica programming that exploits functions, patterns, and recursion. Here is how you use Mathematica to process XML in ways similar to XSLT. Consider the 5.9 Transforming XML Using Patterns and Rules transformation of elements to attributes. Rather than rely on replacement, we use a set of mutually recursive functions with patterns to navigate the XML tree while surgically inserting transformations at the correct places.

Discussion

A natural objection to using this style of transformation rather than using replacement rules is that it is more verbose. This verbosity comes with some advantages. The first advantage is that when things go wrong, it is generally easier to debug a set of discrete functions than a replacement pattern. Most of the action of a replacement pattern is happening under the covers. The second advantage comes in cases where you need to make many changes at different levels in the XML hierarchy. Here the overhead of the recursive approach is less bothersome. We implement a transformation that changes elements to attributes, renames the "item" element to "row", changes "sex" to "gender", and converts the height from feet to meters—all with very little extra overhead.

One of the first things you learn about XSLT is that if you create an empty stylesheet (XSLT’s equivalent of a program), you get some default transformation rules that act to output just the text nodes of the XML data. We can emulate that behavior in Mathematica with the following functions.

In[187]:=  ClearAll[transform]
           transform[XMLObject[type_][content__]] :=
            StringJoin[transform[#] & /@ List[content]]
           transform[XML]lement[tag_, attrs_List, data_List]] :=
            StringJoin[transform[#] & /@ data ]
           transform[text_String] := text
           transform[_] := ""
In[192]:=  transform[data]
Out[192]=  Leonardomale84.7Salvatoremale54.1Alexisfemale64.4

So far, so good, but can we do something more interesting? Suppose we want to clone our XML document but replace all occurrences of the element "sex" with the element "gender".

This recursive transformational approach is overkill in this scenario since we can more easily express this transformation using ReplaceAll.

There are certain types of structure-adding transformations that were difficult to do in XSLT until a grouping construct was added (xsl:for-each-group) in XSLT 2.0. Here is a solution to a grouping problem using Mathematica’s Sort[] and Split[] functions.

The goal of this transformation is to group all employees in the same department under a new element <Dept dept="num">. Notice how this is accomplished with little additional code. Helper functions define an ordering and an equivalence relation for Sort and Split, respectively, and a transform[] applies the additional level of grouping when it matches the "employees" element.

Of course, there are significant differences between these transformations and XSLT. For example, in XSLT, you operate on a tree and, hence, can navigate upward from child elements to parent elements. This is not the case for Mathematica’s representation of XML. The tutorial mentioned in the following See Also section provides some guidance for working around these issues.

5.11 Writing Parsers and Grammars in Mathematica

Problem

You want to write a parser in Mathematica.

Solution

The easiest type of parser to write in Mathematica is a recursive descent parser. Before writing the parser, we need to know the grammar of the language we will parse. The most common notation for grammars is Backus-Naur Form (BNF), but for reasons that will become apparent in the discussion, I use Mathematica itself to represent the grammar. For this example, I use a simplified English grammar. The presentation here is a variation of one developed and given by Daniel Lichtblau of Wolfram Research at the Wolfram Developer’s Conference in 1999. Refer to the See Also section for more information.

First, we need some helper functions to make creating the grammar easier. We use two functions, sequence and choose, with attribute HoldAll to prevent them from evaluating their arguments and causing an infinite recursion. As its name would suggest, sequence[] represents a sequence of terms of the grammar. Choose represents a choice of one out of two or more possible terms. I allow choose to take an extra argument, which is a list of probabilities for the choices. More on that later.

In[212]:=  SetAttributes[{sequence, choose}, HoldAll]
           NILL = "";

This grammar is for a small subset of English.

In[214]:=  sentence := choose[declarative, interrogative, imperative]
           declarative := sequence[subject, predicatepast]
           interrogative := sequence[qverb, subject, predicatepresent]
           imperative := sequence[actverb, subject]
           subject := choose[nounclause, sequence[nounclause, prepositionclause]]
           nounclause := sequence[adjectiveclause, noun]
           noun = {"skyscraper", "ball", "dog", "cow", "shark", "attorney", "hatter",
              "programmer", "city", "village", "buffalo", "moon", "librarian", "sheep"} ;
           adjectiveclause := sequence[article, adjectivelist]
           adjectivelist := choose[NILL, sequence[adjective, adjectivelist] , {0.7}]
           article = {"a", "the", "this", "that"};
           adjective =
             {"big", "wet", "mad", "hideous", "red", "repugnant", "slimy", "delectable",
              "mild-mannered", "lazy", "silly", "crazy", "ferocious", "cute"} ;
           prepositionclause := sequence[preposition, nounclause]
           preposition = {"in", "above", "under", "from", "near", "at", "with"} ;
           predicatepresent := sequence[verbpresent, subject]
           predicatepast := sequence[verbclause, subject]
           verbclause := sequence[adverblist, verbpast]
           adverblist := choose[NILL, sequence[adverb, adverblist ], {0.6}]
           adverb=
             {"swiftly", "unflinchingly", "smugly", "selflessly", "oddly", "mightily"} ;
           verbpast = {"ate", "threw", "gnashed", "boiled",
              "grated", "milked", "spanked", "jumped"};
           verbpresent= {"eat", "throw", "gnash", "boil", "grate",
              "milk", "spank", "salivate", "jump"};
           qverb = {"did", "will", "could", "should"} ;
           actverb= {"break","fix", "launch", "squeeze", "fetch"} ;

This grammar becomes the specification for our parser. Recursive descent parsers are probably the easiest parsers to craft by hand because their structure mimics the grammar. The goal of this parser is to create a labeled parse tree from a sentence. The parser is very simple: it contains no provision for error handling and relies on the grammar being completely conflict free. For example, the major sentence types are completely determined by the first word. Real languages or even artificial languages (like programming languages) are rarely that clean.

In[237]:=  (*Test for membership of a terminal
            symbol in a list of terminal symbols.*)
           isQ[type_, word_]:= MemberQ[type, word]

           (*Get next word for parser.*)
           getNextWord[{}] := ""
           getNextWord[words_List] := First[words]

           (*Parse a single word, classifying it as head, and return length of 1.*)
           atomParse[head_, words_List] := {head[getNextWord[words]], 1}

           (*Top level parse function for
            sentences. Dispatches based on first word.*)
           sentenceParse[sentence_sentenceType] :=
            Module[{sentencelist =Apply[List, sentence], firstWord },
             firstWord = First[sentencelist];
             If[isQ[qverb, firstWord], interrogativeParse[sentencelist],
                If[isQ[actverb, firstWord], imperativeParse[sentencelist],
                declarativeParse[sentencelist]]]]

           (*declarative := sequence[subject, predicatepast]*)
           declarativeParse[words_List]:=
            Module[{subject =subjectParse[words], predicate},
             predicate=predicatepastParse[Drop[words, subject[[2]]]];
             "DECLARATIVE SENTENCE"[subject[[1]], predicate[[1]]]]

           (*interrogative := sequence[qverb, subject, predicatepresent]*)
           interrogativeParse[words_List] :=
            Module[{qverb =atomParse["QUESTION VERB", words], subject, predicate},
             subject=subjectParse[Drop[words, qverb[[2]]]];
             predicate=predicatepresentParse[
               Drop[words, qverb[[2]] +subject[[2]]]];
             "INTERROGATIVE SENTENCE"[qverb[[1]], subject[[1]], predicate[[1]]]]
             (**)

           (*imperative := sequence[actverb, subject]*)
           imperativeParse[words_List] :=
            Module[{actverb =atomParse["ACTION VERB", words], subject},
             subject = subjectParse[Drop[words, actverb[[2]]]];
             "IMPERATIVE SENTENCE"[actverb[[1]], subject[[1]]]]
          (*subject :=
           choose[nounclause, sequence[nounclause, prepositionclause]]*)
          subjectParse[words_List] :=
           Module[{nounclause =nounclauseParse[words], prepositionclause},
            prepositionclause=Drop[words, nounclause[[2]]];
            If[! isQ[preposition, getNextWord[prepositionclause]],
             {"SUBJECT"[nounclause[[1]]], nounclause[[2]]},
             prepositionclause=prepositionclauseParse[prepositionclause];
             {"SUBJECT"[nounclause[[1]], prepositionclause[[1]]],
             nounclause[[2]] +prepositionclause[[2]]}]]

         (*predicatepast :=sequence[verbclause,subject]*)
         predicatepastParse[words_List] :=
          Module[{verbclause = verbclauseParse[words], subject},
           subject = subjectParse[Drop[words, verbclause[[2]]]];
           {"PREDICATE"[verbclause[[1]], subject[[1]]],
            verbclause[[2]] + subject[[2]]}]

        (*predicatepresent:=sequence[verbpresent,subject]*)
        predicatepresentParse[words_List] :=
         Module[{verb =atomParse["VERB (PRESENT TENSE)", words], subject},
          subject=subjectParse[Drop[words, verb[[2]]]];
          {"PREDICATE"[verb[[1]], subject[[1]]], verb[[2]] +subject[[2]]}]

        (*verbclause:=sequence[adverblist,verbpast]*)
        verbclauseParse[words_List] :=
         Module[{adverbs =adverblistParse[words], verb},
          verb=atomParse["VERB (PAST TENSE)", Drop[words, adverbs[[2]]]];
          If[adverbs[[2]] == 0, verb,
           {"VERB CLAUS]"[adverbs[[1]], verb[[1]]], adverbs[[2]] + verb[[2]]}]]

        (*nounclause:= sequence[adjectiveclause, noun]*)
        nounclauseParse[words_List] :=
         Module[{adjectiveclause = adjectiveclauseParse[words], noun},
          noun = atomParse["NOUN", Drop[words, adjectiveclause[[2]]]];
          {"NOUN CLAUS]"[adjectiveclause[[1]], noun[[1]]],
           adjectiveclause[[2]] +noun[[2]]}]

        (*adjectiveclause := sequence[article, adjectivelist]*)
        adjectiveclauseParse[words_List] :=
         Module[{art =atomParse["ARTIC)]", words], adjlist},
          adjlist=adjectivelistParse[Drop[words, art[[2]]]];
         If[adjlist[[2]] ==0, art,{"ADJECTIVE CLAUSE"[art[[1]], adjlist[[1]]],
           art[[2]] +adjlist[[2]]}]]

        (* Parse (possibly empty) list of adjectives.*)
        (*adjectivelist :=
         choose[NILL, sequence[adjective, adjectivelist] , {0.7}]* )
        adjectivelistParse[words_List] :=
         Module[{words2 =words, adj, result, len=0}, result ="ADJECTIVE LIST"[];
          While[isQ[adjective, getNextWord[words2]],
           adj=atomParse["ADJECTIVE", words2];
           len+=adj[[2]];
           result="ADJECTIVE LIST"[result, adj[[1]]];
           words2=Drop[words2, adj[[2]]]];
          {Flatten[result, Infinity, "ADJECTIVE LIST"], len}]

        (*prepositionclause := sequence[preposition, nounclause]*)
        prepositionclauseParse[words_List] :=
         Module[{preposition =atomParse["PREPOSITION", words], nounclause},
          nounclause=nounclauseParse[Drop[words, preposition[[2]]]];
          {"PREPOSITION CLAUSE"[preposition[[1]], nounclause[[1]]],
           preposition[[2]] +nounclause[[2]]}]

       (* Parse Ipossibly empty M list of adverbs.*)
       (*adverblist := choose[NILL, sequence[adverb,adverblist], {0.6}]*)
       adverblistParse[words_List] :=
        Module[{words2 =words, adv, result, len=0}, result="ADVERB LIST"[];
         While[isQ[adverb, getNextWord[words2]],
          adv=atomParse["ADVERB", words2];
          len+=adv[[2]];
          result="ADVERB LIST"[result, adv[[1]]];
          words2=Drop[words2, adv[[2]]]];
         {Flatten[result, Infinity, "ADVERB LIST"], len}]

We can test the parser on a sentence that conforms to the grammar.

In[254]:=  sentenceParse[
            sentenceType["will", "the", "wet", "programmer", "spank", "the", "moon"]]
Out[254]=  INTERROGATIVE SENTENCE[QUESTION VERB[will],
            SUBJECT[NOUN CLAUSE[ADJECTIVE CLAUSE[ARTICLE[the],
               ADJECTIVE LIST[ADJECTIVE[wet]]], NOUN[programmer]]],
            PREDICATE[VERB (PRESENT TENSE)[spank],
             SUBJECT[NOUN CLAUSE[ARTICLE[the], NOUN[moon]]]]]

Discussion

You may wonder why I took the trouble to specify the grammar using Mathematica if I was going to write the parser by hand. First, I did not write this parser; I just prettied up a parser written by Daniel Lichtblau! The more serious answer is that the grammar can be used to easily create a language generator to go along with the parser. The generator is very useful for testing the parser. Here I based a generator on Lichtblau’s implementation but made some significant improvements. The first improvement is that my implementation is more declarative than procedural because it leverages Mathematica’s pattern matching. The second improvement is that the generator absorbs all the complexity so the grammar can remain very clean. In Lichtblau’s original grammar, the representation was soiled by the presence of programmatic constructs, like Hold[] and his implementation of random choice. Other than the presence of probabilities, the grammar in the preceding Solution section is completely clean. In fact, it reads as easy as BNF. Refer to the URL in the See Also section to compare this implementation with the original.

In[255]:=  <<Combinatorica`
           (*needed for BinarySearch[]*)
           (*randomChoose[parts_List,probs_List]  selects an item from
            parts_List based on a list of probabilities the length of
            which must be one less than the number of parts and the sum
            of which is less than one. The interpretation is that each
            probability corresponds to the probability of the item in the same
            position except for the last item, which gets the residual.*)
           randomChoose[parts_List, probs_List]:= Module[{weights, test, pos},
             weights = N[Append[FoldList[Plus, First[probs], Rest[probs]], 1]];
             test = RandomReal[]; pos = Ceiling[BinarySearch[weights, test]];
             parts[[pos]]]
           (*randomPart[]  is responsible for interpreting the grammar in
            a random manner. There is a variation for each possible term,
           and recursion is used to expand nonterminals.*)
           randomPart[sequence[parts__ ]]:= randomPart[#] & /@ List[parts]
           randomPart[choose[parts__, probs_List]]  :=
            Union[Flatten[List[randomPart[randomChoose[List[parts], probs  ]]]]]
           randomPart[choose[parts__ ]]:= Module[{partList, numParts},
             partList = List[parts]; numParts =Length[partList];
             randomPart[randomChoose[partList, Table[1/numParts, {numParts -1}]]]]
           randomPart[terminals_List] :=
            terminals[[ RandomInteger[ {1, Length[terminals ]}] ]]
           randomPart[NILL ] := {}
           (*randomSentence[]  is the entry point for
            generating a random sentence of the grammar.* )
           randomSentence [] := sentenceType @@ Flatten[randomPart[sentence]]
           (* We provide a nice textual formatting for
            sentences that also takes care of punctuation.* )
           Format[sentence_sentenceType]:=
            Module[{word =First[sentence], words, punc},
             words=Map[StringJoin[#, " "] &, sentence];
             punc=If[isQ[qverb, word], "?", If[isQ[actverb, word], "!", "."]];
             words[[Length[words]]] =StringReplacePart[Last[words], punc, -1];
             words[[1]] =StringReplacePart[First[words],
                ToUpperCase[StringTake[First[words], 1]], 1];
             Apply[StringJoin, words ]]

Here you can see the result of generating 10 random sentences. They are, for the most part, utter gibberish, but some are kind of funny. They all conform to the grammar, as we can see by running them through the parser.

The parser we wrote by hand is an instance of a predictive recursive descent parser because it looks ahead wherever there is a choice so that it does not take a wrong path through the grammar. In contrast, a backtracking parser simply starts over from where it left off if a particular parse path fails. If you are ambitious, you can continue this recipe and write a backtracking parser generator in Mathematica. The references in the following See Also section provide some background.

Table of Contents for 5. String and Text Processing

Create new playlist

Sign In

Sign Up

Chapter 5. String and Text Processing

5.0 Introduction

Characters and Character Encodings

String and Regular Expressions

String expressions

See Also

5.1 Comparing Strings

Problem

Solution

Discussion

5.2 Removing and Replacing Characters from Strings

Problem

Solution

Using patterns

Using positions

Discussion

Stripping comments

Changing delimiters

Removing XML markup

Replacing with expression evaluation

See Also

5.3 Extracting Characters and Substrings

Problem

Solution

Using patterns

Using positions

Discussion

See Also

5.4 Duplicating a String

Problem

Solution

Discussion

See Also

5.5 Matching and Searching Text

Problem

Solution

Discussion

See Also

5.6 Tokenizing Text

Problem

Solution

Discussion

5.7 Working with Natural Language Dictionaries

Problem

Solution

Discussion

Crossword puzzles

Anagrams

Palindromes

Spell-checker

See Also

5.8 Importing XML

Problem

Solution

Discussion

Note

See Also

5.9 Transforming XML Using Patterns and Rules

Problem

Solution

Discussion

XML-to-XML transformations

See Also

5.10 Transforming XML Using Recursive Functions (à la XSLT)

Problem

Solution

Discussion

See Also

5.11 Writing Parsers and Grammars in Mathematica

Problem

Solution

Discussion

See Also

Table of Contents for
5. String and Text Processing