Someone will call Something will fall And smash on the floor Without reading the text Know what comes next Seen it before And it’s painful Things must change We must rearrange them Or we’ll have to estrange them All that I’m saying The game’s not worth playing Over and over again
—Depeche Mode, "The Sun and the Rainfall"
Users who come to Mathematica for its superior mathematical capabilities are pleasantly surprised to find strong abilities in programming areas outside of mathematics proper. This is certainly true in the area of textual and string processing. Mathematica’s rich library of functions for string and structured text manipulation rivals Java, Perl, or any other modern language you can tie a string around.
The sections in this introduction provide information on some of the basic tools of strings and string manipulation.
Mathematica uses Unicode internally, but externally (e.g., when saving a notebook) it uses ASCII codes, encoding non-ASCII characters in a special form.
For example, lowercase Greek letters and other non-ASCII
characters are encoded using backslash-bracketed character names
([name]
).
In[1]:= alpha = "α"
Out[1]= α
The function ToString
will translate strings using different encoding schemes.
The default character encoding used by Mathematica is stored in
$CharacterEncoding
, and the native
character encoding of the underlying operating system Mathematica is
running is stored in $SystemCharacterEncoding
. All available
encodings are stored in $CharacterEncodings
.
In[3]:= $CharacterEncoding Out[3]= UTF-8 In[4]:= $SystemCharacterEncoding Out[4]= UTF-8 In[5]:= Partition[$CharacterEncodings, 4] // TableForm Out[5]//TableForm= AdobeStandard ASCII CP936 CP949 CP950 Custom EUC-JP EUC IBM-850 ISO10646-1 ISO8859-10 ISO8859-11 ISO8859-13 ISO8859-14 ISO8859-15 ISO8859-16 ISO8859-1 ISO8859-2 ISO8859-3 ISO8859-4 ISO8859-5 ISO8859-6 ISO8859-7 ISO8859-8 ISO8859-9 ISOLatin1 ISOLatin2 ISOLatin3 ISOLatin4 ISOLatinCyrillic Klingon koi8-r MacintoshArabic MacintoshChineseSimplified MacintoshChineseTraditional MacintoshCroatian MacintoshCyrillic MacintoshGreek MacintoshHebrew MacintoshIcelandic MacintoshKorean MacintoshNonCyrillicSlavic MacintoshRomanian MacintoshRoman MacintoshThai MacintoshTurkish MacintoshUkrainian Math1 Math2 Math3 Math4 Math5 Mathematica1 Mathematica2 Mathematica3 Mathematica4 Mathematica5 Mathematica6 Mathematica7 PrintableASCII ShiftJIS Symbol Unicode UTF8 WindowsANSI WindowsBaltic WindowsCyrillic WindowsEastEurope WindowsGreek WindowsThai WindowsTurkish ZapfDingbats
Notice how UTF-8 needs two bytes to display alpha.
ToCharacterCode
gives the
numerical representation.
You can map from character codes back to characters
using FromCharacterCode[]
.
In[8]:= FromCharacterCode[{87, 88, 89, 90}]
Out[8]= WXYZ
The mapping may not be one-to-one for certain encodings.
In[9]:= FromCharacterCode[{206, 177}, "UTF8"]
Out[9]= α
A great deal of Mathematica’s prowess in text processing comes
from its rich support for pattern matching. There are two basic
classes of string patterns: string expressions and regular
expressions. Introduced in version 5.1, each has
a similar expressive power. The advantage of StringExpression
is that it is less cryptic
because it uses more words than symbols to express patterns. The
advantage of RegularExpression
is
that it is more standardized with other languages such as Perl, Ruby,
Java, and so on. Non-Mathematica programmers, especially those with a
background in Unix, are more likely to understand regular expressions,
although these expressions are cryptic to the uninitiated. You should
become familiar with both if you plan to do much string manipulation.
If you program frequently in languages outside of Mathematica, master
the regular expression syntax. If you work strictly in Mathematica,
choose the one that most appeals to you. If you learn the string
expression syntax, you will have a leg up on learning Mathematica’s
more general pattern-matching syntax, which is used in many contexts
outside text processing. You can also mix string expressions and
regular expressions into compound patterns.
StringExpressions
are
mostly written using the infix operator ~~
, which is a syntactic shortcut for the
StringExpression[]
function.
StringExpression
uses
Mathematica’s blanks notation (e.g., _, __, and ___) to represent
wild cards. See Chapter 4 for more on
blanks.
Match "xy"
followed by any
character.
In[10]:= "xy" ~~ _; In[11]:= StringMatchQ["xyz" , "xy" ~~ _] Out[11]= True In[12]:= StringMatchQ["xyzz" , "xy" ~~ _] Out[12]= False
Match "xy"
followed by one
or more characters.
In[13]:= "xy" ~~ __; In[14]:= StringMatchQ["xyzz" , "xy" ~~ __] Out[14]= True In[15]:= StringMatchQ["xy" , "xy" ~~ __] Out[15]= False
Match "xy"
followed by zero
or more characters.
In[16]:= "xy" ~~ ___; In[17]:= StringMatchQ["xyz" , "xy" ~~ ___] Out[17]= True In[18]:= StringMatchQ["xy" , "xy" ~~ ___] Out[18]= True
Patterns can be associated with variables so that the matching portion can be referred to in a subsequent expression. For example, the following pattern will match if the string begins and ends with the same sequence of characters.
Table 5-1 shows some of the common raw ingredients for string expressions. If you have already read Chapter 4 on pattern matching, you can see that all the same constructs are available for strings. The full set of string expression primitives can be found in tutorial/WorkingWithStringPatterns.
Table 5-1. Common string patterns
Table 5-2 shows some of
the common raw ingredients for regular expressions. The full set of
regular expression primitives can be found in
tutorial/WorkingWithStringPatterns. Here
c
or c
n
, where n
is a number, is a
placeholder for an arbitrary character, and p
n
is a placeholder for an
arbitrary regular expression.
Table 5-2. Common regular expressions
The definitive reference on regular expressions is Mastering Regular Expressions, Second Edition, by Jeffrey E. F. Friedl (O’Reilly). If you plan to do anything nontrivial using regular expression matching, you will save yourself hours of frustration by consulting this book.
An excellent tutorial on working with string patterns in Mathematica can be found in the documentation under tutorial/WorkingWithStringPatterns or online at http://bit.ly/yGbND . Besides being a good all-around tutorial, it has a section specifically targeting Perl programmers, which is helpful for those who already have experience with string manipulation in Perl.
Use Order[e1, e2]
, which
returns 1 if e1 is before e2, -1 if e1 is after e2, and 0 if they are
equal.
In[23]:= Order["rat", "rate"] Out[23]= 1 In[24]:= Order["rat", "cat"] Out[24]= -1
Most users of Mathematica will not find themselves doing direct
string comparison since functions like Sort
and Ordering
do the right thing. However, if you
find yourself needing to use the more natural comparison operators
with strings, you can do the following:
In[25]:= Unprotect[Less, LessEqual, Greater, GreaterEqual]; Less[s1_String, s2_String] := Order[s1, s2] > 0; LessEqual[s1_String, s2_String] := Order[s1, s2] > -1; Greater[s1_String, s2_String] := Order[s1, s2] < 0; GreaterEqual[s1_String, s2_String] := Order[s1, s2] < 1; Protect[Less, LessEqual, Greater, GreaterEqual]; In[31]:= "rat" < "cat" Out[31]= False In[32]:= "cat" < "rat" Out[32]= True In[33]:= "cat" <= "cat" Out[33]= True
You want to strip certain characters (e.g., whitespace) or characters at certain positions from a string. You may also want to replace these characters with other characters.
StringReplace[]
is an
extremely versatile function that solves most character-oriented
stripping and replacing operations. It supports a very general set
of string-substitution rules, including regular expressions and
Mathematica-specific string patterns.
Strip all spaces.
Strip leading and trailing whitespace.
Normalize whitespace: strip leading, trailing, and multiple internal whitespace.
Literal string substitution.
Ignore case while matching.
Use Mathematica-specific patterns instead of regular expressions.
Sometimes you know exactly where the characters are
that you want to remove. In that case, StringDrop[]
is a lot more efficient.
StringDrop[]
takes the string and
a second argument, which can be an offset from the front, an offset
from the end, specific positions, or a range of positions.
Consider:
In[41]:= myString = "abcdefghijklmnop" ;
Here you drop the first three characters.
In[42]:= StringDrop[myString, 3]
Out[42]= defghijklmnop
Alternatively, you drop the last three characters, like so.
In[43]:= StringDrop[myString, -3]
Out[43]= abcdefghijklm
Drop only the third character, like this.
In[44]:= StringDrop[myString, {3}]
Out[44]= abdefghijklmnop
And drop the third through fifth ("cde"
), using a range list.
In[45]:= StringDrop[myString, {3, 5}]
Out[45]= abfghijklmnop
The step size in the range can even be greater than one by
specifying it as the third element. Here you specify a step size of
two to remove every other character. The -1
upper limit is a convenient way to
specify the end of the string without having to know its
length.
In[46]:= StringDrop[myString, {l, -1, 2}]
Out[46]= bdfhjlnp
You can also act on several strings at once.
In[47]:= otherString = "1234567890"; In[48]:= StringDrop[{myString, otherString}, {3, 5}] Out[48]= {abfghijklmnop, 1267890}
The positional form for replacement is called StringReplacePart[]
, and it works using
similar conventions for specifying positions. The difference is that
you must always provide a contiguous range or a list of such
ranges.
In[49]:= StringReplacePart[myString, "ZZZ", {3, 5}] Out[49]= abZZZfghijklmnop In[50]:= StringReplacePart[myString, "ZZZ", {{3, 5}, {10, 15}}] Out[50]= abZZZfghiZZZp
Each range can also have its own replacement string.
In[51]:= StringReplacePart[myString, {"ZZZ", "WWW"}, {{3, 5}, {10, 15}}]
Out[51]= abZZZfghiWWWp
As you can see from the given examples, StringReplace
is quite versatile. However,
the versatility is derived from Mathematica’s rich support for
patterns (see 5.0 Introduction). Here are some
typical text-processing problems that yield to the application of
StringReplace[]
and pattern
matching.
Delimited text (e.g., comma-delimited text) sounds simple at first, but many delimited formats allow a way to handle the delimiters as regular text by some quoting mechanism, as well as a way to escape quotes themselves. Furthermore, you must handle empty fields. If you want to replace a comma-delimited format with, say, a semicolon-delimited format, you must craft expressions that deal with all cases. Here, "" is used to escape a double quote. This example does not handle empty fields, but see Friedl’s Mastering Regular Expressions for guidance.
Simple XML manipulations, such as discarding markup, can
be accomplished with StringReplace[]
.
In[56]:= NotebookDirectory[] Out[56]= /Users/smangano/Documents/workspace/Mathematica Cookbook/mathematica/ In[57]:= xml = Import[FileNameJoin[ {NotebookDirectory[], "..", "data", "ch02", "data1.xml"}], "Text"] Out[57]= <?xml version="1.0" encoding="UTF-8"?> <!-- Some data to use as a test for Mathematica's XML import --> <?test Just for didactic purposes?> <data> <item> <name>Leonardo</name> <sex>male</sex> <age>8</age> <height>4.7</height> </item> <item> <name>Salvatore</name> <sex>male</sex> <age>5</age> <height>4.1</height> </item> <item> <name>Alexis</name> <sex>female</sex> <age>6</age> <height>4.4</height> </item> </data> <!-- Comment at end -->
See 2.4 Mapping Multiple Functions in a Single Pass
for use of StringPosition[]
, which
returns sequence specification that can be fed into StringReplacePart[]
and StringDrop[]
.
See 2.8 Defining Indexed Functions and 2.9 Understanding the Use of Fold As an Alternative to Recursion for more sophisticated forms of XML processing.
StringCases[]
provides the pattern-driven means of extracting substrings. There
are two major variations. In the first, you simply extract what the
patterns literally match. The second variation uses rules to
transform the matched substrings into other strings and return those
instead.
You can extract specific words using regular expressions (here matches word boundaries).
The same can be done using string expressions.
In[64]:= StringCases["The pig thought he was a dog and then chased the cat.", WordBoundary ~~ {"a", "the"} ~~ WordBoundary, IgnoreCase -> True] Out[64]= {The, a, the}
The most common reason for using rules instead of patterns is to match a substring within a specific context but return the substring alone. Here we want to return substrings bracketed by one or more occurrences of the letter a. This example also illustrates that regular expressions and string expressions can be mixed.
Sometimes you know exactly where the characters are that you
want to remove. In that case, StringTake[]
is a lot more efficient.
StringTake[]
takes the string and
a second argument, which can be an offset from the front, an offset
from the end, specific positions, or a range of positions.
Consider:
In[66]:= myString = "abcdefghijklmnop";
Here you take the first three characters.
In[67]:= StringTake[myString, 3]
Out[67]= abc
Alternatively, you take the last three characters, like so.
In[68]:= StringTake[myString, -3]
Out[68]= nop
Take only the third character, like this.
In[69]:= StringTake[myString, {3}]
Out[69]= c
And take the third through fifth ("cde"
) using a range list.
In[70]:= StringTake[myString, {3, 5}]
Out[70]= cde
The step size in the range can even be greater than one by specifying it as the third element. Here you specify a step size of two to take every other character. The -1 upper limit is a convenient way to specify the end of the string without having to know its length.
In[71]:= StringTake[myString, {1, -1, 2}]
Out[71]= aeegikmo
Conveniently, you can also act on several strings at once.
In[72]:= otherString = "1234567890"; In[73]:= StringTake[{myString, otherstring}, {3, 5}] Out[73]= {cde, 345}
If you have read 5.2 Removing and Replacing Characters from Strings, you see that
StringTake
has very similar
parameter variations as StringDrop[]
. However, StringTake
has an additional feature: it
can take a list of position specifications and produce a list of the
resulting extracts.
In[74]:= StringTake[myString, {{1}, {3}, {8, 10}}]
Out[74]= {a, c, hij}
This is useful for picking multiple segments from a string in
one step. However, if you want a string rather than a list, simply
wrap the expression in a StringJoin[]
.
In[75]:= StringJoin[StringTake[myString, {{1}, {3}, {8, lO}}]]
Out[75]= aehij
In the Solution section we
used RegularExpression["(?<=a)"]
(lookbehind) and RegularExpression["(?=a)"]
(look-ahead)
because there is no stringexpression equivalent. However, there is an
option for StringCases[]
called
Overlaps
, which when set to
True
, causes the matcher to
continue at the character that follows the first character of the last
matched substring. In the following example, this allows a single
a to act as both a start of pattern and end of
pattern.
Without Overlaps→True
, you
would not get the "cbcbd"
substring.
There is a third setting, Overlaps→All
, which causes the matcher to
repeat searches from the same position until no new matches are found.
To see the effect of All
, we need
to consider a different example, one in which the bracketing character
is not excluded from the match. A parenthesized expression is a good
example.
Out[80]//TableForm= ((a-b) ((a-b) (c + d) ((a-b) (c + d) (e / (f + g) ((a-b) (c + d) (e / (f + g)) ((a-b) (c + d) (e / (f + g))) (a-b) (a-b) (c + d) (a-b) (c + d) (e / (f + g) (a-b) (c + d) (e / (f + g)) (a-b) (c + d) (e / (f + g))) (c + d) (c + d) (e / (f + g) (c + d) (e / (f + g)) (c + d) (e / (f + g))) (e / (f + g) (e / (f + g)) (e / (f + g))) (f + g) (f + g)) (f + g)))
If you have a list of strings and want to extract those
that match a pattern, you want Select
, using StringMatchQ
with a string pattern as the
test, rather than StringCases
. See
4.1 Collecting Items That Match (or Don’t Match) a
Pattern.
Use StringJoin[]
on the
output of Table[]
.
In[81]:= stringDup[seed_, n_: 2] := StringJoin@Array[seed &, n] In[82]:= stringDup["-", 10] // InputForm Out[82]//InputForm= "----------" In[83]:= stringDup["wiki "] Out[83]= wiki wiki
This is a simple recipe, and I include it because it’s something
you expect to be bundled as a native function, but it’s not. For most
practical applications, the solution is fine, but for very large
n
, a doubling approach will have
better performance. Rather than doing the math to get the exact string
size, we simply truncate the closest sized string obtained from pure
doubling of the seed.
In[84]:= stringDup2[seed_, n_] := StringTake[Nest[# <> # &, seed, Ceiling[Log[2, n]]], n*StringLength[seed]] In[85]:= Mean[Table[Timing[stringDup["-", 100 000]][[1]], {10}]] Out[85]= 0.0486878 In[86]:= Mean[Table[Timing[stringDup2["-", 100 000]][[1]], {10}]] Out[86]= 0.0031014
This solution may not be obvious, so let’s break it down. It
should be clear that mapping the function #<>#&
to a list containing a
string will double that string (recall that <>
is string concatenation).
In[87]:= # <> #& /@ {"_"}
Out[87]= {--}
It follows that doing this twice will quadruple it.
In[88]:= # <> #&/@ (# <> #& /@ {"_";})
Out[88]= {----)
Repeating this process m
times will create a string of length 2^m
. However, the input is the desired
length n
, not the number of
doublings, so we know we need at least Ceiling[Log[2, n]]
doublings; by using
Nest
with this number, we get
exactly that. However, this overshoots the desired length in most
cases, because we rarely expect n
to be an exact power of 2. So we use Take
to extract the correct length. The
reason this is fast for large n
is
that it reduces a 0(n)
operation in
terms of Table
to a 0(log n)
operation using StringJoin
.
You can bundle these versions together into one function that gives good performance across all sizes.
In[89]:= Clear[stringDup]; stringDup[seed_String, n_Integer /; n >=2^12] := StringTake[Nest[# <> # &, seed, Ceiling[Log[2, n]]], n] stringDup[seed_String, n_Integer: 2] := StringJoin@[rray[seed &, n]
Nest
is discussed in
2.11 Computing Through Repeated Function Application.
Use StringMatchQ[string,pattern]
to determine if
a string matches a pattern.
In[92]:= StringMatchQ["1234", NumberString]
Out[92]= True
Here I show a match on multiple strings with a pattern that is predicated.
In[93]:= StringMatchQ[{"1234", "1237"}, p: NumberString /; OddQ [FromDigits [p]]]
Out[93]= {False, True}
Use StringFreeQ[string,pattern]
to determine if
a string does not match a pattern.
In[94]:= StringFreeQ[{"1234", "abcde"}, p: NumberString]
Out[94]= {False, True}
Use StringPosition[string,pattern]
to find the
integer offsets of matches. The default behavior is to search for all
occurrences of the pattern (i.e., Overlaps →
True
).
In[95]:= StringPosition["1234abcd54321", NumberString]
Out[95]= {{l,4}, (2,4), (3,4), (4,4),
{9, 13}, {10, 13}, {ll, 13}, {12, 13}, (13, 13)}
With Overlaps → False
, you
only get matches on substrings that don’t share characters with prior
matches.
StringMatchQ[]
and
StringFreeQ[]
very often find
application in restricting inputs to functions.
In[97]:= classify [word_String /; StringMatchQ[word,{"I", "me", "we", "you", "they", "him", "her", "it"}]] := pronoun[word] classify[word_String /; StringMatchQ[word, {"and", "or", "nor", "after", "although", "as", "because", "before", "how", "if", "once", "since", "than", "that", "though", "till", "until", "when", "where", "whether", "while"}]] := conjunction[word] classify[word_String /; StringMatchQ[word, DatePattern[{"DayName"}]]] := dayofweek[word] classify[word_String /; StringMatchQ[word, DatePattern[{"MonthName"}]]] := month[word] (*...*) classify[word_String] := other[word] ;
You can also use them as input to other functions, like Pick[]
in the following grep implementation
adapted from an example in Mathematica documentation. Recall that in
the standard Unix grep, option
-v
instructs grep to return lines that don’t
match the pattern. Here Transpose
and Range
are used to number each
line so the result contains a list of pairs {line, match text}
. This grep function was
implemented in terms of StringFreeQ
rather than StringMatchQ
since the
latter only succeeds if the entire string matches.
In[102]:= grep[file_, patt_, "-v"] := grepImpl[file, patt, True ] grep[file_, patt_] := grepImpl[file, patt, False] grepImpl[file_, patt_, value_] := With[{data =Import[file, "Lines"]}, Pick[Transpose[{Range[Length[data]], data=], StringFreeQ[data, RegularExpression[patt]], value ]] In[105]:= grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"=], "bar"}] // TableForm Out[105]//TableForm= 1 bar 4 foo bar 5 foobar 6 barfo In[106]:= grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"=}], "bar$"] Out[106]= {{1, bar}, {4, foo bar}, {5, foobar}} In[107]:= grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"}], "bar", "-v"] Out[107]= {{2, foo}, {3, baz}, {7, fo o}}
Both StringMatchQ[]
and StringFreeQ[]
support the
IgnoreCase → True
option. StringMatchQ
also supports option SpellingCorrection → True
, which allows the
match to succeed even if a small number of characters are wrong.
However, in many cases a small number can mean only 1, as the
following example demonstrates, so I would not rely too heavily on
this "feature."
The output of StringPosition[]
can be used as the input to
StringTake
.
In[110]:= With[{str ="1234abcd54321"}, StringTake[str, StringPosition[str, NumberString]]] Out[110]= {1234, 234, 34, 4, 54321, 4321, 321, 21, 1}
If you want to use it with StringDrop[]
, you need to map StringDrop[]
over the list returned by
StringPosition[]
. This will give
you a list with each matching segment dropped. More than likely, you
will want to set Overlaps
→ False
in this case. Try Overlaps → True
with the expression given below to see why it is undesirable.
See 5.3 Extracting Characters and Substrings and
5.2 Removing and Replacing Characters from Strings for usage
of StringTake[]
and StringDrop[]
.
StringSplit[]
provides a
variety of options for tokenizing text. The default is simply to
tokenize on whitespace.
In[112]:= StringSplit["The quick brown fox
jumped over the lazy programmer"]
Out[112]= {The, quick, brown, fox, jumped, over, the, lazy, programmer}
Other delimiters can be specified as literals or more general patterns. Here you specify comma delimiters with zero or more whitespace characters.
In[113]:= StringSplit["2008/01/20, test1, 100.3, 77.8,33.77", ","~~WhitespaceCharacter ...] Out[113]= {2008/01/20, test1, 100.3, 77.8, 33.77}
If there are several delimiters, give each pattern in a list. Here you decide to parse the date along with the comma-delimited text.
In[114]:= StringSplit["2008/01/20, test1, 100.3, 77.8,33.77", {"," ~~WhitespaceCharacter ..., "/"}] Out[114]= {2008, 01, 20, test1, 100.3, 77.8, 33.77}
StringSplit
supports rules as
well as patterns, which leads to some interesting applications, such
as a means of highlighting output. Here is an example that stylizes
XML by rendering directives, comments, and tags in specific font
styles and colors. (The colors will not be visible in a monochrome
print, but you can try the code on your own to see the effect.)
As of version 6, Mathematica comes bundled with many useful data sources. One of these sources is an integrated English language dictionary (dictionaries for other languages can be installed).
Look up words that begin with th and end with y.
In[116]:= DictionaryLookup["th" ~~ ___ ~~"y"]
Out[116]= {thankfully, thanklessly, theatricality, theatrically,
thematically, theocracy, theologically, theology, theoretically,
theory, theosophy, therapeutically, therapy, thereby, thermally,
thermodynamically, thermostatically, they, thickly, thievery,
thingummy, thingy, thinly, thirdly, thirstily, thirsty, thirty, thorny,
thoroughly, thoughtfully, thoughtlessly, thready, threateningly,
threepenny, threnody, thriftily, thrifty, thrillingly, throatily,
throaty, throwaway, thruway, thuggery, thunderously, thundery, thy}
Look up words that end in ee.
In[117]:= DictionaryLookup[___ ~~ "ee"]
Out[117]= {absentee, addressee, agree, Aimee, Albee, amputee, apogee, appointee,
Ashlee, attendee, Attlee, axletree, banshee, bee, bootee, bumblebee,
bungee, carefree, Chattahoochee, Cherokee, chickadee, chimpanzee,
coffee, committee, conferee, consignee, coulee, Cree, debauchee, decree,
Dee, degree, deportee, Desiree, detainee, devotee, disagree, divorcee,
draftee, Dundee, dungaree, Elysee, emcee, employee, enlistee, entree,
epee, escapee, evacuee, fat-free, fee, fiancee, filigree, flee, foresee,
franchisee, free, fricassee, Frisbee, fusee, Galilee, garnishee, gee, ghee,
glee, goatee, grandee, grantee, guarantee, gumtree, honeybee, honoree,
Humvee, inductee, internee, interviewee, invitee, jamboree, Jaycee,
jubilee, kedgeree, Klee, knee, lee, Lee, legatee, Legree, lessee, levee,
licensee, manatee, marquee, matinee, McGee, McKee, melee, Menominee,
Milwaukee, mortgagee, Murrumbidgee, Muskogee, nee, negligee, nominee,
Okeechobee, Okefenokee, oversee, parolee, Pawnee, payee, pedigree, pee,
peewee, Pelee, perigee, pewee, pharisee, Pharisee, pongee, prithee,
protegee, puree, puttee, quadtree, ranee, referee, refugee, Renee,
repartee, retiree, returnee, Rhee, rupee, Sadducee, scree, see, settee,
Shawnee, Sheree, shoetree, singletree, sirree, Slurpee, soiree, spree,
squeegee, standee, subcommittee, subtree, suttee, Suwanee, Swanee,
Tallahassee, tee, Tennessee, tepee, thee, three, toffee, toll-free, topee,
toupee, towhee, townee, Toynbee, trainee, transferee, tree, trochee,
Truckee, trustee, Tuskegee, twee, Tweedledee, Tyree, wannabee, wee, whee,
whiffletree, whippletree, whoopee, Yahtzee, Yankee, yippee, Zebedee}
There are a lot of neat applications for an integrated dictionary.
Here is how you might cheat at a crossword puzzle. Say you have three letters of a six-letter word and the clue is "51 down: unkeyed."
In[118]:= DictionaryLookup["a" ~~ _ ~~ "o" ~~ _ ~~ _ ~~ "l"]
Out[118]= {amoral, atonal, avowal}
Ah, atonal sounds right (pun intended)!
You can also help your second grader impress the teacher on that November worksheet for finding all the words you can make out of the letters in "Thanksgiving" (i.e., anagrams). Here we use a pattern containing all combinations of the letters in "thanksgiving" and an extra constraint function to ensure letters are contained by their availability (count). Strictly speaking, an anagram must use all the letters of the input, but I ignore that here.
Using Tally[]
to
count letter occurrences and doing a bit of set manipulation, we can
generalize this for any word. The condition checking for the empty
complement at the end is not strictly needed here because we will
never match a word in the dictionary that has a letter that is not
in the input word. However, it is needed to make the logic if
isWordSubsetQ[]
is correct as a
general predicate.
You can test the generality against other words.
In[125]:= anagrams["winter"] Out[125]= {en, er, in, inert, inter, ire, it, net, new, newt, nit, niter, re, rein, rent, rite, ten, tern, ti, tie, tier, tin, tine, tire, twin, twine, twiner, we, weir, wen, went, wet, win, wine, winter, wire, wit, wren, writ, write} In[126]:= anagrams["dog"] Out[126]= {do, dog, go, god}
Here is a neat little palindrome finder (courtesy of the Mathematica documentation).
In[127]:= DictionaryLookup[x__/; x ===StringReverse[x]}
Out[127]= {a, aha, aka, bib, bob, boob, bub, CFC, civic, dad, deed, deified,
did, dud, DVD, eke, ere, eve, ewe, eye, gag, gig, huh, I, kayak,
kook, level, ma'am, madam, mam, MGM, minim, mom, mum, nan, non, noon,
nun, oho, pap, peep, pep, pip, poop, pop, pup, radar, redder, refer,
repaper, reviver, rotor, sagas, sees, seres, sexes, shahs, sis,
solos, SOS, stats, stets, tat, tenet, TNT, toot, tot, tut, wow, WWW}
By using all the words in the dictionary with Nearest[]
, you can create a rudimentary
spell-checker. For our first attempt, we’ll use Nearest'
s default distance function. We’ll
return a list for which the first element is True or False depending
on the word’s inclusion in the dictionary and the second element is
a list of potential correct spellings.
In[128]:= (*Returns a function that, when applied to <word> and an integer <n>, returns a list containing the n words in the integrated dictionary considered to be closest to <word>*) nf1 = Nearest[DictionaryLookup[]]; SpellCheck1[word_] := Module[{corrections = nf1[word, 15]} , If[ MemberQ[ corrections, word], {True, word}, {False, corrections}]] In[130]:= SpellCheck1["pickel"] Out[130]= {False, {nickel, picked, picker, picket, bicker, dicker, dickey, hickey, kicked, kicker, licked, Michel, mickey, Mickey, nicked}}
We see that the default distance function used for
strings (EditDistance)
does not
make the greatest spell-checker: the obvious suggestion of
pickle is not among the first 15 nearest words.
You can experiment with other distance functions. Here is one that
penalizes more heavily for mistakes in consonants than for mistakes
in vowels.
Here we test on some commonly misspelled words (according to the Oxford Dictionaries website: http://bit.ly/KuIQ2 ).
In[135]:= SpellCheck2["accomodate"] Out[135]= {False, {accommodate, accommodated, accommodates, accumulate, accelerate, accentuate, acclimate, accolade, accommodation, accordant}} In[136]:= SpellCheck2["alcahol"] Out[136]= {False, {alcohol, alcohols, alcoholic, achoo, ahchoo, algal, anchor, carol, lethal, local}} In[137]:= SpellCheck2["mispell"] Out[137]= {False, {misspell, Aspell, Ispell, miscall, respell, spell, dispel, dispels, misdeal, misplay}}
This returns useful results, but performance (speed) is poor.
In[138]:= SpellCheck2["pickel"] // Timing
Out[138]= {2.22533, {False, {nickel, picked, picker,
picket, pickle, packed, packer, packet, pecked, pick}}}
We can improve the speed using a divide-and-conquer
approach: pick a large but manageable number (e.g., 100) of nearest
words according to simple EditDistance,
and then do a second pass on
the smaller set with the EditDistance
sans vowels. We define a
distance function called ConsonantDistance[]
for the second
pass.
Good results and about 43 times faster!
Mathematica also provides WordData[],
which returns information
about properties of words, such as parts of speech and
definitions.
In[142]:= WordData["run"]
Out[142]= {{run, Noun, Score}, {run, Noun, Travel}, {run, Noun, RegularTrip},
{run, Noun, ShortTrip}, {run, Noun, FootballPlay}, {run, Noun, Endeavor},
{run, Noun, Successiveness}, {run, Noun, Flow}, {run, Noun, Damage},
{run, Noun, Footrace}, {run, Noun, Campaign}, {run, Noun, Streak},
{run, Noun, Stream}, {run, Noun, IndefiniteQuantity},
{run, Noun, Liberty}, {run, Noun, TimePeriod}, {run, Verb, Disintegrate},
{run, Verb, SplitUp}, {run, Verb, Dissolve}, {run, Verb, Treat},
{run, Verb, Change}, {run, Verb, Get}, {run, Verb, Vie}, {run, Verb, Race},
{run, Verb, Catch}, {run, Verb, Draw}, {run, Verb, Operate},
{run, Verb, Function}, {run, Verb, CarryThrough}, {run, Verb, Play},
{run, Verb, Circularize}, {run, Verb, Trip}, {run, Verb, GoThrough},
{run, Verb, Hurry}, {run, Verb, TravelRapidly}, {run, Verb, Sport},
{run, Verb, Accompany}, {run, Verb, Sail}, {run, Verb, SpreadOut},
{run, Verb, Flow}, {run, Verb, GoAway}, {run, Verb, Displace},
{run, Verb, MoveFreely}, {run, Verb, Trade}, {run, Verb, Loose},
{run, Verb, Direct}, {run, Verb, Succeed}, {run, Verb, Implement},
{run, Verb, Occur}, {run, Verb, Continue},{run, Verb, Endure},
{run, Verb, Extend}, {run, Verb, MakePass}, {run, Verb, Lean},
{run, Verb, Incur}, {run, Verb, Go}, {run, Verb, Range}}
Readers interested in spell-checkers should check out this approach (written in Python) by Peter Norvig of Google: http://bit.ly/19gyjN .
Use Import[]
with format
"XMLObject"
to import XML and
convert it to a special Mathematica expression form. Consider the
following XML in file datal.xml (available for
download at the book’s website).
<?xml version="1.0" encoding="UTF-8"?> <!-- Some data to use as a test for Mathematica's XML import --> <?test Just for didactic purposes?> <data> <item> <name>Leonardo</name> <sex>male</sex> <age>8</age> <height>4.7</height> </item> <item> <name>Salvatore</name> <sex>male</sex> <age>5</age> <height>4.1</height> </item> <item> <name>Alexis</name> <sex>female</sex> <age>6</age> <height>4.4</height> </item> </data>
Mathematica imports XML into expression form. You can
manipulate the expression just like you would any other Mathematica
expression, but first you need to understand the structure, which is a
bit unconventional. Mathematica uses two types of heads to encode XML.
XMLObject[
"type"
]
is used to represent everything that is
not an element, including the entire document (type = "Document"
), comments (type = "Comment}
), CDATA sections (type = "CDATASection"
), processing instructions
(type = "Processinglnstruction"
),
declarations (type = "Declaration
),
and document types (type = "Doctype"
). In the XML above, you see
examples for document, declaration, comment, and processing
instruction. XMLElement[tag,{attr1→val1,...},{data1,...}]
is used to represent element data for both simple (text values) and
complex element types (those with child elements). Don’t get tripped
up by the XMLObject
notation. The
entire syntax XMLObject["type"]
is
the head of the expression, while the remainder is a sequence of one
or more arguments that depends on the type.
In[144]:= Head[data] // InputForm
Out[144]//InputForm=
XMLObject["Document"]
The document version consists of three arguments: a list
containing the declaration and possibly other objects, the document
content, and a list of any objects (such as comments) that might
appear past the last XML element. A very crude way to access structure
is through Part[]
or, equivalently,
[[n]]
.
Pattern matching is much more elegant and more resilient
to changes in document structure. Here we extract male elements using
Cases
with a pattern and an
infinite level specification. This is roughly equivalent to using
XPath in native XML processing.
In[150]:= Cases[data, XMLElement[_, _, {_, XMLElement["sex", _, {"male"}], ___}], Infinity] // TableForm Out[150]//TableForm= XMLElement[item, {}, {XMLElement[name, {}, {Leonardo}], XMLElement[sex, {}, {male}], XMLElement[age, {}, {8}], XMLElement[height, {}, {4.7}]}] XMLElement[item, {}, {XMLElement[name, {}, {Salvatore}], XMLElement[sex, {}, {male}], XMLElement[age, {}, {5}], XMLElement[height, {}, {4.1}]}]
Sometimes, the XMLObject
and XMLElement
notation can be a bit too heavy,
and it is easier to work with simple nested lists. This can be done
with Apply
plus List,
specifying all levels.
In[151]:= list = Apply[List, data, {0, Infinity}]
Out[151]= {{{{Version, 1.0}, {Encoding, UTF-8}},
{ Some data to use as a test for Mathematica's XML import },
{test, Just for didactic purposes}}, {data, {},
{{item, {}, {{name, {}, {Leonardo}}, {sex, {}, {male}}, {age, {}, {8}},
{height, {}, {4.7}}}}, {item, {}, {{name, {}, {Salvatore}},
{sex, {}, {male}}, {age, {}, {5}}, {height, {}, {4.1}}}},
{item, {}, {{name, {}, {Alexis}}, {sex, {}, {female}},
{age, {}, {6}}, {height, {}, {4.4}}}}}}, {{ Comment at end }}}
This can shorten the patterns needed to extract content.
In[152]:= Cases[list, {___, {"sex", _, {"male"}}, ___}, Infinity]
Out[152]= {{{name, {}, {Leonardo}}, {sex, {}, {male}},
{age, {}, {8}], {height, {}, {4.7}}}, {{name, {}, {Salvatore}},
{sex, {}, {male}}, {age, {}, {5}}, {height, {}, {4.1}}}}
Another useful transformation is to change all heads to the
symbolic form of the element tag. Here we use //. (ReplaceRepeated)
with rules that strip
XMLObject
and convert XMLElement
expressions. I show the output in
tree form to make it clear what this transformation does.
When converting strings to symbols, you need to be
cognizant of whether a symbol already exists and has a value. This
bit me when I was preparing this recipe, because I failed to
recognize that the top-level element tag name was "data," which, of
course, turned out to be the name of the variable that I was
transforming. Infinite recursion! The solution was to include the
transformation from XMLElement["data",
attrs_, content_]
to XMLElement["items", attrs, content]
as the
first transformation.
5.9 Transforming XML Using Patterns and Rules and 5.10 Transforming XML Using Recursive Functions (à la XSLT) show you how to transform imported XML into other structures.
You want to transform imported XML into something more suitable to mathematical manipulation.
The format of imported XML is a bit heavy. You use pattern
matching and ReplaceAll
to
transform it into something more digestible. Here we take our
row-oriented XML data into a simple matrix.
This technique has two basic steps. First, you use Cases
to extract the relevant elements.
Second, you use a series of one or more transformations to massage the
data into the form you want. In the first transformation, elements are
taken to primitive values. Here you rely on the column position to
determine when strings need conversion into numbers via ToExpression[].
The second transformation
strips out the remaining XMLElement
content. Until you have some experience with these types of
transformations it is unlikely that you’ll whip them up off the top of
your head. The final form of this transformation reflects the fact
that I developed it in stages. Here are the successive
refinements.
Choose the relevant elements.
In[156]:= Cases[data , XMLElement["item", _, _], Infinity]
Out[156]= {XMLElement[item, {},
{XMLElement[name, {}, {Leonardo}], XMLElement[sex, {}, {male}],
XMLElement[age, {}, {{8}], XMLElement[height, {}, {4.7}]}],
XMLElement[item, {}, {XMLElement[name, {}, {Salvatore}],
XMLElement[sex, {}, {male}], XMLElement[age, {}, {5}],
XMLElement[height, {}, {4.1}]}], XMLElement[item, {},
{XMLElement[name, {}, {[Alexis}], XMLElement[sex, {}, {female}],
XMLElement[age, {}, {6}], XMLElement[height, {}, {4.4 }]}]}
Strip out the data-level XML structure.
Strip out the row-level XML structure, leaving the data in matrix form but all the primitive values as strings.
Finally, do the type conversion.
There are always many ways to solve the same transformation
problem. The tradeoffs involve brevity, clarity, generality, and
performance. The solution has clarity, because it accomplishes the
transformation in a step-by-step fashion. However, it is neither brief
nor very general. The following transformation does the same thing but
is more general. It will work on any two-level XML document because it
does not match on specific element names (like "item"
). It also does not hardcode which
columns contain numeric data. However, it is a bit more cryptic
because it does not mention XMLElement
at all. Rather, it immediately
converts the data to a list (using Apply
with head List),
and it uses [[n]]
to pick out the relevant items.
I demonstrate the generality by processing an XML file with a different number of rows, columns, and data types.
You may find that you need to transform XML for reasons other than using the data in Mathematica. Unless you already know a language specifically designed for this purpose (like XSLT), Mathematica is a good choice. Mathematica’s pattern-matching capabilities are well suited to many types of XML transformations. Consider the problem of converting elements to attributes.
It is a bit easier to see how this worked by
converting back to XML text. The stripping of carriage returns
(
) is only for formatting
purposes.
A transformation from attributes to elements follows similar
lines. The use of Join[]
here is
not strictly necessary, but it shows you how to handle cases in
which you don’t want to lose preexisting child elements at the point
where you are injecting attribute content.
See the tutorial XML/tutorial/TransformingXML in the Mathematica documentation (also at http://bit.ly/4tS1Ce ).
5.10 Transforming XML Using Recursive Functions (à la XSLT) shows alternate techniques for XML transformation.
The pure pattern-based approach of 5.9 Transforming XML Using Patterns and Rules is too awkward, cryptic, or complex for your particular transformation problem.
Consider using an approach inspired by Extensible Stylesheet Language Transforms (XSLT). XSLT is a language that is specifically designed to transform XML. There are some rough similarities between XSLT and a style of Mathematica programming that exploits functions, patterns, and recursion. Here is how you use Mathematica to process XML in ways similar to XSLT. Consider the 5.9 Transforming XML Using Patterns and Rules transformation of elements to attributes. Rather than rely on replacement, we use a set of mutually recursive functions with patterns to navigate the XML tree while surgically inserting transformations at the correct places.
A natural objection to using this style of
transformation rather than using replacement rules is that it is more
verbose. This verbosity comes with some advantages. The first
advantage is that when things go wrong, it is generally easier to
debug a set of discrete functions than a replacement pattern. Most of
the action of a replacement pattern is happening under the covers. The
second advantage comes in cases where you need to make many changes at
different levels in the XML hierarchy. Here the overhead of the
recursive approach is less bothersome. We implement a transformation
that changes elements to attributes, renames the "item"
element to "row",
changes "sex"
to "gender",
and converts the height from feet
to meters—all with very little extra overhead.
One of the first things you learn about XSLT is that if you create an empty stylesheet (XSLT’s equivalent of a program), you get some default transformation rules that act to output just the text nodes of the XML data. We can emulate that behavior in Mathematica with the following functions.
In[187]:= ClearAll[transform] transform[XMLObject[type_][content__]] := StringJoin[transform[#] & /@ List[content]] transform[XML]lement[tag_, attrs_List, data_List]] := StringJoin[transform[#] & /@ data ] transform[text_String] := text transform[_] := "" In[192]:= transform[data] Out[192]= Leonardomale84.7Salvatoremale54.1Alexisfemale64.4
So far, so good, but can we do something more interesting?
Suppose we want to clone our XML document but replace all occurrences
of the element "sex"
with the
element "gender".
This recursive transformational approach is overkill in
this scenario since we can more easily express this transformation
using ReplaceAll.
There are certain types of structure-adding
transformations that were difficult to do in XSLT until a grouping
construct was added (xsl:for-each-group)
in XSLT 2.0. Here is a
solution to a grouping problem using Mathematica’s Sort[]
and Split[]
functions.
The goal of this transformation is to group all
employees in the same department under a new element <Dept dept="num">
. Notice how this is
accomplished with little additional code. Helper functions define an
ordering and an equivalence relation for Sort
and Split,
respectively, and a transform[]
applies the additional level of
grouping when it matches the "employees"
element.
Of course, there are significant differences between these transformations and XSLT. For example, in XSLT, you operate on a tree and, hence, can navigate upward from child elements to parent elements. This is not the case for Mathematica’s representation of XML. The tutorial mentioned in the following See Also section provides some guidance for working around these issues.
The tutorial XML/tutorial/TransformingXML in the Mathematica documentation (also at http://bit.ly/4tS1Ce ) has a section comparing Mathematica to XSLT and can provide further help in exploiting these techniques.
You can learn more about XSLT at the XSL Working Group’s website: http://bit.ly/1fJsB.
The easiest type of parser to write in Mathematica is a recursive descent parser. Before writing the parser, we need to know the grammar of the language we will parse. The most common notation for grammars is Backus-Naur Form (BNF), but for reasons that will become apparent in the discussion, I use Mathematica itself to represent the grammar. For this example, I use a simplified English grammar. The presentation here is a variation of one developed and given by Daniel Lichtblau of Wolfram Research at the Wolfram Developer’s Conference in 1999. Refer to the See Also section for more information.
First, we need some helper functions to make creating the
grammar easier. We use two functions, sequence
and choose,
with attribute HoldAll
to prevent them from evaluating
their arguments and causing an infinite recursion. As its name would
suggest, sequence[]
represents a
sequence of terms of the grammar. Choose
represents a choice of one out of two
or more possible terms. I allow choose
to take an extra argument, which is a
list of probabilities for the choices. More on that later.
In[212]:= SetAttributes[{sequence, choose}, HoldAll] NILL = "";
This grammar is for a small subset of English.
In[214]:= sentence := choose[declarative, interrogative, imperative] declarative := sequence[subject, predicatepast] interrogative := sequence[qverb, subject, predicatepresent] imperative := sequence[actverb, subject] subject := choose[nounclause, sequence[nounclause, prepositionclause]] nounclause := sequence[adjectiveclause, noun] noun = {"skyscraper", "ball", "dog", "cow", "shark", "attorney", "hatter", "programmer", "city", "village", "buffalo", "moon", "librarian", "sheep"} ; adjectiveclause := sequence[article, adjectivelist] adjectivelist := choose[NILL, sequence[adjective, adjectivelist] , {0.7}] article = {"a", "the", "this", "that"}; adjective = {"big", "wet", "mad", "hideous", "red", "repugnant", "slimy", "delectable", "mild-mannered", "lazy", "silly", "crazy", "ferocious", "cute"} ; prepositionclause := sequence[preposition, nounclause] preposition = {"in", "above", "under", "from", "near", "at", "with"} ; predicatepresent := sequence[verbpresent, subject] predicatepast := sequence[verbclause, subject] verbclause := sequence[adverblist, verbpast] adverblist := choose[NILL, sequence[adverb, adverblist ], {0.6}] adverb= {"swiftly", "unflinchingly", "smugly", "selflessly", "oddly", "mightily"} ; verbpast = {"ate", "threw", "gnashed", "boiled", "grated", "milked", "spanked", "jumped"}; verbpresent= {"eat", "throw", "gnash", "boil", "grate", "milk", "spank", "salivate", "jump"}; qverb = {"did", "will", "could", "should"} ; actverb= {"break","fix", "launch", "squeeze", "fetch"} ;
This grammar becomes the specification for our parser. Recursive descent parsers are probably the easiest parsers to craft by hand because their structure mimics the grammar. The goal of this parser is to create a labeled parse tree from a sentence. The parser is very simple: it contains no provision for error handling and relies on the grammar being completely conflict free. For example, the major sentence types are completely determined by the first word. Real languages or even artificial languages (like programming languages) are rarely that clean.
In[237]:= (*Test for membership of a terminal symbol in a list of terminal symbols.*) isQ[type_, word_]:= MemberQ[type, word] (*Get next word for parser.*) getNextWord[{}] := "" getNextWord[words_List] := First[words] (*Parse a single word, classifying it as head, and return length of 1.*) atomParse[head_, words_List] := {head[getNextWord[words]], 1} (*Top level parse function for sentences. Dispatches based on first word.*) sentenceParse[sentence_sentenceType] := Module[{sentencelist =Apply[List, sentence], firstWord }, firstWord = First[sentencelist]; If[isQ[qverb, firstWord], interrogativeParse[sentencelist], If[isQ[actverb, firstWord], imperativeParse[sentencelist], declarativeParse[sentencelist]]]] (*declarative := sequence[subject, predicatepast]*) declarativeParse[words_List]:= Module[{subject =subjectParse[words], predicate}, predicate=predicatepastParse[Drop[words, subject[[2]]]]; "DECLARATIVE SENTENCE"[subject[[1]], predicate[[1]]]] (*interrogative := sequence[qverb, subject, predicatepresent]*) interrogativeParse[words_List] := Module[{qverb =atomParse["QUESTION VERB", words], subject, predicate}, subject=subjectParse[Drop[words, qverb[[2]]]]; predicate=predicatepresentParse[ Drop[words, qverb[[2]] +subject[[2]]]]; "INTERROGATIVE SENTENCE"[qverb[[1]], subject[[1]], predicate[[1]]]] (**) (*imperative := sequence[actverb, subject]*) imperativeParse[words_List] := Module[{actverb =atomParse["ACTION VERB", words], subject}, subject = subjectParse[Drop[words, actverb[[2]]]]; "IMPERATIVE SENTENCE"[actverb[[1]], subject[[1]]]] (*subject := choose[nounclause, sequence[nounclause, prepositionclause]]*) subjectParse[words_List] := Module[{nounclause =nounclauseParse[words], prepositionclause}, prepositionclause=Drop[words, nounclause[[2]]]; If[! isQ[preposition, getNextWord[prepositionclause]], {"SUBJECT"[nounclause[[1]]], nounclause[[2]]}, prepositionclause=prepositionclauseParse[prepositionclause]; {"SUBJECT"[nounclause[[1]], prepositionclause[[1]]], nounclause[[2]] +prepositionclause[[2]]}]] (*predicatepast :=sequence[verbclause,subject]*) predicatepastParse[words_List] := Module[{verbclause = verbclauseParse[words], subject}, subject = subjectParse[Drop[words, verbclause[[2]]]]; {"PREDICATE"[verbclause[[1]], subject[[1]]], verbclause[[2]] + subject[[2]]}] (*predicatepresent:=sequence[verbpresent,subject]*) predicatepresentParse[words_List] := Module[{verb =atomParse["VERB (PRESENT TENSE)", words], subject}, subject=subjectParse[Drop[words, verb[[2]]]]; {"PREDICATE"[verb[[1]], subject[[1]]], verb[[2]] +subject[[2]]}] (*verbclause:=sequence[adverblist,verbpast]*) verbclauseParse[words_List] := Module[{adverbs =adverblistParse[words], verb}, verb=atomParse["VERB (PAST TENSE)", Drop[words, adverbs[[2]]]]; If[adverbs[[2]] == 0, verb, {"VERB CLAUS]"[adverbs[[1]], verb[[1]]], adverbs[[2]] + verb[[2]]}]] (*nounclause:= sequence[adjectiveclause, noun]*) nounclauseParse[words_List] := Module[{adjectiveclause = adjectiveclauseParse[words], noun}, noun = atomParse["NOUN", Drop[words, adjectiveclause[[2]]]]; {"NOUN CLAUS]"[adjectiveclause[[1]], noun[[1]]], adjectiveclause[[2]] +noun[[2]]}] (*adjectiveclause := sequence[article, adjectivelist]*) adjectiveclauseParse[words_List] := Module[{art =atomParse["ARTIC)]", words], adjlist}, adjlist=adjectivelistParse[Drop[words, art[[2]]]]; If[adjlist[[2]] ==0, art,{"ADJECTIVE CLAUSE"[art[[1]], adjlist[[1]]], art[[2]] +adjlist[[2]]}]] (* Parse (possibly empty) list of adjectives.*) (*adjectivelist := choose[NILL, sequence[adjective, adjectivelist] , {0.7}]* ) adjectivelistParse[words_List] := Module[{words2 =words, adj, result, len=0}, result ="ADJECTIVE LIST"[]; While[isQ[adjective, getNextWord[words2]], adj=atomParse["ADJECTIVE", words2]; len+=adj[[2]]; result="ADJECTIVE LIST"[result, adj[[1]]]; words2=Drop[words2, adj[[2]]]]; {Flatten[result, Infinity, "ADJECTIVE LIST"], len}] (*prepositionclause := sequence[preposition, nounclause]*) prepositionclauseParse[words_List] := Module[{preposition =atomParse["PREPOSITION", words], nounclause}, nounclause=nounclauseParse[Drop[words, preposition[[2]]]]; {"PREPOSITION CLAUSE"[preposition[[1]], nounclause[[1]]], preposition[[2]] +nounclause[[2]]}] (* Parse Ipossibly empty M list of adverbs.*) (*adverblist := choose[NILL, sequence[adverb,adverblist], {0.6}]*) adverblistParse[words_List] := Module[{words2 =words, adv, result, len=0}, result="ADVERB LIST"[]; While[isQ[adverb, getNextWord[words2]], adv=atomParse["ADVERB", words2]; len+=adv[[2]]; result="ADVERB LIST"[result, adv[[1]]]; words2=Drop[words2, adv[[2]]]]; {Flatten[result, Infinity, "ADVERB LIST"], len}]
We can test the parser on a sentence that conforms to the grammar.
In[254]:= sentenceParse[ sentenceType["will", "the", "wet", "programmer", "spank", "the", "moon"]] Out[254]= INTERROGATIVE SENTENCE[QUESTION VERB[will], SUBJECT[NOUN CLAUSE[ADJECTIVE CLAUSE[ARTICLE[the], ADJECTIVE LIST[ADJECTIVE[wet]]], NOUN[programmer]]], PREDICATE[VERB (PRESENT TENSE)[spank], SUBJECT[NOUN CLAUSE[ARTICLE[the], NOUN[moon]]]]]
You may wonder why I took the trouble to specify the grammar
using Mathematica if I was going to write the parser by hand. First, I
did not write this parser; I just prettied up a parser written by
Daniel Lichtblau! The more serious answer is that the grammar can be
used to easily create a language generator to go along with the
parser. The generator is very useful for testing the parser. Here I
based a generator on Lichtblau’s implementation but made some
significant improvements. The first improvement is that my
implementation is more declarative than procedural because it
leverages Mathematica’s pattern matching. The second improvement is
that the generator absorbs all the complexity so the grammar can
remain very clean. In Lichtblau’s original grammar, the representation
was soiled by the presence of programmatic constructs, like Hold[]
and his implementation of random
choice. Other than the presence of probabilities, the grammar in the
preceding Solution section is completely
clean. In fact, it reads as easy as BNF. Refer to the URL in the See Also section to compare this implementation
with the original.
In[255]:= <<Combinatorica` (*needed for BinarySearch[]*) (*randomChoose[parts_List,probs_List] selects an item from parts_List based on a list of probabilities the length of which must be one less than the number of parts and the sum of which is less than one. The interpretation is that each probability corresponds to the probability of the item in the same position except for the last item, which gets the residual.*) randomChoose[parts_List, probs_List]:= Module[{weights, test, pos}, weights = N[Append[FoldList[Plus, First[probs], Rest[probs]], 1]]; test = RandomReal[]; pos = Ceiling[BinarySearch[weights, test]]; parts[[pos]]] (*randomPart[] is responsible for interpreting the grammar in a random manner. There is a variation for each possible term, and recursion is used to expand nonterminals.*) randomPart[sequence[parts__ ]]:= randomPart[#] & /@ List[parts] randomPart[choose[parts__, probs_List]] := Union[Flatten[List[randomPart[randomChoose[List[parts], probs ]]]]] randomPart[choose[parts__ ]]:= Module[{partList, numParts}, partList = List[parts]; numParts =Length[partList]; randomPart[randomChoose[partList, Table[1/numParts, {numParts -1}]]]] randomPart[terminals_List] := terminals[[ RandomInteger[ {1, Length[terminals ]}] ]] randomPart[NILL ] := {} (*randomSentence[] is the entry point for generating a random sentence of the grammar.* ) randomSentence [] := sentenceType @@ Flatten[randomPart[sentence]] (* We provide a nice textual formatting for sentences that also takes care of punctuation.* ) Format[sentence_sentenceType]:= Module[{word =First[sentence], words, punc}, words=Map[StringJoin[#, " "] &, sentence]; punc=If[isQ[qverb, word], "?", If[isQ[actverb, word], "!", "."]]; words[[Length[words]]] =StringReplacePart[Last[words], punc, -1]; words[[1]] =StringReplacePart[First[words], ToUpperCase[StringTake[First[words], 1]], 1]; Apply[StringJoin, words ]]
Here you can see the result of generating 10 random sentences. They are, for the most part, utter gibberish, but some are kind of funny. They all conform to the grammar, as we can see by running them through the parser.
The parser we wrote by hand is an instance of a predictive recursive descent parser because it looks ahead wherever there is a choice so that it does not take a wrong path through the grammar. In contrast, a backtracking parser simply starts over from where it left off if a particular parse path fails. If you are ambitious, you can continue this recipe and write a backtracking parser generator in Mathematica. The references in the following See Also section provide some background.
See Daniel Lichtblau’s original implementation at http://bit.ly/zXhUm .
Packrat parsing is amenable to Mathematica implementation. See http://bit.ly/RsNCe .
A functional approach to parsing is discussed in "Monadic Parser Combinators" by Graham Hutton and Erik Meijer, published in Journal of Functional Programming, Volume 8, Issue 4, 1996. See http://bit.ly/PIVAh (PostScript file).