Grammar Analysis and Description

The grammar of our legacy file formats can be broken down into two separate grammars: (1) the grammar of records and groups of records within the file, and (2) the grammar of the fields within a record. In the case of CSV files, we have imposed the restriction that each row has the same format. The CSV file grammar can be expressed rather simply with the following BNF production.

CSV File Grammar
CSVFile ::= row+

The plus sign (+) indicates one or more. So, this simply says that a CSV file contains one or more rows. For CSV files, the row grammar is the interesting part. Remember, here we don't want to describe the grammar of a specific CSV file; instead we want to abstract the essential features of CSV files and develop a grammar that describes the whole class of CSV files. So, here's the CSV row grammar. Note that we follow through with the complete grammar by using the row nonterminal symbol from the file grammar.

CSV Row Grammar
row ::= column (column_delimiter column?)* | (column_delimiter
column?)+
column ::= column_characters_A+ |
           text_delimiter column_characters_B+ text_delimiter
column_characters_A ::= All allowed characters except
                        column_delimiter
column_characters_B ::= All allowed characters except
                        text_delimiter

Again, the plus sign (+) indicates one or more occurrences and the vertical bar or pipe (|) indicates an exclusive OR choice. The asterisk (*) indicates zero or more occurrences, the question mark (?) indicates optionality, and the parentheses are used to establish groupings, the same way they are used in mathematical equations. I've taken a bit of a shortcut in the last two productions by falling back to text rather than terminal and nonterminal symbols, but I think it is clearer than trying to enumerate the full set of allowed characters.

Bear in mind that any number of BNF productions can describe this grammar. Let's not get confused about whether or not this is the most elegant way to express the grammar; let's just focus on the one I present.

So, what does this grammar tell us? The first production tells us that a row can have:

  • A single column with nothing after it

  • A single column followed by any combination of empty and filled columns

  • An empty first column followed by any combination of empty and filled columns

This production tells us that all the following rows are legal, assuming we use a comma as the column delimiter.


Mary,had,a,little,lamb
Mary,had
Mary,,a,,lamb
Mary,,a,,,
,had,a,little,lamb
,had,a,,
,,,

Note that the grammar allows a row to end with empty columns or to be completely empty. Many applications won't produce such rows. However, since we're aiming to accommodate the widest possible class of CSV files I saw no reason to impose this restriction. This approach also makes the parsing algorithm a bit easier. But even though we'll be able to parse rows ending with empty columns, we're not going to create such rows. As discussed in Chapter 9, the grammar of EDI records (segments) doesn't allow empty fields at the end.

The last three productions basically tell us that a column either may have any character other than the selected column delimiter or, if it is delimited by the text delimiter in the first and last positions, may include the column delimiter. Again, this allows a much wider range of variations than are usually permitted by any particular application. Many delimit only alphanumeric columns that contain the column delimiter (usually a comma), but some delimit all columns regardless of type. Our approach allows us to accommodate nearly all cases. As we'll shortly see, this also keeps the parsing algorithm from getting too complex since we don't need to be concerned with the data type of the column as we are parsing. Note, however, than when we convert from XML to CSV, we use the DelimitText Element of the column's grammar to determine whether or not we should delimit the column content with the text delimiter character.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset