CSVRecordReader Class (Extends RecordReader)

Overview

We can see that most of the detailed work is performed by the CSVRecordReader class. It inherits several attributes and methods from its RecordReader and RecordHandler base classes (see Chapter 6). Here is a summary of its extensions to those classes. We'll review each of the methods.

Attributes:

  • DOM Nodelist Column Grammars

  • Character or Byte Column Delimiter

  • Character or Byte Text Delimiter

  • Integer Column Number

  • Integer Grammar Index

  • Integer Parsing State

Methods:

  • Constructor

  • parseRecord

  • saveCharacter

Note: In the Java and C++ implementations we also enumerate class-wide constants for the parsing states used by the parseRecord method.

Methods

Constructor

Here is the logic for the CSVRecordReader constructor method.

Logic for the CSVRecordReader Constructor Method
Arguments:
  DOM Document File Description Document

Call RecordReader base class constructor, passing File
    Description Document
Record Terminator <- Get "RecordTerminator" Element's value
    Attribute from File Description Document
Call setTerminator to set the Record Terminator1 and
    Record Terminator2
Column Delimiter <- Get "ColumnDelimiter" Element's value
    Attribute from File Description Document
Text Delimiter <- Get "TextDelimiter" Element's value
    Attribute from File Description Document

parseRecord

The CSVRecordReader's parseRecord method is where we finally take a more rigorous approach to the grammar of a CSV row. We can borrow several techniques and approaches from compiler construction to develop a good parsing algorithm. Most of those approaches are overkill (and indeed, some programmers may think even this discussion is overkill!), but taking advantage of some of the simpler techniques can go a long way toward keeping us out of trouble. We'll also be using them in the parseRecord method we use for EDI formats, which involves a more complex grammar.

The starting point, of course, is the grammar of a CSV row. We reviewed it in BNF earlier in the section. We now need to consider more carefully the characteristics of the grammar of a row. I show it again below so that you don't have to flip back several pages.

CSV Row Grammar
row ::= column (column_delimiter column?)* |
        (column_delimiter column?)+
column ::= column_characters_A+ |
           text_delimiter column_characters_B+ text_delimiter
column_characters_A ::= All allowed characters except
                        column_delimiter
column_characters_B ::= All allowed characters except
                        text_delimiter

If we examine the grammar closely, we can see that we can completely determine the meaning of a character, that is, its place in the grammar, simply by considering the characters that precede it. We don't need to do lookahead parsing, that is, examining one or more characters that follow the current character. Our CSV row conforms to the definition of a class of grammars called “regular expressions.” This fact makes life a lot easier for us than it might be with more complex grammars.

One thing it means is that we can process the grammar with a fairly simple tool known as a finite state automaton. This is an abstract machine that consists of a number of states and specifies the input that causes the machine to move from one state to another. Such machines are easy to depict with state transition diagrams. Once diagrammed it is straightforward to develop a parsing algorithm. Figure 7.1 shows the state transition diagram for parsing a CSV row.

Figure 7.1. State Transition Diagram for Parsing a CSV Row


In this diagram the circles show the various states and the arrows show the characters in the row that cause the movement between states. Generally, the states correspond to the nonterminal symbols in the CSV row grammar described above, and the transitions correspond to the terminal symbols in the grammar, that is, the characters in the row. However, we have added the transitional states of New Column, Start Delimited Column, and Finish Delimited Column. These all correspond to the delimiter characters. Those familiar with finite state automatons will note that this is not a fully specified state machine in two regards. We don't have a final accepting state; the machine simply terminates at the end of the CSV row. Also, we don't show transitions to a so-called “dump” state where we terminate due to unexpected input. For example, if we have just scanned the closing text delimiter of a delimited column and entered the Finish Delimited Column state, anything other than the column delimiter is invalid input and will move us into the dump state.

Now, to turn this grammar and the state machine into not only a parsing approach but also a processing algorithm, we only need to add the actions that are performed during each state. This is simple since we're going to perform only two actions. First we save the input character to the DataCell object by calling the CSVRecordReader's saveCharacter method upon entering (or reentering) the Regular Column and Delimited Column states. Then we increment the column number each time we reenter the New Column state. Note that the algorithm includes “other” cases for unexpected input that correspond to moving to the dump state.

Logic for the CSVRecordReader parseRecord Method
Arguments:
  None

Returns:
  Error status or throw exception

Column Number <- 1
Column Grammars NodeList <- call Row Grammar Element's
    getElementsByTagName on "ColumnDescription"
GrammarsIndex <- -1
Parsing State <- New Column
DO until end of input record or Parsing State is Error
  Input Character <- Next character from input record
  DO CASE of Parsing State
    New Column:
      DO CASE of Input Character
        Column Delimiter:
          Increment Column Number
          BREAK
        Text Delimiter:
          Parsing State <- Start Delimited Column
          BREAK
        other:
          Call saveCharacter
          Parsing State <- Regular Column
          BREAK
      ENDDO
      BREAK
    Regular Column:
      DO CASE of Input Character
        Column Delimiter:
          Parsing State <- New Column
          Increment Column Number
          BREAK
        other:
          Call saveCharacter
          BREAK
        ENDDO
      BREAK
    Start Delimited Column:
      DO CASE of Input Character
        Text Delimiter:
          Parsing State <- Finish Delimited Column
          BREAK
        other:
          Call saveCharacter
          Parsing State <- Delimited Column
          BREAK
      ENDDO
      BREAK
    Delimited Column:
      DO CASE of Input Character
        Text Delimiter:
          Parsing State <- Finish Delimited Column
          BREAK
        other:
          Call saveCharacter
          BREAK
      ENDDO
      BREAK
    Finish Delimited Column:
      DO CASE of Input Character
        Column Delimiter:
          Parsing State <- New Column
          Increment Column Number
          BREAK
        other:
          Parsing State <- Error
          BREAK
      ENDDO
      BREAK
  ENDDO
ENDDO
IF Parsing State = Error
  Return error
ENDIF
Return success

saveCharacter

The saveCharacter method is fairly straightforward. We create a new DataCell object if we're not currently processing one. We then save the input character to the DataCell's buffer.

Logic for the CSVRecordReader saveCharacter Method
Arguments:
  Character Input Character

Returns:
  Integer GrammarIndex

IF Parsing State is not Regular Column or Delimited Column
  Increment Grammar Index
  Grammar Column Number <- Call Column Grammar NodeList
      item(Grammar Index) getAttribute on "FieldNumber", and
      convert to integer
  DO while Grammar Column Number < Column Number
    Increment Grammar Index
    IF Column Grammar NodeList item(Grammar Index) is null
      return error
    ENDIF
    Grammar Column Number <- Call Column Grammar NodeList
        item(Grammar Index) getAttribute on "FieldNumber",
        and convert to integer
  ENDDO
  IF (Grammar Column Number > Column Number)
    return error
  ENDIF
  call RecordHandler's createDataCell method
ENDIF
Call DataCell Array[Highest Cell] putByte method to append
    Input Character to DataCell buffer

Except for the new DataCell derived classes we'll develop in a later section in this chapter, this wraps up the design of our utility to convert from CSV files to XML documents. We'll next go over the design of the XML to CSV converter. Believe it or not, it is quite a bit simpler.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset