write in CSVRowWriter.java

As I said earlier, this is where most of the work gets done. Again, here's the DOM-relevant snippet of code with the pseudocode as comments. The passed Row Element is referred to as nRow. All the DOM work gets done in the DO loop.

From CSVRowWriter.java—write
//  Columns NodeList <- Get Row's childNodes attribute
ColumnList = nRow.getChildNodes();

//  DO until Columns NodeList.item[index] is null
while (ColumnList.item(iRowChildren) != null)
{
  // Skip the Row's Text nodes
  if ( ColumnList.item(iRowChildren).getNodeType()
    != Node.ELEMENT_NODE)
  {
    iRowChildren++;
    continue;
  }
  
  // Get a shorthand name for this guy
  nColumn = ColumnList.item(iRowChildren);

  //  Column Name <- get NodeName attribute
  sColumnName = nColumn.getNodeName();

  //  Column Number <- Derive from Column Name
  iColumnNumber = (new Integer(
    sColumnName.substring(6))).intValue();

  //  IF Column Number > Highest Column
  //    Highest Column <- Column Number
  //  ENDIF
  if (iColumnNumber > iHighestColumn)
    iHighestColumn = iColumnNumber;

  //  Column Array [Column Number] <- get nodeValue of
  //    item[index] firstChild Node
  sColumnArray[iColumnNumber] =
    nColumn.getFirstChild().getNodeValue();

  //  Increment index
  iRowChildren++;
//  ENDDO
}

WARNING

Watch out for unexpected Text Nodes with whitespace!


I need to point out one thing here. At the top of the loop we have these few lines:

// Skip the Row's Text nodes
if ( ColumnList.item(iRowChildren).getNodeType()
  != Node.ELEMENT_NODE)
{
  iRowChildren++;
  continue;
}

This is because the Row's NodeList had an unexpected Node preceding each ColumnXX Element Node. We have to skip them since the only children we want to process are the ColumnXX Element Nodes. This behavior was peculiar to JAXP and Xerces. I did not observe it with MSXML.

Interestingly, I observed this behavior only when processing a file that was “pretty printed,” that is, each tag started on a new line with indentation. If I processed a file with no whitespace between the tags I didn't observe it. Some debugging statements in the code verified that these were indeed Text Nodes with contents of a new line and tab characters. (Thank you, XMLSPY!)

Another interesting thing is that calling the Row Element's normalize method did not make the Text Nodes go away or consolidate them into a single Text Node. However, if you think about it, a normalize call should not affect this behavior since, technically speaking, these are not adjacent Text Nodes. They occur between the Column Nodes, not next to each other.

To eliminate these Text Nodes, the DocumentBuilderFactory class has a method called “setIgnoringElementContentWhitespace”. (I love these verbose Java method names!) This method strips out the “ignorable” whitespace from the content of Elements that have only child Elements and no Text content. However, the parser only knows of such elements from a DTD or a schema, so you have to be validating the input for this method to have any effect. We haven't included validation yet in our code, so the setIgnoringElementContentWhitespace method wouldn't have helped us here. We will use this method later when we do validation in Chapter 5.

I suppose that the Xerces and MSXML developers could have long arguments about which parser behaves correctly, but it is of little interest to me. The bottom line is to know how your parser behaves and to code your programs accordingly. Files intended to be used for business application import/export or exchange with trading partners rarely have such “mixed content” where an Element can have both data and child Elements. However, beware of unexpected Text Nodes and be prepared to handle them. Don't always assume that a Node List has only Elements.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset