File Description Document Design

General Considerations

As I alluded to previously, in addition to specifying the grammars of our legacy non-XML formats, the file description documents also describe physical characteristics of the input files. For example, we specify record terminators and certain characteristics of the output XML documents such as schema URLs. Again, let's look first at the general choices we need to make about the design of these documents. In the previous section we made decisions about how our instance documents should look. Here we consider how the file description documents should look. We allowed a bit more flexibility in the instance document design than we will here largely due to the variation in legacy file formats. Also, we don't want to impose restrictions where they aren't absolutely required. However, in the case of the file description documents, our processing code is relatively more tightly linked with their document design style. We'll impose a bit more order to make our coding easier.

  • Naming conventions: I have chosen relatively intuitive, descriptive names for the Elements and Attributes, with no formal analysis behind them. I have also chosen to use upper camel case.

  • Specific names versus qualifier/value pairs: This is not much of a consideration for the file description documents. To the extent that it applies, I use specific names.

  • Structure: The structure follows a logical breakdown of the file and grammar characteristics. Where we have a flat grammar, as in CSV files, the XML representation is pretty flat with only three levels. More complex grammars such as EDI can have several levels.

  • Namespaces: Again, for most of the considerations noted in the previous section, we're not using them.

The other major choice that needs to be made about the general format of the file description documents is how we use Elements and Attributes. This one deserves a bit more discussion than just a bullet point. Reviewing my thought processes and what I ended up doing may help you with similar decisions in the future.

My initial inclination was to use only Elements, matching the appearance and processing approach used for the instance documents that correspond to our legacy formats. However, when dealing with the grammars, particularly when considering the grammars of flat files and EDI documents, I wanted to use some Attributes. I felt the grammars would be easier to work with if the nonterminal symbols, such as groups, records, and fields, were represented as Elements. There were a few properties of groups and records that I also needed to represent, but I didn't want to depict those as Elements on a peer level with the Elements representing the grammar symbols. For example, I didn't want a flat file's record identifier tag to be a sibling Element to those representing the fields in the record's grammar. That led to creating a few selected Attributes for such things. Then, as I began to design the code, it became apparent that certain field characteristics, such as column number and data type, might be more easily accessed as Attributes. This was leading me to an inconsistent design in which some parts of the document used only Elements and other parts, in particular the grammar, used only Attributes. In the end I decided to go completely the other way and just use Elements for structure and Attributes to depict all values. Basically, I decided to adopt the approach used in W3C Schema language. I felt consistency was important within the document, that it would make processing the document easier. More importantly, I felt it would be easier for end users of the utilities if I followed a consistent approach for all parts of the document.

Major Sections and Elements

The grammars of our legacy formats are described using the following basic items. The exact Element names may vary depending on the legacy grammar being described.

  • Grammar Element: In the file description document, this is the root Element of the subtree that describes the grammar. Depending on the legacy file grammar, it may have RecordDescription or GroupDescription Elements as children.

  • GroupDescription Element: This Element describes a set of records (and perhaps other groups). Its first child Element is always a RecordDescription. It may have one or more other RecordDescription or GroupDescription Elements as children. The GroupDescription Element is not used in our CSV file grammar since we support only one record type and therefore have no groups.

  • RecordDescription Element: This Element describes the structure and overall characteristics of a record. Its child Elements are FieldDescription Elements.

  • FieldDescription Element: This Element has a number of Attributes that describe characteristics of the field in the legacy format. It always specifies a name, a data type, and a number. Depending on the legacy format it may also specify information such as length, minimum or maximum length, offset, and fill character.

EDI grammars add another layer, but we'll talk about that in Chapter 9.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset