Monday, October 8, 2012

Yet another text parser.

Text Parsing using XSD grammar.

 

Text parsing problem has many known solutions in the software development library, solutions like using Java program with regular expressions or any kind of available lexical parsers; but if you add to the parsing problem another problem of converting the parsed text into XML, then we will end up with two steps, first parse the text into tokens then form a DOM document and bind to Java types according to some XSD.
Another solution to this problem and as we’re using the XSD anyway in this cycle is ‘why not to use the XSD itself for parsing the text’, this blog is to explain the XSD parser solution and how it differs from other known solutions

XSD parser 

 

XSD is an XML document to describe a structure of nested elements,  so to parse a text into some structure according to some rules you can use the XSD to define your desired structure, but where to define the rules, the following example can elaborate on the problem at hand, CIMP text format is a complex text representation for the flight manifests used between the airlines to describe the shipment contents, the following message holds manifest information, example of a message called FFM,

Message Lines
Comment
FFM/8
Message identifier and version
1/CV1234/01APR/LUX
Message header containing flight information
DXB
Destination
172-00122474CDGDXB/T4K800MC4.8/CLOTHING
Consignment details
172-00123012HAMAUH/S1K25MC0.1T3/BOOKS
Consignment details
ULD/AKE01063CV
Unit loaded data
172-00123012HAMAUH/S2K50MC0.2T3/BOOKS
Consignment details
172-00123060STRDXB/T5K200MC1.2/MACHINE PARTS
Consignment details
172-00123023LHRDXB/T3K30MC0.13/COMPACT DISCS
Consignment details
ULD/AKE01070CV
Unit loaded data
172-00123071STRDXB/T80K300MC0.7/WATCHES
Consignment details
HKG
Destination
172-00123034LHRHKG/T10K400MC1.2/GARMENTS
Consignment details
ULD/PMC12345CV
Unit loaded data
172-00123056LUXHKG/T5K2500MC15/PORCELAINE
Consignment details
CONT
Manifest Complete Indicator


The previous sample doesn’t contain all of the optional lines which can come with the message and described at the IATA manual.
According to the IATA manual the message structure is translated to the following sample structure, next is only partial representation for the main element,

--Identifier
----Message Id
----Version
--Header
---- MessageSequenceNumber
---- FlightIdentification
---- FlightIdentification
------ CarrierCode
--Details
---- DestinationHeader
------Destination
-------- PointOfUnloading
------ BulkLoadedCargo
------ ULDLoadedCargo
-- CompleteIndicator
-- EndOfMessage

The message lines and elements neither have fixed length or clear separator, but follow different patterns, e.g. flight number follows nnn(n)(a) which means three digits followed by optional digit followed by optional character; the message lines and elements can be either mandatory or optional, with different multiplicity 0, 1, specific number, or unbounded, entire section can repeat like DestinationHeader,  a line can repeat like Destination line, or elements inside the line can be repeated.
As mentioned earlier the message structure can be represented using XSD format, the following is the structure for the first line in the message,
<!-- Manifest Details START  -->
 <xsd:element name="FFMMessage" type="ffm:FFMMessage" />
   <xsd:complexType name="FFMMessage">
<xsd:sequence>
                 <xsd:element name="MesssageIdentifier" type="ffm:MesssageIdentifier"/ >
             <!—Rest of elements goes here  -->
  </xsd:sequence>
 </xsd:complexType>
<!-- MESSAEG INDENTIFIER START  -->
 <xsd:complexType name="MesssageIdentifier" >
  <xsd:sequence>
                                 <xsd:element name="StandardMessageIdentifier">
                                                <xsd:simpleType>
                                                 <xsd:restriction base="xsd:string" />
                                             </xsd:simpleType>

                                                                </xsd:element>                              
                                <xsd:element name="MessageTypeVersionNumber" >
                                    <xsd:simpleType>
                                                  <xsd:restriction base="xsd:integer" />
                                </xsd:simpleType>
                                   </xsd:element>
</xsd:sequence>
</xsd:complexType>
  <!-- MESSAEG INDENTIFIER END  -->

Now we know that the text will start with a message identifier line and that message identifier has a StandardMessageIdentifier and MessageTypeVersionNumber elements, XSD describes the multiplicity of each element and if it's mandatory or optional, but how to convert the “FFM/8” line into two child elements? or where is the grammar defined? This seems a typical regular expressions use case, the “FFM/8” can be translated into (^[A-Z]{3}\b)/(\d{1,3}$) pattern.
But were to specify this to the parser program? one useful XSD feature is the application documentation annotations, this annotation works like an input to the parser program to know how this line should be parsed and assign the parsed tokens to the child elements of the current parser element (line).
Here is the XSD after adding the application documentation annotation,
<!-- Manifest Details START  -->
 <xsd:element name="FFMMessage" type="ffm:FFMMessage" />
   <xsd:complexType name="FFMMessage">
<xsd:sequence>
  <xsd:element name="MesssageIdentifier" type="ffm:MesssageIdentifier" >
<xsd:annotation id="MesssageIdentifier.rule">
                 <xsd:appinfo>
                 <xsd:pattern>
                 <![CDATA[(^[A-Z]{3}\b)/(\d{1,3}$)]]>
                 </xsd:pattern>
                  <xsd:type>line</xsd:type>
                  </xsd:appinfo>
                 </xsd:annotation>
                 </xsd:element>
             <!—Rest of elements go here  -->
  </xsd:sequence>
 </xsd:complexType>
<!-- MESSAEG INDENTIFIER START  -->
 <xsd:complexType name="MesssageIdentifier" >
  <xsd:sequence>
                                 <xsd:element name="StandardMessageIdentifier">
                                                <xsd:simpleType>
                                                 <xsd:restriction base="xsd:string" />
                                             </xsd:simpleType>
                                                                </xsd:element>                              
                                <xsd:element name="MessageTypeVersionNumber" >
                                    <xsd:simpleType>
                                                  <xsd:restriction base="xsd:integer" />
                                </xsd:simpleType>
                                   </xsd:element>
</xsd:sequence>
</xsd:complexType>
  <!-- MESSAEG INDENTIFIER END  -->

The annotation describes two types of information to the parser program, first is the pattern to use in parsing the current element, and that the current element actually represents a new line, other types of elements can be repeated sections or groups of fields inside a line, not only lines.
Next we can write our parser to read the XSD and utilize it to parse the feed messages, we will use XSOM APIs to read the XSD file like in the following snippet,

String schemaFile = “ffm4.xsd”;
XSOMParser  parser = new XSOMParser();
parser.setAnnotationParser(new AnnotationFactory());
LOGGER.debug("Loading schema " + schemaFile);          
parser.parse( new BufferedReader(new InputStreamReader(new ClassPathResource(schemaFile).getInputStream())));

Then we can navigate the schema and read the elements and annotations like in the following code snippet,
XSSchemaSet result = parser.getResult();
 XSSchema schema = result.getSchema(NAME_SPACE_PREFIX + “ffm”);
Map<String, XSElementDecl> elements = schema.getElementDecls();
Iterator<XSElementDecl> jtr = elements.values().iterator();
                  while( jtr.hasNext() ) {
                      XSElementDecl e = jtr.next();
//then read the app info for the elements and use it in parsing.
HashMap<String, String>  annotation = (HashMap<String, String>)e.getAnnotation().getAnnotation();
                                                String linePattern = annotation.get(PATTERN);                                 
                                                 String type = annotation.get(TYPE);
                  }

Now we can use regular expression APIs to parse the line and assign the matched tokens to child elements,
java.util.regex.Pattern compiledRegex = Pattern.compile(linePattern);
matcher = compiledRegex.matcher(currentLine);            
This not a complete parser solution but to give you the idea of how to use the XSD in a parser solution, the next section we will see how it’s different from other options,

Other Parsing options,

  • Lexical parser, Using lexical recognizer or compiler to generate a Java parser code, a famous example of the lexical parsers ANTLR, or JAVACC…, the lexical parser is fed with the message grammar in domain specific language and generate the necessary code to parse the input text to result the parsed tokens.

  • Java with regular expression: In that option a parser is written to process the incoming text message, loop over the lines, match the patterns, check for mandatory lines, loop over repeated sections, lines, and elements, fill java objects with the parsed tokens, and perform the validation.

Based on the following comparison the XSD based solution is selected to parse and translate the incoming message,


Java with regular expression.
XSD with regular expression.
Lexical compilers
Number of parsers
A new Parser for each message version or type
One engine with multiple input XSDs for different messages.
A code generated parser for each message.
Standard APIs
Using standard java.util.regexp library.
Using standard XML processing and regular expression library.
Using domain specific languages to define the grammar, and generate a complex code to parse the message.
Message structure representation.
Message structure is hardcoded in the various “if” conditions and “for/while” loops of the parser code.
Message structure is clear and represented in nested schema representation with defined MOC and multiplicity options.
Message structure is represented in domain specific language for the grammar.
Ease of adding new message and new message version.
Need to develop a new parser for new messages and versions.
New XSD for the message, and extend a common one and change the types for different versions.
Define the grammar for the new message and generate a new parser.
Message binding and validation.
Develop a new validation code and hard code the binding  with java objects from scratch.
JAXB APIs can handle the validation based on the schema, same for the binding, no need to write any code.
Need to develop a code to handle the binding an validation from scratch.
Available interfaces
Only text input is accepted.
Application can provide XML based interface with Queue or webservice for the potential clients,  beside the text format.
Only text format.
Maintenance effort
Need to maintain different parsers with no common design for them, which is hard to understand and change.
Easy to grasp regular expression skills.
Need to maintain regular XSD format which is easy to change.
XSD skills can be found easily between the developers.
The grammar is not part of the application and need to be maintained in known location, it can go easily out of sync with the parser, the resulted code is very hard to understand or change.
A domain specific language and utilities need to be learnt to change the parser or create new parser.


No comments:

Post a Comment