DeletionFilter in a Filter – packaging XML filters within java I/O streams
Replacement
Modification
Notification
Listing 1: Verbose Content HandlerTool set documentation
Listing 2: VerboseSAXFilter test program
Listing 3: ProgrammableSAXFilter demo
Listing 4: NameList program
This article describes a framework for developing efficient XML processing “pipelines” in Java, based on the standard SAX (Simple API for XML) interfaces. The key concept is the filter, an object that receives inputs and delivers output to an object with the same interface implemented by the filter on its input end. In this way, the filter can analyze or modify data before passing it on. The Programmable SAX Filter utilizes a toolkit approach in which simple, extensible tools can be “plugged in” to perform specific tasks on the data stream. This allows processing to be streamlined or focused, and minimizes overhead, especially when compared to approaches such as XSLT (eXtensible Style Language Transform) or DOM (Document Object Model) which need to create an object representation of the XML before processing can begin. Using this toolkit, one can create efficient, event-driven, multi-functional XML processing networks tailored to specific problems.
The basic interfaces involved in SAX filtering are defined in the W3C’s org.xml.sax package -- the XMLReader and ContentHandler interfaces and their derived interfaces and classes. The former defines the interfaces needed by an XML source such as a SAX parser and the later by an XML receiver such as a DOM object builder or an XSLT transformer. When these two interfaces are combined through Java multiple-interface inheritance - as is done by the base class org.xml.sax.helpers.XmlFilterImpl – XML filtering becomes possible. By extending this base class, which implements the simplest possible filtering strategy: do nothing to the data, just pass it on, we can develop a filter that does just about anything we want to the XML stream or triggers other actions based on the content flowing through the SAX interfaces. Furthermore, because our filter implements standard interfaces, it can be plugged into other XML processing systems such as XSLT transforms or DOM builders.
Event driven design: events and actionsFilters can perform three fundamental operations on data: addition, modification or deletion of information. To do this intelligently, these actions must be triggered by certain events or contexts in the data stream. This allows selective rather than global manipulation of information. Therefore, for a general purpose implementation, we need to abstract the concept of event detection within an XML stream to enable different detectors to use simple or complex criteria – or in other words to define for themselves what an ‘event’ is by their implementation of a standard ‘detector’ interface. The programmable filter serves as a base for these detectors to receive SAX messages. Once an event is detected however, it can be coupled to some action on the XML data, or it can serve as the source of an event notification system whereby other objects are alerted to the arrival of this data. As described below, one effective strategy is to use Java Reflection to couple an XML event to the invocation of a method on a Java Object. Another is to build an Event Source / Listener system using standard Java techniques.
SAX filtering vs XSLT transforms:SAX filtering represents one way of modifying an XML tree. In this sense it is similar to XSL Transformation (XSLT). However, as is almost always the case, each of these technologies has strengths and weaknesses that make them suitable for a different set of tasks. The advantage of SAX filtering is that it is fast and in many situations requires little or no memory management overhead (i.e. object creation overhead plus garbage collection). Another advantage of SAX filtering is that the XML transformations can be done entirely in Java. The main disadvantage of SAX filtering as compared to XSL transformation is the degree to which the XML tree structure can be altered. With XSLT, substantial reorganization is possible whereas with SAX filtering, the basic tree structure remains – changes are limited to pruning limbs or nodes, adding nodes or modifying them. Thus for some jobs, SAX filtering is inadequate whereas for others, XSLT is overkill. However, if the XSLT transform package adheres to the SAX2 interfaces (as the latest versions of XSLT-Java like Sun’s JAXP, Apache’s XALAN do), the two technologies can be easily combined.
Introduction to SAX interfaces:The SAX interfaces allow an a source object (typically a parser) to notify ‘listener’ objects about ‘events’ in an XML character stream. This is a “notification” or “publish-and-subscribe” design pattern in which producers implement the org.xml.sax.XMLReader interface and consumers implement ContentHandler, DTDHandler, EntityResolver, ErrorHandler and Locator. XMLFilter is an extension of XMLReader that enables a source to be connected to a “parent” source, thus enabling SAX processors to be chained. SAX is – as its acronym Simple API for XML implies, simple. However, its simplicity derives from the inherent simplicity of XML, and should not be equated with lack of usefulness or power. In fact, it is fair to say that these interfaces form the basis for most current Java-based XML implementations. Further, their main power comes from the freedom derived by using Java interfaces. SAX interfaces define how XML is translated into object-to-object messages. Because interfaces do not define implementation, what can be done with these messages is virtually unlimited.
To understand how SAX processing works, it is useful to examine how an XML source, XMLReader and SAX event consumer are connected. As shown below, the XMLReader object must implement these methods (as well as others not shown here)
public interface org.xml.sax.XMLReader
{
// other methods not shown …
public void setContentHandler( org.xml.sax.ContentHandler handler );
public org.xml.sax.ContentHandler getContentHandler( );
// more methods…
public void parse( org.xml.sax.InputSource xmlSource )
throws java.io.IOException, org.xml.sax.SAXException;
}
Then, as in a typical notification pattern, one constructs the source and consumer objects, sets up the “subscription” – in this case by calling setContentHandler( ) on the source, passing the consumer as a parameter – and then commands the source to begin processing by calling its parse( ) method. When this method returns, the job is done. What has happened is that the XMLReader has detected structural patterns in the XML source, and notified its ContentHandler of these events. It is then up to the ContentHandler implementation to use this information to do whatever it is designed to do. An examination of the ContentHandler interface will illuminate this. It shows that XMLReader breaks the XML into four main hierarchical levels: document, element, attribute and character data. The single document contains one or more elements, elements can contain attributes, character data and other elements. Aside from some other methods to handle namespaces, processing instructions, skipped entities, ignorable whitespace and useful objects called “Locators”, that’s it. The code below shows the five main methods in ContentHandler – where most of the work is done.
public interface org.xml.sax.ContentHandler
{
public void startDocument( ) throws SAXException;
public void endDocument( ) throws SAXException;
public void startElement(java.lang.String namespaceURI,
java.lang.String localName,
java.lang.String qName,
Attributes atts ) throws SAXException;
public void endElement(java.lang.String namespaceURI,
java.lang.String localName,
java.lang.String qName) throws SAXException;
public void characters(char[] ch, int start, int length)
// other methods
// . . .
}
These five methods provide the main processing functionality of the interface. The first two, startDocument() and endDocument( ) signal the beginning and end of an XML document and are called once during XMLReader’s execution of parse( ). The next two, startElement() and endElement() are called when an XML tag is encountered and when an end tag is reached respectively. If the element does not contain any children, e.g.
To see how all of this works, we can write a diagnostic object that implements the org.xml.sax.ContentHandler interface by printing out what happens as a document is parsed. The code for VerboseContentHandler is shown in Listing 1. To get things going, we need a SAX parser. The Saxon parser or Apache’s Xerces parser are good choices because they are free, fairly solid and up to date on standards, although any parser that implements the SAX2 XMLReader interface will do. Xerces can be obtained from http://xml.apache.org/xerces-j/. Saxon is available at http://saxon.sourceforge.net/. Once we have a parser we can write a program to observe how it works. One interesting thing to note is that org.xml.sax.InputSource, which XMLReader uses to get its input, links one set of Java standards (org.xml.sax) with another – the java.io package. This gives us the ability to insert our XML filtering code into any I/O stream (See Filter in a Filter section). Our test program takes an XML filename as input and prints out the SAX events to system output. A useful enhancement to the VerboseContentHandler might be to give its constructor a java.io.PrintStream to print to so that we can point the output anywhere we want. Listing 2 shows a test program (ContentHandlerPrinter.java) that uses the VerboseContentHandler to print SAX events for an XML file. Running this program with the following input:
<patriarch>
<child name="number One">This child has character.
<grandChild name="uno"/>
<grandChild name="two"><greatGrandChild/></grandChild>
</child>
<child name="number two" nickname="black sheep"/>
</patriarch>
we get the following output:
---- setDocumentLocator: locator = org.apache.xerces.readers.DefaultEntityHandler@4901
---- startDocument() called.
---- startElement:
uri = ''
localName = 'patriarch'
qName = 'patriarch'
---- characters = '
'
---- startElement:
uri = ''
localName = 'child'
qName = 'child'
Attributes:
LocalName QName Type URI Value
name name CDATA number One
---- characters = 'This child has character.
'
. . . etc.
If you examine the output, you can get some idea about how this parser works. To get more insights, we can ‘break’ the code a little. Our code for the characters() method looks like:
public void characters( char[] ch, int start, int length )
throws SAXException
{
if (length > 0)
{
String str = new String( ch, start, length );
System.out.println( "\ncharacters = '" + str + "'");
}
}
If instead, we ignore the start and length parameters:
public void characters( char[] ch, int start, int length )
throws SAXException
{
if (length > 0)
{
System.out.println( "\ncharacters = '"
+ new String( ch ) + "'");
}
}
we get output like this:
---- startElement:
uri = ''
localName = 'child'
qName = 'child'
Attributes:
LocalName QName Type URI Value
name name CDATA number two
nickname nickname CDATA black sheep
---- characters = '<patriarch>
<child name="number One">This child has character.
<grandChild name="uno"/><grandChild name="two"><greatGrandChild/></grand
Child>
</child>
<child name="number two" nickname="black sheep"/>
</patriarch> '
---- endElement
uri = ''
localName = 'child'
qName = 'child'
---- characters = '<patriarch>
<child name="number One">This child has character.
<grandChild name="uno"/><grandChild name="two"><greatGrandChild/></grand
Child>
</child>
<child name="number two" nickname="black sheep">
</child>
</patriarch> '
. . . etc. etc.
The same char array is sent to the ContentHandler on every call to characters() – it is the entire character string of the XML being parsed. The start and length parameters allow the SAX parser to “walk” this array without having to reallocate memory and copy out string segments every time it notifies the ContentHandler of character data – the handler is directed to focus its attention between the boundaries it is given. This is one design choice that allows SAX processing to be fast because memory allocation/garbage collection is one of the major performance drags in Java programs. Note also that with attributes, the SAX parser is forced to fill an Attributes object with strings before calling startElement(). This suggests that although using attributes in XML makes it cleaner, there could be a tradeoff in processing speed. That is, instead of using this structure:
<child name=”John” age=”23” birthday=”3/22/1979”/>
use of the seemingly less efficient:
<child><name>John</name><age>23</age><birthday>3/22/1979</birthday></child>
avoids the need for the SAX parser to create/destroy Attributes objects. The question is, which is faster? In my own tests, the answer seems to be that especially with low memory conditions, the use of the non-attribute version can be up to 20 or 30% faster - even though it places an initially larger burden on memory! Another example of the truth that it is always wise to check under the hood before betting on performance.
A slight enhancement to our VerboseContentHandler that will nest our output so that we can see the relationship of startElement and endElement calls, underscores the point made earlier that the ContentHandler is responsible for keeping track of its position in the XML tree. A simple way to do this is to have a counter that is incremented on startElement( ) and decremented on endElement( ) and use it to generate indents:
private int indent;
public void startElement( java.lang.String uri,
java.lang.String localName,
java.lang.String qName, Attributes attrs )
throws SAXException
{
++this.indent;
System.out.println( "\n" + getIndent() + "---- startElement: " );
// print out the rest . . .
}
public void endElement( java.lang.String uri,
java.lang.String localName,
java.lang.String qName )
throws SAXException
{
System.out.println( "\n" + getIndent() + "---- endElement: " );
// . . . print out the rest . . .
--this.indent;
}
And a simple method for getting some strings with variable spaces that uses one long space string to chop from ( there are other ways to do this of course).
// make this have lots of spaces -- probably more than in this example
private String spaceStr = " ";
private String getIndent( )
{
return spaceStr.substring( 0, indent*2 );
}
Adding some more formatting to this mix gives nicer output like this:
---- startDocument() called.
---- startElement:
| uri = ''
| localName = 'patriarch'
| qName = 'patriarch'
---- characters = '
'
---- startElement:
etc... etc...
---- startElement:
| uri = ''
| localName = 'greatGrandChild'
| qName = 'greatGrandChild'
---- endElement
| uri = ''
| localName = 'greatGrandChild'
| qName = 'greatGrandChild'
---- endElement
| uri = ''
| localName = 'grandChild'
| qName = 'grandChild'
---- characters = '
'
---- endElement
| uri = ''
| localName = 'child'
| qName = 'child'
---- characters = '
'
Design of the Programmable SAX Filter
The ProgrammableSAXFilter utilizes a modular design consisting of a main “hub” which connects input to output and a set of tools that can perform various operations on the XML data. Since these tools are based on several standard interfaces and base classes, the tool set can be extended without changes to the main filter object. ProgrammableSAXFilter extends org.xml.sax.helpers.XMLFilterImpl and implements the ContentHandler methods by dispatching SAX events to the tools which analyze, and possibly modify the events. The cumulative results are then passed to the next ContentHandler in the chain.
The toolset covers five main areas: 1) deletion of elements, attributes or character data, 2) replacement of elements, 3) addition of child elements, attributes or character data, 4) modification of attributes or character data, and 5) notification. These functions are to some extent mutually exclusive. The filter imposes a precedence order of deletion – replacement – addition/modification/notification. If an element is being deleted, it is not seen by replacers or modifiers. Likewise, if an element is being replaced, it is not seen by modifiers. As described above, these functions are performed by specialized objects derived from several interfaces and base classes. The first of these, ElementComparator is used to determine if a particular SAX event matches the criteria for a given tool, allowing modifications to be targeted at specific elements. The interface enables both the element tag and any associated character data to be checked. Two simple implementations of this interface are TagComparator and PathComparator which match the element name or path of a tag against a template. ElementComparator serves two main purposes. First, its implementers are used to mark tags for deletion and replacement, and second, the interface is implemented by the ElementModifier base class whose subclasses are responsible for notification, addition and modification of elements. (A complete list of the tools and their capabilities can be found in the Appendix.)
Demonstration of the Programmable SAX FilterThe sample program in Listing 3 shows how to create a ProgrammableSAXFilter and set it up to process an XML file. The program uses another tool: SAXOutputter, which implements ContentHandler by turning SAX events back into XML. As shown in the “Filter in a Filter” section, which describes implementations of java.io Input- and OutputStream filters, SAXOutputter is a useful tool for maintaining an XML stream. It is also useful for diagnostic purposes. As shown in the sample program, it can be hooked to the filter’s output by passing it to setContentHandler( ). It can also be hung “off the side” – i.e., at intermediate points in a filter chain, by adding it to the ProgrammableSAXFilter’s list of ContentHandler “listeners”. Another diagnostic tool is the VerboseSAXFilter which reports SAX events to a java.io.PrintStream. As a filter, it can be placed anywhere within a SAX filter chain. These two tools allow you to monitor the flow of filter inputs (SAX events) and outputs (XML) at any point in a pipeline.
The program takes an XML input file, adds a incremented “Number” attribute to “ChildTag” elements, removes “Dog” tags, and replaces all instances of the string “nuculer” with “nuclear” in any character data section. Then, given an input like this:
<AnXMLDocument> <ChildTag><Dog>Spot</Dog>This child says nuculer instead of nuclear.</ChildTag> <ChildTag><Dog>Fido</Dog>This child does not say nuculer.</ChildTag> <ChildTag>This child works at a nuculer power plant.<Dog>Fred</Dog></ChildTag> <ChildTag>This child is President of the United States</ChildTag> </AnXMLDocument>
The program produces the following output.
The Toolkit: Deletion, Replacement, Modification and Notification<AnXMLDocument> <ChildTag Number="0">This child says nuclear instead of nuclear.</ChildTag> <ChildTag Number="1">This child does not say nuclear.</ChildTag> <ChildTag Number="2">This child works at a nuclear power plant.</ChildTag> <ChildTag Number="3">This child is President of the United States</ChildTag> </AnXMLDocument>
A list of ElementComparators is used to mark elements for deletion. Any tag that causes one of these comparators to match on a startElement() event will cause all subsequent SAX events up to the matching endElement() event to be suppressed – i.e., not passed to the output. This “simple” deletion strategy has limitations because deleted elements must be detected in their start tag, before any of their content can be seen. To handle more sophisticated deletion logic, another SAX filter was designed (SAXDeleteFilter) which temporarily saves SAX events and defers deletion decisions until all of an element’s data has been seen. If one of this filter’s delete tools has decided to delete an element after seeing the entire thing (child tags and all), the filter dumps its cached events and continues on without ever having informed its ContentHandler of what has happened. If the deleter has decided not to delete, all of the events from start to end are “replayed” to the output. Unfortunately, this strategy is incompatible with the design of the ProgrammableSAXFilter, so instead of one very complex implementation, two simpler, more functionally focused filters were written. However, one of the beauties of filters is that they can be combined serially – so nothing is really lost by doing this. Another point to be made here is that with both filters, element deletion is all or none, if child tags need to be preserved or otherwise “reparented”, more radical XML-tree restructuring is required and an approach using DOM or XSLT is required.
Element Replacement:Tools for element replacement enable an entire element to be statically or dynamically replaced. This type of tool can be used for many purposes, insertion of information to replace “placeholders” for example. Replacement thus enables input-output functions to be modularized and extended. The difference between replacement and modification is that the later maintains the element structure: changes are limited to attributes and character data or addition of child elements. Replacement tools are temporarily given control of the SAX input-output stream. The modifications that they can make range from deletion to modification of some of the elements to insertion of a entirely new element. The base interface, ElementReplacer, extends ElementComparator and ContentHandler and adds two methods that signal the beginning and end of the replaced element:
public interface ElementReplacer extends ElementComparator,
org.xml.sax.ContentHandler
{
public void elementStarting( java.lang.String uri,
java.lang.String localName,
java.lang.String qName,
Attributes attrs,
ContentHandler cHandler );
public void elementEnding( ContentHandler cHandler );
}
Implementations of ElementReplacer can be simple objects like SAXElementReplacer or DOMElementReplacer which replace the original element with an XML character or object source respectively; or even an object which extends XMLFilterImpl or ProgrammableSAXFilter. Objects like this can do dynamic, context-dependent replacements.
Element modification:As stated above, the ElementModifier base class represents tools that can be used for notification, and for addition and modification of elements. (In addition to ElementModifiers, the ProgrammableSAXFilter has a list of ContentHandlers that can be also be used for notification). The ElementModifier can have either or both of two helper objects, AttributesModifier and CDataModifier. The former is an abstract base class which contains methods for attributes modification and implements a linked list so that AttributesModifiers can be chained. The code for this object is as follows:
import org.xml.sax.Attributes;
import org.xml.sax.helpers.AttributesImpl;
public abstract class AttributesModifier {
public AttributesModifier( ) { }
public AttributesModifier( AttributesModifier next )
{
this.nextMod = next;
}
public Attributes modifyAttributes( Attributes source )
{
return modifyAttributes( new AttributesImpl( source ) );
}
public Attributes modifyAttributes( AttributesImpl source ) {
AttributesImpl modified = _modifyAttributes( source );
return (nextMod != null) ? nextMod.modifyAttributes( modified )
: modified;
}
public void setNextMod( AttributesModifier next ) {
this.nextMod = next;
}
protected abstract AttributesImpl _modifyAttributes( AttributesImpl source );
private AttributesModifier nextMod = null;
}
Note that some work is done to translate an Attributes object (passed to startElement( )) to an org.xml.sax.helpers.AttributesImpl. This is done because Attributes is a read-only interface – it only contains get__( ) methods. If we want to add, substract, or change attributes, we need to copy the Attributes into an AttributesImpl which is a manipulable subclass of Attributes. Our AttributesModifier base class performs this work and also implements a simple linked list to allow more than one AttributesModifier to be combined. Subclasses need to override the one abstract method:
protected abstract AttributesImpl _modifyAttributes( AttributesImpl source );
The subclasses direct their surgical skills at the passed AttributesImpl which is eventually passed back to the ProgrammableSAXFilter to substitute for the original Attributes. The ContentHandlers are unaware or our intervention because they are looking for an Attributes object, which by inheritance AttributesImpl is. Modification (e.g., insertion, deletion or modification) of character data is handled by cooperation between the main filter, its ElementModifiers and their contained CDataModifier objects. Data modification starts in the filter:
public void characters( char[] ch, int start, int length )
{
if (this.deletingElement) return;
if (replacingElement != null)
{
replacingElement.characters( ch, start, length );
return;
}
if (currentModifiers.size() > 0)
CDataContent dContent = new CDataContent( ch, start, length );
for (int i = 0; i < currentModifiers.size(); i++)
{
ElementModifier currentModifier
= (ElementModifier)currentModifiers.get(i);
currentModifier.filterCData( dContent );
}
super.characters( dContent.getChars( ),
dContent.getStart(),
dContent.getLength() );
}
else
super.characters( ch, start, length );
// notify our listeners . . .
}
If the filter is not deleting or replacing this element and has some modifiers, it creates a wrapper object, CDataContent, to hold the character data. This allows ElementModifiers to change the properties of the characters without affecting the original – which would completely wreck a SAX parser such as Xerces, since this character array is its “Rosetta Stone” – if we change it in any way, we will probably cause the entire operation to fold in a nasty way. Instead, we provide an extra layer where character data can be replaced or pointers moved without changing the original. After the CDataModifiers have done their work (if any) we send the results to our super class (org.xml.sax.helpers.XMLFilterImpl) which passes them along to its ContentHandler. Any replacements, deletions or modifications will have replaced the char[] array in our wrapper object so that we send altered parameters in place of the ones we were given. This is OK because the receiver is like an unscrupulous pawn shop owner – it is supposed to focus on what it gets and does not need to know where the stuff came from. We just cannot modify what we were given if we want to keep our source happy and our data coming – so we use the old shell game on our unsuspecting client. CDataContent looks like this:
public class CDataContent
{
public CDataContent( String str, int start, int length )
{ this.chars = str.toCharArray();
this.start = start;
this.length = length;
}
public CDataContent( char[] chars, int start, int length )
{ this.chars = chars;
this.start = start;
this.length = length;
}
public char[] getChars( ) { return this.chars; }
public void setChars(char[] chars ) { this.chars = chars; }
public int getStart( ) { return this.start; }
public void setStart( int start ) { this.start = start; }
public int getLength( ) { return this.length; }
public void setLength( int length ) { this.length = length; }
public String toString( )
{ return new String( chars, start, length );
}
private char[] chars;
private int start;
private int length;
}
CDataContent’s purpose is like that of AttributesImpl: to provide a malleable version of the original, that keeps the original untouched. This allows CDataModifier implementations to do any ripping, tweaking or pasting operations that they want by implementing the single method defined in CDataModifier:
public void filterCData( CDataContent charData );
CDataInserter, for example, is allows insertions of character data. Its constructor takes a string to insert, a pattern string to locate the insertion point and a boolean flag to indicate whether the inserted string should come before or after the pattern string.
public CDataInserter( String insertString,
String matchPattern,
boolean insertAfterPattern )
In the implementation of its filterCData( ) method, a new string is constructed by string concatenation. CDataContent is now set to this new string. The original data it had pointed to is still valid and held by the source XMLReader.
public void filterCData( CDataContent dataContent )
{
String itsString = dataContent.toString( );
int patIndex = itsString.indexOf( this.patternString );
if (insertAfterPattern) patIndex += patternString.length();
if ( patIndex >= 0) {
String aftStr = itsString.substring( patIndex );
String newString = itsString.substring( 0, patIndex )
+ patternString + aftStr;
dataContent.setChars( newString.toCharArray( ) );
dataContent.setStart( 0 );
dataContent.setLength( newString.length() );
}
}
Notification:
ProgrammableSAXFilter provides a means of notifying subscribing objects of definable events in an XML document using a flexible mechanism based on Java Reflection. The subscription process involves registering an ElementComparator to detect the event, a target java.lang.Object to receive the notification, the name of the method to be called and depending on the notifier type, parameters to be passed. For some types of notifier, the parameter comes from the XML document. There are two types of notifiers in the toolset, objects which extend the main modifier classes ElementModifier, AttributesModifier and CDataModifier and other tools which implement the ContentHandler interface.
Subscription involves constructing a notifier object and adding it to the filter. In the program shown in Listing 4, we construct a AttributeObjectPropertySetter which adds the “name” attribute of all elements named “Author” to a NameList object. Now, given an xml input file like this:
<Books>
<Book>
<Title>Moby Dick></Title>
<Author name="Herman Melville"/>
</Book>
<Book>
<Title>War and Peace</Title>
<Author name="Leo Tolstoy"/>
</Book>
<Book>
<<Title>Crime and Punishment</Title>
<Author name="Fyodor Dostoyevsky"/>
</Book>
<Book>
<Title>The Grapes of Wrath</Title>
<Author name="John Steinbeck"/>
</Book>
</Books>
The program creates the following output:
Author is Herman Melville Author is Leo Tolstoy Author is Fyodor Dostoyevsky Author is John Steinbeck
The key line in the above program is the line that adds the notifier to the filter:
psf.addElementModifier( new ElementModifier(
new TagComparator( “Author” ),
new AttributeObjectPropertySetter( “”, “name”,
nameList, “addName” ) ) );
The magic is done by AttributeObjectPropertySetter. Its constructor takes the attribute’s namespace and name, a target object and the name of the callback method which must take a single String parameter. The constructor sets up the necessary machinery for Java reflection.
public class AttributeObjectPropertySetter extends AttributesModifier
{
public AttributeObjectPropertySetter( String uri, String localName,
Object target, String setMethodName )
{
super( );
this.uri = uri;
this.localName = localName;
this.target = target;
this.targetClass = target.getClass( );
this.paramArray = new Class[1];
try
{
paramArray[0] = Class.forName( "java.lang.String" );
}
catch (ClassNotFoundException e )
{
// have to do this to make the compiler happy
// if it can’t find java.lang.String we are in bad shape anyway
}
}
// . . .
}
During XML processing of a startElement( ) event, this object’s _modifyAttributes method will be called by its parent ElementModifier. This object does not actually modify the attributes. It takes the attribute name it was given, and invokes the method given to it on its target object passing the value of the attribute, if that attribute is present:
protected AttributesImpl _modifyAttributes( AttributesImpl source )
{
int index = source.getIndex( uri, localName );
if (index >= 0)
{
try
{
Object[] params = new Object[1];
params[0] = source.getValue( index );
Method m = targetClass.getMethod( setMethodName, paramArray );
m.invoke( target, params );
}
catch( IllegalAccessException iace ) { }
catch( IllegalArgumentException iare ) { }
catch( InvocationTargetException ite ) { }
catch( NoSuchMethodException nsme ) { }
}
return source;
}
Connecting filters:
Pipes and Tees:
The filter’s design provides several ways of connecting SAX processors. The first is provided by the XMLFilter interface which allows a parent XMLReader (or by inheritance, another XMLFilter) to be set. This allows serial processor chains to be created. ProgrammableSAXFilter also maintains a ContentHandler list that receives SAX events after the filter has done its processing. Since any subclass of org.xml.sax.helpers.XMLFilterImpl implements ContentHandler, branching chains of processors can be created by adding one filter to the ContentHandler list of another. Replacement tools allow input branching, i.e. more than one source can be filtered and combined. Combining these capabilities enables xml processing networks of arbitrary complexity to be created.
Filter in a Filter – packaging XML filters within java.io.Input/OutputStreams:Filters are inherrently “modular”, making it possible to build sophisticated processing systems out of simple components. As shown in the previous section, the tools provided in the SAX filter package enable an XML input stream to be parsed into SAX events, filtered in various ways and transformed into an XML output stream. This end to end transition from input to output, is o useful in many situations. However, for other purposes, it would be more convenient to put this entire package into a one-way filter such as a java.io.FilterInputStream or FilterOutputStream, which deal with input-to-input and output-to-output filtering. The main problem is how to turn an output into an input either at the beginning -- to implement a FilterOutputStream, or at the end -- to implement a FilterInputStream. Fortunately, the java.io package provides the necessary plumbing parts: PipedInputStream and PipedOutputStream – which are specifically designed to talk to each other. With these parts and some judicious use of the “Delegation” design pattern we can write our I/O filters (Listings 5 and 6).
XMLFilterInputStream:An XMLFilterInput stream as its name implies, extends the java.io.FilterInputStream class. To do this, it must accept (in its constructor) a java.io.InputStream to read from and must itself implement the java.io.InputStream interface so that another object can read from it. We can use the InputStream we are given to start SAX filtering and can turn the output of the filter back to XML using a SAXOutputter. What remains is the problem of converting this output back to an input that we can expose to our InputStream interface. Fortunately, we do not need to write all of this code – we can use a java.io.PipedOutputStream / PipedInputSteam pair. We create a PipedInputStream object that we own, connect it to a PipedOutputStream object and tell the SAXOutputter to write its output to this PipedOutputStream -- which is by inheritance a java.io.OutputStream. The SAXOutputter’s constructor accepts an OutputStream reference so all is well. To complete the connections, we simply implement the java.io.InputStream interface by delegating to our contained PipedInputStream, so that when read( ) methods are called on our object, we pass them to the PipedInputStream which knows how to read from a PipedOutputStream which in turn has gotten its information from the SAXOutputter and so on back to the original InputSource we we given to read from. One potential snag in this scenario is how to coordinate the reading and writing. If all of this is running in one thread, the possibility for race conditions exists (as described in the Javadocs for PipedInputStream), so we must do a little more work to make sure that the SAX processing – to – PipedOutputStream code runs in a separate thread.
XMLFilterOutputStream:XMLFilterOutputStream is conceptually similar but needs a different implementation because here we must invert from output to input at the beginning. Our constructor is given an OutputStream to write to and we must also implement the OutputStream interface so that anything written to our FilterOutputStream will eventually be written to this output. We again construct a pair of PipedInputStream and PipedOutputStream objects, but in this case we own the PipedOutputStream and give our XMLFilterImpl the connected PipedInputStream as its InputStream and connect the SAXOutputter to the OutputStream that we were given to write to. Then, as in the XMLFilterInputStream, we implement java.io.OutputStream interface by delegating all calls to our PipedOutputStream. Now anything written to our object provides data for the PipedInputStream to read which is parsed, filtered and written to our output by the SAXOutputter. As before, we need to run the PipedInputStream and PipedOutputStream in separate threads to avoid deadlock.
Uses of SAX filtering:Having discussed what SAX filtering is, how it works and how to use the ProgrammableSAXFilter package, it is a good time to talk about what you can do with it. As discussed earlier, SAX filtering is a faster, completely Java based alternative to other means of XML processing such as XSL transformation or DOM object construction followed by XML regeneration, which require more overhead in terms of object construction. It cannot compete when wholesale reorganization of an XML structure is needed, but when cutting trees, sometimes you need a pocket knife and sometimes you need a chain saw. Filters are a natural choice for implementing more restrictive parsing systems such as XML Schema validation, logging, encryption/description or input-output schema translations.
The combination of SAX filtering with Java reflection techniques opens a number of possibilities that make this an attractive alternative in certain situations. Two of the most compelling are flexible notification systems and property setting via notifiers. With property setting callbacks, JavaBeans can be initialized from any XML stream that contains (an) element(s) with the necessary property/value pairs defined either in attributes or as child element character data. Containment of JavaBeans can also be handled. Furthermore, the number of separate objects that can be initialized from a single XML document is “unlimited” (subject to restrictions like memory constraints of course). This mechanism also allows “selective” property setting – that is, unlike object construction via DOM, the entire XML document need not be converted into an object, selected elements with selected properties can be plucked out as the document is parsed, enabling a more frugal conversion of XML to objects. The coding of JavaBeans can either be done manually, before or after the XML schema is developed or automatically, using a BeanCreator, which converts XML attributes and child tags to Java code.
Notification patterns also enable frugal implementations especially in cases where incoming XML represents a “message” that is intended to trigger other object messages and/or processes within the target object system, and where the creation of an object to bridge the “impedence mismatch” between an XML character string and the target object system is simply overhead. SAX callback notification enables messages to be extracted from the incoming XML and dispatched to any object with little impact on memory management overhead – only the Strings extracted from the XML document represent newly created objects. This is therefore a good way to implement XML-based messaging systems such as SOAP where speed and efficiency is a premium design goal.
Finally, the extensibility of the Programmable SAX filter provides other opportunities that would be hard to duplicate with DOM or XSLT. In fact, many of the functions provided by the filter can only be fully exploited in the context of a specific application. For example, dynamic replacement of elements, attributes or character data may require application-dependent state information. By providing objects that can be extended with little excess baggage, tailored to specific tasks, and activated when needed, the package can be used to create powerful, yet efficient XML processing systems.