Wednesday, 13 April 2011

Line and Column Numbers in an XML DOM Document.

If an application reads data from XML configuration files, it can be useful to give the filename, line number and column if a problem data is found with the data.

You might expect all XML parsers to provide access to this sort of information as a matter of course.  The standard SAX parser does (as we will see), but the DOM parser does not, unless the XML actually fails to parse.  The standard DOM parser probably uses a SAX parser under the hood, but the API denies us access to it.

Switching from DOM to SAX is a high price to pay to make your error reporting better.  Beside, you may need to use DOM tools such as XSLT.  You could switch to using a thirdy party parser, access the underlying SAX parser in some sneaky non-standard and unsupported way, or you can just use the following trick.

Use the java.xml.transform API.

The great thing about the XSL Transformation API is that it will use a number of different types of input sources and output destinations.  Possible options are:
  • input and output streams
  • files
  • SAX parser input sources
  • DOM fragments and documents
  • JAXB object models

So we can read XML using a SAX parser and get a resulting DOM.

In our case we don't actually want to apply an XSL transformation.  However, the API will provide us with a Transformer that just copies to the output form without altering the data.  So here we have a tool that can convert XML from one form to another. The following examples converts XML from a file into a DOM and back.

Reading an XML file into a DOM:
    TransformerFactory transformerFactory
            = TransformerFactory.newInstance();
    // Do not share transformers between threads
    Transformer nullTransformer = transformerFactory.newTransformer();

    Source fileSource = new StreamSource(new File("input.xml"));
    DOMResult domResult = new DOMResult();
    nullTransformer.transform(fileSource, domResult);

    Document dom = (Document) domResult.getNode();
Writing an XML DOM to a file:
    Source domSource = new DOMSource(dom);
    Result fileResult = new StreamResult(new File("output.xml"));
    nullTransformer.transform(domSource, fileResult);

So how do we obtain the line number information for DOM nodes?

The trick is to use a SAX parser and attach the location infomation it provides to the created element nodes as they are added to the DOM.  Here is a SAX filter that does exactly this:
public class LocationAnnotator extends XMLFilterImpl {

    private Locator locator;
    private Element lastAddedElement;
    private Stack<Locator> locatorStack = new Stack<Locator>();
    private UserDataHandler dataHandler = new LocationDataHandler();

    LocationAnnotator(XMLReader xmlReader, Document dom) {
        super(xmlReader);

        // Add listener to DOM, so we know which node was added.
        EventListener modListener = new EventListener() {
            @Override
            public void handleEvent(Event e) {
                EventTarget target = ((MutationEvent) e).getTarget();
                lastAddedElement = (Element) target;
            }
        };
        ((EventTarget) dom).addEventListener("DOMNodeInserted",
                modListener, true);
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        super.setDocumentLocator(locator);
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName,
            String qName, Attributes atts) throws SAXException {
        super.startElement(uri, localName, qName, atts);

        // Keep snapshot of start location,
        // for later when end of element is found.
        locatorStack.push(new LocatorImpl(locator));
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {

        // Mutation event fired by the adding of element end,
        // and so lastAddedElement will be set.
        super.endElement(uri, localName, qName);
      
        if (locatorStack.size() > 0) {
            Locator startLocator = locatorStack.pop();
          
            LocationData location = new LocationData(
                    startLocator.getSystemId(),
                    startLocator.getLineNumber(),
                    startLocator.getColumnNumber(),
                    locator.getLineNumber(),
                    locator.getColumnNumber());
          
            lastAddedElement.setUserData(
                    LocationData.LOCATION_DATA_KEY, location,
                    dataHandler);
        }
    }

    // Ensure location data copied to any new DOM node.
    private class LocationDataHandler implements UserDataHandler {

        @Override
        public void handle(short operation, String key, Object data,
                Node src, Node dst) {
          
            if (src != null && dst != null) {
                LocationData locatonData = (LocationData)
                        src.getUserData(LocationData.LOCATION_DATA_KEY);
              
                if (locatonData != null) {
                    dst.setUserData(LocationData.LOCATION_DATA_KEY,
                            locatonData, dataHandler);
                }
            }
        }
    }
}
Next the LocationData objects that the filter attaches to each DOM element node.
public class LocationData {

    public static final String LOCATION_DATA_KEY = "locationDataKey";

    private final String systemId;
    private final int startLine;
    private final int startColumn;
    private final int endLine;
    private final int endColumn;

    public LocationData(String systemId, int startLine,
            int startColumn, int endLine, int endColumn) {
        super();
        this.systemId = systemId;
        this.startLine = startLine;
        this.startColumn = startColumn;
        this.endLine = endLine;
        this.endColumn = endColumn;
    }

    public String getSystemId() {
        return systemId;
    }

    public int getStartLine() {
        return startLine;
    }

    public int getStartColumn() {
        return startColumn;
    }

    public int getEndLine() {
        return endLine;
    }

    public int getEndColumn() {
        return endColumn;
    }

    @Override
    public String toString() {
        return getSystemId() + "[line " + startLine + ":"
                + startColumn + " to line " + endLine + ":"
                + endColumn + "]";
    }
}
The final piece of code shows how to wire up all the pieces:
    /*
     * During application startup
     */
    DocumentBuilderFactory documentBuilderFactory
            = DocumentBuilderFactory.newInstance();
    TransformerFactory transformerFactory
            = TransformerFactory.newInstance();
    Transformer nullTransformer
            = transformerFactory.newTransformer();

    /*
     * Create an empty document to be populated within a DOMResult.
     */
    DocumentBuilder docBuilder
            = documentBuilderFactory.newDocumentBuilder();
    Document doc = docBuilder.newDocument();
    DOMResult domResult = new DOMResult(doc);

    /*
     * Create SAX parser/XMLReader that will parse XML. If factory
     * options are not required then this can be short cut by:
     *      xmlReader = XMLReaderFactory.createXMLReader();
     */
    SAXParserFactory saxParserFactory
            = SAXParserFactory.newInstance();
    // saxParserFactory.setNamespaceAware(true);
    // saxParserFactory.setValidating(true);
    SAXParser saxParser = saxParserFactory.newSAXParser();
    XMLReader xmlReader = saxParser.getXMLReader();

    /*
     * Create our filter to wrap the SAX parser, that captures the
     * locations of elements and annotates their nodes as they are
     * inserted into the DOM.
     */
    LocationAnnotator locationAnnotator
            = new LocationAnnotator(xmlReader, doc);

    /*
     * Create the SAXSource to use the annotator.
     */
    String systemId = new File("example.xml").getAbsolutePath();
    InputSource inputSource = new InputSource(systemId);
    SAXSource saxSource
            = new SAXSource(locationAnnotator, inputSource);

    /*
     * Finally read the XML into the DOM.
     */
    nullTransformer.transform(saxSource, domResult);

    /*
     * Find one of the element nodes in our DOM and output the location
     * information.
     */
    Node n = doc.getElementsByTagName("title").item(0);
    LocationData locationData = (LocationData)
            n.getUserData(LocationData.LOCATION_DATA_KEY);
    System.out.println(locationData);

Although XML files can include other XML files by enabling XInclude on the SAXParserFactory, this does not currently give correct location within included files.  See XERCESJ-1247.   

Tuesday, 5 April 2011

How to Process Large XML Documents Using XSLT

In the previous post I showed how to the the DOM Load and Save API (org.w3c.dom.ls) to process small parts of a larger XML document as DOM fragments.

A great use for this technique is to apply an XSLT transformation to a potentially huge XML document.  Normally it is not possible to use XSLT on large documents without having enough memory available to read the whole document into a DOM before any transformation is performed.  However, by performing the transformation on each item in the XML as it is found, we will only require enough memory to hold one item at a time.

Let's re-use the ProcessingFilter from the previous post and create a new ElementProcessor to do XSLT transformations:

public class XsltElementProcessor implements ElementProcessor {

    private final Transformer transformer;
    private final StreamResult streamResult;

    public XsltElementProcessor(String xsltFilename,
                OutputStream outputStream)
            throws TransformerConfigurationException,
                FileNotFoundException
    {
        this.streamResult = new StreamResult(outputStream);

        TransformerFactory transformerFactory =
            TransformerFactory.newInstance();
        Reader reader = new BufferedReader(new FileReader(xsltFilename));
        Source source = new StreamSource(reader, xsltFilename);
        Templates templates = transformerFactory.newTemplates(source);

        // Hint: The Templates can be shared between threads,
        // Transformer can not.
        this.transformer = templates.newTransformer();
        transformer.setOutputProperty("omit-xml-declaration", "yes");
    }

    @Override
    public void process(Element element) throws TransformerException {
        DOMSource domSource = new DOMSource(element);
        transformer.transform(domSource, streamResult);
    }
}
 Now we re-work our code to create HTML table rows for an RSS feed:
    DOMImplementationRegistry registry = DOMImplementationRegistry
            .newInstance();
    DOMImplementationLS domImpl = (DOMImplementationLS) registry
            .getDOMImplementation("XML 1.0 LS 3.0");

    ElementProcessor processor =
            new XsltElementProcessor("itemToRow.xsl", System.out);
    ProcessingFilter filter = new ProcessingFilter(processor, "item");

    LSParser parser = domImpl.createLSParser(
            DOMImplementationLS.MODE_SYNCHRONOUS, null);
    parser.setFilter(filter);
    LSInput input = domImpl.createLSInput();

    URL url = new URL("http://news.google.com/?output=rss");
    InputStream inputStream = new BufferedInputStream(url.openStream());
   
    try {
        input.setByteStream(inputStream);

        System.out.println("<html><body><table><tr><th>Title</th>" +
                "<th>Category</th><th>Date</th></tr>");
        parser.parse(input);
        System.out.println("</table></body></html>");

        Exception ex = filter.getProcessingException();
        if (ex != null){
            throw ex;
        }
    }
    finally {
        inputStream.close();
    }
OK, the way I have output the XML that wraps our transformed items is output is a bit crumby, but this might just do.  If we wanted to be a bit more clever we could perhaps create another filter.  One that when used to output a template XML document, removes a place holder element and replaces it with the results of our transformations.

And finally a basic "itemToRow.xsl":
    <xsl:stylesheet
             xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output method="html" indent="yes" encoding="UTF-8"/>       
        <xsl:template match="/item">
            <xsl:variable name="href">
                <xsl:value-of select="link"/>
            </xsl:variable>
            <tr>
               <td><a href="{$href}"><xsl:value-of select="title"/></a></td>
                <td><xsl:value-of select="category"/></td>
                <td><xsl:value-of select="pubDate"/></td>
            </tr>
        </xsl:template>
   
    </xsl:stylesheet>

Monday, 4 April 2011

What is the DOM Load and Save API (org.w3c.dom.ls) and what is it good for?

org.w3c.dom.ls is one of those enigmatic packages listed at the end of the standard Java 1.6 APIs.  The Javadoc has few clues as to what is for, and no examples showing you how to use it.  This is a shame because it is really useful if you want to use DOM to deal with very large XML documents.

The DOM Load and Save API provides filtering of DOM nodes on input and output. During loading, completed nodes can be inspected before they are included in the final document, and vetoed if they are unwanted.  Similarly nodes can be omitted when a document is output - but that is of less interest to us.


When we process large XML documents, we might use this mechanism to filter out most of the nodes and leave us with a smaller set.  However, we can put this mechanism to much better use, to process every item in a large feed.



As each item element is completed it is passed to our filter class to be inspected. The item element is a DOM fragment that can be processed using standard DOM tools, such as XSLT or XPath.  Afterwards the filter can reject the item and the memory the DOM fragment used will be freed.  In this way we can use DOM processing on very large XML documents without requiring huge amounts of memory.  We can also begin processing as soon as the first item has arrived, instead of having to wait until the whole document as been read.

Let's write some code:

This API is composed completely of interfaces that no documented API classes implement or return, and the Javadoc doesn't explain where they come from.  The missing piece of the jigsaw is the use of the DOMImplementationRegistry and the magic parameter you have to pass to the factory method.
DOMImplementationRegistry registry = DOMImplementationRegistry
        .newInstance();
DOMImplementationLS domImpl = (DOMImplementationLS) registry
        .getDOMImplementation("XML 1.0 LS 3.0");

Now you have a factory object from which to get implementations of the API interfaces.  The only one you need to implement yourself is LSParserFilter.  Let's create an general purpose implementation that passes target elements to another class to do whatever processing is required:
/**
 * An implementation of {@link LSParserFilter} to process
 * DOM element nodes.
 */
public class ProcessingFilter implements LSParserFilter {

    final private ElementProcessor elementProcessor;
    final private String targetElementName;
    final private String targetNamespace;

    private boolean withinTargetNode = false;
    private Exception processingException = null;

    /**
     * An implementation of {@link LSParserFilter} to process DOM element
     * nodes.
     *
     * @param elementProcessor
     *            the component that will process each target element.
     * @param targetElementName
     *            the local name of the target elements to process.
     * @param targetNamespace
     *            the namespace of the target elements to process.
     */
    public ProcessingFilter(ElementProcessor elementProcessor,
            String targetElementName, String targetNamespace) {
        this.elementProcessor = elementProcessor;
        this.targetElementName = targetElementName;
        this.targetNamespace = targetNamespace;
    }

    /**
     * An implementation of {@link LSParserFilter} to process DOM element
     * nodes.
     *
     * @param elementProcessor
     *            the component that will process each target element.
     * @param targetElementName
     *            the name of the target elements (in the default namespace) to
     *            process.
     */
    public ProcessingFilter(ElementProcessor elementProcessor,
            String targetElementName) {
        this(elementProcessor, targetElementName, null);
    }

    public int getWhatToShow() {
        return NodeFilter.SHOW_ALL;
    }

    public short startElement(Element element) {
        if (isTargetElement(element)) {
            withinTargetNode = true;
        }
        return FILTER_ACCEPT;
    }

    public short acceptNode(Node node) {
        /* When we get a completed target element, we process it. */
        if (isTargetElement(node)) {
            try {
                elementProcessor.process((Element) node);
            } catch (Exception ex) {
                this.processingException = ex;
                return FILTER_INTERRUPT;
            }
            withinTargetNode = false;
            return FILTER_REJECT;
        }
        return withinTargetNode ? FILTER_ACCEPT : FILTER_REJECT;
    }

    public Exception getProcessingException() {
        return this.processingException;
    }

    private boolean isTargetElement(final Node node) {
        return Node.ELEMENT_NODE == node.getNodeType()
                && targetElementName.equals(node.getLocalName())
                && (targetNamespace == null ?
                    node.getNamespaceURI() == null :
                    targetNamespace.equals(node.getNamespaceURI()));
    }
}

Then we put our actual processing in a class that implements this interface.
public interface ElementProcessor {
    void process(Element element) throws Exception;
}

To put these ideas into practice, let's print out the titles in an RSS feed using an xpath expression:
    /*
     * During application startup ...
     */
    DOMImplementationRegistry registry = DOMImplementationRegistry
            .newInstance();
    DOMImplementationLS domImpl = (DOMImplementationLS) registry
            .getDOMImplementation("XML 1.0 LS 3.0");

    /*
     *  The processing for each target element ...
     */
    ElementProcessor processor = new ElementProcessor(){
       
        private XPath xpath = XPathFactory.newInstance().newXPath();

        public void process(Element element) throws Exception {
            String title = (String) xpath.evaluate(
                    "./title", element, XPathConstants.STRING);
            System.out.println(title);
        }
    };
    ProcessingFilter filter = new ProcessingFilter(processor, "item");

    /*
     *  Create the parser and process the feed ...
     */
    LSParser parser = domImpl.createLSParser(
            DOMImplementationLS.MODE_SYNCHRONOUS, null);
    parser.setFilter(filter);
    LSInput input = domImpl.createLSInput();

    URL url = new URL("http://news.google.com/?output=rss");
    InputStream inputStream = new BufferedInputStream(url.openStream());
    try {
        input.setByteStream(inputStream);

        parser.parse(input);  // Returns almost empty document DOM

        Exception ex = filter.getProcessingException();
        if (ex != null){
            throw ex;
        }
    }
    finally {
        inputStream.close();
    }