Tuesday, 5 April 2011

How to Process Large XML Documents Using XSLT

In the previous post I showed how to the the DOM Load and Save API (org.w3c.dom.ls) to process small parts of a larger XML document as DOM fragments.

A great use for this technique is to apply an XSLT transformation to a potentially huge XML document.  Normally it is not possible to use XSLT on large documents without having enough memory available to read the whole document into a DOM before any transformation is performed.  However, by performing the transformation on each item in the XML as it is found, we will only require enough memory to hold one item at a time.

Let's re-use the ProcessingFilter from the previous post and create a new ElementProcessor to do XSLT transformations:

public class XsltElementProcessor implements ElementProcessor {

    private final Transformer transformer;
    private final StreamResult streamResult;

    public XsltElementProcessor(String xsltFilename,
                OutputStream outputStream)
            throws TransformerConfigurationException,
                FileNotFoundException
    {
        this.streamResult = new StreamResult(outputStream);

        TransformerFactory transformerFactory =
            TransformerFactory.newInstance();
        Reader reader = new BufferedReader(new FileReader(xsltFilename));
        Source source = new StreamSource(reader, xsltFilename);
        Templates templates = transformerFactory.newTemplates(source);

        // Hint: The Templates can be shared between threads,
        // Transformer can not.
        this.transformer = templates.newTransformer();
        transformer.setOutputProperty("omit-xml-declaration", "yes");
    }

    @Override
    public void process(Element element) throws TransformerException {
        DOMSource domSource = new DOMSource(element);
        transformer.transform(domSource, streamResult);
    }
}
 Now we re-work our code to create HTML table rows for an RSS feed:
    DOMImplementationRegistry registry = DOMImplementationRegistry
            .newInstance();
    DOMImplementationLS domImpl = (DOMImplementationLS) registry
            .getDOMImplementation("XML 1.0 LS 3.0");

    ElementProcessor processor =
            new XsltElementProcessor("itemToRow.xsl", System.out);
    ProcessingFilter filter = new ProcessingFilter(processor, "item");

    LSParser parser = domImpl.createLSParser(
            DOMImplementationLS.MODE_SYNCHRONOUS, null);
    parser.setFilter(filter);
    LSInput input = domImpl.createLSInput();

    URL url = new URL("http://news.google.com/?output=rss");
    InputStream inputStream = new BufferedInputStream(url.openStream());
   
    try {
        input.setByteStream(inputStream);

        System.out.println("<html><body><table><tr><th>Title</th>" +
                "<th>Category</th><th>Date</th></tr>");
        parser.parse(input);
        System.out.println("</table></body></html>");

        Exception ex = filter.getProcessingException();
        if (ex != null){
            throw ex;
        }
    }
    finally {
        inputStream.close();
    }
OK, the way I have output the XML that wraps our transformed items is output is a bit crumby, but this might just do.  If we wanted to be a bit more clever we could perhaps create another filter.  One that when used to output a template XML document, removes a place holder element and replaces it with the results of our transformations.

And finally a basic "itemToRow.xsl":
    <xsl:stylesheet
             xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output method="html" indent="yes" encoding="UTF-8"/>       
        <xsl:template match="/item">
            <xsl:variable name="href">
                <xsl:value-of select="link"/>
            </xsl:variable>
            <tr>
               <td><a href="{$href}"><xsl:value-of select="title"/></a></td>
                <td><xsl:value-of select="category"/></td>
                <td><xsl:value-of select="pubDate"/></td>
            </tr>
        </xsl:template>
   
    </xsl:stylesheet>

No comments:

Post a Comment