Parsing XML Documents using SAX

Xml Parsers are used to parse and extract information from Xml Documents. The most commonly used Xml Parsers are Simple API for Xml Parsing and Document Object Model. SAX Parsers are preferred when the size of the Xml Document is comparatively large and the application doesn’t wish to store and reuse the Xml information in the future. In this tip, let us see how to parse an Xml Document using SAX Technique.

also read:

Before that, it is important to understand the architecture behind SAX Parsing. SAX usually follows Push-based parsing, in which case, the Parser will scan the XML Document from top to bottom and whenever it founds some node (like start node, end node, text-node etc.) it will push notifications to the Application in the form of Events. So, SAX is basically a sequential, event-based parser.

For example, let us consider the following XML folders.xml which we are going to parse using SAX technique.

folders.xml

<?xml version='1.0' encoding='UTF-8'?>

<folders>

    <folder name = 'softwares'>
        <file name = 'Winamp'/>
        <file name = 'Java'/>
        <file name = 'Xml Editor'/>
    </folder>

    <folder name = 'users'>
        <file name = 'Admin'/>
        <file name = 'guest'/>
        <file name = 'jenny'/>
    </folder>

</folders>

Before getting into the sample code, let us spend some time in understanding the various common jargons related to Xml. At a higher level, XML Documents are nothing but collection of nodes. For example, in the above document, folders is called the root node and it has two child nodes namely folder. The node 'folder' has an attribute called 'name'. Even an attribute is treated as a node during XML Parsing. Given below is the code sample for XML Parsing,

SaxReader.java

package tips.xml.sax;

import java.io.File;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

public class SaxReader {

    public static void main(String[] args) throws Exception {

        SAXParser parser = SAXParserFactory.newInstance()
            .newSAXParser();
        FolderHandler handler = new FolderHandler();
        parser.parse(new File('folders.xml'), handler);
    }
}

A SAXPArserFactory follows the factory pattern to create instances of SAXParser objects that can be used for parsing. The default implementation that ships along with JDK Distribution is Apache XML Parsing and it also can be overridden with someother vendor’s implementation. Now, in order to handle Events that are emitted as a result of Parser parsing the XML Document, we have defined a new class called FolderHandler and has passed an instance of the same to SAXParser.parse() method along with the file object. Following is the class definition for FolderHandler.

FolderHandler.java

package tips.xml.sax;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class FolderHandler extends DefaultHandler {

    public void startElement (String uri, String localName,
    String qName, Attributes attributes) throws SAXException
    {
        System.out.println();
        System.out.print('<' + qName + '');
        if (attributes.getLength() == 0){
            System.out.print('>');
        }else{
            System.out.print(' ');
            for(int index = 0; index < attributes.getLength(); index ++){
            System.out.print(
                attributes.getLocalName(index) + ' => ' + attributes.getValue(index));
        }
        System.out.print('>');
    }
    }

    public void endElement (String uri, String localName, String qName)
    throws SAXException
    {
    System.out.print('n</' + qName + '>');
    }
}

Note that the above class extends DefaultHandler which is just an Adapter class giving empty implementations for the various abstract like startDocument(), endDocument(), startElement() and endElement(). The methods of interest in our case are startElement() and endElement(). The method startElement() will be called as soon as the parser encounters the start xml node like <folder> and the endElement() will be called when the parser visits the element </folder>. Note that the document will be parsed sequentially. So, in our case, the events happen in the following order,

startElement – for the root <folders> element
startElement – for the first <folder> element
startElement – for the first <file> element within the first <folder> element
endElement – for the first <file> element within the first <folder> element
startElement – for the second <file> element within the first <folder> element
endElement – for the second <file> element within the first <folder> element

and so on…
Now, let us look into the various argument passed on to the startElement() and the endElement() method. Four arguments are being passed to the startElement() method and the arguments of our interest are the 3rd and the 4th argument, qName (or) Qualified name which is literally the node name and arrtibutes, which represents a list of attributes for a particular node. We have made some simple logic within the startElement() method so that we get the output in the following xml-like format,

<folders>
<folder name => softwares>
<file name => Winamp>
</file>
<file name => Java>
</file>
<file name => Xml Editor>
</file>
</folder>
<folder name => users>
<file name => Admin>
</file>
<file name => guest>
</file>
<file name => jenny>
</file>
</folder>
</folders>

Comments

comments

About Krishna Srinivasan

He is Founder and Chief Editor of JavaBeat. He has more than 8+ years of experience on developing Web applications. He writes about Spring, DOJO, JSF, Hibernate and many other emerging technologies in this blog.

Speak Your Mind

*