www.alphaworks.ibm.comwww.ibm.com/developerwww.ibm.com

Home

Readme
Download

Build Instructions







Migration

Releases

Feedback

Y2K Compliance


CVS Repository
Mail Archive

This page has sections on the following topics:

SAX1 Programming Guide
 
Constructing a parser
 

In order to use XML4C to parse XML files, you will need to create an instance of the SAXParser class. The example below shows the code you need in order to create an instance of SAXParser. The DocumentHandler and ErrorHandler instances required by the SAX API are provided using the HandlerBase class supplied with XML4C.

int main (int argc, char* args[]) {

    try {
        XMLPlatformUtils::Initialize();
    }
    catch (const XMLException& toCatch) {
        cout << "Error during initialization! :\n"
             << toCatch.getMessage() << "\n";
        return 1;
    }

    char* xmlFile = "x1.xml";
    SAXParser* parser = new SAXParser();
    parser->setDoValidation(true);    // optional.
	parser->setDoNamespaces(true);    // optional

    DocumentHandler* docHandler = new HandlerBase();
    ErrorHandler* errHandler = (ErrorHandler*) docHandler;
    parser->setDocumentHandler(docHandler);
    parser->setErrorHandler(errHandler);

    try {
        parser->parse(xmlFile);
    }
    catch (const XMLException& toCatch) {
        cout << "\nFile not found: '" << xmlFile << "'\n"
             << "Exception message is: \n"
             << toCatch.getMessage() << "\n" ;
        return -1;
    }
}

Using the SAX API
 

The SAX API for XML parsers was originally developed for Java. Please be aware that there is no standard SAX API for C++, and that use of the XML4C SAX API does not guarantee client code compatibility with other C++ XML parsers.

The SAX API presents a callback based API to the parser. An application that uses SAX provides an instance of a handler class to the parser. When the parser detects XML constructs, it calls the methods of the handler class, passing them information about the construct that was detected. The most commonly used handler classes are DocumentHandler which is called when XML constructs are recognized, and ErrorHandler which is called when an error occurs. The header files for the various SAX handler classes are in '<xml4c-3_5_1>/include/sax'

As a convenience, XML4C provides the class HandlerBase, which is a single class which is publicly derived from all the Handler classes. HandlerBase's default implementation of the handler callback methods is to do nothing. A convenient way to get started with XML4C is to derive your own handler class from HandlerBase and override just those methods in HandlerBase which you are interested in customizing. This simple example shows how to create a handler which will print element names, and print fatal error messages. The source code for the sample applications show additional examples of how to write handler classes.

This is the header file MySAXHandler.hpp:

#include <sax/HandlerBase.hpp>

class MySAXHandler : public HandlerBase {
public:
    void startElement(const XMLCh* const, AttributeList&);
    void fatalError(const SAXParseException&);
};

This is the implementation file MySAXHandler.cpp:

#include "MySAXHandler.hpp"
#include <iostream.h>

MySAXHandler::MySAXHandler()
{
}

MySAXHandler::startElement(const XMLCh* const name,
                           AttributeList& attributes)
{
    // transcode() is an user application defined function which
    // converts unicode strings to usual 'char *'. Look at
    // the sample program SAXCount for an example implementation.
    cout << "I saw element: " << transcode(name) << endl;
}

MySAXHandler::fatalError(const SAXParseException& exception)
{
    cout << "Fatal Error: " << transcode(exception.getMessage())
         << " at line: " << exception.getLineNumber()
         << endl;
}

The XMLCh and AttributeList types are supplied by XML4C and are documented in the include files. Examples of their usage appear in the source code to the sample applications.



SAX2 Programming Guide
 
Constructing an XML Reader
 

In order to use XML4C to parse XML files, you will need to create an instance of the SAX2XMLReader class. The example below shows the code you need in order to create an instance of SAX2XMLReader. The ContentHandler and ErrorHandler instances required by the SAX API are provided using the DefaultHandler class supplied with XML4C.

int main (int argc, char* args[]) {

    try {
        XMLPlatformUtils::Initialize();
    }
    catch (const XMLException& toCatch) {
        cout << "Error during initialization! :\n"
             << toCatch.getMessage() << "\n";
        return 1;
    }

    char* xmlFile = "x1.xml";
    SAX2XMLReader* parser = XMLReaderFactory::createXMLReader();
    parser->setFeature(XMLString::transcode("http://xml.org/sax/features/validation", true)   // optional
    parser->setFeature(XMLString::transcode("http://xml.org/sax/features/namespaces", true)   // optional

    ContentHandler* contentHandler = new DefaultHandler();
    ErrorHandler* errHandler = (ErrorHandler*) contentHandler;
    parser->setContentHandler(contentHandler);
    parser->setErrorHandler(errHandler);

    try {
        parser->parse(xmlFile);
    }
    catch (const XMLException& toCatch) {
        cout << "\nFile not found: '" << xmlFile << "'\n"
             << "Exception message is: \n"
             << toCatch.getMessage() << "\n" ;
        return -1;
    }
}

Using the SAX2 API
 

The SAX2 API for XML parsers was originally developed for Java. Please be aware that there is no standard SAX2 API for C++, and that use of the XML4C SAX2 API does not guarantee client code compatibility with other C++ XML parsers.

The SAX2 API presents a callback based API to the parser. An application that uses SAX2 provides an instance of a handler class to the parser. When the parser detects XML constructs, it calls the methods of the handler class, passing them information about the construct that was detected. The most commonly used handler classes are ContentHandler which is called when XML constructs are recognized, and ErrorHandler which is called when an error occurs. The header files for the various SAX2 handler classes are in '<xml4c-3_5_1>/include/sax2'

As a convenience, XML4C provides the class DefaultHandler, which is a single class which is publicly derived from all the Handler classes. DefaultHandler's default implementation of the handler callback methods is to do nothing. A convenient way to get started with XML4C is to derive your own handler class from DefaultHandler and override just those methods in HandlerBase which you are interested in customizing. This simple example shows how to create a handler which will print element names, and print fatal error messages. The source code for the sample applications show additional examples of how to write handler classes.

This is the header file MySAX2Handler.hpp:

#include <sax2/DefaultHandler.hpp>

class MySAX2Handler : public DefaultHandler {
public:
    void startElement(
        const   XMLCh* const    uri,
        const   XMLCh* const    localname,
        const   XMLCh* const    qname,
        const   Attributes&     attrs
    );
    void fatalError(const SAXParseException&);
};

This is the implementation file MySAX2Handler.cpp:

#include "MySAX2Handler.hpp"
#include <iostream.h>

MySAX2Handler::MySAX2Handler()
{
}

MySAX2Handler::startElement(const   XMLCh* const    uri,
                            const   XMLCh* const    localname,
                            const   XMLCh* const    qname,
                            const   Attributes&     attrs)
{
    // transcode() is an user application defined function which
    // converts unicode strings to usual 'char *'. Look at
    // the sample program SAX2Count for an example implementation.
    cout << "I saw element: " << transcode(qname) << endl;
}

MySAX2Handler::fatalError(const SAXParseException& exception)
{
    cout << "Fatal Error: " << transcode(exception.getMessage())
         << " at line: " << exception.getLineNumber()
         << endl;
}

The XMLCh and Attributes types are supplied by XML4C and are documented in the include files. Examples of their usage appear in the source code to the sample applications.


Xerces SAX2 Supported Features
 

The behavior of the SAX2XMLReader is dependant on the values of the following features. All of the features below can be set using the SAX2XMLReader::setFeature(XMLCh*,bool) function. None of these features can be modified in the middle of a parse, or an exception will be thrown.

http://xml.org/sax/features/namespaces 
true:  Perform Namespace processing (default) 
false:  Optionally do not perform Namespace processing 

http://xml.org/sax/features/namespace-prefixes 
true:  Report the orignal prefixed names and attributes used for Namespace declarations (default) 
false:  Do not report attributes used for Namespace declarations, and optionally do not report original prefixed names.  

http://xml.org/sax/features/validation 
true:  Report all validation errors. (default) 
false:  Do not report validation errors.  

http://apache.org/xml/features/validation/dynamic 
true:  The parser will validate the document only if a grammar is specified. (http://xml.org/sax/features/validation must be true) 
false:  Validation is determined by the state of the http://xml.org/sax/features/validation feature (default) 

http://apache.org/xml/features/validation/schema 
true:  Enable the parser's schema support. (default)  
false:  Disable the parser's schema support.  

http://apache.org/xml/features/validation/reuse-grammar 
true:  The parser will reuse grammar information from previous parses in subsequent parses.  
false:  The parser will not reuse any grammar information. (default) 

http://apache.org/xml/features/validation/reuse-validator (deprecated) 
true:  The parser will reuse grammar information from previous parses in subsequent parses.  
false:  The parser will not reuse any grammar information. (default) 


DOM Programming Guide
 
Java and C++ DOM comparisons
 

The C++ DOM API is very similar in design and use, to the Java DOM API bindings. As a consequence, conversion of existing Java code that makes use of the DOM to C++ is a straight forward process.

This section outlines the differences between Java and C++ bindings.


Accessing the API from application code
 
// C++
#include <dom/DOM.hpp>
// Java
import org.w3c.dom.*

The header file <dom/DOM.hpp> includes all the individual headers for the DOM API classes.


Class Names
 

The C++ class names are prefixed with "DOM_". The intent is to prevent conflicts between DOM class names and other names that may already be in use by an application or other libraries that a DOM based application must link with.

The use of C++ namespaces would also have solved this conflict problem, but for the fact that many compilers do not yet support them.

DOM_Document   myDocument;   // C++
DOM_Node       aNode;
DOM_Text       someText;
Document       myDocument;   // Java
Node           aNode;
Text           someText;

If you wish to use the Java class names in C++, then you need to typedef them in C++. This is not advisable for the general case - conflicts really do occur - but can be very useful when converting a body of existing Java code to C++.

typedef DOM_Document  Document;
typedef DOM_Node      Node;

Document   myDocument;        // Now C++ usage is
                              // indistinguishable from Java
Node       aNode;

Objects and Memory Management
 

The C++ DOM implementation uses automatic memory management, implemented using reference counting. As a result, the C++ code for most DOM operations is very similar to the equivalent Java code, right down to the use of factory methods in the DOM document class for nearly all object creation, and the lack of any explicit object deletion.

Consider the following code snippets

// This is C++
DOM_Node       aNode;
aNode = someDocument.createElement("ElementName");
DOM_Node docRootNode = someDoc.getDocumentElement();
docRootNode.AppendChild(aNode);
// This is Java
Node       aNode;
aNode = someDocument.createElement("ElementName");
Node docRootNode = someDoc.getDocumentElement();
docRootNode.AppendChild(aNode);

The Java and the C++ are identical on the surface, except for the class names, and this similarity remains true for most DOM code.

However, Java and C++ handle objects in somewhat different ways, making it important to understand a little bit of what is going on beneath the surface.

In Java, the variable aNode is an object reference , essentially a pointer. It is initially == null, and references an object only after the assignment statement in the second line of the code.

In C++ the variable aNode is, from the C++ language's perspective, an actual live object. It is constructed when the first line of the code executes, and DOM_Node::operator = () executes at the second line. The C++ class DOM_Node essentially a form of a smart-pointer; it implements much of the behavior of a Java Object Reference variable, and delegates the DOM behaviors to an implementation class that lives behind the scenes.

Key points to remember when using the C++ DOM classes:

  • Create them as local variables, or as member variables of some other class. Never "new" a DOM object into the heap or make an ordinary C pointer variable to one, as this will greatly confuse the automatic memory management.
  • The "real" DOM objects - nodes, attributes, CData sections, whatever, do live on the heap, are created with the create... methods on class DOM_Document. DOM_Node and the other DOM classes serve as reference variables to the underlying heap objects.
  • The visible DOM classes may be freely copied (assigned), passed as parameters to functions, or returned by value from functions.
  • Memory management of the underlying DOM heap objects is automatic, implemented by means of reference counting. So long as some part of a document can be reached, directly or indirectly, via reference variables that are still alive in the application program, the corresponding document data will stay alive in the heap. When all possible paths of access have been closed off (all of the application's DOM objects have gone out of scope) the heap data itself will be automatically deleted.
  • There are restrictions on the ability to subclass the DOM classes.

DOMString
 

Class DOMString provides the mechanism for passing string data to and from the DOM API. DOMString is not intended to be a completely general string class, but rather to meet the specific needs of the DOM API.

The design derives from two primary sources: from the DOM's CharacterData interface and from class java.lang.string.

Main features are:

  • It stores Unicode text.
  • Automatic memory management, using reference counting.
  • DOMStrings are mutable - characters can be inserted, deleted or appended.

When a string is passed into a method of the DOM, when setting the value of a Node, for example, the string is cloned so that any subsequent alteration or reuse of the string by the application will not alter the document contents. Similarly, when strings from the document are returned to an application via the DOM API, the string is cloned so that the document can not be inadvertently altered by subsequent edits to the string.

The ICU classes are a more general solution to UNICODE character handling for C++ applications. ICU is an Open Source Unicode library, available at the IBM DeveloperWorks website.

Equality Testing
 

The DOMString equality operators (and all of the rest of the DOM class conventions) are modeled after the Java equivalents. The equals() method compares the content of the string, while the == operator checks whether the string reference variables (the application program variables) refer to the same underlying string in memory. This is also true of DOM_Node, DOM_Element, etc., in that operator == tells whether the variables in the application are referring to the same actual node or not. It's all very Java-like

  • bool operator == () is true if the DOMString variables refer to the same underlying storage.
  • bool equals() is true if the strings contain the same characters.

Here is an example of how the equality operators work:

DOMString a = "Hello";
DOMString b = a;
DOMString c = a.clone();
if (b == a)           //  This is true
if (a == c)           //  This is false
if (a.equals(c))       //  This is true
b = b + " World";
if (b == a)           // Still true, and the string's
                      //    value is "Hello World"
if (a.equals(c))      // false.  a is "Hello World";
                      //    c is still "Hello".

Downcasting
 

Application code sometimes must cast an object reference from DOM_Node to one of the classes deriving from DOM_Node, DOM_Element, for example. The syntax for doing this in C++ is different from that in Java.

// This is C++
DOM_Node       aNode = someFunctionReturningNode();
DOM_Element    el = (Element &) aNode;
// This is Java
Node       aNode = someFunctionReturningNode();
Element    el = (Element) aNode;

The C++ cast is not type-safe; the Java cast is checked for compatible types at runtime. If necessary, a type-check can be made in C++ using the node type information:

// This is C++

DOM_Node       aNode = someFunctionReturningNode();
DOM_Element    el;    // by default, el will == null.

if (anode.getNodeType() == DOM_Node::ELEMENT_NODE)
   el = (Element &) aNode;
else
   // aNode does not refer to an element.
   // Do something to recover here.

Subclassing
 

The C++ DOM classes, DOM_Node, DOM_Attr, DOM_Document, etc., are not designed to be subclassed by an application program.

As an alternative, the DOM_Node class provides a User Data field for use by applications as a hook for extending nodes by referencing additional data or objects. See the API description for DOM_Node for details.



Experimental IDOM Programming Guide
 

The experimental IDOM API is a new design of the C++ DOM API. Please note that this experimental IDOM API is only a prototype and is subject to change.

Constructing a parser
 

In order to use XML4C to parse XML files using IDOM, you will need to create an instance of the IDOMParser class. The example below shows the code you need in order to create an instance of the IDOMParser.

int main (int argc, char* args[]) {

    try {
        XMLPlatformUtils::Initialize();
    }
    catch (const XMLException& toCatch) {
        cout << "Error during initialization! :\n"
             << toCatch.getMessage() << "\n";
        return 1;
    }

    char* xmlFile = "x1.xml";
    IDOMParser* parser = new IDOMParser();
    parser->setValidationScheme(IDOMParser::Val_Always);    // optional.
    parser->setDoNamespaces(true);    // optional

    ErrorHandler* errHandler = (ErrorHandler*) new HandlerBase();
    parser->setErrorHandler(errHandler);

    try {
        parser->parse(xmlFile);
    }
    catch (const XMLException& toCatch) {
        cout << "\nFile not found: '" << xmlFile << "'\n"
             << "Exception message is: \n"
             << toCatch.getMessage() << "\n" ;
       return -1;
    }

    return 0;
}
      

Comparision of C++ DOM and IDOM
 

This section outlines the differences between the C++ DOM and IDOM APIs.


Motivation behind new design
 

The performance of the C++ DOM has not been as good as it might be, especially for use in server style applications. The DOM's reference counted automatic memory management has been the biggest time consumer. The situation becomes worse when running multi-threaded applications.

The experimental C++ IDOM is a new alternative to the C++ DOM, and aims at meeting the following requirements:

  • Reduced memory footprint.
  • Fast.
  • Good scalability on multiprocessor systems.
  • More C++ like and less Java like.

Class Names
 

The IDOM class names are prefixed with "IDOM_". The intent is to prevent conflicts between IDOM class names and DOM class names that may already be in use by an application or other libraries that a DOM based application must link with.

IDOM_Document*   myDocument;   // IDOM
IDOM_Node*       aNode;
IDOM_Text*       someText;
      
DOM_Document     myDocument;   // DOM
DOM_Node         aNode;
DOM_Text         someText;
      

Objects and Memory Management
 

The C++ IDOM implementation no longer uses reference counting for automatic memory management. The storage for a DOM document is associated with the document node object. Applications would use normal C++ pointers to directly access the implementation objects for Nodes in IDOM C++, while they would use object references in DOM C++.

Consider the following code snippets

// IDOM C++
IDOM_Node*       aNode;
IDOM_Node* docRootNode;
aNode = someDocument->createElement("ElementName");
docRootNode = someDocument->getDocumentElement();
docRootNode->appendChild(aNode);
      
// DOM C++
DOM_Node       aNode;
DOM_Node docRootNode;
aNode = someDocument.createElement("ElementName");
docRootNode = someDocument.getDocumentElement();
docRootNode.appendChild(aNode);
      

The IDOM C++ uses an independent storage allocator per document. The advantage here is that allocation would require no synchronization in most cases (based on the the same threading model that we have now - one thread active per document, but any number of documents running in parallel with separate threads).

The allocator does not support a delete operation at all - all allocated memory would persist for the life of the document, and then the larger blocks would be returned to the system without separately deleting all of the individual nodes and strings within the document.

The C++ DOM and IDOM are similar in the use of factory methods in the document class for all object creation. They differ in the object deletion mechanism.

In C++ DOM, there is no explicit object deletion. The deallocation of memory is automatically taken care of by the reference counting.

In C++ IDOM, there is an implict and explict object deletion. When parsing a document using an IDOMParser, the storage allocated will be automatically deleted when the parser instance is deleted (implicit). If a user is manually building a DOM tree in memory using the document factory methods, then the user needs to explicilty delete the document object to free all allocated memory.

Consider the following code snippets:

// C++ IDOM - explicit deletion
IDOM_Document*   myDocument;
IDOM_Node*       aNode;
myDocument = IDOM_DOMImplementation::getImplementation()->createDocument();
aNode = myDocument->createElement("ElementName");
myDocument->appendChild(aNode);
delete myDocument;
      
// C++ DOM - implicit deletion
IDOM_Document   myDocument;
DOM_Node        aNode;
myDocument = DOM_DOMImplementation::getImplementation().createDocument();
aNode = myDocument.createElement("ElementName");
myDocument.appendChild(aNode);
      

Key points to remember when using the C++ IDOM classes:

  • The DOM objects are accessed via C++ pointers.
  • The DOM objects - nodes, attributes, CData sections, etc., are created with the factory methods (create...) in the document class.
  • If you are manually building a DOM tree in memory, you need to explicitly delete the document object. Memory management will be automatically taken care of by the IDOM parser when parsing an instance document.

DOMString vs. XMLCh
 

The IDOM C++ no longer uses DOMString to pass string data to and from the DOM API. Instead, the IDOM C++ uses plain, null-terminated (XMLCh *) utf-16 strings. The (XMLCh*) utf-16 type string is much simpler with lower overhead. All the string data would remain in memory until the document object is deleted.

//C++ IDOM
const XMLCh* nodeValue = aNode->getNodeValue();
    
//C++ DOM
DOMString    nodeValue = aNode.getNodeValue();
    


Footer