|
| |
The new version is improved in many ways. Some general improvements
are: significantly better conformance to the XML spec, cleaner
internal architecture, many bug fixes, and faster speed.
| |
Except for a couple of the very obscure (mostly related to
the 'standalone' mode), this version should be quite compliant.
We have more than a thousand tests, some collected from various
public sources and some IBM generated, which are used to do
regression testing. The C++ parser is now passing all but a
handful of them.
|
| |
This version has many bug fixes with regard to XML4C version 2.x.
Some of these were reported by users and some were brought up by
way of the conformance testing.
|
| |
Much work was done to speed up this version. Some of the
new features, such as namespaces, and conformance checks ended
up eating up some of these gains, but overall the new version
is significantly faster than previous versions, even while doing
more.
|
|
| |
The sample programs no longer use any of the unsupported
util/xxx classes. They only existed to allow us to write
portable samples. But, since we feel that the wide character
APIs are supported on a lot of platforms these days, it was
decided to go ahead and just write the samples in terms of
these. If your system does not support these APIs, you will
not be able to build and run the samples. On some platforms,
these APIs might perhaps be optional packages or require
runtime updates or some such action.
More samples have been added as well. These highlight some
of the new functionality introduced in the new code base. And
the existing ones have been cleaned up as well.
The new samples are:
- PParse - Demonstrates 'progressive parse' (see below)
- StdInParse - Demonstrates use of the standard in input source
- EnumVal - Shows how to enumerate the markup decls in a DTD Validator
|
| |
In the XML4C 2.x code base, there were the following parser
classes (in the src/parsers/ source directory):
NonValidatingSAXParser, ValidatingSAXParser,
NonValidatingDOMParser, ValidatingDOMParser. The
non-validating ones were the base classes and the validating
ones just derived from them and turned on the validation.
This was deemed a little bit overblown, considering the tiny
amount of code required to turn on validation and the fact
that it makes people use a pointer to the parser in most cases
(if they needed to support either validating or non-validating
versions.)
The new code base just has SAXParer and DOMParser
classes. These are capable of handling both validating and
non-validating modes, according to the state of a flag that
you can set on them. For instance, here is a code snippet that
shows this in action.
| | | | void ParseThis(const XMLCh* const fileToParse,
const bool validate)
{
//
// Create a SAXParser. It can now just be
// created by value on the stack if we want
// to parse something within this scope.
//
SAXParser myParser;
// Tell it whether to validate or not
myParser.setDoValidation(validate);
// Parse and catch exceptions...
try
{
myParser.parse(fileToParse);
}
...
}; | | | | |
We feel that this is a simpler architecture, and that it makes things
easier for you. In the above example, for instance, the parser will be
cleaned up for you automatically upon exit since you don't have to
allocate it anymore.
|
| |
Experimental early support for some parts of the DOM level
2 specification have been added. These address some of the
shortcomings in our DOM implementation,
such as a simple, standard mechanism for tree traversal.
|
| |
The new parser classes support, in addition to the
parse() method, two new parsing methods,
parseFirst() and parseNext(). These are
designed to support 'progressive parsing', so that you don't
have to depend upon throwing an exception to terminate the
parsing operation. Calling parseFirst() will cause the DTD (or
in the future, Schema) to be parsed (both internal and
external subsets) and any pre-content, i.e. everything up to
but not including the root element. Subsequent calls to
parseNext() will cause one more pieces of markup to be parsed,
and spit out from the core scanning code to the parser (and
hence either on to you if using SAX or into the DOM tree if
using DOM.) You can quit the parse any time by just not
calling parseNext() anymore and breaking out of the loop. When
you call parseNext() and the end of the root element is the
next piece of markup, the parser will continue on to the end
of the file and return false, to let you know that the parse
is done. So a typical progressive parse loop will look like
this:
| | | | // Create a progressive scan token
XMLPScanToken token;
if (!parser.parseFirst(xmlFile, token))
{
cerr << "scanFirst() failed\n" << endl;
return 1;
}
//
// We started ok, so lets call scanNext()
// until we find what we want or hit the end.
//
bool gotMore = true;
while (gotMore && !handler.getDone())
gotMore = parser.parseNext(token); | | | | |
In this case, our event handler object (named 'handler'
surprisingly enough) is watching form some criteria and will
return a status from its getDone() method. Since the handler
sees the SAX events coming out of the SAXParser, it can tell
when it finds what it wants. So we loop until we get no more
data or our handler indicates that it saw what it wanted to
see.
When doing non-progressive parses, the parser can easily
know when the parse is complete and insure that any used
resources are cleaned up. Even in the case of a fatal parsing
error, it can clean up all per-parse resources. However, when
progressive parsing is done, the client code doing the parse
loop might choose to stop the parse before the end of the
primary file is reached. In such cases, the parser will not
know that the parse has ended, so any resources will not be
reclaimed until the parser is destroyed or another parse is started.
This might not seem like such a bad thing; however, in this case,
the files and sockets which were opened in order to parse the
referenced XML entities will remain open. This could cause
serious problems. Therefore, you should destroy the parser instance
in such cases, or restart another parse immediately. In a future
release, a reset method will be provided to do this more cleanly.
Also note that you must create a scan token and pass it
back in on each call. This insures that things don't get done
out of sequence. When you call parseFirst() or parse(), any
previous scan tokens are invalidated and will cause an error
if used again. This prevents incorrect mixed use of the two
different parsing schemes or incorrect calls to
parseNext().
|
| |
The system now supoprts loadable message text, instead of
having it hard coded into the program. The current drop still
just supports English, but it can now support other
languages. Anyone interested in contributing any translations
should contact us. This would be an extremely useful
service.
In order to support the local message loading services, we
have created a pretty flexible framework for supporting
loadable text. Firstly, there is now an XML file, in the
src/NLS/ directory, which contains all of the error messages.
There is a simple program, in the Tools/NLSXlat/ directory,
which can spit out that text in various formats. It currently
supports a simple 'in memory' format (i.e. an array of
strings), the Win32 resource format, and the message catalog
format. The 'in memory' format is intended for very simple
installations or for use when porting to a new platform (since
you can use it until you can get your own local message
loading support done.)
In the src/util/ directory, there is now an XMLMsgLoader
class. This is an abstraction from which any number of
message loading services can be derived. Your platform driver
file can create whichever type of message loader it wants to
use on that platform. We currently have versions for the in
memory format, the Win32 resource format, and the message
catalog format. An ICU one is present but not implemented
yet. Some of the platforms can support multiple message
loaders, in which case a #define token is used to control
which one is used. You can set this in your build projects to
control the message loader type used.
Both the Java and C++ parsers emit the same messages for an XML error
since they are being taken from the same message file.
|
| |
In a preliminary move to support Schemas, and to make them
first class citizens just like DTDs, the system has been
reworked internally to make validators completely pluggable.
So now the DTD validator code is under the src/validators/DTD/
directory, with a future Schema validator probably going into
the src/validators. The core scanner architecture now works
completely in terms of the framework/XMLValidator abstract
interface and knows almost nothing about DTDs or Schemas. For
now, if you don't pass in a validator to the parsers, they
will just create a DTDValidator. This means that,
theoretically, you could write your own validator. But we
would not encourage this for a while, until the semantics of
the XMLValidator interface are completely worked out and
proven to handle DTD and Schema cleanly.
|
| |
Another abstract framework added in the src/util/ directory
is to support pluggable transcoding services. The
XMLTransService class is an abtract API that can be derived
from, to support any desired transcoding
service. XMLTranscoder is the abstract API for a particular
instance of a transcoder for a particular encoding. The
platform driver file decides what specific type of transcoder
to use, which allows each platform to use its native
transcoding services, or the ICU service if desired.
Implementations are provided for Win32 native services, ICU
services, and the iconv services available on many
Unix platforms. The Win32 version only provides native code
page services, so it can only handle XML code in the intrinsic
encodings ASCII, UTF-8, UTF-16 (Big/Small Endian), UCS4
(Big/Small Endian), EBCDIC code pages IBM037 and
IBM1140 encodings, ISO-8859-1 (aka Latin1) and Windows-1252. The ICU version
provides all of the encodings that ICU supports. The
iconv version will support the encodings supported
by the local system. You can use transcoders we provide or
create your own if you feel ours are insufficient in some way,
or if your platform requires an implementation that we do not
provide.
|
|
|