Java XML

Document Object Model (DOM)

DOM is a standard tree structure, where each node contains one of the components from an XML structure. The two most common types of nodes are element nodes and text nodes. Using DOM functions lets you create nodes, remove nodes, change their contents, and traverse the node hierarchy.

The JAXP 1.4.2 implementation of DOM supports XML Schema.

For simple data structures use one of the more object-oriented standards, such as JDOM or dom4j.
Processing a DOM differs from processing a JDOM or dom4j structure.
Historically DOM is not very OO but JDOM/Dom4j are.
DOM standard is a codified standard for an in-memory document model.
DOM is complex and takes more coding but is robust and future proof.

The difference between the document model used in DOM and the data model used in JDOM or dom4j lies in:

  • The kind of node that exists in the hierarchy
  • The capacity for mixed content

Text and elements can be freely intermixed in a DOM hierarchy: <sentence>This is an <bold>important</bold> idea.</sentence>

ELEMENT: sentence
+ TEXT: This is an
+ ELEMENT: bold
+ TEXT: important
+ TEXT: idea.

the "content" of the first element (its value) simply identifies the kind of node it is.
In DOM the value of an element is not the same as its content.
In JDOM and dom4j, after you navigate to an element that contains text, you invoke a method such as text() to get its content. When processing a DOM, though, you must inspect the list of sub-elements to "put together" the text of the node.

DOM API is easier than SAX.

Simple API for XML (SAX)

SAX is an event-driven, serial-access mechanism for accessing XML documents. This protocol is frequently used by servlets and network-oriented programs that need to transmit and receive XML documents, because it is the fastest and least memory-intensive mechanism that is currently available for dealing with XML documents, other than the Streaming API for XML (StAX).

SAX is oriented towards state independent processing, where the handling of an element does not depend on the elements that came before. StAX, on the other hand, is oriented towards state dependent processing.

Setting up a program to use SAX requires a bit more work than setting up to use the Document Object Model (DOM). SAX is an event-driven model (you provide the callback methods, and the parser invokes them as it reads the XML data), and that makes it harder to visualize.

Developers who are writing a user-oriented application that displays an XML document and possibly modifies it will want to use the DOM mechanism described in Document Object Model.

SAX is fast and efficient, but its event model makes it most useful for such state-independent filtering. For example, a SAX parser calls one method in your application when an element tag is encountered and calls a different method when text is found. If the processing you are doing is state-independent (meaning that it does not depend on the elements that have come before), then SAX works fine.

On the other hand, for state-dependent processing, where the program needs to do one thing with the data under element A but something different with the data under element B, then a pull parser such as the Streaming API for XML (StAX) would be a better choice. With a pull parser, you get the next node, whatever it happens to be, at any point in the code that you ask for it. So it is easy to vary the way you process text (for example), because you can process it multiple places in the program (for more detail, see Further Information).

SAX requires much less memory than DOM, because SAX does not construct an internal representation (tree structure) of the XML data, as a DOM does. Instead, SAX simply sends data to the application as it is read; your application can then do whatever it wants to do with the data it sees.

Pull parsers and the SAX API both act like a serial I/O stream. You see the data as it streams in, but you cannot go back to an earlier position or leap ahead to a different position. In general, such parsers work well when you simply want to read data and have the application act on it.

But when you need to modify an XML structure - especially when you need to modify it interactively - an in-memory structure makes more sense. DOM is one such model. However, although DOM provides many powerful capabilities for large-scale documents (like books and articles), it also requires a lot of complex coding. The details of that process are highlighted in When to Use DOM in the next lesson

Java API for XML Processing (JAXP)

JAXP leverages the parser standards of SAX and DOM so that you can choose to parse your data as a stream of events or to build an object representation of it.
JAXP supports XSLT, DTD, StAX.

Streaming API for XML (StAX)

StAX is the latest API in the JAXP family, and provides an alternative to SAX, DOM, TrAX, and DOM.
StAX provides a standard, bidirectional pull parser interface for streaming XML processing, offering a simpler programming model than SAX and more efficient memory management than DOM. StAX enables developers to parse and modify XML streams as events, and to extend XML information models to allow application-specific additions.

StAX is not as powerful or flexible as TrAX or JDOM.
StAX-enabled clients are generally easier to code than SAX clients.
StAX is a bidirectional API, meaning that it can both read and write XML documents. SAX is read only.
SAX is a push API, whereas StAX is pull.


Pull parsing vs push parsing

Streaming pull parsing refers to a programming model in which a client application calls methods on an XML parsing library when it needs to interact with an XML infoset—that is, the client only gets (pulls) XML data when it explicitly asks for it.

Streaming push parsing refers to a programming model in which an XML parser sends (pushes) XML data to the client as the parser encounters elements in an XML infoset—that is, the parser sends the data whether or not the client is ready to use it at that time.

Java Architecture for XML Binding (JAXB)

JAXB provides a way to:

  • Bind (transform) xml schemas to java objects
  • unmarshalling (reading) XML instance documents into Java content trees
  • marshalling (writing) Java content trees back into XML instance documents
  • Generate XML schema from annotated Java objects

xml schema -> schema compiler -> jaxb annotated class (+ Object factory) -> schema generator -> xml schema

  • schema generator does not need any annotations in java objects but with annotations we can alter the default behaviour and customise the generated schema.
  • xjc is the jaxb compiler in jdk
  • schemagen is the jaxb schema generator in jdk.

Extensible Stylesheet Language (XSL)

XSL has three major subcomponents:

The Formatting Objects standard. This standard gives mechanisms for describing font sizes, page layouts, and other aspects of object rendering. This subcomponent is not covered by JAXP.

This is the transformation language, which lets you define a transformation from XML into some other format (HTML, XML, text, etc). JAXP includes an interpreting implementation of XSLT.

XSLT is a language that lets you specify what sorts of things to do when a particular element is encountered. But to write a program for different parts of an XML data structure, you need to specify the part of the structure you are talking about at any given time. XPath is that specification language. It is an addressing mechanism that lets you specify a path to an element so that, for example, <article><title> can be distinguished from <person><title>. In that way, you can describe different kinds of translations for the different <title> elements.



Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License