XML : Java Glossary


xml logo XML
The primary function of XML (extensible Markup Language) is to consume RAM (Random Access Memory) and datacommunication bandwidth. Presumably it was promoted to its current frenzy by companies who sell either RAM or bandwidth. Others promoting it have patents they hope to spring on the public once it is entrenched. XML is the biggest con game going in computers. You probably guessed, I am known for my rabid dislike of XML.
The Basics Schema
Naming Awkward Characters and Entity References
Encoding Quoting
Schemas Writing
Validation XML Serialization
Parsing Digitally Signing XML
XML Benefits Tools
XML Drawbacks Books
What Should Replace XML? Learning More
DTD Links

The Basics

XML is a W3C (World Wide Web Consortium) proposed recommendation. Like HTML (Hypertext Markup Language), XML is based on SGML (Standard Generalised Markup Language), an International Standard ( ISO (International Standards Organisation) 8879) for creating markup languages. However, while HTML is a single SGML document type, with a fixed set of element type names (aka tag names), XML is a simplified profile of SGML : you can use it to define many different document types, each of which uses its own element type names (instead of HTML ’s html, body, h1, ol, etc.). For example, in XML, Fields that there can be only zero or one of are usually specified as attributes e.g. unit= box. Fields that there can be many of are enclosed in tags called elements e.g. <item>…</item> e.g. Just like HTML, comments begin with <!-- and end with -->. You can abbreviate <mytag myattrib=something></mytag> as <mytag myattrib=something />.

XML was designed to make it easy to write a parser. I think this was an unfortunate decision. Only a handful of people in the world will ever write an XML parser, but hundreds of thousands have to compose XML . They should have designed it to be easy and terse to write. For example, its mandatory quotes around each field are there solely for the convenience of the parser writer. The tag names in the </mytag;> are redundant and should be optional. They are not needed at all in XML designed solely for machine consumption. Even in human-read XML, they add nothing on the innermost nest on a single line.


Pretty well any character is legal in an element or attribute name. You can use upper or lower case, accented letters, digits or punctuation. _ is good for separating words. You may not use a space. It is considered poor style to use -, . and :. Names cannot start with a number or punctuation or with the letters xml (in any case). Names are case-sensitive.


UTF-8 is the default encoding, but unfortunately the encoding could be any ruddy encoding ever invented. Using other encodings destroys XML as an interchange format. Don’t do it!

<?xml version="1.0" encoding="UTF-8" ?>
<!-- explicit encoding specification -->
<!-- The space before the ?> is optional -->


You describe your little XML subgrammar by writing a DTD (Document Type Definition) file. Optionally, you can include the DTD inline inside your XML file. There are other more elaborate schema grammars including RELAX NG, Schematron, XSD and various other schemas. I like XSD (XML Scheme Definition) s the best.


Each schema has its corresponding technique for validating an XML file that the syntax is valid. If you use a DTD , here:


There are two popular parsing techniques, SAX (Simple API for XML), which hands you each field as it parses and W3C DOM (Document Object Model) tree which creates a complete parse tree you can prune and repeatedly scan.

I personally detest XML, however, it has caught on like a cocaine wave. It must have some redeeming features.

XML Benefits

XML Drawbacks

xml logo Using XML to transmit data is the analog of insisting that all code be passed around as triple spaced Java source files, with added dummy comments, rather than as binary byte code. There is no guarantee a source file is even syntactically correct. It is impossible to create a syntactically incorrect byte code file. Byte code files can be processed without time-consuming parsing. In byte code, repeating strings are naturally specified only once. XML, as it stands, suffers from all those analogous drawbacks and more.

What Should Replace XML ?

The characteristics include:

One possible candidate for the XML replacement job is the Java serialized object format. It can handle just about any data structure imaginable. It is platform independent. It has a simple DTD — Java source code for the corresponding class. Some claim it is Java-only. Not so. It is no more difficult for C++ to parse than any other similar newly concocted protocol. It is not tied to any hardware or OS (Operating System). It is just that Java has a head start implementing it. Java can implement it with no extra overhead.

There have been some efforts made to patch up the shortcomings of XML, in fact there are dozens of them. XML is no longer simple any more. It is raggedy patchwork quilt. People were sucked in by the initial simplicity, then discovered that it was not really all that useful in its simple form. Schema was added to allow specifying types (but still only permitting strings). Yes we need a standard interchange format, but XML was only a back of the envelope stab at it. XML was destined to fail since it totally ignored so many factors in coming up with a good design.

One such effort is VTD (Virtual Token Descriptor). A VTD record is a 64-bit integer that encodes the starting offset, length, type and nesting depth of a token in an XML document. Because VTD records don’t contain data fields, they work alongside of the original XML document, which is maintained intact in memory by the processing model.

Due to the stupidity, duplicity and/or greed of those promoting XML, we will likely be stuck with some committee-patched variant of it forever — something that will make even HTML look clean. We need a common data interchange format, but not so inept.


You need to compose a DTD file that describes the format of the XML file. The <!ELEMENT statement is used to list the various tags you will use and which tags may be used inside which tags and how often and in which order. The <!ATTLIST statement is used to list the various attributes (mandatory and optional) of each tag. The <!ENTITY statement lets you make up you own abbreviations.

Here is a simple example:


<!ATTLIST square width CDATA 0>
The CDATA means the value of the field is a string.


<square width=100></square>


A schema is a document that describes what constitutes a legitimate XML document. It might be very generic, describing all XML documents, or some particular class of XML documents, say ones describing an invoice for the XYZ company. The original XML schema was called DTD, borrowed from the HTML people. It was clumsy and did not allow very tight specification. It basically just let you specify the names of the tags and attributes. Since then there have been several other flavours of schema: RELAX NG, Schematron and a new one from W3C called XML schema. DTD s look nothing like XML itself. XML Schema is itself a flavour of XML . XML Schema is a major advance over DTD . It is described in three documents: Primer, Structures and Data Types. It can define datatypes, ranges, enumerator, dates, complex datatypes to much more rigidly specify what constitutes a valid XML file. In English, entity means a thing with a distinct independent existence. It is as meaningful as thingamajig. Had it been my choice, I would have called them stand-ins, locums or deputies.

Handling Awkward Characters, XML Entities

XML has a similar problem to HTML with reserved characters. What if < incidentally appears in your data? It would be look like the beginning of some </end> tag. There is only one truly awkward character, namely < and you deal with it the same way you do in HTML, by encoding it as an entity reference, namely &lt;. (They are not called entities in XML since that term is already taken to mean a group of data.)

HTML has scores of entities whereas XML has only five:

< ( &lt; ), & ( &amp; ), > ( &gt; ), " ( &quot; ), ' ( &apos; ).
All of the entity references are optional except for &lt; and &amp;

But what about awkward non-ASCII characters such as é and Ω and ? There are six ways around the restriction that XML does not support the full set of HTML character entity references.

  1. If you use UTF-8 encoding, you can use any Unicode characters plain without entification.
  2. If you use an 8-bit encoding such as ISO-8859-1, you can stick to just 256 characters defined in that encoding.
  3. You could use decimal NCE (Numeric Character Entities) e.g. &#8364; for the euro sign . Values of numeric character references are interpreted as Unicode characters — no matter what encoding you use for your document. To be perverse, you could use decimal numeric entity references or the basic entity references i.e.
    < ( &#60; ), & ( &#38; ), > ( &#62; ), " ( &#34; ), ' ( &#39; ).
  4. You could write a DTD to create the additional alphabetic character entities references you need, e.g. &euro;
  5. You could use hexadecimal NCEs (Numeric Character Entitieses) . Again the values of numeric character references are interpreted as Unicode characters — no matter what encoding you use for your document, e.g. &#xa9 for ©. These are only sporadically supported.
  6. If you take a depraved pleasure in deformity, you could use the CDATA sandwich. Place pretty well whatever data you want, including raw (un-entified) <, > and &, within in a bizarre sandwich of characters namely: <![ CDATA [ … ]]>

    e.g. <caption><![ CDATA [Rah! <><><> Rah! & all that.]]> </caption>

Handling awkward characters is a concern if: Otherwise, the XML package will transparently handle awkward characters for you both on writing and reading, so you can forget about them.

UTF-8 files using the basic five character-entity encodings, or ISO-8859-1, with the basic five character entities (possibly excluding &apos;) plus decimal NCE s, will create the files easiest to read and compose manually, XML ’s saving grace.

Nearly all XML documents now use UTF-8 encoding, so the usual way to handle awkward characters is to code them with a UTF-8-aware text editor as ordinary characters. That leaves you with only < > " and & to worry about.


You must enclose parameters in either " or '. If the attribute value itself contains "s, you must enclose the parameter value in '. If the attribute value itself contains 's, you must enclose it in ". What do you do if a string contains both " and '? You must use the entity &quote; for embedded " and surround the string in "s, e. g.:
<album title="Sergeant Pepper’s Lonely Hearts Club Band">
<album title='The Wall'>
<album title="Peter’s &quot;Weird Songs&quot;">


There are a number of ways of writing XML .

XML Serialization

There is another form of serialization that produces XML instead of binary ObjectOutputStreams. It uses the java.beans.XMLEncoder class. It does not use the Serializable interface, but writes ordinary Objects that have JavaBean-style getter and setter methods and a no-arg constructor. It does not persist fields, but rather properties (in the Delphi sense, not System. setProperty), implemented with get/set. Basically it looks for all the get XXX methods and calls them and emits a stream of tags named after the properties. To reconstitute, XMLDecoder instantiates an Object of the class and calls the corresponding set XXX methods from the values in the XML stream. The source and target classes need not have matching code the way they do with true serialization. Most trouble using this features comes from thinking it behaves like ordinary serialization. They have almost nothing in common.


There are all kinds of tools for reading and writing XML. I am familiar with only a few of them. Please help me fill out this table.
XML Tool Comparison
Tool Advantages Disadvantages
  • A hand-written parser will run quickly
  • Writing XML by hand is conceptually simpler and faster than doing it with a tool.
  • Writing XML by hand gives you complete control over layout, headers, encoding etc.
  • Not feasible for all but the simplest files.
  • Hard to maintain.
  • You can navigate the tree in any way you please in any order.
  • Will not work for large files since the whole tree must reside in RAM.
  • Slow parsing.
  • Fast parsing.
  • You can represent the data with a different structure from the XML structure of the file.
  • Uses only a little RAM .
  • Must process sequentially.
JAXB (Java Api for XML data Binding)
  • Very little coding needed to read XML . JAXB generates most of the Java code for you from the schema. You deal with Java primitives and ordinary getters and setters.
  • Complicated to write XML files.
  • You can avoid the low level details of navigating and specify a search query instead to find what you want.
  • Slow.


Learning More

Oracle’s Javadoc on Schema class : available:
Oracle’s Javadoc on SchemaFactory class : available:
Oracle’s Javadoc on Validator class : available:
Oracle’s Javadoc on XMLConstants class : available:
Oracle’s Javadoc on SAXParser class : available:
Oracle’s Javadoc on XMLEncoder class : available:
Oracle’s Javadoc on XMLStreamReader class : available:
Altova XMLSpy
Ant: XML validator
Binary XML: unfortunately still in the it-would-be-a-good-idea stage
Caucho Resin
Digitally Signing XML
Digitally Signing XML documents
DOM 1 spec
DTD attributes
DTD: a language for describing XML file layouts
Elliotte Rusty Harold’s XML online book
Fluffiness of various file formats: student project
HTML entities
IBM’s tutorial
IBM’s XML page
JAXP: Oracle’s XML manipulating classes
JNLP (Java Web Start’s XML configuration language)
JUntotal: a more compact XML alternative
Liquid XML: code generator to read/write XML given schema
Mistakes with XML
NotXMLProposal: SDL streamlined XML proposal
online XML validator
Oracle’s Fast Web Services Project
Reading XML with DOM
Reading XML with SAX
RefleX: (XSLT and XQuery)
RELAX NG: a language for describing XML file layouts
Schematron: an XML description and pattern finding language
Stylus Autogen: figures out a schema from sample XML
Stylus Schema Editor
Stylus Studio
Stylus XML tools
UBDDL (a Yahoo group working to define a more efficient replacement for XML)
VTD-XML: faster, more efficient XML parsing
W3 online XML validator: via URL
W3 XML standard: lawyerly document
W3 XML xinclude standard on includes: lawyerly document
W3Schools XML validator
W3schools: XML tutorials
Wattle XML editor and schema converter
Writing XML with a DOM: by myong
Writing XML with DOM
Writing XML with SAX
XML 1.0 spec
XML Compactor
XML databases
XML inventors
XML Validator tools
xmlfiles.com (has lots of examples and tutorials)
XMLFox: free Windows XSD/XML editor/validator
XMLGlobal has some tutorials and information
XSD: schema to describe XML, friendlier and more specific than DTD

This page is posted
on the web at:


Optional Replicator mirror
of mindprod.com
on local hard disk J:

Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

Your face IP:[]
You are visitor number