XML Compactor XML Compactor
home Student Projects no local find frame, full screen Google search web for topic jump to footer translate with Babelfish by Roedy Green ©1996-2008 Canadian Mind Products
This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. It does not describe an actual complete program. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; I have too many other projects of my own.

I do contract work for a living, which could include writing a program such as this. However, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project any way you please.

This project is similar to the HTML Compactor. XML is catching on like wildfire. Unfortunately is it extremely fluffy. This means in takes many times longer to transmit than it need do. XML files require considerably more disk space to store, RAM to read and CPU time to process. You pay a big penalty for the supposed simplicity of XML. Further, XML requires a major production with considerable futzing about to parse it, deal with entities like " etc. This is not a suitable format for handhelds or cell-phones where every byte is sacred. What you want is a computer-friendly compact XML variant that can be converted to and from regular XML, but that can can be used directly with much less parsing overhead.

These compact, binary XML-on-a-diet files would look to application programs that use tools like SAX as identical to bloated XML. Very few programs deal with XML directly. They would not notice the difference, just faster more efficient processing, and cleaner more consistent data.

The basic idea is to convert the tags to 8/16 bit binary tokens, and the strings to counted UTF-8 strings. Short strings (1 to 255 chars) are preceded by a 1-byte length count. Long strings are preceded by a four byte big-endian length count. You can then rapidly chase through the file jumping over strings you don’t want, and indexing the 16 bit tokens into a table to decide what to do with that tag. You reserve a couple of 8-bit tokens for use in introducing strings. There are no quoted characters. Counted strings don’t need to be parsed to be processed or bypassed.

XML has other features besides simple begin/end tags that you will also have to encode. You might encode the offset to the corresponding virtual end tag for even faster processing. Numeric strings can be encoded instead as binary ints or IEEE. Dates can be encoded as binary int days since 1970.

Since tags are constrained only to appear inside certain other tags, you could usually get away with 8-bit tokens to represent tags. The encoding for a tag is specific to the context. You don’t need to assign a unique number to every possible tag or attribute.

When you are done, you could even run it through a GZIP compression.

Your binary version of the DTD would contain the same information as the usual one, with additional information about which tokens you assigned to which tags. You could assign a token for a tag + int, tag + short string, or tag + long string, high/low bounds on a field, fixed length restrictions or character set restrictions. The binary descriptor for an xml-on-a-diet file could be automatically generated from a traditional XML schema, such as Schematron, RELAX NG or a DTD. The binary scheme might be optimised by letting the optimiser study a number of real world XML files.

You could use a different token to encode the same field as 1, 2, 3, 4 etc. byte integers, signed or unsigned.

You could use a different token to encode a fix-length uncounted string field, e.g. a postal code in Canada that is always 6 long, but allow the same field with a different token to encode it for variable length string contents.

Of course you would mark your files with a format version number so that the binary format could evolve without concerning anyone except people to who write DOM drivers, much the way PKZIP format has evolved.

RELAX NG
Schematron
XML
XSD

CMP_homejump to top
CMP logo
feedback Please email your feedback for publication, errors, omissions, broken/redirected link reports
and suggestions to improve this page to Roedy Green : feedback email
made with CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[65.110.21.43]
Your face IP:[38.103.63.18] Greenpeace energy [r]evolution
You are visitor number 3,765.
You can get a fresh copy of this page from: or possibly from your local J: drive (Java virtual drive/Mindprod website mirror)
http://mindprod.com/project/xmlcompactor.html J:\mindprod\project\xmlcompactor.html