image provider

XML Compactor


Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

This project is similar to the HTML Compactor. XML (extensible Markup Language) is catching on like wildfire. Unfortunately, it is extremely fluffy. This means in takes many times longer to transmit than it need do. XML files require considerably more disk space to store, RAM (Random Access Memory) to read and CPU (Central Processing Unit) time to process. You pay a big penalty for the supposed simplicity of XML. Further, XML requires a major production with considerable futzing about to parse it, deal with entities like " etc. This is not a suitable format for handhelds or cellphones where every byte is sacred. What you want is a computer-friendly compact XML variant that can be converted to and from regular XML, but that can can be used directly with much less parsing overhead.

These compact, binary XML-on-a-diet files would look to application programs that use tools like SAX as identical to bloated XML. Very few programs deal with XML directly. They would not notice the difference, just faster more efficient processing and cleaner more consistent data.

The basic idea is to convert the tags to 8/16-bit binary tokens and the strings to counted UTF-8 strings. Short strings (1 to 255 chars) are preceded by a 1-byte length count. Long strings are preceded by a four byte big-endian length count. You can then rapidly chase through the file jumping over strings you don’t want and indexing the 16-bit tokens into a table to decide what to do with that tag. You reserve a couple of 8-bit tokens for use in introducing strings. There are no quoted characters. Counted strings don’t need to be parsed to be processed or bypassed.

XML has other features besides simple begin/end tags that you will also have to encode. You might encode the offset to the corresponding virtual end tag for even faster processing. Numeric strings can be encoded instead as binary ints or IEEE (Institute of Electrical & Electronics Engineers). Dates can be encoded as binary int days since 1970.

Since tags are constrained only to appear inside certain other tags, you could usually get away with 8-bit tokens to represent tags. The encoding for a tag is specific to the context. You don’t need to assign a unique number to every possible tag or attribute.

When you are done, you could even run it through a GZIP compression.

Your binary version of the DTD (Document Type Definition) would contain the same information as the usual one, with additional information about which tokens you assigned to which tags. You could assign a token for a tag + int, tag + short string, or tag + long string, high/low bounds on a field, fixed length restrictions or character set restrictions. The binary descriptor for an xml-on-a-diet file could be automatically generated from a traditional XML schema, such as Schematron, RELAX NG or a DTD. The binary scheme might be optimised by letting the optimiser study a number of real world XML files.

You could use a different token to encode the same field as 1, 2, 3, 4 etc. byte integers, signed or unsigned.

You could use a different token to encode a fix-length uncounted string field, e.g. a postal code in Canada that is always 6 long, but allow the same field with a different token to encode it for variable length string contents.

Of course, you would mark your files with a format version number so that the binary format could evolve without concerning anyone except people to who write DOM (Document Object Model) drivers, much the way PKZIP format has evolved.

RELAX NG
Schematron
XML
XSD

This page is posted
on the web at:

http://mindprod.com/project/xmlcompactor.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\xmlcompactor.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[3.91.19.28]
You are visitor number