I do contract work for a living, which could include writing a program such as this. However, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project any way you please.
These compact, binary XML-on-a-diet files would look to application programs that use tools like SAX as identical to bloated XML. Very few programs deal with XML directly. They would not notice the difference, just faster more efficient processing, and cleaner more consistent data.
The basic idea is to convert the tags to 8/16 bit binary tokens, and the strings to counted UTF-8 strings. Short strings (1 to 255 chars) are preceded by a 1-byte length count. Long strings are preceded by a four byte big-endian length count. You can then rapidly chase through the file jumping over strings you don’t want, and indexing the 16 bit tokens into a table to decide what to do with that tag. You reserve a couple of 8-bit tokens for use in introducing strings. There are no quoted characters. Counted strings don’t need to be parsed to be processed or bypassed.
XML has other features besides simple begin/end tags that you will also have to encode. You might encode the offset to the corresponding virtual end tag for even faster processing. Numeric strings can be encoded instead as binary ints or IEEE. Dates can be encoded as binary int days since 1970.
Since tags are constrained only to appear inside certain other tags, you could usually get away with 8-bit tokens to represent tags. The encoding for a tag is specific to the context. You don’t need to assign a unique number to every possible tag or attribute.
When you are done, you could even run it through a GZIP compression.
Your binary version of the DTD would contain the same information as the usual one, with additional information about which tokens you assigned to which tags. You could assign a token for a tag + int, tag + short string, or tag + long string, high/low bounds on a field, fixed length restrictions or character set restrictions. The binary descriptor for an xml-on-a-diet file could be automatically generated from a traditional XML schema, such as Schematron, RELAX NG or a DTD. The binary scheme might be optimised by letting the optimiser study a number of real world XML files.
You could use a different token to encode the same field as 1, 2, 3, 4 etc. byte integers, signed or unsigned.
You could use a different token to encode a fix-length uncounted string field, e.g. a postal code in Canada that is always 6 long, but allow the same field with a different token to encode it for variable length string contents.
Of course you would mark your files with a format version number so that the binary format could evolve without concerning anyone except people to who write DOM drivers, much the way PKZIP format has evolved.
![]() |
and suggestions to improve this page to Roedy Green : | ||
| Canadian Mind Products | |||
| mindprod.com IP:[65.110.21.43] | |||
| Your face IP:[38.103.63.18] | ![]() | ||
| You are visitor number 3,765. | |||
| You can get a fresh copy of this page from: | or possibly from your local J: drive (Java virtual drive/Mindprod website mirror) | ||
| http://mindprod.com/project/xmlcompactor.html | J:\mindprod\project\xmlcompactor.html | ||