This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in
Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything
else useful to implementing this project. Everything I have prepared to help you is right here.
This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is
fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the
define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.
Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many
other projects of my own.
Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
Ordinary ZIP (Lempel, Ziv Welch) compression depends on
finding repetition and, roughly speaking, creating abbreviations for common words. This
approach does not do well on small files. Here are some ideas for super compression.
- Imagine you had a dictionary of 64,000 words. You
numbered them 1 to 64,000. Before you compress a text file, you look up each word and
replace the word and its following space by a 16-bit binary
number — the word’s index in the dictionary. You create specialized
dictionaries for compressing/decompressing different kinds of files. You assign the
most frequently used words the lowest index numbers. This is a pre-process to ordinary
- Imagine the state of a compression engine just after it had finished compressing
the entire Java version 1.4 Java source code. What if you preloaded
that state just before compressing any Java source code file. You would then have a set
of common Java abbreviations all ready to go. There would be no penalty for compressing
a short file. You need to create loadable states for various types of files and
distribute them along with the compression/decompression engine. java.util.zip.Deflater.setDictionary gives you just that ability.
Compression is even tighter if you use a dictionary compiled from sampled words of
a single author. There is research on how text correction can be done with a
dictionary that knows the relative probabilities of various words occuring.
- Imagine you had fixed length records to compress. If you transposed the fields and
records so that all the names, then all the addresses, then all the postal codes, then
all the phone numbers… came in order, the compression engine could do much
better because the repetition is clearer. Further, you might recognize certain field
types such as YYYY/MM/DD and replace them with
16-bit days since 1970 or some such.
- Write a compressor that uses information it gleaned from compressing other files in
the archive. This gets tricky. If you delete an item, you may have to adjust the
- You can use variable length unsigned tokens. For example, token numbers 0..127 are
1 byte long. They represent your 128 most common words/punctuation etc. Token numbers
128..32768 are 2 bytes long. They represent your next 32K of
most common words. Token numbers 32768..16,777,215 are three bytes long. They represent
what’s left, possibly including common phrases as well. You might experiment with
reserving some of the prime 1 or 2 byte tokens for encoding unique words not in the
- What we are after here are ways to compress short messages that use the same
vocabulary. We don’t need to send the dictionary along with each message.
Everyone gets a copy of the needed dictionaries when they install the software. You
will have to send a short dictionary for any novel words not in the official dictionary