Super Compressor  Super Compressor

go to home page Student Projects full screen, hide local find menu Google search web for more information on this topic jump to foot of page translate this page with Babelfish by Roedy Green ©1996-2008 Canadian Mind Products
This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. It does not describe an actual complete program. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; I have too many other projects of my own.

I do contract work for a living, which could include writing a program such as this. However, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project any way you please.

Ordinary ZIP (Lempel, Ziv Welch) compression depends on finding repetition and, roughly speaking, creating abbreviations for common words. This approach does not do well on small files. Here are some ideas for super compression.
  1. Imagine you had a dictionary of 64,000 words. You numbered them 1 to 64,000. Before you compress a text file, you look up each word and replace the word and its following space by a 16-bit binary number — the word’s index in the dictionary. You create specialized dictionaries for compressing/decompressing different kinds of files. You assign the most frequently used words the lowest index numbers. This is a pre-process to ordinary compression.
  2. Imagine the state of a compression engine just after it had finished compressing the entire JDK 1.4 Java source code. What if you preloaded that state just before compressing any Java source code file. You would then have a set of common Java abbreviations all ready to go. There would be no penalty for compressing a short file. You need to create loadable states for various types of files and distribute them along with the compression/decompression engine. java.util.zip.Deflater.setDictionary gives you just that ability.

    Compression is even tighter if you use a dictionary compiled from sampled words of a single author. There is research on how text correction can be done with a dictionary that knows the relative probabilities of various words occuring.

  3. Imagine you had fixed length records to compress. If you transposed the fields and records so that all the names, then all the addresses, then all the postal codes, then all the phone numbers… came in order, the compression engine could do much better because the repetition is clearer. Further you might recognize certain field types such as YYYY/MM/DD and replace them with 16-bit days since 1970 or some such.
  4. Write a compressor that uses information it gleaned from compressing other files in the archive. This gets tricky. If you delete a item, you may have to adjust the remaining entries.
  5. You can use variable length unsigned tokens. For example, token numbers 0..127 are 1 byte long. They represent your 128 most common words/punctuation etc. Token numbers 128..32768 are 2 bytes long. They represent your next 32K of most common words. Token numbers 32768..16,777,215 are three bytes long. They represent what’s left, possibly including common phrases as well. You might experiment with reserving some of the prime 1 or 2 byte tokens for encoding unique words not in the standard dictionary.
  6. What we are after here are ways to compress short messages that use the same vocabulary. We don’t need to send the dictionary along with each message. Everyone gets a copy of the needed dictionaries when they install the software. You will have to send a short dictionary for any novel words not in the official dictionary however.
Delta Creator
HTML Compactor
voice compression

CMP homejump to top
CMP logo
feedback Please email your feedback for publication, errors, omissions, broken/redirected link reports
and suggestions to improve this page to Roedy Green : feedback email
made with CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[65.110.21.43]
Your face IP:[38.103.63.18] Spread the Net
You are visitor number 5,874.
You can get a fresh copy of this page from: or possibly from your local J: drive (Java virtual drive/mindprod.com website mirror)
http://mindprod.com/project/supercompressor.html J:\mindprod\project\supercompressor.html