I do contract work for a living, which could include writing a program such as this. However, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project any way you please.
I have written a HTML compactor that implements: 1 through 7. I will be putting it up for distribution as shareware shortly.
All browsers will handle the compacted format just fine without any special tool to fluff it up first. If you wanted to restore it to the original fluffy form, you could use a tool such as the SlickEdit beautifier which could restore to something a close approximation of the original. You could use a Funduc Search and Replace macro to insert extra blank lines before whichever tags you chose. To get back exactly where you started, it would be simplest just to save a backup copy prior to compressing, or to save a delta file of the differences (e.g. a list of offsets in the compressed file and what had been deleted).
However, even if you took measures 1 through 8, you would save only about 20%. To get the five times compression you need to invent a new, more compact format and use the the last two drastic measures 11 and 12. The problem is, you then need a plug-in for the browser to fluff it back up again. Ideally you would want the browsers to ship with your plug-in built-in.
This coding is extremely simple. The hard part is political, persuading people to use your program once you have it written.
This project defines a compact, Java-friendly format for HTML files. The compact form is much easier for computers and browsers to process since it is pre-tokenized. It also much more compact. Your job is simply to write a program to convert HTML to compact form and back again. You might use a parser or StringTokenizer.
| Example Codings | ||
|---|---|---|
| Code | Meaning | Notes |
| -2 | <p> | All the usual HTML tags are predefined. |
| -3 | <p>_ | <p> followed by one or more spaces. |
| -4 | </p> | |
| -5 | </p>_ | |
| -6 | <p | |
| -7 | <p_ | |
| -8 | nonbreaking space | |
| -9 | _ | nonbreaking space followed by one or more regular spaces. |
| -10 | é | é |
| -8 11 | é_ | é followed by one or more spaces. |
| -1000 | the | The 30,000 most commonly used words discovered by a random sampling of Internet web files rate their own special pre-defined code. For reasons I don’t want to go into here, assign the most common words the lowest numbers. In other words sort the list by frequency before you assign the codes. |
| -1001 | the_ | the followed by one or more spaces |
| -32000 | the next word should be capitalised, if it is not already. | |
| -32001 | the next word should be lower case, if it is not already. | |
| +2 | sensibility | This word is not in the standard dictionary, so it must be defined in this document. You define it at the point you first use it. Alternatively, you could put all the definitions at the head of the document without leading codes. Dense encoding would be presumed. The structure of the document becomes even simpler. A batch of strings followed by a batch of shorts, not interleaved. You would then need a lead 32-bit count on the document of how many strings there are and how many tokens there are. With the interleaved format, you would not need that. |
| +3 | sensibility_ | You only have to define the even +2 or odd +3 once. Defining either defines both. Thereafter you can use either +2 or +3 without defining it. |
The key thing to understand about this project is that compact format is very computer friendly. Is takes less work to read it that uncompressed HTML! It is not like a traditional compressed format that takes considerable CPU power to uncompress first, at least if you leave of the final PKZIP step. Ideally the browser would not fluff CBF up to HTML then parse the HTML. Instead it would take the CBF directly to its internal tables since it pre-tokenised. CPUs are evolving faster than communication lines, so the overhead of the UNZIP step may be negiligible on modern computers.
You can extend this project to an HTML Tidier that indents and inserts the correct about of \ns before and after each tag.
![]() |
and suggestions to improve this page to Roedy Green : | ||
| Canadian Mind Products | |||
| mindprod.com IP:[65.110.21.43] | |||
| Your face IP:[38.103.63.18] | The information on this page is for non-military use only. | ||
| You are visitor number 5,488. | Military use includes use by defence contractors. | ||
| You can get a fresh copy of this page from: | or possibly from your local J: drive (Java virtual drive/Mindprod website mirror) | ||
| http://mindprod.com/project/htmlcompactor.html | J:\mindprod\project\htmlcompactor.html | ||