image provider

HTML Compactor


Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

What is this project for?
  1. Quintuple the effective bandwidth of the Internet.
  2. Make HTML (Hypertext Markup Language) files download five times faster.
  3. Speed up browser rendering.
  4. Encourage everyone to post only valid HTML on the web. This leads to HTML that will render properly in all browsers, not just one.
  5. Make it trivially easy to write your own programs that do custom processing on HTML files, e.g. replacing boilerplate, patching broken/disturbed links, splitting large HTML files into smaller ones, tidying HTML, table sorting and table reorganising.
  6. Make a not so subtle hint to the XML (extensible Markup Language) people about what they should do to XML to put it on a diet. See the XML compactor project.

Types of Compression

There are a number of simple things you might consider do to make HTML more compact. I list them in order of how drastic the changes are to the source files, not how difficult they are to do:
  1. Remove trailing spaces on lines.
  2. Remove leading spaces on line. (except inside <pre>..</pre>).
  3. Convert \r \n to plain \n.
  4. Convert multiple spaces on a line to a single space. (except inside <pre>..</pre>.
  5. Convert multiple spaces on a line to a single space. (even inside <!--… -->)
  6. Convert three or more \n in a row to two \n. (remove excess blank lines)
  7. Convert two or more \n in a row to one \n. (remove all blank lines)
  8. Remove all comments (even inside <!--… -->), except ones needed for SSI (Server Side Includes) like <!--#CONFIG TIMEFMT=%Y-%m-%d-->
  9. Remove unnecessary space on either side of tags. It depends on the tag and the amount of space on the other side of the tag whether space can be completely removed.
  10. Consolidate tags. e.g <span class=x>this </span><span class=x>and that</span> can be collapsed to <span class=x>this and that</span>.
  11. Convert to CBF (Compact Binary Format) : 16-bit tokens.
  12. Zip the tokens.
All but the last two of these compactions don’t change the meaning of the HTML.

I have written a HTML compactor that implements: 1 through 7. I will be putting it up for distribution as shareware shortly.

All browsers will handle the compacted format just fine without any special tool to fluff it up first. If you wanted to restore it to the original fluffy form, you could use a tool such as the SlickEdit beautifier which could restore to something a close approximation of the original. You could use a Funduc Search and Replace macro to insert extra blank lines before whichever tags you chose. To get back exactly where you started, it would be simplest just to save a backup copy prior to compressing, or to save a delta file of the differences (e.g. a list of offsets in the compressed file and what had been deleted).

However, even if you took measures 1 through 8, you would save only about 20%. To get the five times compression you need to invent a new, more compact format and use the last two drastic measures 11 and 12. The problem is, you then need a plug-in for the browser to fluff it back up again. Ideally you would want the browsers to ship with your plug-in built-in.

This coding is extremely simple. The hard part is political, persuading people to use your program once you have it written.

This project defines a compact, Java-friendly format for HTML files. The compact form is much easier for computers and browsers to process since it is pre-tokenized. It also much more compact. Your job is simply to write a program to convert HTML to compact form and back again. You might use a parser or StringTokenizer.

HTML Compact Binary Format

All the 32,765 possible negative codes are predefined. The definitions are not part of the compact document. All the 32,765 possible positive codes are reserved for words not in the standard dictionary. They have to be defined inside this document the first time they are used by following the even or odd code with the UTF-8 encoding of the Unicode String, without any trailing spaces. Here is a very approximate example of how you might assign the codes:
Example Codings
Code Meaning Notes
-2 <p> All the usual HTML tags are predefined.
-3 <p>_ <p> followed by one or more spaces.
-4 </p>  
-5 </p>_  
-6 <p  
-7 <p_  
-8   nonbreaking space
-9  _ nonbreaking space followed by one or more regular spaces.
-10 &eacute; é
-8 11 &eacute;_ é followed by one or more spaces.
-1000 the The 30,000 most commonly used words discovered by a random sampling of Internet web files rate their own special pre-defined code. For reasons I don’t want to go into here, assign the most common words the lowest numbers. In other words sort the list by frequency before you assign the codes.
-1001 the_ the followed by one or more spaces
-32000 the next word should be capitalised, if it is not already.  
-32001 the next word should be lower case, if it is not already.  
+2 sensibility This word is not in the standard dictionary, so it must be defined in this document. You define it at the point you first use it. Alternatively, you could put all the definitions at the head of the document without leading codes. Dense encoding would be presumed. The structure of the document becomes even simpler. A batch of strings followed by a batch of shorts, not interleaved. You would then need a lead 32-bit count on the document of how many strings there are and how many tokens there are. With the interleaved format, you would not need that.
+3 sensibility_ You only have to define the even +2 or odd +3 once. Defining either defines both. Thereafter you can use either +2 or +3 without defining it.
You may recognise this as a special implementation of the supercompressor. After you are finished, you can further compact the file with traditional PkZip style compression. However, this format is already quite compact and is very easy for a program to process. Any program that wants to analyse the file or decompact it, merely has to use two arrays of strings, one for positive and one for negative and a SWITCH statement with the tags of interest defined with special processing. Lookup is extremely fast. Compacting is done with a simple tokeniser. You can get fancy with a parser to validate syntax, or you can use a simple minded tokeniser that just looks for space, < and >

The key thing to understand about this project is that compact format is very computer friendly. Is takes less work to read it that uncompressed HTML ! It is not like a traditional compressed format that takes considerable CPU (Central Processing Unit) power to uncompress first, at least if you leave of the final PkZip step. Ideally the browser would not fluff CBF up to HTML then parse the HTML. Instead it would take the CBF directly to its internal tables since it pre-tokenized. CPUs (Central Processing Units) are evolving faster than communication lines, so the overhead of the UNZIP step may be negligible on modern computers.

You can extend this project to an HTML Tidier that indents and inserts the correct about of \ns before and after each tag.

SPDY

This page is posted
on the web at:

http://mindprod.com/project/htmlcompactor.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\htmlcompactor.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[18.224.32.86]
You are visitor number