HTML Compactor

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Types of Compression

I have written a HTML compactor that implements: 1 through 7. I will be putting it up for distribution as shareware shortly.

All browsers will handle the compacted format just fine without any special tool to fluff it up first. If you wanted to restore it to the original fluffy form, you could use a tool such as the SlickEdit beautifier which could restore to something a close approximation of the original. You could use a Funduc Search and Replace macro to insert extra blank lines before whichever tags you chose. To get back exactly where you started, it would be simplest just to save a backup copy prior to compressing, or to save a delta file of the differences (e.g. a list of offsets in the compressed file and what had been deleted).

However, even if you took measures 1 through 8, you would save only about 20%. To get the five times compression you need to invent a new, more compact format and use the last two drastic measures 11 and 12. The problem is, you then need a plug-in for the browser to fluff it back up again. Ideally you would want the browsers to ship with your plug-in built-in.

This coding is extremely simple. The hard part is political, persuading people to use your program once you have it written.

This project defines a compact, Java-friendly format for HTML files. The compact form is much easier for computers and browsers to process since it is pre-tokenized. It also much more compact. Your job is simply to write a program to convert HTML to compact form and back again. You might use a parser or StringTokenizer.

HTML Compact Binary Format

Example Codings
-2	<p>	All the usual HTML tags are predefined.
-3	<p>_	<p> followed by one or more spaces.
-4	</p>
-5	</p>_
-6	<p
-7	<p_
-8		nonbreaking space
-9	_	nonbreaking space followed by one or more regular spaces.
-10	é	é
-8 11	é_	é followed by one or more spaces.
-1000	the	The 30,000 most commonly used words discovered by a random sampling of Internet web files rate their own special pre-defined code. For reasons I don’t want to go into here, assign the most common words the lowest numbers. In other words sort the list by frequency before you assign the codes.
-1001	the_	the followed by one or more spaces
-32000	the next word should be capitalised, if it is not already.
-32001	the next word should be lower case, if it is not already.
+2	sensibility	This word is not in the standard dictionary, so it must be defined in this document. You define it at the point you first use it. Alternatively, you could put all the definitions at the head of the document without leading codes. Dense encoding would be presumed. The structure of the document becomes even simpler. A batch of strings followed by a batch of shorts, not interleaved. You would then need a lead 32-bit count on the document of how many strings there are and how many tokens there are. With the interleaved format, you would not need that.
+3	sensibility_	You only have to define the even +2 or odd +3 once. Defining either defines both. Thereafter you can use either +2 or +3 without defining it.

The key thing to understand about this project is that compact format is very computer friendly. Is takes less work to read it that uncompressed HTML ! It is not like a traditional compressed format that takes considerable CPU (Central Processing Unit) power to uncompress first, at least if you leave of the final PkZip step. Ideally the browser would not fluff CBF up to HTML then parse the HTML. Instead it would take the CBF directly to its internal tables since it pre-tokenized. CPUs (Central Processing Units) are evolving faster than communication lines, so the overhead of the UNZIP step may be negligible on modern computers.

You can extend this project to an HTML Tidier that indents and inserts the correct about of \ns before and after each tag.

	This page is posted on the web at:	http://mindprod.com/project/htmlcompactor.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\project\htmlcompactor.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[216.73.216.133]
Feedback	You are visitor number

Example Codings
Code	Meaning	Notes
-2	<p>	All the usual HTML tags are predefined.
-3	<p>_	<p> followed by one or more spaces.
-4	</p>
-5	</p>_
-6	<p
-7	<p_
-8		nonbreaking space
-9	_	nonbreaking space followed by one or more regular spaces.
-10	é	é
-8 11	é_	é followed by one or more spaces.
-1000	the	The 30,000 most commonly used words discovered by a random sampling of Internet web files rate their own special pre-defined code. For reasons I don’t want to go into here, assign the most common words the lowest numbers. In other words sort the list by frequency before you assign the codes.
-1001	the_	the followed by one or more spaces
-32000	the next word should be capitalised, if it is not already.
-32001	the next word should be lower case, if it is not already.
+2	sensibility	This word is not in the standard dictionary, so it must be defined in this document. You define it at the point you first use it. Alternatively, you could put all the definitions at the head of the document without leading codes. Dense encoding would be presumed. The structure of the document becomes even simpler. A batch of strings followed by a batch of shorts, not interleaved. You would then need a lead 32-bit count on the document of how many strings there are and how many tokens there are. With the interleaved format, you would not need that.
+3	sensibility_	You only have to define the even +2 or odd +3 once. Defining either defines both. Thereafter you can use either +2 or +3 without defining it.

HTML Compactor

Disclaimer

Types of Compression

HTML Compact Binary Format