This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in
Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything
else useful to implementing this project. Everything I have prepared to help you is right here.
This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is
fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the
define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.
Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many
other projects of my own.
Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
The tidier makes it easier to perform edits on the
HTML (Hypertext Markup Language) source with a text editor. It
might do thing such as:
The HTML editor
features of Swing make this a rather trivial project. Some say that MS Front Page does
this already. Not so. It mangles your HTML
- Put 0, 1 or 2 newline characters before and after each tag as appropriate. If there
are newline characters or whitespace characters that should not appear before or after
a tag, remove them to tighten up. You will probably need to make this table driven
since not everyone will agree which tags deserve a new line and blanks lines before or
aft. Some tags such as <font> must be left alone other
than possibly consolidating whitespace inside, before or aft, into a single newline or
space. If you examine the HTML source of this document you are reading, you can see it
has been tidied with a crude search/replace script for consistent newline behaviour.
By default you might make sure every <p>,
<ol>, <ul> and
<li> tag starts on a new line with a blank line
ahead of it.
- Remove excess blanks lines. No more than one blank line in a row.
- Consistently convert to the desired line end convention, Unix = Lf,
or Windows = CrLf or Mac OS (Operating System)
1-9 = Cr.
- Use consistent all lower case tags.
- Remove lead trail blanks on <li>…</li>
- Avoid lines longer than 60 characters. Break lines at
auspicious points — but never in the middle of a quoted string. Avoid reflowing
- Sort meta tags in alphabetical order
- Put parameters of <img> in some standard order,
probably not alphabetical, e.g. SRC then width, height, border then alt. ditto for
other tags with parameters such as font, dt, table etc.
- Generate a new KEYWORDS metatag based on scooping the phrases in h1, h2 and dt tags
(avoiding font tags of course). Combine it with the existing KEYWORDS metatag, removing
duplicates and putting the phrases in alphabetical order.
- Indent to show structure.
- insert missing </li>, </dt>, </sd>, </h?> etc. tags. Exactly which ones you want should be
configurable. So you may also remove unwanted clutter end tags.
- replace " with ", & with &, < with < etc. where they occur in the context of ordinary text.
- Convert css tags to old style on request. That way you can have the convenience of
composing with css, but broadcast compatible old style tags. Don’t get too
carried away, but you might add some user macros too.
- consolidate tags, e.g.
<b>both</b> <b>bold</b> => <b>both bold</b>..
=> <font size=+1 color=red>… </font> .
- Convert colour names to hex or vice versa. See the chart of Netscape colour names, e. g. convert
<font color=papayawhip> to
<font color=#ffefd5> .
- Remove redundant tags that are already handled by CSS (Cascading Style Sheets)
- Convert stray high bit characters to the equivalent alphabetic entity or a numberic
entity if there is no alphabetic entity. Convert numeric entities to alphabetic
entities where possible. Standardise on either decimal or hex entities. Also permit
documents encoded with entities to be converted to UTF-8
encoding without entities.
- A parser such as JavaCC or ANTLR (Another Tool for Language Recognition)
might be useful. See parser in the Java glossary.
- Check all IMG statements to ensure the width and height parameters actually match
the size of the *.png, *.gif or
*.jpg file. If the image file is missing, insert an easily
found comment in the HTML. If the size is wrong, correct it, or alternatively,
insert the new size information as an easily found comment, so it can be manually
corrected later, or left as is if image resizing was actually intended. Invent a
comment to put in deliberately resized images to bypass this check.
- Integrate the Compactor
so that the user can alternate between efficient and fluffy easy-to-edit forms.
- Provide an option to convert HTML
to plain text by stripping out all the tags. There are a few wrinkles to watch out
- > may occur improperly unpaired with <. In which case treat it as
- < and > may appear inside quotes inside <..> in which case they
should be treated as ordinary characters.
- References should optionally be converted to English so they are not lost, e.
could be rendered as:
You can read up on the rules [see
http://www.cwi.nl/~dik/english/codes/isbn.html] for calculating the expected
check digit.If you wanted to get really clever you would avoid mentioning the reference
if it were already present nearby in plain text, e. g.
should be rendered as:
You can read up on the rules for calculating the expected check digit at
- You might also optionally render <b>Wow!</b> as
I am working on implementing subset of this problem, reflowing and indenting.