image provider

Regex Proofreader


This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

If you don’t know what regular expressions are, have a look in the Java & Internet Glossary. Basically they are search/replace patterns.

If you have ever used Funduc Search and Replace or SlickEdit or any program using regular expressions, you know what a beast they can be to proofread. You are often quite amazed by what they do to your files when you turn them loose.

The main problem is quoting. Does < mean the literal character < or is it a command? In Funduc Search and Replace, it is a literal in 0the search argument, but a command in the replace argument. Arrgh! Every implementation of regexes uses a slightly different set of commands and command characters. In Java 1.4, the character \ has when used literally to represent data has to be written as \\\\ because regex and Java string quoting gang up on you.

If a character is reserved as a command, when you want to use it literally, you must precede it with a \. This turns expressions into unreadable nightmares like this:

*[ \\\*<\r\nabc]
A very simple regex proofreader simply colour codes each letter as a literal or as a command. For example in Funduc search expressions the following characters need to be preceding with \ when used as literals:
- + * ? ( ) [ ] \ | $ ^ !
and the following characters in replace expressions need to be preceded with \ :
% \ < >
You must not precede any other literals with \..

You might display command characters in red and literal characters in blue, instead of like this:

*[ \\\*<\r\nabc]
you would see them like this:

A slightly cleverer proofreader would let you hide the \ characters (except those preceded by another \ ) and just display the raw colour-coded literals. You could think of this as unquoting. The unquoted regex expressions would then look much more like the actual strings they are intended to match. The expression may then look like this:


What then would you do with tab=\t, cr=\r and newline= \n ? You might cook up some special glyphs to represent them more literally, much the way MS word can be persuaded to make visible the spaces, line and paragraph ends with special symbols, like this:

You might just use the letters t, r and n but display them in green.
*[\*<rn abc]
If you can view them that way, why not type them that way? Instead of worrying about which letters need to be quoted, just use ctrl-R to load your pen with red ink(commands), ctrl-B to load it with blue ink(literal data) and ctrl-G to load it with green ink(control chars) — or tap one of three coloured inkwells with your mouse, as if dipping an old goose quill pen. Just type your literals naturally and behind the scenes, your Applet will insert the \ characters as needed. If you typed an invalid command in red-command mode, it would just beep at you. You could even use the Enter key to more directly represent \n or the \r\n pair.

Everything so far is pretty easy. You don’t need to know anything much about regex expressions to write such a simple proofreader. The only problem is  Java version 1.1 does not support rich text — multiple font colours in a TextArea, though Swing does. Unfortunately you can’t handle the colours by generating HTML (Hypertext Markup Language) <font color= commands and having something render it. You need to handle your colours at a lower level — like a Canvas.

For a more advanced proofreader, things get a little tougher. The user should be able to submit some sample strings and see what the regex would do to them. Do they match the pattern. What do they get converted to. You have to simulate Funduc Search and Replace, SlickEdit or whatever other Regex engine you are proofing for. If you dig about the web, you may find source code to do this for you that you could cannibalise. You have one advantage over the authors of the commercial regex programs. Java 1.4.1 has Perl-like Regex now part of the java.util.regex package. Your parsing and scanning code can be very slow and no one will mind. The user should be able to maintain little libraries of test strings that he can quickly test out. The before and after views should use colour or bold to highlight the changes.

Once you have that working, you can now try something even tougher, generate a sample set of strings to feed to the regex that exercise each part of the regex expression. Some should match, some should not. By examining the results on that test set of strings, the author of the regex expressions should have a pretty good idea of all the things that regex will do when it is turned loose in the real world on files. To do it very well, your generated sample set of strings should exercise every feature of the regex expression. You should generate strings that pass and fail each filtering point in the regex.

If your regex engine is sufficiently fast, you might consider turning your proofreader into a full blown search/replace engine, that can run scripts of commands or accept them one at a time via a GUI (Graphic User Interface). You might also consider writing your own text editor with this as the crown jewel. For blinding speed, generate byte codes to parse the regex expression. This may then be JITed on the fly, which should give Perl-like speeds. However, the easy way is just to use java.util.regex.

If you want to tackle a simplified version of this project, try creating proofreader for Java source code string literals. So that instead of seeing C:\\mydir\\myfile.txt you would see C:\mydir\myfile.txt. Or abc ¶def instead of abc\ndef.

I thought of an easy way to implement the proofreader. You scan the regex left to right classifying each char as ordinary, quoted, command or backslash and surrounding each character or run in a classifying <span class=??> </span> sandwich. Then you feed the decorated regex string to a JEditorPane to display the HTML. Depending on which styles you stick on the front you will see either:

I think regexes would be much easier to comprehend with these visual character category clues. You could take it further with shades of colour to subliminally reflect even finer categories. Ideally this functionality should be built into the IDE (Integrated Development Environment), perhaps with plug-in.

Have a look at how RegexBuddy highlights regexes to help you proofread. They highlight space, (), commands and literal characters differently.

Regex Composer
Regex Debugger
Regex Tidier
Regex Utility

This page is posted
on the web at:

Optional Replicator mirror
on local hard disk J:

Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

Your face IP:[]
You are visitor number