a program that analyses syntax. It might for example look at a piece of Java
source code and find all the variable names, method names and operators in order
to compile it into JVM byte code, or it might analyse HTML, or your own invented
language. The original LEX/YACC/Bison generated C code. There are now variants
that generate Java code. My personal favourite, based mainly on the accessible
documentation is JavaCC. People who write parsers have a strange language all
their own. The Parsifal
glossary may help. The writers of these tools are academics, and are not
interested in teaching you anything, just impressing you with how brilliant
their programs are. This means the manuals are almost useless. You have to study
examples, particularly the simple ones and gradually the manuals will begin to
make sense. Another learning technique is to examine the Java code generated
from some sample grammars. Authors took six years of university courses to get
to their level of parser understanding, why should they make it any easier for
you?
Roughly what happens is you describe your grammar in some Mickey Mouse syntax.
Then a utility converts that into a Java program that will analyse text
conforming to that grammar. I must admit I am shocked at how ugly the
specification languages are. I would have thought they would be the most
beautiful and regular of all languages, being composed by afficionados of
language analysis.
Java has four simple built-in parsers, java.util.StringTokenizer,
java.io.StreamTokenizer, java.text.BreakIterator.
and java.regex.Pattern.
Java 1.5+ also has a number of XML parsers built-in. Check out the DOM,
SAX, XSD, XPath
and Schema entries.
Hand Rolled Parsers
I wrote a number of parsers as part of JDisplay — the tools that pretties
up listings on this website. The problem I faced is the code had to work with deliberately
erroneous code and code fragments.
The traditional parsers are totally unforgiving. They want perfect, complete
programs or data files to parse and give up totally on the first hiccough.
So I wrote my own using finite state automata, using enum constants to represent
each state.
Download and have a look at the source.