TagSoup is a SAX-compliant parser written in Java
that, instead of parsing well-formed or valid XML (extensible Markup Language),
parses HTML (Hypertext Markup Language) as it is found in the
wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed
for people who have to process this stuff using some semblance of a rational
application design. By providing a SAX (Simple API for XML)
interface, it allows standard XML
be applied to even the worst HTML.
TagSoup also includes a command-line processor that reads
files and can generate either clean HTML
or well-formed XML that is a close approximation to
XHTML (extensible Hypertext Markup Language).