image provider

Encoding Identification


Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

As the world has become a global village, the problem of file encodings has become more acute. Now a file created on one side of the planet may be read on another. It is not obvious which encoding scheme was used. There are hundreds of possibilities.

Unfortunately, the encoding scheme used is not usually embedded as a signature in the document. See encoding identification for a fuller description of the problem.

Your job is to look at the document and make an educated guess at the encoding scheme used to encode it. You might provide a list of guesses in descending order of probability for someone to make the final decision manually.

There are two parts to the project.

  1. The viewer, which displays either the entire document in a given encoding, or just selected parts of the document that would render differently in different likely encodings. This is a simple text viewer with no HTML (Hypertext Markup Language) rendering. It can strip tags for HTML and XML (extensible Markup Language).
  2. The guesser.
The guesser can use the following clues: The guesser can also guess the language(s) used simply by looking for common but unique words in each language.

You also might want to tackle this as a neural net problem. Teaching it with thousands of documents with known encoding.

If you have control over the source of the documents, you can sidestep the problem by embedding the encoding as the first field followed by a line terminator. Better still, settle on UTF-8 or UTF-16BE as your encoding and be done with the problem.

File Format Identification

There is a related broader problem, identifying what format a file is. In general, there is no way to tell what format a file is, or what program can process it. You must simply remember or guess from the extension. Unfortunately, many extensions like *.doc give little clue. This is a royal mess and one of the side effects that males designed most of computing. I can’t imagine such a thing would have happened if Martha Stewart had a hand in. Files would have been automatically neatly labeled with the format, creating program and encoding.

Files are sometimes labeled with what they are in the first few bytes in a signature.

File Signatures
type signature
class CAFEBABE in hex
*.gif GIF87a or GIF89a
*.jpg FFD8 in hex
*.png 89504e470d0a1a0a in hex
You can collect these signatures and use them to guess what you have. Unfortunately, there is no central registry of signatures and most files formats don’t have them. You will have to discover them yourself for the sorts of file in your universe using a hex viewer.
encoding
Encoding Recogniser Applet
hexadecimal
Wotsit: collection of file format descriptions

This page is posted
on the web at:

http://mindprod.com/project/encodingidentification.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\encodingidentification.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[3.137.178.133]
You are visitor number