Universal Data Format
This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in
Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything
else useful to implementing this project. Everything I have prepared to help you is right here.
This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is
fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the
define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.
Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many
other projects of my own.
Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
You have probably used a number of universal data formats before and found them
XML (extensible Markup Language)
very bulky, only handles trees. Format/validation
loosely linked with several schemes.
RTF (Rich Text Format)
bulky, does not preserve structure. Only works for
a lot of geeky work to set up.
PDF (Portable Document Format)
very difficult to extract clean, raw tabular data.
You want a data descriptor language that specifies the data well enough that:
- You can extract just the data you want using standard utilities. You should not
have to write custom code.
- A viewer for the data comes in the wash generated from the descriptor, comparable
is quality to one you would custom write.
- A editor for the file come is the wash generated from the descriptor, comparable in
quality to one you would custom write. It uses data about types, ranges, inter-field
assertions when letting you edit.
- You can read/write the data with an API (Application Programming Interface)
that delivers the data as Java objects ready to go without parsing.
- A scheme for embedding a reference to the corresponding format in the data file,
with local caching of the descriptions.
- custom compression plug-ins to do things like convert tabular data to sorted order
with deltas instead of absolute values.
- Ability to specify types so you don’t have to spell out all the detail on
every field when fields are similar.
- Includes short and long descriptions on each field to help others understand
precisely how it is is defined.
- Track units of measure.
- Dates and timestamps have only one format internally. We standardise as much as
possible so that there is no special work needed to merge data from different
- Programmatic API to the descriptions as well so smart programs can handle
- You might steal an plank from OpenType font format and use existing formats with a
wrapper to stitch together what appears to be a unified format.
- We are not describing arbitrary datafile, just ones we produce. We want to be able
to handle the sorts of data people are currently handling, but we don’t have to
handle them the same way. We don’t need to be quite as efficient. We are
interested mainly in data interchange.
What You Need To Create
You need to produce:
- grammar for specifying the data descriptor as a text file.
- specify the binary format of the data files we work with.
- editor for data descriptor
- API for
building data descriptor
- API to
read/write data files
- generator for viewer given description
- generator for editor/validator given description
- Database of sales and other taxes of a country, with historical record.
- Database of postal code/zip code lookup
- Database of phone numbers, names, addresses
- Currency exchange
- Earthquake records
- History of documents posted at whitehouse.gov