Delta Creator

Delta Creator


Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project, and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

Let us say you send somebody a file, then you change the original file. Instead of sending the whole new file, you can send just the changes. How do you detect the changes? You could try Funduc Patch. The catch is, often the delta files are almost as big as the original, even if all you did was rearrange the order of the paragraphs of a word processing file, or change a few lines in a compiled exe file.

I suggest inventing a generic way to do this that works with files other than just text files. Current schemes are not very bright and end up creating delta files almost as big as the new file. They are not good at noticing that text has simply been reordered.

Here’s how you might tackle it.

First you have to determine if the file really changed, or just suffered an Oscar Wilde change, e.g. he put in a comma one day into one of his poems, then the next took it out again. So here are the basic tools you have to work with to tell:

  1. Examine the last change date of the file the OS (Operating System) maintains. In Java you can access this timestamp with File.lastModified(). See last modified date.
  2. Check the file size. Most changes will trigger a file size change.
  3. Compute a digest, e.g. a 32-bit Adlerian checksum. If it is the same for both the old and new files chances are nothing really changed. The odds are quite astronomically small you could have a change and not detect it, but it is theoretically possible. You could also use cryptographic digests like MD5 (Message Digest algorithm 5) and SHA-1 (Secure Hash Algorithm 1) but they are much slower to compute even though they make it all but impossible to miss a change. The advantage of this technique is you don’t need to keep the old files around, just compact digests. See digest.
  4. Compare the old and new files with equals() after reading them into RAM (Random Access Memory), possibly doing the I/O in large chunks. The FIleTransfer class for example chunk reading code.
I talk a little more about these issues in the automatic file updater student project.

One you have decide the file actually did change, you can send the changes.

For each file type you write a chunker. It divides the file into chunks. For a text file, it may divide at each newline character. For a MSWord file it may divide at each paragraph. For a FORTH BLK file, it might divide at each 1024 bytes. For a Btree file it chunks out each record…

Then you compute a hash on each chunk of the original file and index it in a Hashtable along with a reference offset and length into the old file. You don’t put the text itself in the Hashtable.

To compose the delta file, you look at each chunk in the new file. You compute its hash, and look it up in your Hashtable. If you find a match, you make a note that the chunk can be had from offset/length of the old file. If you don’t find a match, you have to insert the chunk from the new file into your delta file. Very often you will get a sequential run of matching chunks from the old file. In your notes, you can collapse these entries into one big offset/length reference.

Creating delta files of *.exe and *.com files requires some special handling. Experiment by adding a single instruction to a program and notice how that tiny change ripples and cause changes throughout the rest of the file. The idea is to send only the fact that one extra instruction was added, and somehow generate all the rippling side effects instead of transmitting them.

Once you have these low level tools in place, set up a scheme so that customers are automatically kept up to date from a master copy, using the Internet. The scheme should recover even if customer should accidentally delete files, corrupt files, or restore out of date files.

This project is a subset of a more general problem of distributing file and program updates. See the Automatic Updater project.


This page is posted
on the web at:

http://mindprod.com/project/deltacreator.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\deltacreator.html
logo
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.
Blog
IP:[65.110.21.43]
Your face IP:[23.20.220.79]
You are visitor number