Atomic FTP Uploader
by Roedy Green ©1996-2008 Canadian Mind Products
This essay is about a suggested
student project in
Java programming. This essay gives a rough overview of how it might work. It
does not describe an actual complete program. I have
no source, object,
specifications, file layouts or anything else useful to implementing this
project. Everything I have to say to help you with this project is written below.
I am
not prepared to help you implement it; I have too many other
projects of my own.
I do contract work for a living, which could include writing a program such as
this. However, I don’t do people’s homework
for them. That just robs them of an education.
You have my full permission to implement this project any way you please.
I added a new section on implementation details
to this essay on 2005-07-08.
The Problem
FTP software is notoriously difficult to use and notoriously unreliable. I have
tried dozens of packages. FTP clients are all utterly hopeless at the basic task
of keeping a server website identical to the client side. They are really
dinosaurs left over from the days when people downloaded files over dial up
phone lines with FTP.
What are the problems:
- The software gets confused and fails to upload or delete files on the server, or
uploads them when it does not need to.
- When someone out on the net is reading a file on the server, that locks it from
being updated, and bombs the update run.
- If I make a massive set of changes to the website, it make take hours to upload.
During that time people out on the web will see an incompatible mixture of old
and new files. I don’t want the new files to be visible until they are all ready.
Uploads should be atomic.
- Server and workstation clocks may be out of sync. This should not confuse the
software. Usually the workstation is the one in trouble. Its clock should be
reset from an atomic clock with software similar to SetClock.
Your software should work even when the server’s clock is badly out of whack or
in a different time zone.
- You upload entire files even if only a few bytes in them have changed.
Ideally you would like to do this without running any software on the server,
since usually ISPs will not let you, or will charge you considerably more if you
do run your own software.
Approaches
- Upload files to a different directory branch. When they are all ready, delete
the master, and rename the uploaded files. It might be possible to do this
without server-side code since FTP supports a rename function. I know of no
product that does this. It would be very useful since a website can fail when
you have half old and half new files being served to the public, or old files
referencing images in the process of being deleted or renamed.
- Use the Replicator as a core.
You upload replicator-style zips to the server, and only once they are all
uploaded, unzip them. If they are busy you get on with unzipping the next file
and put that file on a queue to handle later.
- Use the Subversion version control as a core. From the workstation’s point of
view, it is just like checking in changes to a set of source files to version
control. How do you get the files in shape for the HTML server?
- You write an HTML server than asks the Subversion server to reconstruct the file
each time.
- You persuade Subversion to export/publish flat files after an update.
- You use a client version of subversion to extract the changed files, but run it
on the server.
Subversion handles the problems of atomicity, and picking up after a disconnect
in the middle of an upload. It also is smart about only uploading changes, thus
saving bandwidth.
- Look into Rsync for site mirroring. You can
use the --delay-updates option for reasonable atomicity.
It has the usual Unix utility problems, novice-unfriendly documentation, and the
need to tweak and compile source code for your particular server. Perhaps you
might write a wrapper to hide Rsync’s installation complexities.
Implementation Details
I would like to see a product specialized for FTP uploads that runs unattended.
It would work in conjunction with my Replicator
software for automatically distributing and keeping large file sets up to date
without needing any server-side software.
However, you don’t have to know a thing about the Replicator to understand this
project. I am just telling you I have a couple of paying customers ready for you
if you decide to write this.
What I need is a streamlined FTP upload-only program designed specifically to
upload website files to a server, with the following features:
- There is no human operator, just a script running it saying which directories to
upload to which websites.
- It must try heroically to do the upload, redoing any file it has trouble with
later. Often files cannot be temporarily updated because someone is downloading
them.
- It should only upload files if they have changed.
- If someone deletes files or adds files or updates files or uploads old files,
creates or deletes directories, or restores from back up to the website behind
its back, this proposed uploader should notice and solve the problem all on its
own without human help. It also deletes files that no longer exist in the master
tree. The script defines rules to prevent it from deleting files from the
website that don’t belong to it. e.g. a set of wildcards (*.cnt)
to tell it which files to leave alone.
- It should set the ERRORLEVEL so that the program or bat file that spawned it can
tell how successful it was.
- Design the program to expect disconnections. Just pick up and carry on where you
left off. Presume a permanent Internet connection rather than dialup for
simplicity.
- I would like updates to be almost atomic. By that I
mean, the outside world viewing my uploaded website sees no changes until the
updated files have all been successfully uploaded. Only then they are
they instantly revealed in a few seconds by a set of quick deletes and renames.
I want the atomicity because it can take a long time to update the website if
there have been global changes. For an hour or two the entire website is half
working under the old scheme and half the new and nothing works properly.
Further my webserver is not very clever. It won’t let me upload a file if anyone
out there on the web is reading it. With NetLoad, which I use now, that aborts
the entire run. Which forces me to do my big uploads late at night when traffic
is lower.
Doing it with renames makes me much less vulnerable to a file being locked in
use and unchangeable. Further, this way, I take the file out of commission from
the outside world for only a second or two, not for the entire upload.
- You can use two different strategies for doing the deletes and renames.
- Do the deletes and then all renames. This takes files out of commission longer,
but never exposes the website in an inconsistent state, just a lot of file not
founds.
- When all the new versions are updated, delete the old, then rename the new in
pairs. This is not quite as atomic, but it leaves webpages out of commission
less time.
- The strategy should be configurable, if you can’t think of even better ones.
- In either case you, have to deal with a stubbornly locked file that is
constantly busy being downloaded or that the OS thinks is locked and is not
really that won’t clear for hours. You have to eventually handle it, even if
tomorrow.
I have been so frustrated with GUI-style FTP programs for uploading websites,
that I put on my to do list the task of writing my own implementation of this
student project. It is a much simpler beast than something like FTP
voyager. You might use a GUI like FTP-Voyager to compose the connections
information and test the configurations out so that you don’t have to compose
that stuff from scratch in your scripts. Hopefully you can find a companion GUI
that will export that information in easy-to-use format.
You can get started with Peter van der Linden’s little LinLyn
FTP class and by watching the conversations back and forth between an GUI-style
FTP client and a FTP server during an upload. It is much simpler than you might
imagine.