Atomic FTP Uploader
by Roedy Green ©1996-2009 Canadian Mind Products
This essay does not describe an existing computer program, just
one that should exist. This essay is about a suggested
student
project in Java programming. This essay gives a rough overview of how it
might work. I have
no source, object, specifications, file layouts or
anything else useful to implementing this project. Everything I have to say to
help you with this project is written below. I am
not prepared to help
you implement it; I have too many other projects of my own.
I do contract work for a living, which could include writing a program such as
this. However, I don’t do people’s homework
for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and
to keep all the profits from your endeavor.
I added a new section on implementation details to this essay on 2005-07-08.
The Problem
FTP software is notoriously difficult to use and notoriously unreliable. I have tried dozens of packages. FTP clients
are all utterly hopeless at the basic task of keeping a server website identical to the client side. They are really
dinosaurs left over from the days when people downloaded files over dial up phone lines with FTP.
What are the problems:
- The software gets confused and fails to upload or delete files on the server, or uploads them when it does not need to.
- When someone out on the net is reading a file on the server, that locks it from being updated, and bombs the update run.
- If I make a massive set of changes to the website, it make take hours to upload. During that time people out on the web
will see an incompatible mixture of old and new files. I don’t want the new files to be visible until they are all
ready. Uploads should be atomic.
- Server and workstation clocks may be out of sync. This should not confuse the software. Usually the workstation is the
one in trouble. Its clock should be reset from an atomic clock with software similar to SetClock.
Your software should work even when the server’s clock is badly out of whack or in a different time zone.
- You upload entire files even if only a few bytes in them have changed.
Ideally you would like to do this without running any software on the server, since usually ISPs will not let you, or
will charge you considerably more if you do run your own software.
Approaches
- Upload files to a different directory branch. When they are all ready, delete the master, and rename the uploaded files.
It might be possible to do this without server-side code since FTP supports a rename function. I know of no product that
does this. It would be very useful since a website can fail when you have half old and half new files being served to
the public, or old files referencing images in the process of being deleted or renamed.
- Use the Replicator as a core. You upload replicator-style
zips to the server, and only once they are all uploaded, unzip them. If they are busy you get on with unzipping the next
file and put that file on a queue to handle later.
- Use the Subversion version control as a core. From the workstation’s point of view, it is just like checking in
changes to a set of source files to version control. How do you get the files in shape for the HTML server?
- You write an HTML server than asks the Subversion server to reconstruct the file each time.
- You persuade Subversion to export/publish flat files after an update.
- You use a client version of subversion to extract the changed files, but run it on the server.
Subversion handles the problems of atomicity, and picking up after a disconnect in the middle of an upload. It also is
smart about only uploading changes, thus saving bandwidth.
- Look into Rsync for site mirroring. You can use the --delay-updates
option for reasonable atomicity. It has the usual Unix utility problems, novice-unfriendly documentation, and the need
to tweak and compile source code for your particular server. Perhaps you might write a wrapper to hide Rsync’s
installation complexities.
Implementation Details
I would like to see a product specialized for FTP uploads that runs unattended. It would work in conjunction with my Replicator
software for automatically distributing and keeping large file sets up to date without needing any server-side software.
However, you don’t have to know a thing about the Replicator to understand this project. I am just telling you I
have a couple of paying customers ready for you if you decide to write this.
What I need is a streamlined FTP upload-only program designed specifically to upload website files to a server, with the
following features:
- There is no human operator, just a script running it saying which directories to upload to which websites.
- It must try heroically to do the upload, redoing any file it has trouble with later. Often files cannot be temporarily
updated because someone is downloading them.
- It should only upload files if they have changed.
- If someone deletes files or adds files or updates files or uploads old files, creates or deletes directories, or
restores from back up to the website behind its back, this proposed uploader should notice and solve the problem all on
its own without human help. It also deletes files that no longer exist in the master tree. The script defines rules to
prevent it from deleting files from the website that don’t belong to it. e.g. a set of wildcards (*.cnt)
to tell it which files to leave alone.
- It should set the ERRORLEVEL so that the program or bat file that spawned it can tell how successful it was.
- Design the program to expect disconnections. Just pick up and carry on where you left off. Presume a permanent Internet
connection rather than dialup for simplicity.
- I would like updates to be almost atomic. By that I mean, the outside world viewing my
uploaded website sees no changes until the updated files have all been successfully uploaded. Only then they are
they instantly revealed in a few seconds by a set of quick deletes and renames.
I want the atomicity because it can take a long time to update the website if there have been global changes. For an
hour or two the entire website is half working under the old scheme and half the new and nothing works properly.
Further my webserver is not very clever. It won’t let me upload a file if anyone out there on the web is reading
it. With NetLoad, which I use now, that aborts the entire run. Which forces me to do my big uploads late at night when
traffic is lower.
Doing it with renames makes me much less vulnerable to a file being locked in use and unchangeable. Further, this way, I
take the file out of commission from the outside world for only a second or two, not for the entire upload.
- You can use two different strategies for doing the deletes and renames.
- Do the deletes and then all renames. This takes files out of commission longer, but never exposes the website in an
inconsistent state, just a lot of file not founds.
- When all the new versions are updated, delete the old, then rename the new in pairs. This is not quite as atomic, but it
leaves webpages out of commission less time.
- The strategy should be configurable, if you can’t think of even better ones.
- In either case you, have to deal with a stubbornly locked file that is constantly busy being downloaded or that the OS
thinks is locked and is not really that won’t clear for hours. You have to eventually handle it, even if tomorrow.
I have been so frustrated with GUI-style FTP programs for uploading websites, that I put on my to do list the task of
writing my own implementation of this student project. It is a much simpler beast than something like FTP
voyager. You might use a GUI like FTP-Voyager to compose the connections information and test the configurations out
so that you don’t have to compose that stuff from scratch in your scripts. Hopefully you can find a companion GUI
that will export that information in easy-to-use format.
You can get started with Peter van der Linden’s little LinLyn
FTP class and by watching the conversations back and forth between an GUI-style FTP client and a FTP server during
an upload. It is much simpler than you might imagine.