Atomic FTP Uploader
©1996-2012 Roedy Green, Canadian Mind Products
This essay does not describe an
existing computer program, just one that should exist. This
essay is about a suggested student project in
Java programming. This essay gives a
rough overview of how it might work. I have no source, object,
specifications, file layouts or anything else useful to implementing this
project.
This project outline is not like the artificial, tidy little problems you
are spoon-fed in school, when all the facts you need are included, nothing
extraneous is mentioned, the answer is fully specified, along with hints
to nudge you toward a single expected canonical solution. This project is
much more like the real world of messy problems where it is up to you to
fully the define the end point, or a series of ever more difficult versions
of this project, and research the information yourself to solve them.
Everything I have to say to help you with this project is written below.
I am not prepared to help you implement it; or give you any additional
materials. I have too many other projects of my own.
Though I am a programmer, I don’t do people’s homework for
them. That just robs them of an education.
You have my full permission to implement this project in any way you please
and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
I added a new section on implementation details to this essay on 2005-07-08.
The Problem
FTP (File Transfer Protocol) software is notoriously difficult to use and notoriously unreliable. I have tried dozens of packages. FTP clients
are all utterly hopeless at the basic task of keeping a server website identical to the client side. They are really
dinosaurs left over from the days when people downloaded files over dial up phone lines with FTP.
What are the problems:
- The software gets confused and fails to upload or delete files on the server, or uploads them when it does not
need to.
- When someone out on the net is reading a file on the server, that locks it from being updated, and bombs the
update run.
- If I make a massive set of changes to the website, it make take hours to upload. During that time people out on
the web will see an incompatible mixture of old and new files. I don’t want the new files to be visible until
they are all ready. Uploads should be atomic.
- Server and workstation clocks may be out of sync. This should not confuse the software. Usually the workstation
is the one in trouble. Its clock should be reset from an atomic clock with software similar to SetClock. Your software should work even when the server’s clock is badly out
of whack or in a different time zone.
- You upload entire files even if only a few bytes in them have changed.
Ideally you would like to do this without running any software on the server, since usually ISP (Internet Service Provider)s
will not let you, or
will charge you considerably more if you do run your own software.
Approaches
- Upload files to a different directory branch. When they are all ready, delete the master, and rename the
uploaded files. It might be possible to do this without server-side code since FTP supports a rename function. I
know of no product that does this. It would be very useful since a website can fail when you have half old and half
new files being served to the public, or old files referencing images in the process of being deleted or
renamed.
- Use the Replicator as a core. You upload
replicator-style zips to the server, and only once they are all uploaded, unzip them. If they are busy you get on
with unzipping the next file and put that file on a queue to handle later.
- Use the Subversion version control as a core. From the workstation’s point of view, it is just like
checking in changes to a set of source files to version control. How do you get the files in shape for the HTML (Hypertext Markup Language)
server?
- You write an HTML server than asks the Subversion server to reconstruct the file each time.
- You persuade Subversion to export/publish flat files after an update.
- You use a client version of subversion to extract the changed files, but run it on the server.
Subversion handles the problems of atomicity, and picking up after a disconnect in the middle of an upload. It
also is smart about only uploading changes, thus saving bandwidth.
- Look into Rsync for site mirroring. You can use the
--delay-updates option for reasonable atomicity. It has the usual Unix utility problems,
novice-unfriendly documentation, and the need to tweak and compile source code for your particular server. Perhaps
you might write a wrapper to hide Rsync’s installation complexities.
Implementation Details
I would like to see a product specialized for FTP uploads that runs unattended. It would work in conjunction with my
Replicator software for automatically distributing and keeping
large file sets up to date without needing any server-side software.
However, you don’t have to know a thing about the Replicator to understand this project. I am just telling
you I have a couple of paying customers ready for you if you decide to write this.
What I need is a streamlined FTP upload-only program designed specifically to upload website files to a server,
with the following features:
- There is no human operator, just a script running it saying which directories to upload to which websites.
- It must try heroically to do the upload, redoing any file it has trouble with later. Often files cannot be
temporarily updated because someone is downloading them.
- It should only upload files if they have changed.
- If someone deletes files or adds files or updates files or uploads old files, creates or deletes directories,
or restores from back up to the website behind its back, this proposed uploader should notice and solve the problem
all on its own without human help. It also deletes files that no longer exist in the master tree. The script
defines rules to prevent it from deleting files from the website that don’t belong to it, e. g. a set of
wildcards (*.cnt) to tell it which files to leave alone.
- It should set the ERRORLEVEL so that the program or bat file that spawned it can tell how successful it
was.
- Design the program to expect disconnections. Just pick up and carry on where you left off. Presume a permanent
Internet connection rather than dialup for simplicity.
- I would like updates to be almost atomic. By that I mean, the outside world viewing
my uploaded website sees no changes until the updated files have all been successfully uploaded. Only then
they are they instantly revealed in a few seconds by a set of quick deletes and renames.
I want the atomicity because it can take a long time to update the website if there have been global changes.
For an hour or two the entire website is half working under the old scheme and half the new and nothing works
properly.
Further my webserver is not very clever. It won’t let me upload a file if anyone out there on the web is
reading it. With NetLoad, which I use now, that aborts the entire run. Which forces me to do my big uploads late
at night when traffic is lower.
Doing it with renames makes me much less vulnerable to a file being locked in use and unchangeable. Further,
this way, I take the file out of commission from the outside world for only a second or two, not for the entire
upload.
- You can use two different strategies for doing the deletes and renames.
- Do the deletes and then all renames. This takes files out of commission longer, but never exposes the
website in an inconsistent state, just a lot of file not founds.
- When all the new versions are updated, delete the old, then rename the new in pairs. This is not quite as
atomic, but it leaves webpages out of commission less time.
- The strategy should be configurable, if you can’t think of even better ones.
- In either case you, have to deal with a stubbornly locked file that is constantly busy being downloaded or that
the OS (Operating System) thinks is locked and is not really that won’t clear for hours. You have to eventually handle it, even
if tomorrow.
I have been so frustrated with GUI-style FTP programs for uploading websites, that I put on my to do list the task
of writing my own implementation of this student project. It is a much simpler beast than something like FTP voyager. You might use a GUI (Graphic User Interface) like FTP-Voyager to compose the
connections information and test the configurations out so that you don’t have to compose that stuff from
scratch in your scripts. Hopefully you can find a companion GUI that will export that information in easy-to-use
format.
You can get started with Peter van der Linden’s little LinLyn FTP
class and by watching the conversations back and forth between a GUI-style FTP client and a FTP server during an
upload. It is much simpler than you might imagine.
From Scratch, FTP Replacement
FTP protocol is old and has a number of disadvantages:
- It does not preserve timestamps
- It gets confused by time zones and DST (Daylight Saving Time).
- It is slow up uploading a number of small files.
- It is not secure.
- It does no compression.
- It does not do deltas. It always sends entire files even if just one byte has changed.
- It is not atomic. An upload is revelead to the public a file at a time. This can lead to files pointing to ones
that have not been uploaded yet. Idealy the upload should appear to the public all at once.
- If someone is downloading a file that is being uploaded, the entire session aborts.
Perhaps what is needed is a completely fresh start. The catch is then you need to write software both for the client
and server. It might work something like the Replicator.
You might implement deltas, compression, UDP (User Datagram Protocol), SAX-like protocol, automatic recovery from disconnect…
  |
You can get the freshest copy of this page from: |
or possibly from your local J: drive (Java virtual drive/mindprod.com website mirror) |
| http://mindprod.com/project/smartftp.html |
J:\mindprod\project\smartftp.html |
 | Please email your feedback for publication, letters to the editor, errors, omissions, typos, formatting errors, ambiguities, unclear wording, broken/redirected link reports, suggestions to improve this page or comments to
Roedy Green :
If you want your message kept confidential, not considered for posting, please explicitly specify that. |
| Canadian Mind Products |
|
| mindprod.com IP:[65.110.21.43] |
| view Blog | Your face IP:[38.107.179.212] |
| Feedback | You are visitor number
17,868. | |