I do contract work for a living, which could include writing a program such as this. However, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project any way you please.
The point of this program is to check the HREF= links in your website, to make sure they are valid.
It should handle local websites, e.g. URL of the form file://localhost/E:/mindprod/index.html. Checking the local hard disk copy could be hundreds of times faster than checking the ISP’s copy, at least for checking the internal links.
The process should be restartable. Further, it should retain what it has learned for future scans so it can save work rescanning unchanged pages.
It should not have a heart attack if someone uploads a new file to the website while you are scanning it.
It also produce its reports in a very simple ASCII file format so that you can write your own programs to process the report file and mark or delete the links. Alternatively export your findings in csv (comma separated value) format.
There should be several threads simultaneously checking URLs, each working on a URL to a different site. This way you can get on with checking something else while you wait for a slow site to respond. Your program can monitor itself to home in on the optimal number of threads. It tries adding a thread or subtracting a thread and sees if that makes things faster or slower. It should then jitter about the optimal number of threads, which may change over time.
When a link automatically takes you somewhere else, your clone should correct your original HTML to point directly to the new location. It has to be a bit clever. You don’t want to replace valid URLs with 500 character CGI references that will change the next minute. This applies only to permanent redirects, not temporary ones.
When a link is broken, your clone should try to fix it for you. For example if http://oberon.ark.com/~Zeugma has disappeared, it should try http://www.zeugma.com and http://www.zeugma.org. Failing that it might be able to make some guesses by combing the URL names in some search engine results. It marks its corrections with a special *.gif. Someone can then manually check these corrections out.
You want be able to review any changes to your HTML before they are applied, and selectively turn off ones you don’t want. There are three types of change:
You want to be able to use it without Internet access, just checking the files available on local hard disk. Similarly you want to be able to limit it to just checking links within a website, and to exclude regions of that website.
You should be able to give it a list of pages to check (or wildcards, or lists of directories), a list of pages to avoid (or negative wildcards), and whether you want the indirectly-linked pages also checked (/I) and whether you want subdirectories checked (/S). So for example, you might want it just to do a quick check of all your offsite amazon.com links, just checking internal links enough to effect that, i.e. not checking any gif links, internal # links, or other offsite links. It takes a long time to manually research a dead link. You don’t won’t to be told again about dead links you already know about.
You want it to be able to find orphans, files on your website with nothing pointing to them. To find these, you need to specify a list of root landing points, e.g. index.html where visitors start off. If you can’t get to a file indirectly from one of these landing points, it is an orphan. Further, all links you can get to from a landing point should be checked.
The basic way you spider is to add the master root web page URLs to a queue of web pages to be checked. Then you spawn N threads (you determine the optimal value for N by experiment) that start working, each grabbing an item off the queue to process, or waiting for one to be available. The thread then reads that web page with a CGI-GET. See file I/O Amanuensis for how. Then it uses a regex to find all the <a href="xxxxx"> on that page. Keep in mind that HTML can be ugly with extra blank spaces and unexpected attributes. As it finds each href link, it adds it to the queue, but only if it is not already there. The process stops when there are no more links in the queue to check. You might want to randomise the queue so that you don’t repetitively hammer one site. Another optimisation is when you find that a domain can’t be found, you can make a note of that, and avoid testing any links to it. A very clever version might validate mailto links first by ensuring the names have the proper format and a registered domain under DNS, then by starting a conversation with the target’s mail server. This could be quite tricky since your software has to simulate some of the functions of a mailserver. You don’t actually want to send mail, just find out if you probably could. You have to be able to talk to any flavour of mailserver. At the very least you could ensure the email address conforms to RFC 822. I have code for such email validation as part of a bulk email program I wrote.
You should be able to control it completely either from the command line or from a GUI.
Before declaring a link dead it should probe it several ways:
| Broken Link Handling Options | |||
|---|---|---|---|
| Command | How to Process the Link | Display | HTML |
| Original broken link | Defunct Inc. | <a class="offsite" href="http://www.defunct.com">Defunct Inc.</a> | |
| L | Leave the link alone, just mark it with comment. | Defunct Inc. | <!-- broken link L "http://www.defunct.com"-->
<a class="offsite" href="http://www.defunct.com">Defunct Inc.</a> |
| R | Repair the link, replacing it with a new one. | Defunct Inc | <!-- repaired link R "http://www.defunct.com" -->
<a class="offsite" href="http://www.defunct.org">Defunct Inc.</a> |
| F | Flag the link as broken. | Defunct Inc. | <!-- flagged link F "http://www.defunct.com" -->
< a class="broken" href="http://www.defunct.com">Defunct Inc.</a> |
| D | Deactivate the link, namely, Flag the link as broken and Remove it. | <!-- broken link D "http://www.defunct.com"
-->
<img src="../image/stylesheet/brokenlink.png" width="32" height="32" alt="broken_link" border="0" /> Defunct Inc. | |
| W | Wipe out the link entirely. | Defunct Inc. | <!-- broken link W "http://www.defunct.com"
-->
Defunct Inc. |
The commands to the link fixing utility might consist of comma separated values, a command letter, a filename where the link occurs, and the link itself, what goes inside the <a href=" "> . The utility would sort them by filename.
The point of adding the <!--broken link comments is both for keeping an audit trail of what you did, and for helping the link checker avoid pestering you about broken links that you have already dealt with, and to make it easier to reprocess a link.
Since I wrote this proposal, I have discovered a number of link checking utilities. None are 100% conforming to this wish list, but Xenu Link Sleuth comes close. It’s big problem is that it ignores <applet tags and it does not mark broken links. It just finds them.
| Product | Notes |
|---|---|
| CyberSpyder | shareware |
| HTMLValidator | trialware. Only checks links within a page. Primarily checks for other HTML syntax errors. |
| Linkcop | Duke Engineering no longer supports it. It throws everything away and starts from scratch if you have to stop and restart it. |
| NetMechanic | commercial. Variable price depending on your website size. to per URL. Also checks and repairs HTML syntax. |
| Xenu Link Sleuth | free. Best program of the lot. Uses multiple threads. Will recheck links
over a period of days. Lets you configure external link checking on or off. Will
work off a local hard disk if you give it the index.html file to check, or a URL
of the form:
file:///E:\mindprod\index.html
Will export to a tab separated file. Extremely fast checking of local links.
Erroneously reports every APPLET reference to a class or jar as broken. The
author has no plans to correct this. Does not fix or mark broken links, just
lists them for manual attention. On a local site, it can detect orphans,
files that nothing points to. It can even detect orphan files on a website if
you give it your FTP password. |
This is one reason I tend to use specific links rather than home page links, even though they go out of date faster.
To deal with these, what you need to do is maintain a private cache of the offsite pages your links point to. Then you need to periodically check them with some sort of automated tool to make sure the link points to something roughly the same as before. If it does not, the automated tool at least has a target of the sort of page it is looking for. It might even find a similar one on a totally unrelated site, using search engines to help. The replacement(s) it finally suggests might not even be from the original author, just on the same subject.
![]() |
and suggestions to improve this page to Roedy Green : | ||
| Canadian Mind Products | |||
| mindprod.com IP:[65.110.21.43] | |||
| Your face IP:[38.103.63.18] | The information on this page is for non-military use only. | ||
| You are visitor number 8,665. | Military use includes use by defence contractors. | ||
| You can get a fresh copy of this page from: | or possibly from your local J: drive (Java virtual drive/Mindprod website mirror) | ||
| http://mindprod.com/project/htmlbrokenlink.html | J:\mindprod\project\htmlbrokenlink.html | ||