image provider

HTML Broken Link Finder/Fixer

The CurrCon Java Applet displays prices on this web page converted with today’s exchange rates into your local international currency, e.g. Euros, US dollars, Canadian dollars, British Pounds, Indian Rupees… CurrCon requires an up-to-date browser and Java version 1.8, preferably 1.8.0_131. If you can’t see the prices in your local currency, Troubleshoot. Use Firefox for best results.

Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

This project is vaguely related to the HTML Disturbed Link Patcher. This project finds broken links, where the link patcher prevents them from being created in the process of reorganising your website.

The point of this program is to check the HREF= links in your website, to make sure they are valid.

It should handle local websites, e.g. URL (Uniform Resource Locator) of the form file://localhost/E:/mindprod/index.html. Checking the local hard disk copy could be hundreds of times faster than checking the ISP (Internet Service Provider) ’s copy, at least for checking the internal links.

The process should be restartable. Further, it should retain what it has learned for future scans so it can save work rescanning unchanged pages.

It should not have a heart attack if someone uploads a new file to the website while you are scanning it.

It also produce its reports in a very simple ASCII (American Standard Code for Information Interchange) file format so that you can write your own programs to process the report file and mark or delete the links. Alternatively export your findings in CSV (Comma Separated Value) format.

There should be several threads simultaneously checking URL s, each working on a URL to a different site. This way you can get on with checking something else while you wait for a slow site to respond. Your program can monitor itself to home in on the optimal number of threads. It tries adding a thread or subtracting a thread and sees if that makes things faster or slower. It should then jitter about the optimal number of threads, which may change over time.

Oracle’s Javadoc on Executors.cachedThreadPool : available:

When a link automatically takes you somewhere else, your clone should correct your original HTML (Hypertext Markup Language) to point directly to the new location. It has to be a bit clever. You don’t want to replace valid URLs (Uniform Resource Locators) with 500 character CGI (Common Gateway Interface) references that will change the next minute. This applies only to permanent redirects, not temporary ones.

When a link is broken, your clone should try to fix it for you. For example if http://oberon.ark.com/~Zeugma has disappeared, it should try http://www.zeugma.com and http://www.zeugma.org. Failing that it might be able to make some guesses by combing the URL names in some search engine results. It marks its corrections with a special *.gif. Someone can then manually check these corrections out.

You want be able to review any changes to your HTML before they are applied and selectively turn off ones you don’t want. There are three types of change:

  1. marking broken links. Sometimes you know a link could not really be broken, just temporarily down.
  2. replacing redirects with their targets. Sometimes the redirect is purely internal.
  3. replacing broken links with guessed new targets. Sometimes the guess may be out to lunch.
  4. When it creates lists of broken links, it should include the text ahead of the <a> so the user has an idea what the link is for. Usually there are too many to clean up at once and you must prioritise.

You want to be able to use it without Internet access, just checking the files available on local hard disk. Similarly you want to be able to limit it to just checking links within a website and to exclude regions of that website.

You should be able to give it a list of pages to check (or wildcards, or lists of directories), a list of pages to avoid (or negative wildcards) and whether you want the indirectly-linked pages also checked (/I) and whether you want subdirectories checked (/S). So for example, you might want it just to do a quick check of all your offsite amazon.com links, just checking internal links enough to effect that, i.e. not checking any gif links, internal # links, or other offsite links. It takes a long time to manually research a dead link. You don’t won’t to be told again about dead links you already know about.

You want it to be able to find orphans, files on your website with nothing pointing to them. To find these, you need to specify a list of root landing points, e.g. index.html where visitors start off. If you can’t get to a file indirectly from one of these landing points, it is an orphan. Further, all links you can get to from a landing point should be checked.

The basic way you spider is to add the master root web page URLs to a queue of web pages to be checked. Then you spawn N threads (you determine the optimal value for N by experiment) that start working, each grabbing an item off the queue to process, or waiting for one to be available. The thread then reads that web page with a CGI-GET. See file I/O Amanuensis for how. Then it uses a regex to find all the <a href=xxxxx> on that page. Keep in mind that HTML can be ugly with extra blank spaces and unexpected attributes. As it finds each href link, it adds it to the queue, but only if it is not already there. The process stops when there are no more links in the queue to check. You might want to randomise the queue so that you don’t repetitively hammer one site. Another optimisation is when you find that a domain can’t be found, you can make a note of that and avoid testing any links to it. A very clever version might validate mailto links first by ensuring the names have the proper format and a registered domain under DNS (Domain Name Service), then by starting a conversation with the target’s mail server. This could be quite tricky since your software has to simulate some of the functions of a mailserver. You don’t actually want to send mail, just find out if you probably could. You have to be able to talk to any flavour of mailserver. At the very least you could ensure the email address conforms to  RFC 5322. I have code for such email validation as part of a bulk email program I wrote.

You should be able to control it completely either from the command line or from a GUI (Graphic User Interface).

Before declaring a link dead it should probe it several ways:

  1. First with HEAD. This is the most efficient since he website returns just the header, not the whole document. You can also tell if the document has changed since the previous time it was polled.
  2. Next with GET. Some sites won’t return just headers. You have to ask for the whole page. You can abort once the header portion arrives. A clever program would remember which links could be successfully polled with HEAD and which needed GET so that on subsequent probings it could save time.
  3. Try polling again after all the other pages have been polled. This may give the site time to clear a temporary backlog or hardware problems.
  4. Optionally just put the non-responding sites in a list to be automatically rechecked on several successive days. Only if URLs stay dead, do you bother notifying the user about them. Of course, if a site is up, but there is no such document, there is not much point in retrying or delaying notification of the problem.
  5. Often the problem is the link was misspelled. If you correct the original, on retries, the program should be smart enough to notice that and not complain about the original link not working. It will never work. It is not supposed to!
  6. In particular pay attention to <applet links which Xenu ignores. Make sure the jar exists and the main class exists in the correct package in the jar.
  7. Normally you will have several links to the same spot. There is no need to check the link more than once.
  8. Checking Images and Applets to make sure the class file is indeed inside the jar as advertised require special handling. You only spider HTML files, not PDF (Portable Document Format), sounds, movie etc. files.
  9. There are three kinds of links: local to the hard disk, internal to the website being tested and external to other websites. You only spider local and internal pages.
  10. You probably want to configure which check you will do so the process can be sped up: local, internal, external, images, Applets, pdfs, downloads, various other extensions …
  11. When you are checking links, you want avoid reloading a page when you have several links to it, even with several different anchors on the page.
There should be a GUI to let you decide what to do with the various broken links it finds. The following table shows how a broken link a be transformed.
Broken Link Handling Options
Command How to Process the Link Display HTML
  Original broken link Defunct Inc. <a class=offsite href=http://www.defunct.com>Defunct Inc.</a>
L Leave the link alone, just mark it with comment. Defunct Inc. <!-- broken link L http://www.defunct.com -->
<a class=offsite href=http://www.defunct.com>Defunct Inc.</a>
R Repair the link, replacing it with a new one. Defunct Inc <!-- repaired link R http://www.defunct.com -->
<a class=offsite href=http://www.defunct.org>Defunct Inc.</a>
F Flag the link as broken. Defunct Inc. <!-- flagged link F http://www.defunct.com -->
< a class=broken href=http://www.defunct.com>Defunct Inc.</a>
D Deactivate the link, namely, Flag the link as broken and Remove it. broken_link Defunct Inc. <!-- broken link D http://www.defunct.com -->
<img src=../image/stylesheet/brokenlink.png width=32 height=32 alt=broken_link border=0 />

Defunct Inc.
W Wipe out the link entirely. Defunct Inc. <!-- broken link W http://www.defunct.com -->
Defunct Inc.
You want to be able to manually add links to be fixed. The link checker may falsely think some links are fine, e.g. ones that point to a Domains For Sale site, when the original company has gone belly up.

The commands to the link fixing utility might consist of comma separated values, a command letter, a filename where the link occurs and the link itself, what goes inside the <a href=>. The utility would sort them by filename.

The point of adding the <!--broken link comments is both for keeping an audit trail of what you did and for helping the link checker avoid pestering you about broken links that you have already dealt with and to make it easier to reprocess a link.

Since I wrote this proposal, I have discovered a number of link checking utilities. None are 100% conforming to this wish list, but Xenu Link Sleuth comes close. It’s big problem is that it ignores <applet tags and it does not mark broken links. It just finds them.

Product Notes
HTMLValidator trialware. Only checks links within a page. Primarily checks for other HTML syntax errors.
Linkcop Duke Engineering no longer supports it. It throws everything away and starts from scratch if you have to stop and restart it.
NetMechanic commercial. Variable price depending on your website size. $35.00 USD to $200.00 USD per URL. Also checks and repairs HTML syntax.
Xenu Link Sleuth free. Best program of the lot. Uses multiple threads. Will recheck links over a period of days. Lets you configure external link checking on or off. Will work off a local hard disk if you give it the index.html file to check, or a URL of the form:
file://localhost/E:\mindprod\index.html
Will export to a tab separated file. Extremely fast checking of local links. Erroneously reports every APPLET reference to a class or jar as broken. The author has no plans to correct this. Does not fix or mark broken links, just lists them for manual attention. On a local site, it can detect orphans, files that nothing points to. It can even detect orphan files on a website if you give it your FTP (File Transfer Protocol) password.

Technically Not Broken Links

The most embarrassing sort of broken link happens when a company goes out of business and some other company, particularly a porn company, buys up their domain name. Now your link inadvertently points to something quite different than it did ordinarily and you get no warning, just a furious phone call or email out of the blue. This link technically is still functioning. Xenu won’t report it as broken. You get blamed for maliciously leading children to porn sites or trying to sell penis enlargement pills etc.

This is one reason I tend to use specific links rather than home page links, even though they go out of date faster.

To deal with these, what you need to do is maintain a private cache of the offsite pages your links point to. Then you need to periodically check them with some sort of automated tool to make sure the link points to something roughly the same as before. If it does not, the automated tool at least has a target of the sort of page it is looking for. It might even find a similar one on a totally unrelated site, using search engines to help. The replacement(s) it finally suggests might not even be from the original author, just on the same subject.

Global Broken Link Solution

Life could be ever so much easier if we co-operated on the problem of fixing broken links. What we need are lists of old and new links with a date. There would also be wildcard entries so that you could say that all files beginning with proj*.* in directory x were moved to directory y. Such lists, possibly maintained in database servers could be used in several ways:

Implementation

I have implemented a simplified unreleased version of this project that works as an adjunct to Xenu. Mainly it just keeps track of each link and when it was last successfully and unsuccessfully negotiated. I discovered that links temporarily stop working all the time. The site might be down for maintenance. The link to the that part of the Internet may be down. The site or net may be unusually sluggish resulting in timeouts. So it is important to keep re-testing links, and only do the work of researching them if they stay broken for a week or so.

I have also discovered that links to news stories are notoriously volatile. Almost all of them stop working within a year or two, even worse for controversial stories. It is an Orwellian world when the history of what happened melts away. This is very distressing when you use newspapers stories to back up your assertions. You are left dangling, looking as if you made it all up.

BrokenLink Implementation

I wrote a broken link checker that works as a back end to Xenu. It is not quite ready for prime time, but saves me hundreds of hours. It has a few principles: Improvements:
A 1 Website Analyser
HTML broken link patcher
Xenu

This page is posted
on the web at:

http://mindprod.com/project/htmlbrokenlink.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\htmlbrokenlink.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[18.118.23.89]
You are visitor number