Sitemaps were originally created by Google as a way to help them spider websites more efficiently. Google desiged them in way to make it difficult for other search engines to use them. Since then, sitemaps have been opened up and made accessible to all search engines. Basically, a sitemap is just an XML (extensible Markup Language) file that lists all the spiderable files on your website, when you last updated them, and how important you think it is to keep the spidering up to date. A typical sitemap file might look something like this:
Normally the file itself is GZip compressed, which, because XML is so fluffy, gets about 25 to 1 compression. You should validate your XML files before submitting them using these XSD schemas for SiteMaps.
|How Do Search Engines find your Sitemap?||SiteMap Utility|
# parts of the mindprod.com website not indexed user-agent: * disallow: /include/ disallow: /jgloss/include/ disallow: /image/restricted/ Sitemap: http://mindprod.com/sitemap.gzNote how Sitemap takes a full URL (Uniform Resource Locator), unlike the others.
Many people are using a program called SOFTPlus GSiteCrawler to create their sitemaps. It works just like the Google spider does by chasing links in your online website to find all the pages. This is quite time consuming since it has to download every one of your website’s pages individually. There is a much faster way, that takes only under two seconds to prepare a sitemap for a website of 10,000 files, by using an offline site preparation utility that does not even need to read any of your files, just the directory entries. Obviously there are side benefits to GsiteCrawler’s spidering, e.g. broken link detection, but you can get that much faster with offline spidering with Xenu. Many FTP (File Transfer Protocol) utilities will verify the consistency of offline and online versions of your files without the heavy overhead of spidering such as NetLoad and FTP Voyager.
I have written a Java program called SiteMap you can to generate a Google sitemap file for your own website, offline. It can prepare a sitemap for a website of 10,000 files in under 2 seconds. This is fast enough that you can use it before every upload, ensuring your Google sitemap is 100% up to date for when the Google spider of opportunity knocks to spider your site.
You control the utility by composing three files with a text editor. The first siteconfig.properties looks like this:
You also compose a list of entire directories to specially catalog: directories.csv, containing the directory name, frequency of update and relative importance when it comes to keeping it up to date, expressed
You then compose a list of exceptions, the files to specially catalog a different way from the default for their directory : files.csv: It contains directory name, filename, frequency updated, and
# individual file exceptions to the directory rules # directory, file, frequency, spidering importance. , whatsnew.html, daily, .9 jgloss, deadpadsites.html, never, 0 project, projects.html, weekly, .7
My utility then scans your disk and prepares a Google sitemap entry for every individual file that meets those criteria and compresses it. You then upload it to your website. The first time, you must also register that file’s name with Google.
You want to regenerate your site map just prior to every upload, otherwise if the Google spider comes, it will miss some of your recently updated files.
You can validate a sitemap file with an sitemap XML schema.
There is another kind of site map, that tries to give your a birds eye view of the entire website, so you can jump directly to the section you need. You can prepare these manually using the usual HTML (Hypertext Markup Language) editors, combined with directory listings and search/replace or you can use a utility to build one for you such as Coffee Cup SiteMapper. I created a primitive one manually for my own website.
available on the web at:
optional Replicator mirror
Please email your feedback for publication, letters to the editor, errors, omissions, typos, formatting errors, ambiguities, unclear wording, broken/redirected link reports, suggestions to improve this page or comments to Roedy Green : . If you want your message, your name or email kept confidential, not considered for public posting, please explicitly specify that. Unless you state otherwise, I will treat your message as a letter to the editor that I may or may not publish in the feedback section. After that, it will be too late to retract it. If you disagree with something I said, especially when sending an ad-hominem attack, a rant composed mainly of obscenities or a death threat, please quote the offending passage and cite the web page where you found it, tell me why you think it is wrong, and, if possible, provide some supporting evidence. I can’t very well fix erroneous or ambiguous text if I can’t find it.
Your face IP:[22.214.171.124]
|Feedback||You are visitor number 7,381.|