SiteMaps

SiteMaps


Purpose of Sitemaps excludes.csv
robots.txt Invoking the Utility
Configuring the SiteMap utility Installing The SiteMap Utility
sitemapconfig.properties Jet Accelerator
directories.csv Spider Icon
files.csv Ensuring All Is Working Properly
includes.csv

Purpose of Sitemaps

You use a sitemap to encourage Google or other search engines to more frequently and efficiently index your website.

If you are not familiar with sitemaps, see the overview information about sitemaps.

The layout of the files that Sitemap generates are defined at Sitemaps.org.

You use this sitemap utility because it is so quick you can use it before every upload so that your sitemap is always 100% up-to-date, always ready for whenever opportunity knocks, in the form of the Google spider.

This approach in orders of magnitude quicker than actually spidering the site yourself with a tool like Xenu or GSiteCrawler. With the sitemap utility, you can prepare a fresh sitemap in a couple of seconds.

To speed spidering, and ensure the most important files get spidered frequently, Google has created a system where you leave a compressed catalog of all your files on the website for it to find. The CMP (Canadian Mind Products) SiteMap utility creates that file.

robots.txt

For a sitemap to take effect, you must upload the generated compressed sitemap.gz file to the root directory of your website and you must one time register the name of the file with Google ⇒ Tools ⇒ Add Site so they will know what you called it and where to look for it. You can check it is registered properly in your Google ⇒ Webmaster ⇒ Tools ⇒ Dashboard. The sitemap.gz file can be used by all search engines, not just Google, so long as they know to look for it. You can tell all the search engines where to find your sitemap by adding a line to your robots.txt files like this:

# robots.txt. Lives in root directory of the wabsite
# parts of the mindprod.com website not indexed
user-agent: *
disallow: /include/
disallow: /jgloss/include/
disallow: /image/restricted/
Sitemap: http://mindprod.com/sitemap.gz

See more information on robots.txt. You can use it to control which parts of your website get indexed. You can also use the robots meta tag.

It might be wise to put a link to your sitemap somewhere on your website so all the search engines could in theory find it by spidering, without other help.

Configuring the SiteMap utility

I wrote SiteMap back in 2006-01 before I knew of the existence of any competition, so I just called it SiteMap, which sounds a bit grandiose now. To distinguish it from others, you might refer to it as CMP SiteMap.

To use the program, you must configure five files:

  1. sitemapconfig.properties
  2. directories.csv
  3. files.csv
  4. includes.csv
  5. excludes.csv

You also need robots.txt, but that is not for the SiteMap utility.

sitemapconfig.properties

defines where to find your website html files.

directories.csv

defines which directories you want Google to look at, and some facts about them. All files in each directory mentions will be treated the same way, unless there is a special entry for it in files.csv It has comma-separated fields:

  1. directory (relative to root of website, root is blank)
  2. frequency of update, e.g. always, hourly, daily, weekly, monthly, yearly, never. Never lets you suppress indexing of a single file, or you can just not mention that directory.
  3. priority.

files.csv

defines files you want Google to handle specially. They have comma-separated fields

  1. directory (relative to root of website, root is blank)
  2. file (including .html)
  3. frequency of update, e.g. always, hourly, daily, weekly, monthly, yearly, never. never lets you suppress indexing of a single file.
  4. priority.

You may optionally include # comments. The meaning of these frequency and priority file is defined in the Google FAQ. The program automatically generates the lastmod.

# individual file exceptions to the directory rules
# directory, file, frequency, spidering importance.
,        whatsnew.html,     daily,  .9
jgloss,  deadpadsites.html, never,   0
project, projects.html,     weekly, .7

includes.csv

A list of records that look like this:

#includes.csv file, which extensions to include in the sitemap
startWith,xxx
endsWith,.html
endsWith,.txt
regexMatch,xxxxx

You ran specify strings file names (not path names) that start with, end with or that match a regex pattern. You can put only one string per line. If don’t provide an includes.csv file out it presumes endswith .html .htm .txt .pdf. These are the files you want to include is the list you leave for Google.

excludes.csv

A list of records that look like this:

# excludes.csv which extensions you don't want to include in the sitemap
startWith,xxx
endsWith,.html
endsWith,.txt
regexMatch,xxxxx
You ran specify strings file names (not path names) that start with, end with or that match a regex pattern. You can put only one string per line. If you don’t provide an excludes.csv file, it presumes no excludes. These are the files you want to exclude from the list you leave for Google.

Invoking the Utility

The program looks in the current working directory for all five files.

You invoke it with without parameters:

rem invoke the sitemap utility to create a sitemap.
rem it reads five configuration files from the current directory
sitemap.jar

You can view the generated sitemap.gz with WinZip. Tell it that the internal file ends in .xml. You can also view a sitemap.proof file with contains a decompressed version of what is inside the sitemap.gz file.

Installing The SiteMap Utility

To install the sitemap utility, extract the zip download with WinZip, available from WinZip.com (or similar unzip utility) into any directory you please, often C:\ — ticking off the User folder names option. To run as an application, type:

rem invoke the sitemap utility to create a sitemap.
rem it reads five configuration files from the current directory
sitemap.jar

adjusting as necessary to account for where the jar file is.

See notes in the program and in the sample files.

Note that you normally have both a robots.txt file and a Google sitemap.gz file. The directories you exclude in robots.txt trump inclusion in the sitemap.

Jet Accelerator

For superfast execution, compile the jar with Jet to create a highly optimised sitemap.exe file. Then you can invoke it with just sitemap on the command line.

Spider Icon

sitemap icon Why the spider icon? The sitemap helps Google rapidly spider the website, visiting all the files.

Ensuring All Is Working Properly

The file created for Google contains a list of all the individual *.html and *.txt files on the website, and when they were last updated. It is fairly easy to modify the program to include other types of files. There is no point in including type Google does not index, such as zip.

You can tell how often Google is actually spidering your files by looking your files up in Google and noting the date of the cached version.

Technophiles might want to validate the sitemap.xml that the sitemap utility generates and compresses into the sitemap.gz, just to make sure it is completely compliant with the sitemap standards. Use an XSD (XML Scheme Definition) Sitemap Schema with the link embedded in the sitemap.proof file, namely http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd You can use a utility like Stylus Studio to do the validation. This would be just for your own reassurance. The code generated is always compliant.

Acquiring SiteMap

PackageVersionReleasedLicenceLanguageNotes 
sitemap
SiteMap For Google
1.9 2010-12-01 free Java
more infoprecismanualscreenshotbrowse source repository
for the current version of SiteMap For Google.
Build a Sitemap for a website for more efficient spidering by Google and other search engines.
download 2.4MB zip for SiteMap For Google Java source, compiled class files, jar and documentation to run on your own machine as an application.

Runs on any OS that supports Java e.g. W2K, XP, W2003, Vista, W2008, W7-32, W7-64, W8-32, W8-64, W2012, Linux, LinuxARM, LinuxX86, LinuxX64, Ubuntu, Solaris, SolarisSPARC, SolarisSPARC64, SolarisX86, SolarisX64 and OSX.

First install the most recent Java.

To install, extract the zip download with WinZip, (or similar unzip utility) into any directory you please, often J:\ — ticking off the use folder names option.

To check out the corresponding source from the Subversion repository, use the TortoiseSVN repo-browser to
access sitemap source in repository with [Tortoise] Subversion client on wush.net/svn/mindprod/com/mindprod/sitemap/.

After you have installed the jar, you can run it as an application. Type:

java -jar J:\com\mindprod\sitemap\sitemap.jar parms

adjusting as necessary to account for where the jar file is.

download ASP PAD XML program description for the current version of SiteMap For Google.

$1989.00 US donated so far. If the CMP utilities solved your problem, please donate a buck or two, or donate to one of the charities featured in the footer public service ads throughout the website and get a tax receipt.

SiteMap For Google is free. Full source included. You may even include the source code, modified or unmodified in free/commercial open source/proprietary programs that you write and distribute. Non-military use only.
 
 
Details of the Sitemaps XML protocol
Google’s List of Sitemap Preparation Software

This page is posted
on the web at:

http://mindprod.com/application/sitemap.manual.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\application\sitemap.manual.html
logo
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.
no blog for this page
IP:[65.110.21.43]
Your face IP:[54.161.214.221]
You are visitor number