!Help file for !SiteMap v1.10 (5th Feb 2009)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Purpose
~~~~~~~
To quickly and easily create site maps for websites and additionally check
the integrity of inter-page links.


Licence
~~~~~~~
SiteMap is SHAREWARE. If you use it and find it useful, you should register
by sending a cheque for 10.00 to

Digital Phenomena Ltd.,
104 Manners Road,
Southsea,
Hampshire,
PO4 0BG.

Please make cheques payable to "Digital Phenomena Limited" which is the
trading name for my RISC OS software company.

Unregistered copies are fully functional, but due to time constraints I'm
only able to  offer technical support to registered users or beta testers.

You can contact me at nospam@vigay.com


Use
~~~
* UNREGISTERED DEMO VERSIONS HAVE A RESTRICTION ON THE NUMBER OF FILES YOU
CAN SCAN *

Double-click on !SiteMap to load and open the main window.

Drag the directory containing your website to the 'Source directory' icon.
This will then set the file path to the root of your offline website.

N.B. From version 1.05 you can link check individual files by dragging them
to the main window. See below.

In the 'Site Map' icon, enter the name of your sitemap file. By default this
is sitemap/html and will be created in the root directory of your website,
unless you drag the HTML icon somewhere else on your hard drive.

The 'Default index files' is a writable icon and can be changed depending
upon the filename you've used for the directory index files. Normally a web
server would return something like www.mywebsite.com/data/index.html if you
just enter www.mywebsite.com/data/ but some web servers allow you to change
this, such as "index.htm" or "default.htm" or something. Enter the RISC OS
name (as opposed to the HTML name, which substitutes .'s for /'s).

N.B. SiteMap always indexes the index/html files first when it encounters a
new directory. All other files are scanned in alphabetical order. The reason
for scanning index/html files first is because SiteMap makes a single
recursive pass through your entire website, and the index file is generally
the 'contents' or 'index' file for each directory, with other pages linked as
sub-sections.

The 'Template' icon gives a drop-down menu of all the available templates
(directories inside !SiteMap.Templates). Each template directory can consist
of up to three files:-

	exclusions	A list of files/directories to EXCLUDE when creating
                        your sitemap

	footer		A file to automatically add to the bottom of your
                        sitemap file.

	header		A file to automatically add to the top of your
                        sitemap file.

	
'Scan' allows you to select the type of scan that SiteMap will perform. This
determines  where SiteMap will find descriptions for your web pages. At
present the following options are provided:-

	Title		From the <title>...</title> tags. The description
                        will be extracted from within these tags.
			
			An optional "Include blank <title>'s" option will
                        allow SiteMap to automatically include "-no title-" text
                        for any files which don't have valid <title>
                        descriptions.

	SSIs		From SSI parameters. ie. Any text entered after a ?
	                character as part of the SSI parameter.
			
	Filename	Use the filename as a description - useful if you've not
	                used meaningful <title> tags in your HTML pages.
	                
			
Style	This allows you to choose various styles for your final sitemap 
	file. At present the following styles are implemented.
	                
	Listing		A straight-forward listing of all files in your
	                webspace, nested accordingly to directory structure.
	                                
	XML Sitemap	An XML based sitemap. Extremely preliminary support so
	                far though!!
  	      See https://www.google.com/webmasters/tools/docs/en/protocol.html
	      for more information.
	                Selecting this option automatically opens another mini
	                window in which you can set the domain URL prefix for
	                your website, as required by the XML sitemap specs.

Clicking on 'Edit Exclusions' will let you edit the list of files and/or
directories that will be ignored when creating your sitemap file. Simply add
a list of words, one on each line, that will be ignored. For example, if you
have a directory called "private" that you don't want indexed in your sitemap
file, just enter "private" (without the quotes) on its own line. Then save
the file. Make sure any filenames are in RISC OS format, NOT web.
Thus, if you want to exclude something like website/data/private then you would
add the following to the exclusions file;
data.private

Ticking the 'Include blank \<title\>s' option will label any HTML files with an
undefined or undetectable title description as -no title- so that you can easily
spot files which don't have suitable descriptions.
If off, files without a description will be left blank.

'Add <base> tag' will automatically write a <base href> tag to the header of
the site map file so that you can follow the links correctly when viewing
offline. This will effectively convert the RISC OS root directory of your
offline website into a web browsable URL.

'Add indents' will automatically insert extra spaces into the sitemap file in
order to make source code viewing a bit clearer on the eye.

This option affects the listing order of the files in your website hierarchy. By
default, SiteMap will read any index files before the rest of the files in a
directory. This is because the index file is generally the default file served
by a web server if you only supply the directory name.
Thus, the description of the index file will be used as the contents of the
relevant sub-section when producing the sitemap. The other files will then be
listed below this, in the form of a sub-section. This is effectively promoting
the index file above the other files.

If you tick this option, it will NOT promote the index files, which will just be
treated like any other file and indexed alphabetically along with the other
files in each directory.

Experiment with this setting to suit your own requirements.


Link Checking
~~~~~~~~~~~~~
SiteMap will check the validity of links within all your HTML files if you
enable this  function by toggling ON the 'Link check' icon.

Turning ON 'Check external' will also check any external links. This uses a
third-party utility called Wget and will be greyed out if this is not
installed.

To install Wget, you can download it from
http://www.riscos.info/packages/NetworkDetails.html#wget and then install it
in your 'Library' directory. This is usually something like !Boot.Library but
may vary if you have Select installed. If in doubt, check with your OS
supplier.

WARNING: Turning on external link checking can take a *LONG* time, depending
upon how many links you have within your site. Each link check has an
overhead of a few seconds to allow for the remote site to repond within a
suitable time period.

Turning 'Ignore cgi files' ON will ignore any references to cgi-bin files.
This can be useful because cgi files are often stored outside your normal
webspace anyway, so will lead to misleading error reports.

Use throwback: This requires the use of the DDEUtils module, which should be
loaded automatically when you load !SiteMap.

If on, SiteMap will automatically list lines containing errors into a 'throwback
window' provided by your text editor. This allows you to double-click on any
errors and be taken automatically to the relevant line in the source file
containing the error(s).

Turn on the 'Count bandwidth' option to generate a report detailing the 
approximate amount of bandwidth required by each page. This will allow you to
judge whether or not any of your pages might pose a problem for people on
dial-up or with slow internet connections.

By default SiteMap will generate reports in plain text format, which can be
viewed in your favourite text editor. However, you can click on the drop-down
menu option under 'Report format' to alternatively select HTML format if you
prefer. This will generate reports in HTML format and cross-link them between
each other, with hyperlinks to any files with  errors in them.

Toggling 'Auto open reports' will automatically open any reports which
contain 1 or more errors when the scanning is complete.

Clicking on 'View Reports' will open the 'reports' directory so that you can
manually examine the various report files produced by SiteMap. These are all
in plain text format and are described below.

'Report spaces' will indicate any filenames/links which contain spaces. This is
because using spaces in filenames is generally frowned upon and some file 
systems may give an error.

'Check for Orphans' will automatically check for any NON HTML files that are NOT
linked to from any HTML files. This can be useful for spotting any redundant
files which are not linked to from anywhere. The reason HTML files are not 
checked is because you may not cross-link to all HTML files but you may update a
graphic or other file without necessarily deleting older, unused, files.

Note. Turning this option may slow down the scan considerably because it needs
to make two passes of the website; once to populate the file database and again
to cross-reference all the links.

Note 2. Checking for orphaned files will not log some common 'system' files,
such as .htaccess, favicon.ico, robots.txt and index.rss files. This is because
many websites include these files but don't actually link to them from anywhere.


Link Checking Individual Files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(added at version 1.05)
If you drag an individual file to !SiteMap's main window it will
automatically link check it and produce a set of reports for you. The overall
settings used are those currently set in the main window. However, you don't
specifically need to turn on 'Link check' as it will assume that dragging a
single file directly to this window requires it. After all, a sitemap for a
single file would be a bit pointless. :-)

SiteMap will only recognise HTML (filetype &FAF) files. Dragging other
filetypes will give an error. N.B. SiteMap will however, recognise TEXT
(filetype &FFF) files being dragged to the window. This will load a
replacement set of choices, so be careful of the filetype when you drag a
file to SiteMap.


Reports
~~~~~~~
When you run SiteMap with the 'Link check' option enabled, it will
automatically create a  number of reports, giving information on the
integrity of all the files comprising your website.

The report files are automatically created inside a directory called
"SiteMap" in your <Wimp$Scrap> directory.

In the current version of SiteMap the following report files are created:-

	analysis	This is a short file giving various statistics about
	                your website.
	
	empty		Lists all the files that contain empty links.
			An empty link is some code such as <a href="">....</a>
			
	external	(if 'Check external' is enabled - see above)
	                Lists all the broken external links. SiteMap will query
	                the remote server and if it responds with a "200 OK",
	                "301 Moved Permanently" or a "302 Found" response, the
	                external link will be deemed to exist, otherwise it will
	                be flagged as a broken link and listed in this report
	                file.
			
	mismatch	Lists all the files that have mismatched <...> tags.
			If you've accidentally entered something such as
			<<strong> as a tag then you will end up with more
			opening < symbols than closing > symbols. SiteMap counts
			all the references and gives you a handy analysis.
			The filename is given along with the numbers counted of
			the < and > symbols. Lastly, a byte offset (in
			hexadecimal) is given to the first detected occurrence.
			
	notfound	Provides a list of all files which contain broken links.
	                Each file is listed alphabetically, followed by a list
	                of the relevant links which cannot be found. You can
	                then manually edit each of the files in order to fix the
	                relevant links.
			
	outside		Similar to the 'notfound' report but this one lists all 
	                the links which point outside your website. For instance
	                a link such as <a href="../../page.html">link</a> will
	                generate an error if the file containing this link is
	                only nested one directory down from your root directory.
	                This is because ../../ will point above your root
	                directory.
			N.B. Many web servers will ignore references above your
			root web space and just link to the root directory.
			However, SiteMap will pick them up, because this can
			lead to confusing, especially if you're browsing your
			website offline from your hard drive.
			
	spaces		If the option is turned on, this report file will be
	                produced, highlighting all the links which contain
	                spaces in them. Filenames with spaces are generally 
	                frowned upon and may be invalid on some file systems.
			
Click 'Quit' to quit SiteMap. You can also close the window to quit the
application.

Click 'Start' to start processing and generate your sitemap file and check
links.


Multiple Choices
~~~~~~~~~~~~~~~~
(added at version 1.05)
As more options have been added to SiteMap you may wish to automatically
remember certain combinations of settings for different websites. You can now
save the current choices by clicking MENU and selecting the 'Save choices'
option from the main menu. This will provide a dialogue window from where you
can enter a filename and then save a  backup copy of all the current options
and settings.

To reload a previous set of choices, simply drag a saved choices file back to
the main  window and SiteMap will replace the current choices with the
previously saved ones. If you do this though, note that the CURRENT choices
will be lost (unless you save them first!)

Although the choices file is specifically saved in plain text format with
comments, you *really* shouldn't edit settings by hand. If SiteMap
subsequently can't understand an option it will be ignored (and thus set to
OFF).


Miscellaneous
~~~~~~~~~~~~~
* The current settings are automatically stored when you quit !SiteMap.

* The window positions are automatically stored, so that if you move the
  windows, they'll re-open in the same place, even if you quit and reload
  !SiteMap. If open when quit, the debug window will automatically re-open when
  you next load !SiteMap.


robots.txt files
~~~~~~~~~~~~~~~~
If you enable the 'Obey robots.txt' option in the main window, SiteMap will
automatically detect any robots.txt files stored in the root directory of your
website (ie. in the directory you drag to SiteMap, not resulting
sub-directories) and exclude any files listed in the "User-Agent: *" section.
Additionally, you can exclude files specifically for SiteMap by specifying them
after a "User-Agent: pv_sitemap" declaration.


The sitemap.html file
~~~~~~~~~~~~~~~~~~~~~
By default !SiteMap will produce a file called "sitemap/html" containing a
map of your site. Although this is designed to be incorportated into your
site by adding the relevant site specific headers and footers (see the
templates options), it also uses some custom stylesheets. These are included
in the default (-none-) template, but for reference, they are listed below.
Each nested level of list item has a new stylesheet, so that you can
customise the output by editing the relevant CSS file.

	ul.sitemap { margin: 0px; padding: 0px;	background: #ffffff; }
	ul.sitem1 { margin: 4px; padding: 4px;	background: #ffffff; }

	li.sitem1 {
		font-weight: bold;
		list-style-type: none;
		list-style-image: none;
		line-height: 1.5em;
		margin: 0px 8px 4px 12px;
		padding: 2px 4px 4px 4px;
		border-style: outset;
		border-width: 1px;
		border-color: #22ff66;
		background: #ddffee;
	}

	li.sitem2 {
		font-weight: normal;
		list-style-type: circle;
		list-style-image: url(/i/bullet2.gif);
		line-height: 120%;
		margin: 0px 0px 4px 12px;
		padding: 0px;
		border-style: none;
	}

	(These are the ones I use for my www.vigay.com site)
	
	If you have more than 2 levels of nested lists, you may want to create
	more stylesheet definitions, using "ul.sitemX" and "li.sitemX" where X
	is changed to the level of nesting you require. Hopefully you'll be able
	to understand the method used from the example code above.


Remote Filing Systems
~~~~~~~~~~~~~~~~~~~~~
There is no reason why SiteMap shouldn't work across network and shared filing
systems. I've tested !SiteMap as working normally on the following systems, but
please contact me if you encounter any problems or compatibility issues.

LanMan98 on a Buffalo Network Attached Storage (NAS) device.
ShareFS (SiteMap on Iyonix, scanning website stored on a Risc PC)


Debugging
~~~~~~~~~
Click on the 'Debug...' option from the main menu to open the debug/logging
options window. This window has a number of additional options, all of which
should be OFF for normal use.

'Debug' (back in the main window) is the main control and turns logging on or
off.

'Log all links detected' will give a running total of all links detected in
each file. These will be stored in the main log file, which will grow to
become quite large if you have lots of links.

'Cumulative <> count' will give a running total of the number of < and >
characters for  every line in each HTML file detected. This will dramatically
enlarge the log file and also slow down the scanning speed. However, it can
be useful if you need to track down a rogue mis-match of characters.

'Comment nesting' will additionally give a running count of the number of
nested comments or SSIs encountered.


To Do List
~~~~~~~~~~
There are a number of features that I'd like to add at some point, but
firstly I'd like to get everything working and tested. Once it's "doing what
it says on the tin" I hope to look at the possibility of adding the following
functions:-

 * Work out what to do about sub-directories with pages in them, but no
   index.html file.
  
 * Test with sites stored on remote filing systems (such as SunFish)
   

Known Bugs
~~~~~~~~~~
SiteMap is quite a complex application, and consists of a very efficient but
tightly recursive search algorithm. It's quite likely that if anything causes
it to go wrong whilst scanning, a file may be accidentally left open -
although I've included quite stringent error checking. If a file does
accidentally get left open, you may find my !CloseFiles utility useful. This
can be downloaded from http://www.vigay.com/software/closefiles.html


Author
~~~~~~
SiteMap is the copyright of Paul Vigay.

Email: nospam@vigay.com
  Web: http://www.vigay.com/software/


2009 Paul Vigay
