Site File Parameters Reference


Value Types

Most parameters take one of the following value types:

All the parameters use the format Parameter: value, where the line starts with optional white space, followed by the param name, followed immediately by a colon and more white space, followed by the value.


The Details

Here's the details on each parameter, and what it does. Note that only the URL parameter is required; all the others are optional, and have default values.

URL

Top-level URL of the site. Sitescooper will always start scooping the site by requesting this URL. This is required, and must appear before any other parameters.


Name

Name of the site. This name will be used in both the filename of the output file and the name of the resulting PRC database for Palm conversion, if needed. This is pretty much required.


Description

A one-line description of the site. Optional, but a great help!


AuthorName

The name of the site file's author.


AuthorEmail

The email address of the site file's author.


Active

Whether this site is active, ie. should be scooped.


SizeLimit

The top size limit, in kilobytes, of a scoop from this site. By default, this is left at 0, which means that the process-wide limit is inherited: 300K by default, or whatever the user specified on the command line.


MinPages

The minimum number of pages that this site requires. If a scoop generates less than this number of pages, it is assumed the site has not been updated and the scoop is ignored. By default, this is set to 1.


Levels

How many levels this site has. If unspecified, the site is assumed to be made up of one page, ie. a 1-level site.


AddURL

An easy way to add additional URLs to a site, along with the top-level URL.


LayoutURL

LayoutURL is similar to URL, but defines a layout for a specific pattern. If a page's URL falls within this pattern, and parameters are defined for this layout, but not defined by the site file, the layout parameters will be used.

This allows an easy way to specify default values that several sites can use; just define them in a layouts file such as lib/layouts.site and they will be inherited.


ExceptionURL

ExceptionURL is like LayoutURL, but it takes priority over both LayoutURL and the normal site file rules. The idea is that you can define a site using URL, then after defining the rules for the main site pages, you can specify rules for a different set of pages that will be encountered while scooping the site.


RequireCookie

If a site requires that a HTTP Cookie be set before it can be accessed, use this parameter. It takes a two-part value, consisting of the cookie's hostname and key separated by whitespace. For example, the economist_full.site site file uses RequireCookie: www.economist.com econ-key.


Rights

The rights of reproduction for that site, in addition to whatever will be scooped. This allows you to append arbitrary copyright text to the output, instead of the default End of snarf - copyright retained by original providers. message.


TableRender

How tables should be rendered in the output. There are 3 possible values:


LevelNLinksStart / IssueLinksStart / ContentsStart / StoryStart

Specify the start-of-links-area or start-of-story-area pattern for a page at that level.


LevelNLinksEnd / IssueLinksEnd / ContentsEnd / StoryEnd

Specify the end-of-links-area or end-of-story-area pattern for a page at that level.


LevelNLinksIncludeStartPattern / IssueLinksIncludeStartPattern / ContentsIncludeStartPattern / StoryIncludeStartPattern

Causes the start-of-links-area or start-of-story-area pattern to be included in the resulting scooped HTML. By default, it is not.


LevelNLinksIncludeEndPattern / IssueLinksIncludeEndPattern / ContentsIncludeEndPattern / StoryIncludeEndPattern

Causes the end-of-links-area or end-of-story-area pattern to be included in the resulting scooped HTML. By default, it is not.


LevelNPrint / IssuePrint / ContentsPrint

Specify whether a links-level page should be printed, ie. output. The default is 0 for text-style output, or 1 for HTML-style output. (HTML-style in this case is defined as supporting hyperlinks, including iSilo etc.)

There is no StoryPrint, as stories are always printed.


LevelNCacheable / LevelNCachable / IssueCacheable / ContentsCacheable / StoryCacheable

Whether pages at that level should be cached. The default is 0, meaning they are not cached (ie. it is assumed that a link to a file with the same URL may not contain the same text next time around). Both Cacheable and Cachable can be used, because it's a tricky word to spell ;)


LevelNDiff / IssueDiff / ContentsDiff / StoryDiff

Whether pages at that level should be diffed, ie. their contents compared against that of a previous run, and only the new elements used.


LevelNUseTableSmarts / IssueUseTableSmarts / ContentsUseTableSmarts / UseTableSmarts

Should the automatic trimming of narrow tables take place? Narrow tables are defined as tables with a width of less than 40% or less than 250 pixels. The default is 1.

UseTableSmarts can also be called StoryUseTableSmarts for clarity.


LevelNFollowLinks / IssueFollowLinks / ContentsFollowLinks / StoryFollowLinks

Should links which fit into the StoryURL, etc. pattern for that level be followed to parallel pages, ie. other pages at the same level? This allows a site to handle situations where stories or links to stories are split into "page 1 of 4" etc.

The default is 0.


LevelNAddURL / IssueAddURL / ContentsAddURL / StoryAddURL

Add a URL to the list that needs to be scooped at that level.


LevelNURL / IssueURL / ContentsURL / StoryURL

The URL pattern that a page for that level must fit into. Multiple URLs can be specified on multiple lines.

Generally, pages at the highest level do not need this to be specified, unless the FollowLinks parameter for that level is turned on.


ContentsFormat

The format for the links-level pages. Currently either html or rss can be used; HTML is the default, RSS indicates that the XML RSS format is used by that site.


LevelNSkipURL / IssueSkipURL / ContentsSkipURL / StorySkipURL

If an URL at the given level matches this URL pattern, it will not be examined by sitescooper. (note: versions of sitescooper before 2.3.x do not support LevelNSkipURL or IssueSkipURL.)


StoryHeadline

This specifies a regular expression pattern used to search for the story's headline or title. This is primarily useful for DOC-format output, where a bookmark is created at the start of each story using the headline as a bookmark title.

The story HTML is searched for this pattern before StoryStart and StoryEnd stripping takes place.

It should be specified as a regular expression containing a single (pattern) subexpression; the text that matches the section between brackets is used as the headline text.


StoryToPrintableSub

A Perl regular expression substitution used to convert story links to a form more suitable for sitescooper output. For example, many sites provide multiple views of a story, including a "printable" view for printing, and often the "printable" view is more amenable to scooping than the non-printable version. StoryToPrintableSub allows you to convert the story URLs to this "printable" format.

The StoryURL pattern must match the "printable" version. It does not need to match the original, "non-printable" format.

The format of a perl substitution is as follows: s,from-pattern,replacement, where from-pattern is a perl regexp pattern, generally containing (pattern) subexpressions, and replacement is a replacement text containing \number markers where the strings matched by the bracketed bits are inserted.

See the FAQ entry on multi-page stories in the Writing a .site File document for more information.


StoryLifetime

Very old stories, by default older than 90 days, are not scooped. This limit can be changed using this parameter.


StoryHTMLHeader

Additional HTML which should be added to the top of any story page.


StoryHTMLFooter

Additional HTML which should be added to the bottom of any story page.


UseAltTagForURL

If an <img> tag refers to an image which matches this URL pattern, its ALT tag will be used instead. The default is that no ALT tags be used. Note: this URL pattern is for the image's URL, not the URL it may be linked to (if there is one).


NeedLoginURL

If a page requires HTTP authentication to access, you can specify its pattern here to avoid a needless HTTP transaction. Normally, the page is requested first, and the server responds with a request for authentication; then the page is re-requested. This allows you to skip the initial request for a minor speed-up.


ImageURL

If an <img> tag refers to an image which matches this URL, that image tag will be left in the scooped document. Normally all images are stripped. Note that not all output formats support images however.


ImageOnlySite

Specify to sitescooper that no text is expected to appear on the resulting page; the only thing scooped is the image.


ImageScaleToMaxWidth

Specify the maximum width of an image. By default, this is 300, the rough width in pixels of the Palm handheld's screen; sites with large images, such as comics, can specify a larger value, which requires the user to scroll around the image but generally improves the readability of the picture.

This is not the way to solve the problem, by the way, so this parameter may go away or change in some way in the future...


ImageProcess

A chunk of Perl code which will be used to transform every image that sitescooper downloads.

The filename of the image downloaded from the website is passed in as $img_in, and the processed image should be written to the file named in $img_out. Set $img_out to the undef value if you want to skip that image.

This parameter is intended to allow the use of image rotation, resizing or quantizing code. For these purposes, the PerlMagick module may prove very useful.


URLProcess

A chunk of Perl code which will be used to transform every URL that sitescooper needs to download. This allows a huge degree of control over the links that sitescooper operates on.

The URL to operate on is passed in as $_, and the post-processed URL is expected to be in $_ afterwards. Set $_ to the undef value if you want to skip that URL.

Note that links which do not pass the StoryURL, etc. patterns will be dropped before URLProcess takes effect, so make sure those patterns are open enough for this.


LevelNHTMLPreProcess / IssueHTMLPreProcess / ContentsHTMLPreProcess / StoryHTMLPreProcess

A chunk of Perl code which will be used to transform HTML pages before sitescooper operates on them. This takes place after sitescooper strips the StoryStart and StoryEnd sections, and after the StoryHTMLHeader and StoryHTMLFooter sections are added (where applicable).

The text to operate on is passed in as $_, and the output is expected to be in $_ afterwards.


StoryPostProcess

A chunk of Perl code which will be used to transform every piece of text that sitescooper outputs. Confusingly, this operates on pages at all levels, not just story-level pages; sorry about that! This takes place after sitescooper performs its own cleanup, StoryStart and StoryEnd stripping, table-stripping, etc.

This parameter is deprecated, since the same processing is run for HTML output, text output, DOC format etc., and levels are not differentiated. Using the LevelNHTMLPreProcess parameters is recommended instead.


EvaluatePerl

Evaluate some arbitrary perl code before running that site. This takes Perl code as a value.

If the $skip_site variable is set to a non-zero value after the EvaluatePerl code is run, the site is skipped.


ExtraISiloIxlTags

Extra xml tags for the iSilo .ixl file generated when using the -isilox or -misilo option. This file is ignored when using -isilo (no 'x') or any other output formats. The value specified is inserted at the end of the temporary .ixl file which is generated when using iSiloXC to create the Palm document.

This allows you to override the default values to give, for example, different image processing commands or category command. The value must be a complete top-level tag for an .ixl file. Any tags here override default sections of the same name but augment the rest of the default .ixl. (The default .ixl file is either internally generated or defined in a file specified by the ISiloDefaultIxlFile tag in sitescoooper.cf) This takes advantage of the fact that iSiloXC will accept multiply specified options in the .ixl file and simply use the last one.

To get an idea of how to specify these tags, create an .ixl file with the iSiloX GUI tool and load it into a text editor. Or just learn the format from the iSilo website: .ixl file format

Note that not all tags are useful: For example the default generated .ixl file sets the link depth to 9 (using <MaximumDepth value="9"/>), but the link depth is really determined by the normal sitescooper configuration. This is because iSiloXC is only used to create the doc for the already-scooped (hence already pruned) web site.

Example:
The following set of tags overrides the image processing to force 16 bit depth, no alt-text no dithering, and a max-width of 320 (among others). I use it in my own version of Dilbert.site to get the best image for my Palm T|T. (Otherwise the default is 4 bit, dithered with alt-text and a max width x height of 224 x 198)

  ExtraISiloIxlTags: {
	<ImageOptions>
      <AltText value="exclude"/>
      <Images value="include"/>
      <ResizeLargeImages value="yes"/>
      <ImproveContrast value="no"/>
      <Dither value="no"/>
      <MaximumWidth value="900"/>
      <MaximumHeight value="600"/>
      <Compress value="yes"/>
      <BitDepth1 value="exclude"/>
      <BitDepth2 value="exclude"/>
      <BitDepth4 value="exclude"/>
      <BitDepth8 value="exclude"/>
      <BitDepth16 value="include"/>
  </ImageOptions>
}


[ README ]
[ Installing ]|[ on UNIX ]|[ on Windows ]|[ on a Mac ]
[ Running ]|[ Command-line Arguments Reference ]
[ Writing a Site File ]|[ Site File Parameters Reference ]
[ The rss-to-site Conversion Tool ]|[ The subs-to-site Conversion Tool ]
[ Contributing ]|[ GPL ]|[ Home Page ]