Writing a .site File

Writing a Sitescooper .site File

This is a step-by-step guide to writing a .site file for your favourite site. I'll use an imaginary news site, www.whizzynews.com, to demonstrate.

The Basics
Multi-Level Sites
StoryURL and Regular Expressions
Diffing
Trimming
Headlines
3-Level Sites
Including Images
Frequently Asked Questions

The Basics

The basic format of the .site files is like this. Comments are started with a # sign, and continue up until the end of that line. The site parameters generally start with a capital letter and are separated from their value by a colon and a space.

# a comment
#
URL: xxxx
Parameter1: value
Parameter2: value
[... etc.]

First, start with the URL. This is the URL that sitescooper will start crawling that site from. Also you might as well add a readable name for the site, and a description of what's on offer at the site.

URL: http://www.whizzynews.com/
Name: Whizzy News
Description: News about whizzy stuff, updated daily

You can leave out the http:// bit in the URL if you like, but I think it's a little more readable if it's there. The URL line must always be first in a .site file.

Finding a good URL can be tricky -- the front page isn't always the best one, as it's sometimes "optimized" for MSIE and/or Netscape. If the site has a "Palmpilot version", maybe that would be a better URL to start with.

A handy way to write a site file is to use the "AvantGo version" of many sites; quite often these are not easy to track down, but if you search AltaVista using the keywords url:avantgo -url:avantgo.com sitename, or link:.subs avantgo sitename, it may find it. Another way to do it is to search for link:avantgo.com/mydevice/autoadd.html, which is AvantGo's interface allowing site authors to add their own sites, providing the details in the URL.

Alternatively if you can find the .subs file for that site, I've included a conversion script called subs-to-site.pl which does a good job of converting them into a sitescooper site file.

The next most important thing is the number of levels to the site. A 1-level site will only have the initial page downloaded. Typically this would be something like NTK, Slashdot, Memepool, RobotWisdom etc. To indicate a site like this, use the line:

Levels: 1

in your .site file.

A 2-level site would have an index page with links to the full stories elsewhere. This is quite common too: LinuxToday, The Register, Wired News (when you're downloading it section-by-section that is). To do this you'd use "Levels: 2", unsurprisingly.

A 3-level site is something like The Onion, New Scientist or other periodicals, which come out with "issues", each of which is a page of links to stories: "Levels: 3".

If you're writing a 1-level site, or you're using a "Palmpilot version", you could almost stop here. You may want to trim off bits from the top and bottom, in which case check the Trimming section; also, you might want to only download the differences between the current page and what you'd previously downloaded; check Diffing in that case. Otherwise carry on to...

Multi-Level Sites

So far, you've got something like

URL: http://www.whizzynews.com/
Name: Whizzy News
Levels: 2

This may do the trick for you -- the default behaviour is for sitescooper to get the index page ("URL"), then follow any links that go to another page on the same host (using the same protocol and port if present). In the example above, that means any other page under http://www.whizzynews.com/.

In sitescooper terms, in the example above, the URL http://www.whizzynews.com/ is the "contents" page; any pages linked to from that page are the "story" pages. Sitescooper is optimised to treat story pages on a multi-level site as static pages (i.e. their content does not change), while contents pages are dynamic (their content may change over time).

StoryURL and Regular Expressions

If you want to avoid downloading the masthead, search page, site map, RealAudio files, and whatever other clutter may be linked to from a site's contents page, you'll need to specify a StoryURL. This is a Perl regular expression which a link needs to match to be downloaded.

Regular expressions are a very powerful way to match text strings, and once you get the hang of it you'll wonder how you ever got by without them. If you're a regexp newbie, don't worry, I'll give a quick overview as we go by.

Anyway, StoryURL is a regexp, and it must match the URL of any links found fully, i.e. specifying

StoryURL: http://www.whizzynews.com/stories/

will *only* match "http://www.whizzynews.com/stories/", which is not what you want. If you want to match any page under /stories, do this:

StoryURL: http://www.whizzynews.com/stories/.*

(".*" means "match any number of characters"). Even better, narrow it down to avoid non-HTML pages like this:

StoryURL: http://www.whizzynews.com/stories/.*\.html

("\." means "match a dot", "." usually has a special meaning for regexps). Just for convenience, StoryURL (and ContentsURL, see later) allow you to leave off the hostname and protocol if they're the same as the URL, so you could write the above line as:

StoryURL: /stories/.*\.html

See, easy. Some other quick regexp tips:

. matches one character, regardless of what it is
x+ matches more than one 'x' in a row (x could be a character, a character class, or . to match anything)
x* matches zero or more 'x' in a row
x? matches zero or one 'x's
\d matches a number (this is a character class)
\d+ matches one or more numbers in a row
(foo|somethingelse) matches either "foo" or "somethingelse" (and, optionally, stores the matched value for later use)

Some symbol characters (. + * ( | ) [ and] for example) have special meanings. To just use them as normal characters, prefix them with a backslash, e.g. \., \+, \*.

So a really advanced StoryURL example would be:

StoryURL: http://www.whizzynews.com/stories/.+/\d+_\d+\.(html|htm)

Which means "match any document under a subdirectory of /stories, with a filename consisting of 2 numbers separated by an underscore, ending in either .html or .htm".

Check the .site files in the "site_samples" directory for good examples of StoryURLs.

For more information on regular expressions in general, and the Perl variety that sitescooper uses, I recommend checking out some of these pages. First of all, Steve Litt at Troubleshooters.com has put together a great guide here: Steve Litt's PERLs of Wisdom: PERL Regular Expressions. Highly recommended. Next up, there's A Tao Of Regular Expressions, it's quite good. Also, weblogger Jorn Barger has written up a good page of to regexps at Regular Expressions Resources on the Web. The "Perl" section is most relevant to sitescooper's regexp format. The About.com guide to Perl has a great tutorial on them too.

Diffing

In general, you'll be downloading a site which has at least one page that changes over time, with new links being added to it continually (in sitescooper's terms, this page is "dynamic"). Sitescooper handles this for most sites, by tracking which story pages have already been read, and not re-downloading them or, in the worst case, not converting the same text for a given URL twice. Here are the mechanisms it uses to avoid this:

if the page has already been read, and sitescooper's heuristics indicate that that page is a static "story" page, it will not be downloaded again;
if the heuristics indicate that a page is dynamic, but the Last-Modified date is the same as for the previous time it was downloaded, it will not be output again;
and finally if the page is dynamic but the converted text is exactly the same as it was last time it was downloaded, it will not be re-output.

However some sites, such as Slashdot, Ars Technica, BluesNews and others display their stories on one big page, with new stories being added to the existing page as time goes on. Obviously you don't want all the old stories every time a new one appears, so sitescooper also supports "diffing".

"Diffing" is when sitescooper compares the page with what it had previously seen on a previous snarfing run, and only reports the differences. (The word "diffing" comes from the UNIX tool, diff(1), BTW.)

Quite frequently sites that need this are 1-level sites, although sometimes it can be handy to use diffing on 2-level or 3-level sites' contents pages, if the text from these pages is downloaded as well as the links to stories.

Diffing is enabled for a site by specifying one of the following parameters:

StoryDiff: 1

Enable diffing of story pages -- typically used for 1-level sites, where the URL specifies the story page.

ContentsDiff: 1

Enable diffing of the contents page. As explained above, this is typically only needed if you will be printing the text from the contents page (see the ContentsPrint parameter, described later). Many of the parameters that affect story pages can be applied to contents pages as well, by the way -- more on this later.

Trimming

Most story pages feature some kind of excess clutter that you don't need when you're browsing them offline, such as navigation bars, etc. In addition, the contents pages may link to certain pages that match the StoryURL, but you don't want them downloaded.

To work around this, sitescooper supports the ContentsStart/ContentsEnd and StoryStart/StoryEnd parameters. These are Perl regular expressions (again) which should match the beginning and end of the contents or stories. E.g. if you want to trim the retrieved stories so that the text output only runs from the end of the navigation bar, which is helpfully commented in the HTML with a comment "end of nav bar", to the HTML comment "end of story text", you could use the line:

StoryStart: -- end of nav bar --
StoryEnd: -- end of story text --

Headlines

Sitescooper will, by default, treat the first line of each story as a bookmark, and mark it as such so you can jump straight to that story in your DOC reader using its bookmark feature. However, you can optionally indicate a headline pattern, like so:

StoryHeadline: -- HEADLINE=(.*?) --

This will cause sitescooper to look for that pattern in the HTML source of the story page and use whatever is in the brackets, namely .*?in the example above, as a headline for the bookmark.

Sitescooper has built-in support for some headline mechanisms, namely PointCast headline tags in the HTML, and the headline tag used in My Netscape-style RSS files.

3-Level Sites

So far, 2-level sites have been covered. 3-level sites are essentially the same, but they have an additional page at the front which links to a contents page for that "issue":

URL: http://www.whizzynews.com
Name: Whizzy News
Levels: 3
IssueLinksStart: Latest issue is
IssueLinksEnd: Back issues
ContentsURL: http://www.whizzynews.com/issue\d+/.*
[... usual ContentsStart etc. from here on]

As you can see, you can specify IssueLinksStart and IssueLinksEnd parameters, which work in a similar fashion to ContentsStart and ContentsEnd, and a ContentsURL keyword which works a la StoryURL. Most of the keywords used for 2-level sites have a parallel keyword for 3-level sites too.

Including Images

To include images, add an ImageURL line. This takes the same format as StoryURL, etc., but applies for all levels of the site; an image file that matches that URL will be included in the output HTML or iSilo file.

Frequently Asked Questions

I used the # character in my site file, and it isn't used/ has no effect/ wierd things happen. Why?

# is the site file comment character, so anything after that character is ignored by sitescooper. Try to avoid using it if possible. In patterns, use the . character (which matches any character) instead.

How do I handle sites with more than one possible URL pattern for stories?

Just specify more than one StoryURL line, sitescooper will know what to do.

How do I print out the contents- or issue-level pages as well as the story pages?

Set the ContentsPrint parameter to 1. Note that when you output in HTML format, or a HTML-based format like iSilo, this happens automatically -- but for text or DOC output it needs to be switched on by hand.

How do I handle sites where the story's URL stays the same, but I don't want it cached, or the contents page can use a cached version if it's URL is the same as on a previous run?

Use the StoryCachable or ContentsCacheable keywords. A value of 0 indicates that the page should not be cached, and a value of 1 means that it can.

My site is downloaded OK, but some of the text is lost because it's in a narrow table, and sitescooper is trimming it out automatically. How do I stop this?

Set StoryUseTableSmarts or ContentsUseTableSmarts to 0.

Is there any way to snarf a site in frames?

Yes. You can either treat it as a site with an extra level (sitescooper will follow FRAME SRC tags as if they were A HREF links), or, if the framed document has a static URL, just treat it as a site with the right number of levels and start crawling the site from the framed document's URL.

How do I skip some stories, even though they match the StoryURL?

Use StorySkipURL (or ContentsSkipURL if you need to skip contents pages). This should contain the regexp pattern of URLs you want to skip.

Can I add start URLs to a site, in addition to the one in the URL parameter?

Yep, use the AddURL keyword in the site file. Each additional URL will start the downloading again, and the output will all be in one file.

I want to do some mangling of the output from sitescooper, after it's been converted from HTML to text, but before it gets send to MakeDoc. Can I do this?

Sure, but you need to know some Perl. Sitescooper provides the StoryPostProcess keyword to do this. For example, here's one I use on one of my personal site files:
StoryPostProcess: {
    s/^\s+//gm;
    s/^(\d)/\n$1/gm;
    s/^(Please reload |MAIN INDEX |MAIN NEWS INDEX ).*$//gm;
    s/^(TV Extra |ENTERTAINMENT INDEX |Last Updated: ).*$//gm;
    s/^(NEWS HEADLINES ).*$//gm;
    s/\n\s*\n+/\n\n/gs;
}

There are several things you should note here. Firstly, the entire story is passed in as one string in the Perl default variable, $_, so you need to use /gm on your substitutions to modify it as a multi-line string. Secondly, the StoryPostProcess parameter is a multi-line parameter, so you need to scope it in a vaguely C-like style with { squiggly brackets }. You can use squiggly brackets inside the post-processing code, and sitescooper will keep track of them correctly.

The site I'm crawling has multi-page stories. How do I deal with this?

Sitescooper will look for a "next page" link using a built-in pattern, and follow it if it links to another article which matches the StoryURL pattern. Alternatively, you can specify a regexp substitution to perform on the story's URL using the StoryToPrintableSub keyword, like so:
StoryToPrintableSub: s,^(http://.+.wired.com/news)/news/(.+\.html)\S*,\1/print_version/\2?wnpg=all,
This (quite complicated) example is for Wired News. It takes the first bit of the URL, from "http://" to "/news", and the second bit, after the second "/news/" string, and converts it to "(first bit)/print_version/(second_bit)?wnpg=all". Note the use of \1 and \2 to do this. A perl regexp substitution generally looks like s,from-pattern,replacement, the from-pattern contains (bracketed bits) to catch the important parts of the URL, and the replacement string contains \number markers where the strings matched by the bracketed bits are inserted. Note that the \number markers match up to the (bracketed bits) from left to right, so the first one is \1, second is \2, etc.
You can also use the StoryFollowLinks parameter to follow the links to the other story pages, if that would work.

Is there an easy way to scoop sites that provide links to new content using the My Netscape or Scripting News RSS XML formats?

Yep. Set the ContentsFormat parameter to rss, and the URL will be treated as an RSS file instead of a normal HTML page.

Sitescooper supports My Netscape RSS (Rich Site Summary) channels, as seen on my.netscape.com and my.userland.com; check out the tbtf.site file for an example. Here's Netscape's documentation on what they are and how they're made.

RSS takes care of most of the hard work of writing a .site file. You need to run the "rss-to-site.pl" script, included, and provide the URL of the RSS file as the command-line argument. rss-to-site.pl will spit out a rudimentary site file for that site.

Some sites will need StoryURL changed; also sometimes StoryStart and StoryEnd patterns need to be added to clean up the resulting story HTML.

I have not yet worked out a way of winkling the RSS file's location from the sites BTW, most of them do not link to it! However, you can try to find the site in My Userland ( http://my.userland.com/) or TheWeb.StartsHere (http://theweb.startshere.net/backend/) -- both of these have a "Backend" section where you can find out the URLs they use to get the RSS file.

One of my sites produces lots and lots of output on its first run, because the contents pages have links to really old articles. How can I fix this?

Set the StoryLifetime parameter to something lower than the default, which is 60. The parameter is specified in days. Alternatively, try using the [[MM]] , [[YYYY]], [[DD]] parameters in the URLs to restrict the months, days, and years that articles should be chosen from.