Once you run it, sitescooper will ask you to select your default sites to scoop. See Typical Daily Use, below.
The first time you run sitescooper, it will pop up a list of sites in your editor, and ask you to pick which sites you wish to scoop. This creates a file in your temporary directory called site_choices.txt with these choices. Your temporary directory is the .sitescooper subdirectory of your home directory on UNIX, or C:\Windows\TEMP for Windows users; this can be changed by editing the built-in configuration in the script.
Once you've chosen some sites, it'll run through them, retrieve the pages, and convert them into iSilo format (which is the default). See Changing Output Format if you wish to change this.
By the way, note that the resulting PRC files may be well under 300Kb in size; sitescooper imposes the limit on the raw HTML or text as it goes along, and it's entirely plausible that the conversion tools used might do a great job of compressing the data.
Also it should be noted that often, when you hit the limit on a site, the missed stories will simply be scooped next time you run the script. This depends on the site file though.
If you want to increase the limit, use the -limit argument:
-text: plain text, with all the articles listed one after the other
-doc: DOC format, for reading with AportisDoc or another DOC reader on a Palm handheld. This is essentially plain-text format converted using MakeDoc.
-html: HTML format, with all the indexes and articles listed one after the other in one file, with hyperlinks between them
-mhtml: M-HTML (multiple-page HTML) format, which is the same as HTML except it separates the stories and indexes into separate files with hyperlinks between them
-isilo (the default format): iSilo format, for reading with iSilo on a Palm handheld. This is HTML converted using the iSilo conversion tool
-misilo: M-iSilo format, iSilo format files with multiple pages. This is an M-HTML article converted using the iSilo tool
-plucker: Plucker format, for reading with Plucker on a Palm handheld. This is HTML converted using the plucker-build conversion tool
-mplucker: Multi-page Plucker format, for reading with Plucker on a Palm handheld. Again, the plucker-build conversion tool is used. This format is recommended over the -plucker format, as Plucker has a page-length limitation, which results in pages being split in an ugly fashion.
-richreader: RichReader format, for reading with RichReader on a Palm. This is HTML converted using the RichReader conversion tool
If you want to convert to multiple output formats, you need to run sitescooper once for each output format, and use a shared cache between the separate invocations. Ask on the mailing list for more information on this.
Here's a sample profile file, as an example:
Name: Bay Area Earthquakes Score: 10 Required: san jose, earthquake.*, (CA|California) Desired: Richter scale, magnitude, damage, destruction, fire, flood, emergency services, shaking, shock wave Exclude: soccerAnd here's James' description of the format:
A profile contains the following:To turn on Profile mode, use the -grep argument when running sitescooper; any sites that do not contain IgnoreProfiles: 1 will then be searched for the active profiles.Obviously, one or both of 'Required' or 'Desired' must be present or it wont match anything.
Name - appears with the output story to identify which profile was matched (required)
Score - indicates how well the profile has to match (higher numbers filter out more stories (optional: default 30)
Required - words that are required to be in the story or it doesn't match (semi-optional)
Desired - words that might be in the story for it to match (semi-optional)
Excluded - words that can not appear in the story, otherwise it doesn't match (optional)
The score is a minimum value that must be matched (basically a percentage of keyword hits vs. number of sentences). The required keywords must be present or the story does not match. The desired keywords give hits about what is interesting. The more desired keywords that match, the higher the story scores. The exclude keywords will cause a story not to match if they are present.
All of the keywords (required, desired, exclude) can be phrases and all are processed as PERL regular expressions so they can be quite complex if needed. Keywords are separated by either a comma or a newline. Scouts.nhp is probably the richest example of what can be done with a profile (includes regular expressions).
I added an "IgnoreProfiles" command to the site file definition to allow users to scoop the entire site rather than just the stories that match.
To use a profile, create a directory called profiles, and set the ProfilesDir parameter in the sitescooper.cf file to point to that directory. Now copy in the profiles you are interested in from the profile_samples directory of the distribution. UNIX users should look in /usr/share/sitescooper or /usr/local/share/sitescooper if you're not sure where sitescooper has been installed. Edit the profiles to taste, and run
There's also a -nowrite argument which will stop sitescooper from writing cache files, already_seen entries, and output files.
If the worst comes to the worst, you can get sitescooper to copy the HTML of every page accessed to a journal file using the -admin journal switch. This HTML is logged, first, in its initial form straight from the website, secondly, after the StoryStart and StoryEnd has been stripped from the page, and finally, as text. This is handy for debugging a site file, but is definitely not recommended during normal use, as a big site will produce a lot of journal output.
If you have all the files in your cache, use the -fromcache switch and network accesses will be avoided entirely. This is handy for debugging your site offline, or for producing output in multiple formats from the same files, if you have a shared cache set up.
On UNIX, it'll be copied into the ~/pilot/install directory if it exists, or ~/pilot otherwise. Users of gnome-pilot, PilotManager, or JPilot should enter gnome-pilot, PilotManager, or JPilot in the PilotInstallApp field of the configuration, as sitescooper includes built-in support for those tools. (If you use KPilot, mail me and tell me where it should go!)
Windows users have it easy, as sitescooper will automatically install PRC files into the Pilot Desktop Install directory. If you use multiple Palm devices however, you will still need to edit the configuration to name the correct directory.
Mac users get the worst of all worlds, as the output is simply left in the txt sub-folder of the temporary folder. They need to copy it over themselves manually (sorry).
-dumpprc does the same thing for the binary formats, such as DOC, iSilo, M-iSilo or RichReader. Note that multi-file formats such as M-HTML don't get dumped either way; the path to the file which contains the first page of the output is printed instead.
Some versions of Windows perl have difficulty redirecting stdout, so the -stdout-to argument allows the same thing to be done from within the script itself.
[
Installing ]|[
on UNIX ]|[
on Windows ]|[
on a Mac ]
[
Running ]|[
Command-line Arguments Reference ]
[
Writing a Site File ]|[
Site File Parameters Reference ]
[
The rss-to-site Conversion Tool ]|[
The subs-to-site Conversion Tool ]
[
Contributing ]|[
GPL ]|[
Home Page ]