This is sitescooper, a perl script which you run on your Palm Computing handheld organizer's hotsync machine. It will retrieve news stories automatically from various news websites and convert them into Palm DOC, iSilo, RichReader or text format; in addition, it can now convert into any other format for which you have a conversion program that takes text or HTML input.
(If you've just installed sitescooper, you probably don't want to read the blurb again; so just go straight to the Installation page.)
HTTP and local files, using the file:/// protocol, are both supported.
Multiple types of sites can be snarfed:
1-level sites, where the text to be converted is all present on one page, (such as Slashdot, Linux Weekly News, BluesNews, NTKnow, Ars Technica);
2-level sites, where the text to be converted is linked to from a Table of Contents page (such as Wired News, BBC News, and I, Cringely);
3-level sites, where the text to be converted is linked to from a Table of Contents page, which in turned is linked to from a list of issues page (such as PalmPower or New Scientist).
In addition sites that post news as items on one big page, such as Slashdot, Ars Technica, and BluesNews, are supported using diff.
It even trims out sidebar tables automatically, by making the assumption that tables < 30% of the average browser width are not part of the news story. Effectively, sitescooper is a transcoder for handheld PCs.
The script should run easily on most UNIX variants that support perl, as well as the Win32 platform, even Windows 95 (tested with ActivePerl 5.00502 build 509). It has been reported to work on a Mac, using MacPerl 5.1.9r4.
Output is supported in the following formats:
plain text
Plucker, a HTML-based format for Palm Computing organizers. Plucker is free software licensed under the GPL, like sitescooper.
iSilo, a HTML-based format for the Palm Computing organizers from DC and Co., available from http://www.isilo.com/
RichReader format, an RTF-based format with formatting, see http://users.erols.com/arenakm/palm/RichReader.html
DOC format, as used by AportisDoc, TealDoc, CSpotRun, etc.
any other format using the -pipe switch.
DOC format, Plucker format, and text are all free. RichReader is shareware, and iSilo has both shareware and free readers available.
You may ask, "why not just use AvantGo, 'lynx -dump' and 'makedoc', or some other web-page-downloading software?" Well, sitescooper has several advantages:
it will follow links, and has a sophisticated set of mechanisms to follow the right links and use the "printing version" of a story;
it can use heuristics to trim out irrelevant tables;
the HTML rendering code is optimised for viewing on a Palm handheld, by trimming all images (even their ALT tags), forms, and extraneous headers and footers (based on the .site file), resulting in much more space free on your handheld;
it's very configurable for each target site -- you can even use Perl code in a site file to rewrite the HTML as it's scooped;
it tracks what stories you've already read, and is quite sophisticated about removing text you've seen before;
it's portable to UNIX, Win32, Mac, and any other perl-supporting platform;
it's free software, distributed under the GNU GPL.
In short, it's pretty neat.
Pick up the latest version of sitescooper at the following URL:
http://sitescooper.org/
Sitescooper is distributed under the GNU GPL, and as such is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. The full text of the GPL is available here.
The next thing to do is to follow the links below to the next section, Installing.
[
Installing ]|[
on UNIX ]|[
on Windows ]|[
on a Mac ]
[
Running ]|[
Command-line Arguments Reference ]
[
Writing a Site File ]|[
Site File Parameters Reference ]
[
The rss-to-site Conversion Tool ]|[
The subs-to-site Conversion Tool ]
[
Contributing ]|[
GPL ]|[
Home Page ]