There's a mailing list for discussion of sitescooper. To join, send a mail to <sitescooper-request /at/ netnoteinc.com> with the one word subscribe in the message body to join. If you're already on the list, send a mail to <sitescooper-request /at/ netnoteinc.com> with the word unsubscribe in the message body to unsubscribe. Note: the mail addresses above are "spam-protected", so you need to change the " /at/ " parts to an @ sign to send a mail to them!

If you have a site you think others will like, mail the .site file to the list, and I'll stick it in the distribution -- and list your name in the CREDITS section. Same goes for bug patches!


If you find one, send a bug report to the list (or myself) and I'll try to get around to fixing it. Could take a while though, as I don't get paid for this stuff. BTW I really like bugfix patches if you feel like submitting one after finding a bug ;)


Some of the post-processing and HTML cleanup code include ideas and code shamelessly stolen from http://pilot.screwdriver.net/ , Christopher Heschong's <chris at screwdriver.net> webpage-to-pilot conversion tool.

Included in the distribution is a copy of Algorithm::Diff, an implementation of the Longest Common Subsequence algorithm, Copyright 1998, 1999 M-J. Dominus (mjd-perl-diff /at/ plover.com).

Also Robb Canfield (robbc /at/ canfield.com) has kindly provided Table.pm, "a general purpose HTML table converter that tries, usually successfully, to convert wide tables to long lists. In general it copies the table headers and rotates them down for each row." It's used when you set "TableRender: list".

James Brown (jbrown /at/ burgoyne.com) has also contributed NewsHound.pm, which "adds what I call story profiles to sitescooper. Basically you tell sitescooper what sort of stories you are interested in by describing them in one or more profiles. Then the system only scoops stories that interest you. Obviously this works better on either 2 or 3 level sites where stories are encapsulated in a single file. You can also disable the profiles for a particular site (for example, a headlines page where you want everything)." It's used when you use the -grep command-line argument.

Both are free software; you can redistribute it and/or modify it under the same terms as Perl itself. If you've downloaded the "full" version of sitescooper, also included under the Artistic license are: HTML-Parser 2.23, by Gisle Aas; 1995-1999 Gisle Aas. All rights reserved.
Libwww-perl 5.45, 1995-1999 Gisle Aas. All rights reserved, and 1995 Martijn Koster. All rights reserved.
MIME-Base64 2.11, Copyright 1995-1999 Gisle Aas <gisle /at/ aas.no>.
URI 1.04, Copyright 1998-1999 Gisle Aas, Copyright 1998 Graham Barr.

These are included to ease the task of installation.

Here's a list of people who've contributed to sitescooper, either with .site files, patches, or suggested fixes and functionality:

Carsten Clasohm, <cc /at/ clasohm.com>: fix for diffing sites with newlines in the href tags, regional_germany sites.
michael d. ivey <ivey /at/ gweezlebur.com>: packaging sitescooper as a .deb, and general Debian compliance -- thanks!
Stefan Schwingeler <stefan /at/ schwingeler.de>: fix for ContentsSkipURL, regional_germany sites. Stefan and Carsten are responsible, between them, for all the sites in the regional_germany category -- thanks guys!
Pierre-Yves Letournel <e-py.letournel /at/ wanadoo.fr>: regional_francais: afp.site, le_monde.site, 01_informatique.site, lmi_hebdo.site, lmi_quotidien.site.
Jacques Turbé <jturbe /at/ cybercable.fr>: regional_francais: lemondecomplet.site nouvelobs.site libe_portrait_du_jour.site libe_rebonds.site libe_q.site journaldunet_dossiers.site echos_infos.site, and journaldunet.site. Jacques and Pierre-Yves have, between them, provided all the sites in regional_francais, which is great!
Jason Simpson <jason /at/ xio.com>: contributed seattletimes.site
Joe Pfeiffer <pfeiffer /at/ cs.nmsu.edu>: HTML rendering fixes, lots of sites
Mike Miller <mmiller /at/ mediageneral.com>: several sites
dLux <dlux /at/ dlux.hu>: sites for Debian Weekly News, Freshmeat, Hirnet, Linux.Hu, Palmcentral, updated Linux Today
Andrew Fletcher <fletch /at/ computer.org>: MacOS support
spacehog /at/ knowfear.knowfear.net>: yahoo_top_stories.site
Jason C. Axley <jason /at/ axley.net>: installation instructions update for RedHat 6.0, and SRPM for the URI module.
Kennis Koldewyn <kennis.koldewyn /at/ wcom.com>: NY Times sites.
Michael Lapsley <mlapsley /at/ ndirect.co.uk>: fixed bug with "-refresh -fromcache".
Jason Yanowitz <yanowitz /at/ poboxes.com>: site file for The Guardian.
Kevin Olson <kevolson /at/ visi.com>: fixed bug with RichReader command-line.
Vince <reverso /at/ club-internet.fr>: contributed le_temps.site.
Dave Collins: <Dave.Collins /at/ tiuk.ti.com>: fix for (no text to write) when text started with a quote char.
Albert K T Hui <avatar /at/ deva.net>: lots of regional_hk site files, and fixed to allow more 8-bit text; also HTML abuse by Sing Tao Daily worked around.
Alastair Rankine <arankine /at/ lucent.com>: fairfax_it.site
Kevin L. Dupree <kdupree /at/ flash.net>: image-only site support.
Andy Rabagliati <andyr /at/ wizzy.com>: csmonitor.site and KPilot support.
Memeteau, Michael <Michael.Memeteau /at/ autoeuropa.pt>: site files.
Derek Glidden <dglidden /at/ illusionary.com>: fixed lots of delinquent site files, added science_daily.site, spaceref.site.
Justin Henry <jhenry /at/ fjicl.com>: A fine selection of sites: updated salon.site; gist_tv.site; cats_cradle.site; clark_howard.site; morbid_fact_du_jour.site; news_observer.site; ny_times_handheld.site; roger_ebert.site; usa_today.site; weather24.site, wral_tv.site, and movietickets.site.
Sergi Puso Gallart <sergi /at/ iAgora.net>: elmundo_* and marca_* sites, creating the new regional_spain category.
Lim Swee Tat <st_lim /at/ 3ui.com>: samba_traffic.site, wine_traffic.site, techweb.site added; webmonkey.site, javaworld.site fixed; AnywhereYouGo sites contributed.
Marko Bozikovic <redbyron /at/ fly.srk.fer.hr >: All of regional_croatia; a lot of comics, and several science sites.
Thean Yoon Fui <yoonfui /at/ bigfoot.com>: Lots of comics sites, and updates to the visorcentral site.
Peter Marschall <peter.marschall /at/ mayn.de>: updates to de_sz, de_heise, the_register and de_zeit sites; pointed out that .pdb was correct extn for iSilo output; support for multiple site choices files and FHS conformance.
David Czerwinski: Chicago Tribune sites
Wari Wahab: concept and implementation of the index page for HTML and M-HTML output
David A. Desrosiers <hacker /at/ gnu-designs.com> Lots of new ''Palm version'' site file URLs
Robert Edmonds <stu /at/ brainfood.com> Several ''humor'' sites: BOFH, ditherati, pigdog.site.

[ Installing ]|[ on UNIX ]|[ on Windows ]|[ on a Mac ]
[ Running ]|[ Command-line Arguments Reference ]
[ Writing a Site File ]|[ Site File Parameters Reference ]
[ The rss-to-site Conversion Tool ]|[ The subs-to-site Conversion Tool ]
[ Contributing ]|[ GPL ]|[ Home Page ]