what's new |
What it costs: WebFetcher is Shareware. THIS IS NOT PUBLIC DOMAIN SOFTWARE. After 30 days, educational and nonprofit institutions must send a postcard. Others must pay for WebFetcher to use it after 30 days. Checks should be in US funds drawn on a US bank. To license one copy, remit $35 to:
123 University Place
Pittsburgh, PA 15213
What it does: Downloads World Wide Web pages to your local hard
offline viewing. Pages are periodically updated on a regular schedule
set by you.
How it works: You supply a list of http URLs (in a file called
schedule.txt) and desired download
times. At those times, WebFetcher downloads the associated documents.
Embedded images and hyperlinked pages (down to a certain depth) can be
downloaded as well. You view these pages offline using your favorite
Web browser. WebFetcher periodically checks the original site and
automatically downloads any new or updated pages.
Quick Start |
User Reference |
A live Internet connection. (Direct connection, SLIP or PPP.)
Macintosh (fat binary), System 7.0.1 or later.
UNIX versions: NeXT, SunOS, Solaris 2.3, OSF/1 (for Dec Alpha), Dec
Windows '95 or NT.
Unhqx and expand the file
expands into it's own folder,
containing the application and other files.
Gunzip and tar -xvf the file
directory called WebFetcher is created,
containing the WebFetcher application and other files.
By default WebFetcher installs here:
C:\Program Files\Television Computer\WebFetcher\
The executable program is in the
All other files,
the sample schedule file (
WebFetcher Master Index (
are in the
Data subdirectory. The sample
schedule file (schedule.txt) and WebFetcher Master Index (index.html)
are all in this
Here's a quick overview. Detailed instructions appear below under
Lines in a schedule file fit this pattern:
1 3/20/96 7:30 am http://www.ontv.com/ 12 h 2 1
The encoding is what you'd expect: starting date and time,
url, and some "detail codes". The "detail codes" are repeat
interval, fetch depth, and graphics flag (1=yes,
0=no). (The '1' at the start of the line is a format code. It needs to
The above line is interpreted as: Fetch the page http://www.ontv.com/
every 12 hours starting at 7:30 am on March 13, 1996. Go two levels
deep (beyond the first page), and include graphics."
For a one time only fetch, use a repeat interval of zero, e.g.
0 h like this:
1 3/20/96 7:30 am http://www.ontv.com/ 0 h 2 1
Connect to the internet, then launch the WebFetcher application. You
will be prompted for a schedule file
to load. Try the default schedule
WebFetcher stores fetched files in the application directory.
(Now read the Windows section below regarding the Master Index and
creating your own schedule...)
WebFetcher is best run as a background process. Recommended usage:
$ WebFetcher [-s schedule_file ] [-d data_directory ] [&]
Schedule_file points to your WebFetcher schedule. The
in the current directory.
Data_directory points to the
you want WebFetcher to store your fetched files. The default is the
All messages are written to the file
log.txt in the data
directory. To see what
WebFetcher is doing "right now", cd to that directory and say this:
$ tail log.txt .
Under UNIX, there's nothing to prevent you from running multiple copies
simultaniously. Also note that once in teh background WebFetcher will
not terminate on
it's own: you'll have to kill it yourself from the shell. (Do the
(Now read the Windows section below regarding the Master Index and
creating your own schedule...)
After connecting to the Internet, just launch WebFetcher. It loads the
schedule.txt file and
fetches the files listed there. The main window displays a log of
After a minute or two (give WebFetcher a chance
to fetch the files!), launch your favorite web browser and open the
index.html in the WebFetcher folder.
Follow the links: the pages you see have been fetched to your local hard
Now exit WebFetcher and edit the file
your own schedule --
see the notes below under Schedule File for help. List the URLs
you'd like to fetch and set their download times. Delete the lines you
Relaunch WebFetcher and you're on your way!
Configure WebFetcher to fetch your
sports, and weather pages every few hours. Have it check your favorite
for updates. Mirror important documentation to your own hard disk.
Check monthly for
updates. Have WebFetcher make daily checks for important press
releases. Keep an eye on your
Below are the items found in the WebFetcher subdirectory. Some items
after installation, others after WebFetcher is first run.
- WebFetcher Program: The executable, WebFetcher or
- Master Index : index.htm
An HTML page with hyperlinks to your successfully-fetched pages. As
runs, it appends new links to this page that point to new downloads.
This is a "top
level" index: it lists only the pages you've explicitly asked for and
that have actually been fetched. You're welcome to edit this
file with your favorite text editor if you wish to reorder the listing
- Daily Index : di<datecode>.htm
Daily HTML pages generated by WebFetcher with hyperlinks to ALL the new
downloaded that day. (These indices may be deleted if they take up too
space. They're basically here to provide a "What's New" function.)
- WebFetcher Update Page : update.htm
A page reachable from the Master Index and fetched fresh from our site
every time you
run WebFetcher. It contains information on WebFetcher updates and other
news items. Later, if demand warrants, the Update Page will announce
services such as
"fetch profiling" for congestion-avoidance.
- Help File : help.htm
This file, also available in WebFetcher's Help menu.
- Log File : log.txt
A disk-based copy of the information written to WebFetcher's main
the log reaches 256K in length, it truncates itself to 32K. The log
will not fill up your hard disk...)
- Data Files : various subfolders
WebFetcher creates a subfolder for each host visited. Pages for a
particular host are stored in that host's folder.
You may delete these folders whenever you wish, but deleting them
will obviously destroy the pages they contain. If you wish to view the
deleted pages again, you'll have to refetch them.
- Schedule File: schedule.txt
The default schedule WebFetcher automatically loads at startup.
You can modify this schedule or create new schedules using your favorite
You designate which schedule to load at startup using the Load New
in the File menu.
The schedule file is a simple text file that contains records, one per
exactly this form:
frmt date-time URL repeat-interval interval-type fetch-depth
1 11/20/97 12:00 pm http://www.ontv.com/ 2 w 5 1
This translates into English as
"Fetch the page at
http://www.ontv.com/ on November
20, 1997 at 12:00 pm. Fetch all the attached pages up to 5 links away,
all embedded graphics. Every two weeks, check to see if anything has
download only the pages and graphics that have changed."
The fields in each record are defined as follows (note that each
field is separated by a space, and the record ends with a carriage
- frmt : A record format flag, for now always 1.
- date-time : The local date and time to fetch this URL,
mm/dd/yy hh:mm followed by either
You must use either am or pm, military time (24 hour
clock) is not recognized.
- URL : An ordinary Universal Resource Locator, as described
in RFC 1738.
This web page,
along with all it's embedded images and hyperlinked pages (if
requested) will be
fetched to your hard disk. (For now, only URLs of the scheme 'http'
- repeat-interval and repeat-type: Together these two
codes describe how often to check the
original source URL for new updates. If repeat-interval is zero,
repeat-code is ignored and the
page is fetched "one time only". Otherwise, any other positive integer
number of intervals of interval-type before the next periodic
fetch occurs. The codes are
m = minutes, h = hours, d = days, w =
Examples clarify their use:
90 m would mean check every 90 minutes.
10 d would mean check every ten days.
3 w would mean check every three week.
0 w would mean check "one time only" (no repeat
It follows that the
following codes are equivalent:
60 m = 1 h , 24
h = 1 d, 1 w = 7 d.
As another example:
1 1/20/96 7:00 am http://www.yahoo.com/headlines/summary.html 3 h
Means "fetch Yahoo's News Summary every three hours, starting January
20, 1996 at 7 am."
- fetch-depth :
The maximum number of links to follow away from the page initially
The above http://www.ontv.com/ example will fetch
any files within five links (jumps) of the requested page
Generally speaking large sites like CERN, Microsoft, Netscape, Yahoo,
etc., tend to
"fan out" quickly, so start with very small numbers.
Good choices are
0 (fetch only the requested page), 1 (fetch the requested page
and it's attached pages) and 2 (fetch the requested page and all the
attached pages, and all their
attached pages). Numbers bigger than 3 or 4 should be used with extreme
A good technique is to schedule
a fetch at depth 1 and examine the results. Pick out the interesting
branches and focus new, deeper fetches on those branches only.
- graphics-flag : 1 or 0.
1 means fetch any inline graphics (GIFs and JPEGs), 0 means
ignore (don't fetch) graphics.
(Note that this setting is different from hyperlink depth. With this
flag, you can order
your pages "with or without graphics", so to speak.)
- WebFetcher isn't yet intended to be a general-purpose automatic
downloader robot. Its purpose for now is to facilitate caching of WWW
HTML pages for convenient (and fast) offline viewing. It will only
download text and images data,
not compressed files, postscript files, executable binaries, etc.
- The schedule accepts http URLs only. Gopher, WAIS, news, ftp, etc.
URLs are explicity disallowed.
- WebFetcher will fetch data on the sites other than the one indicated
in the original
request URL, but only one level deep. That is, if an embedded hyperlink
jumps to a
"non-local" site, WebFetcher will follow that hyperlink no deeper that
it's first page.
- If a page hasn't been fetched, a local hyperlink to that page will
not work. WebFetcher never leaves your hard disk. It will not go back
net to find pages you haven't fetched.
- Server-side image maps won't work. They rely on software running on
server, so it's nearly impossible for WebFetcher to properly mimic image
- Queries, like
usually won't work.
- WebFetcher will not fetch data from servers running HTTP/0.9.
The server must be running HTTP/1.0 or better.
- WebFetcher endeavours to be a good net citizen. The authors are
sensitive to the
havoc personal robots can wreak on the net. To make WebFetcher more
we enforce a minimum 10 second wait between any two non-graphics fetches
to the same site,
and a minimum 30 minute refetch interval in user schedules.
- WebFetcher does not currently follow the Robot Exclusion Protocol,
but it will
as soon as we can implement it.
- Webfetcher comes as a software demon and an text based API . We
realize that our
scheduling method (editing a text file) is a bit arcane. A
GUI interface will be forthcoming if demand warrants one.
Send email to firstname.lastname@example.org. We're
especially interested in
hearing about which of the above limitations you'd like to see removed.
development on WebFetcher will be strictly feedback-driven.