smtp-messages/robots-nexor-mbox.txt

From /CN=robots-errors/@nexor.co.uk Wed Jun  1 21:17:14 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 1 Jun 1994 21:17:35 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 1 Jun 1994 21:17:14 +0100
Date: Wed, 1 Jun 1994 21:17:14 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039230:940601201716]
Content-Identifier: WWW robots di...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 1 Jun 1994 21:17:14 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"3912 Wed Jun  1 21:17:00 1994"@nexor.co.uk>
To: Jonathon Fletcher <J.Fletcher@stirling.ac.uk>, 
    David Eichmann <eichmann@rbse.jsc.nasa.gov>, 
    Oliver McBryan <mcbryan@piper.cs.colorado.edu>, 
    Roy Fielding <fielding@ics.uci.edu>, 
    Brian Pinkerton <bp@biotech.washington.edu>, Fred Barrie <barrie@unr.edu>, 
    Matthew Gray <mkgray@mit.edu>, Paul De Bra <debra@win.tue.nl>, 
    Guido van Rossum <Guido.van.Rossum@cwi.nl>, 
    "James E. Pitkow" <pitkow@aries.colorado.edu>, 
    Andreas Ley <ley@rz.uni-karlsruhe.de>, 
    Christophe Tronche <Christophe.Tronche@lri.fr>, 
    Charlie Stross <charless@sco.com>, L.McLoughlin@doc.imperial.ac.uk, 
    Michael L Mauldin <fuzzy@cmu.edu>
Cc: /CN=robots/@nexor.co.uk
Subject:  WWW robots discussion list
Status: RO
Content-Length: 1305

At the WWW'94 Conference the robot authors present expressed an
interest in some closer collaboration. I volunteered to set up a
mailing list to serve as a platform for these technical discussions.

This list is now active. As you are all developing or administering
robots I'd urge you to make use of this facility; together we 
should be able to reduce the occurence of problems caused by
robots, to reduce some of the duplicate effort, and improve
the service to users of robot-generated facilities.

If you'd like to subscribe, send a message to
robots-request@nexor.co.uk, with the lines

subscribe
help
stop

in the body of the message. The list manager is of course NXDLM, which
we market as product, and is configured to keep an archive of traffic
on the list. This archive is accessible from the Web vie our experimental
gateway reacheable from <http://web.nexor.co.uk/>.

To send messages to the list itself use robots@nexor.co.uk.

Next week (allowing people time to register) I'll post a proposed
charter to the list, and list some issues I'd like to see discussed.

Looking forward to your contributions,

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 09:37:38 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 6 Jun 1994 09:38:15 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 09:37:38 +0100
Date: Mon, 6 Jun 1994 09:37:38 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:130800:940606083743]
Content-Identifier: Proposed Char...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 09:37:38 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"13056 Mon Jun  6 09:37:06 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Proposed Charter
Status: RO
Content-Length: 1401

Welcome to you all...,

Here is the proposed charter for this list, for future
reference by new subscribers. It's straightforward, but
if anybody would like to see any changes let me know.

--

Proposed charter for robots@nexor.co.uk.

This list is intended as a technical forum
for authors, maintainers and administrators of WWW robots. 
Its aim is to maximise the benefits WWW robots can offer
while minimising drawbacks and duplication of effort.
It is intended to address both development and operational
aspects of WWW robots. 

This list is not intended for general discussion of WWW
development efforts, or as a first line of support for
users of robot facilities.

Postings to this list are informal, and decisions and
recommendations formulated here do not constitute any 
official standards. Postings to this list will be made
available publicly through the list-managers archive,
and NEXOR doesn't accept any responsibility for the content
of the postings.

Related lists:
www-talk@info.cern.ch: technical WWW development discussions
www-html@info.cern.ch: HTML specific development discussions
www-cache@info.cern.ch: technical discussions on proxys and caching
comp.infosystems.www.*: WWW discussions

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 09:39:16 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 6 Jun 1994 09:39:54 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 09:39:16 +0100
Date: Mon, 6 Jun 1994 09:39:16 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:131210:940606083924]
Content-Identifier: Topics
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 09:39:16 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"13118 Mon Jun  6 09:39:07 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Topics
Status: RO
Content-Length: 7634


Here is a long list topics I'd like to see discussed at
some point, in no particular order. I look forward to comments
on these topics, other issues, and what the priorities should be.

* public information

  - robot profile matrix

    It would be nice to have a matrix of certain attributes of
    the various robots that exist, that is available to the Web
    public at large. The list I maintain on
    <http://web.nexor.co.uk/mak/doc/robots/active.html> could
    server as a basis; are there any additions people would like
    to see?
    
* sharing of data

  - format / access protocol of database

     Most indexing robots generate a database of information,
     that can then be searched through publicly accessible
     ISINDEX/Form pages. It would be nice if the actual database
     was publicly available, or where applicable an access
     protocol can be made publicly available (eg. SQL).
     
     Others could then run local mirrors of the search engines,
     write their own search engines, or do analysis of the data.

  - distributed data gathering
  
     If there was a standard database format / access protocol
     the data gathering could be distributed over the net, either
     by separate robots, or multiple copies of the same robot.
      
     Jonathon, you mentioned once you were working on some robot
     database synchronisation scheme. Did you get anywhere?
  
* data analysis

  As robots traverse the Web, they could do a lot of statistical
  analysis, either real-time, or on the resulting database. It seems
  silly that multiple robots go out over the same data, all doing
  slightly different analysis.
  
  It would be really nice to publish:
  
  - a list of servers
  
    Like Mathew Gray's list, but then one that is as up-to-date as
    the latest robot run, has only got hosts that actually exist,
    and are smart about multiple DNS names for the same IP address.
  
  - inverse maps

    Robots can create inverse maps, so that I can find out which pages
    refer to a particular page. Until the Referer HTTP field becomes
    more used this could be very valuable to find bad links. And it'd
    be nice to know the average number of links to a page; how inter-
    linked is the Web? We could have a most-referenced league table;
    which is the most popular page in the web in terms of links?
  
  - general stats like
  
    avergae number of visited documents per server (and min & max),
    total number of documents visited,
    total number of hosts visited,
    percentage of links that are bad,
    percentage of HTML documents that are a tag soup,
    percentage of documents not changed in x days,
    etc. etc.
  
* sharing of operational tips

  All robot maintainers hit the same problems at certain sites,
  and get things like:
  
  - seed documents
  
    What documents are good to start robots from.
    
  - site exclusion lists
  
    Which sites explicitly ask not to be visited.
  
  - black hole lists
  
    Which cgi-scripts create infinitely linked web spaces
  
  - avoidance lists
  
    Which data should be avoided (e.g. the UNIX manual gateways)
  
  - robotsnotwanted proposal
  
    I'd like to get some more discussion on this. As all the robot writers
    are on this list we should be able to decide on something that can 
    easily be implemented by robots and users. The only outstanding issue
    is the name of the file; it is too long for DOS-based servers. Is there
    any problem with changing the filename to robotsp.txt (for robots 
    policy) ?
  
  - scheduled runs
  
    It might be nice to know when which robots are running, just in case
    people start wondering.
  
  - ALIWEB
  
    For those sites that have a /site.idx file it might be worth to take
    the documents referenced in it special consideration.
  
* sharing of algorithms

  All robots have different algorithms for a lot of the same functions.
  It should be possible to find the best algorithm that all robots can use:

  - document selection
  
    Which documents do you visit? A lot of robots to "n levels deep" which
    seems pretty arbitrary to me. Doing "n levels from the root document"
    might make more sense.
  
  - HTML parsing
  
    This is tricky, with so much bad HTML out there. There must be a "best
    way" to extract URL's from documents; I am sure that at the moments 
    some robots barf on some documents.
  
  - load balancing
  
    How do you decide when to query a site as to balance the load most?
    It is by now clear the "visit one site at top speed" approach is
    nasty; what is used now? Roud robin? Can time zones be used? How
    fast do robots run?
    
  - search algorithms
  
    Once you have a database, what algorithm can one use to search it?
    At the moment there are Perl scripts, SQL scripts, WAIS database etc.
    If there was a standard database format these could be benchmarked.
  
  - error recovery
  
    Robots should be restarteable without having to backtrack. How is this
    best achieved?
  
* sharing of code

  There is a lot of duplication of effort in the coding and maintenance
  of robot code. It would be useful if there was one common code base
  for robots to draw from, implementing the separate algorithms used in 
  robots.

  I would really like to see a single robot implemetation (TUM: The Unified
  Robot?), that could run cooperatively around the world. Is this me 
  dreaming, or is it something more of you see as beneficial. If so, how
  can we make this a reality? What language is most suiteable (Perl, 
  surely :-) ? What design allows the most flexibility and safety?
  
* HTML/HTTP extensions

  It maybe that there are things in HTTP/HTML that robots could use but 
  don't at the moment, and it may even be worth extending the protocol
  to put facilities aimed at automated tools in (eg If-modified-since).

  At WWW'94 one idea was for example to implement as server-side facility
  to parse an HTML document, and return only the links. 

* Caching issues

  The increased use of caching presents special problems for robots: 
  how does a robot recognise a cached document sitting in the cache
  data area of a chaching server? Should it document them?
  
  But caches and robot do similar things, a robot uses it's own database
  as a cache (I hope!), but a caching server could also use that data.
  This comes back to standardising the database; maybe the structure
  used by the CERN cache can be used as format for robot gathering output.
    
  Robots can also be useful for pre-loading a cache, to do mirroring, or
  to prepare for off-line demo's. Maybe robots should have command-line 
  options to facilitate this. Then again, robot code should probably not
  be handed out freely.

* Testing

  The person running a robot should keep close tabs on what it is doing
  at any one time. What sort of monitoring tools are used to do that?
  
  Testing robot modifications is another issue. I have noticed in the
  past that a robot did the same run several times in a day, which it
  turned out to do "for testing". Surely tests should be done locally.

Right, I have been waiting to get all these off my chest. I think TUM is
the most challenging long-term topic, but in the short term I think the
standard database(s) is the most important; it would bring immediate benefit,
and a lot of the other issues can follow on from that.

Any comments?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 10:06:46 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 6 Jun 1994 10:07:47 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 10:06:46 +0100
Date: Mon, 6 Jun 1994 10:06:46 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:136200:940606090649]
Content-Identifier: Inverse Maps
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:06:46 +0100;
Alternate-Recipient: Allowed
From: mkgray@MIT.EDU
Message-ID: <9406060905.AA22918@deathtongue.MIT.EDU>
To: /CN=robots/@nexor.co.uk
Subject:  Inverse Maps
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 770

I am currently working on W4v3.0, and one of the features I have implemented
so far is some inverse mapping features.  It's yielded some interesting 
results.  Not surprisingly, the most pointed to sites in the documents 
examined in a preliminary run were info.cern.ch and www.ncsa.uiuc.edu.

Other highly pointed to sites include nearnet.gnn.com (:-),
www.cis.ohio-state.edu, www.cs.cmu.edu, gopher.vt.edu, and sunsite.unc.edu.

For the initial portion of the implementation, I am only constructing
interconnectivity within sites.  That is, I keep track of what documents
point to site FOO, not what documents point to what documents.  Any ideas
on implemenation of the latter that is reasonable?

Has anyone else done such interconnectivity mapping?

					...Matthew

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 10:36:42 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:48:42 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 10:36:42 +0100
Date: Mon, 6 Jun 1994 10:36:42 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140740:940606093644]
Content-Identifier: re: Inverse M...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:36:42 +0100;
Alternate-Recipient: Allowed
From: Charlie Stross <charless@sco.COM>
Message-ID: <9406061033.aa01127@ruddles.sco.com>
To: /CN=robots/@nexor.co.uk, mkgray@MIT.EDU
Subject:  re: Inverse Maps
X-Mailer: SCO Portfolio 2.0
Status: RO
Content-Length: 1799

mkgray@MIT.EDU writes ...

>I am currently working on W4v3.0, and one of the features I have implemented
>so far is some inverse mapping features.  It's yielded some interesting 
>results.  Not surprisingly, the most pointed to sites in the documents 
>examined in a preliminary run were info.cern.ch and www.ncsa.uiuc.edu.

>Other highly pointed to sites include nearnet.gnn.com (:-),
>www.cis.ohio-state.edu, www.cs.cmu.edu, gopher.vt.edu, and sunsite.unc.edu.

>For the initial portion of the implementation, I am only constructing
>interconnectivity within sites.  That is, I keep track of what documents
>point to site FOO, not what documents point to what documents.  Any ideas
>on implemenation of the latter that is reasonable?

One idea I was playing with when I was working on websnarf 2
(which is currently on the shelf) was the idea of using a whacking
great .dbm file to store either entire HTML files, indexed on their
URL, or a list of URLs extracted from such files. (I ran into a
problem in that the standard dbm and Berkeley dbm libraries have a
maximum record size of 1024 or 2096 bytes respectively; GNU dbm
apparently doesn't have this restriction, but I didn't have time to
rebuild my version of Perl with a new library.) Anyway, the idea
is that keeping such a database would reduce the problem of cross-
referencing large webs; simply read a record, and for each URL
in the record (which contains a list) do a lookup on the database.
(The output could then be turned into input for a graph-generating
program like AT&T's NEATO.)
 

-- Charlie
--------------------------------------------------------------------------------
Charlie Stross is charless@sco.com, SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
 

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 18:28:28 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:46:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 18:28:28 +0100
Date: Mon, 6 Jun 1994 18:28:28 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:203320:940606172830]
Content-Identifier: Re: Inverse M...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 18:28:28 +0100;
Alternate-Recipient: Allowed
From: Brian Pinkerton <bp@biotech.washington.edu>
Message-ID: <9406061727.AA09398@biotech.washington.edu>
To: mkgray@MIT.EDU
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: Inverse Maps
Original-Received: by NeXT.Mailer (1.100)
PP-warning: Illegal Received field on preceding line
Original-Received: by NeXT Mailer                    (1.100)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 403

I've done some inverse mapping with the WebCrawler, but not to any great  
extent.  Right now, I just generate the "Top 25" list -- a list of the 25 most  
frequently referenced sites on the Web (at least, based the WebCrawler's  
limited experience).  This turns out to work pretty well -- you can see the  
(predictable) results at
    

	http://www.biotech.washington.edu/WebCrawler/Top25.html.

bri

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 10:14:53 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 6 Jun 1994 10:15:25 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 10:14:53 +0100
Date: Mon, 6 Jun 1994 10:14:53 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:137830:940606091454]
Content-Identifier: Avoidance Alg...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:14:53 +0100;
Alternate-Recipient: Allowed
From: mkgray@MIT.EDU
Message-ID: <9406060914.AA22927@deathtongue.MIT.EDU>
To: /CN=robots/@nexor.co.uk
Subject:  Avoidance Algorithms
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 1101

One of the features that I implemented in W4v1.0 was an avoidance algorithm
I called 'boredom'.  First a brief implementation profile of W4v1.0:

W4v1.0 was written in June of 1993 as a simple depth first search that kept
the entire database in memory of where it had been and dumped to disk when
it had exhausted a document tree.  Very simple.

So, one issue I was concerned about was infinite trees (this is a bad thing
with depth first searches :-) so I added a feature to the Wanderer that allowed
it to 'get bored'.  Specifically, if it retrieved more than N documents with
the same path (except for the last element) and a few other heuristics, it
bailed out and found something more interesting to do.  For the most part
this was very successful.

W4v2.0 was a modification to do breadth first searching, and in that revision
'boredom' got removed, as it was not as useful to the algorithm.  I am planning
on reimplementing a more advanced version of 'boredom' in W4v3.0, partially
based on content parsing.

Suggestions? Comments? Other implementations to avoid large trees?

						...Matthew

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 10:26:27 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:48:21 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 10:26:27 +0100
Date: Mon, 6 Jun 1994 10:26:27 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:138740:940606092628]
Content-Identifier: Database/memo...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:26:27 +0100;
Alternate-Recipient: Allowed
From: mkgray@MIT.EDU
Message-ID: <9406060925.AA22934@deathtongue.MIT.EDU>
To: /CN=robots/@nexor.co.uk
Subject:  Database/memory implementation
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 1128

How have people in general implemented the DB?  By the database (DB) I mean
the robot's record of where it has been, not necassarily anything it
constructs for later consumption.

W4v1.0 implemented a completely in memory DB.  This worked fine when there
were 100 sites on the web.  It doesn't work any more :-)  Plus if the
Wanderer crashed, it wouldn't always successfully dump it's DB.

W4v2.0 implemented a disk based DB which has a number of advantages
1) It can get as big as it wants and not kill the machine
2) It saves state, so arbitrary crashes don't lose any substantial data
On the other hand, it is somewhat slower, though most of the time is spent
waiting for HTTP responses.

Currently, it maintains one record of where it has been ('log') and another
record of where it plans on going ('dq') and another set of analogous
in-memory lists which regularly get flushed to disk.

Any other more novel implementations out there?  I've given a passing thought
to trying a heierarchical DB, but I'm not sure it would be useful.

Any ideas on how to make an in-memory DB smaller?  Or a disk DB faster?

						...Matthew

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 10:34:35 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:48:38 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 10:34:35 +0100
Date: Mon, 6 Jun 1994 10:34:35 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140130:940606093437]
Content-Identifier: Server list
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:34:35 +0100;
Alternate-Recipient: Allowed
From: mkgray@MIT.EDU
Message-ID: <9406060934.AA22943@deathtongue.MIT.EDU>
To: /CN=robots/@nexor.co.uk
Subject:  Server list
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 977

Once I get W4v3.0 finished, I intend to add a number of the modifications
mentioned by Martijn in his initial letter (DNS identification of identical
servers, bogus servers eliminated, etc.)

Additionally, I would welcome any other lists of servers.  I can merge such
lists with the comprehensive list.  I will continue to maintain the 
"Comprehensive List of WWW Sites", so anything to make this as up to date and 
accurate as possible would be great.

Suggestions on other useful techniques for sorting the comprehensive list would
be great too.  If you don't know what I'm talking about, or have lost the URL:

http://www.mit.edu:8001/people/mkgray/compre.bydomain.html

So, please do send me any sitelists.  No desperate need to crosscheck with my
list, I can do that.  Of course, if you want to, that just makes my life
easier.

					...Matthew

BTW, I'm sending all these messages out separately to keep the topic threads
	vaguely separate, in case that wasn't apparent.

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 11:26:46 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:47:04 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 11:26:46 +0100
Date: Mon, 6 Jun 1994 11:26:46 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:145260:940606102648]
Content-Identifier: Avoidance Alg...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 11:26:46 +0100;
Alternate-Recipient: Allowed
From: Charlie Stross <charless@sco.COM>
Message-ID: <9406061120.aa01408@ruddles.sco.com>
To: mkgray@MIT.EDU, /CN=robots/@nexor.co.uk
Subject:  Avoidance Algorithms
X-Mailer: SCO Portfolio 2.0
Status: RO
Content-Length: 3168


mkgray@MIT.EDU writes ...

>One of the features that I implemented in W4v1.0 was an avoidance algorithm
>I called 'boredom'.  First a brief implementation profile of W4v1.0:

>W4v1.0 was written in June of 1993 as a simple depth first search that kept
>the entire database in memory of where it had been and dumped to disk when
>it had exhausted a document tree.  Very simple.
        :
>W4v2.0 was a modification to do breadth first searching, and in that revision
>'boredom' got removed, as it was not as useful to the algorithm.  I am planning
>on reimplementing a more advanced version of 'boredom' in W4v3.0, partially
>based on content parsing.

>Suggestions? Comments? Other implementations to avoid large trees?

Well, my first cut at websnarf was a recursive depth-first probe.  This
rapidly ran away into the web, and as my bandwidth is limited to my
share of a 64K line this seemed like a bad idea.  It also had a tendency
to dump core due to stack frame overflows.

I went to the bookshelf and was most interested to read the chapter on
graph searching in Sedgewick (Algorithms, 2nd edn, can't remember the
year).  It turns out that you can use a stack to emulate a recursive
depth-first traversal, and a queue to emulate a recursive breadth-first
traversal, both without the need for recursion.  Perl provides a handy
data structure -- the list -- and calls to use a given list as either a
queue or a stack.

I modified websnarf so that it could do both breadth- and depth- 
traversals (with the switchover being handled in a small subroutine
that decided whether to push or shift URLs onto the list, and pop or
unshift them off the list).

Because my nonrecursive implementation created a list of URLs
representing the current state of your tree walk, it was then relatively
trivial to scan along the list.  If you see two occurences of the same
URL in the list, you know there's some danger of getting into a loop,
and you can just prune one of them out of the list.  It's also useful to
store the "depth" (i.e.  number of links away from home) along with each
URL in the list.  If two pointers to the same URL occur at the same
depth, the odds are that they're fairly safe -- just time-consuming.
But if one is above the other, there's the possibility of some kind of
weird loop occuring.


Finally, one thing I'd do immediately if I was working on websnarf right
now [*] would be to ensure that it avoids any URLs that look like search
commands or internal document pointers.  That way lies madness ...


-- Charlie

[*] websnarf is a personal effort, not a company-sanctioned project.
It's on the shelf due to lack of spare time at work.  I'm hoping
(fingers firmly crossed) to get a grant for next year to do the job
professionally, i.e.  to spend all my time on it, not just a couple of
hours a week.  Meantime, I'm spending my time thinking about how to get
right all the things I got wrong the first time round ...
--------------------------------------------------------------------------------
Charlie Stross is charless@sco.com, SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
 

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 11:52:10 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:47:52 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 11:52:10 +0100
Date: Mon, 6 Jun 1994 11:52:10 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:148030:940606105212]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 11:52:10 +0100;
Alternate-Recipient: Allowed
From: Lee McLoughlin <lmjm@doc.imperial.ac.uk>
Message-ID: <"swan.doc.i.352:06.05.94.10.50.10"@doc.ic.ac.uk>
To: Charlie Stross <charless@sco.COM>, mkgray@MIT.EDU, /CN=robots/@nexor.co.uk
In-Reply-To: <charless@sco.COM>
Subject:  Re: Avoidance Algorithms
X-Mailer: Mail User's Shell (7.2.5 10/14/92)
Status: RO
Content-Length: 747

My scanning routine was the usual depth-first search.  But this meant that
certain sites were scanned before others.  Apart from causing sites way down
the list to be left out it also meant that one site would get "soaked".  In
the end I went for random scanning of all stored URLs looking for a URL and
site that I hadn't gone to recently.

My system is also written in perl except that I store all the retrieved data
in dbm files.  Since I have URL and site timers I can now also run multiple
scanning processes at the same time.

-- 
--
Lee McLoughlin.                          Phone: +44 71 589 5111 X 5085
Dept of Computing, Imperial College,     Fax: +44 71 581 8024
180 Queens Gate, London, SW7 2BZ, UK.    Email: L.McLoughlin@doc.ic.ac.uk

From /CN=robots-errors/@nexor.co.uk Wed Jun  8 10:09:49 1994
Replied: Wed, 08 Jun 1994 11:02:20 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: " (Paul De Bra)" <debra@info.win.tue.nl>
Replied: nlc@cs.nott.ac.uk
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 8 Jun 1994 10:11:38 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 8 Jun 1994 10:09:49 +0100
Date: Wed, 8 Jun 1994 10:09:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:090430:940608090957]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 10:09:49 +0100;
Alternate-Recipient: Allowed
From: " (Paul De Bra)" <debra@info.win.tue.nl>
Message-ID: <199406080910.JAA11791@pcpaul.info.win.tue.nl>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Avoidance Algorithms
Status: RO
Content-Length: 857

Strange to hear that the strategy in W4 changed from depth-first in 1.0 to
breadth-first in 2.0. The experiments we ran with the fish-search, both real
and simulated, all showed that depth-first is a better navigation algorithm
than breadth-first.

We also have boredom, set to 1, meaning that we never retrieve the same
url twice.

Another thing the fish-search does is to try to not load url's from the same
host in succession. (It searches among the first 30 or so url's in its list
to find one from another host.)

A feature still missing in the fish-search, which I would like to hear about
from others is the use of ISMAP's.
My idea is to try a limited selection of coordinates first, and put a larger
selection further back in the "queue" of url's to be tried.

How do other robots find out which url's can be reached by clicking in an
ismap?

Paul.

From /CN=robots-errors/@nexor.co.uk Wed Jun  8 11:02:39 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 8 Jun 1994 11:03:19 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 8 Jun 1994 11:02:39 +0100
Date: Wed, 8 Jun 1994 11:02:39 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:096060:940608100242]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 11:02:39 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"9595 Wed Jun  8 11:02:26 1994"@nexor.co.uk>
To: " (Paul De Bra)" <debra@info.win.tue.nl>
Cc: /CN=robots/@nexor.co.uk, nlc@computer-science.nottingham.ac.uk
In-Reply-To: <199406080910.JAA11791@pcpaul.info.win.tue.nl>
Subject:  Re: Avoidance Algorithms 
Status: RO
Content-Length: 1994


> Strange to hear that the strategy in W4 changed from depth-first in 1.0 to
> breadth-first in 2.0.

Not really; most Web server URL spaces are structured hierarchically,
with growing more specific towards the leaves.  So if you start from a
server root and you do a bread-first search for a limited number of
documents you'll get a broader (and therefore for the purposes of
general indexing better) overview than if you do a depth-first search
for a limited number of documents, which can shoot of down one
specific area (especially in deep trees).

If you use maximum-depth rathyer then maximum-documents it shouldn't
matter much (depending on the structure of the data).

> The experiments we ran with the fish-search, both real
> and simulated, all showed that depth-first is a better navigation algorithm
> than breadth-first.

Can you elaborate on how they were better?

> A feature still missing in the fish-search, which I would like to hear about
> from others is the use of ISMAP's.
> My idea is to try a limited selection of coordinates first, and put a larger
> selection further back in the "queue" of url's to be tried.
> 
> How do other robots find out which url's can be reached by clicking in an
> ismap?

Dave Ragget would refer you to the HTML 3.0 facilities for specifying
links on figures within the figure element. I think that is the only
way; there are an infinite number of coordinates in an ISMAP, you
don't know where they are, and you can't check two locations for
equivalence.

Incidentally, I reckon that it is bad HTML if you provide an ismap as
sole access to a small set of URL's.

A friend of mine is working on a "click on this festival map to show
where you are going to be" service. I'd hate to think what a random
ISMAP coordinate-trying robot would do to that. ;-)

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From debra@info.win.tue.nl Wed Jun  8 12:50:17 1994
Replied: Wed, 08 Jun 1994 13:45:38 +0100
Replied: robots
Replied: debra@info.win.tue.nl (Paul De Bra)
Return-Path: <debra@info.win.tue.nl>
Delivery-Date: Wed, 8 Jun 1994 12:50:28 +0100
Received: from svin04.info.win.tue.nl by lancaster.nexor.co.uk 
          with SMTP (XTPP); Wed, 8 Jun 1994 12:50:17 +0100
Received: from pcpaul.info.win.tue.nl 
          by svin04.info.win.tue.nl (8.6.8/1.45)    id NAA28716;
          Wed, 8 Jun 1994 13:50:05 +0200
Received: from localhost by pcpaul.info.win.tue.nl (8.6.4/1.60)    id LAA20290;
          Wed, 8 Jun 1994 11:52:19 GMT
From: debra@info.win.tue.nl (Paul De Bra)
Message-Id: <199406081152.LAA20290@pcpaul.info.win.tue.nl>
Subject: Re: Avoidance Algorithms
To: m.koster@nexor.co.uk (Martijn Koster)
Date: Wed, 8 Jun 1994 13:52:17 +0200 (MET DST)
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <199406081002.MAA06363@svin02.info.win.tue.nl> from "Martijn Koster" at Jun 8, 94 11:02:11 am
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 2835

> > Strange to hear that the strategy in W4 changed from depth-first in 1.0 to
> > breadth-first in 2.0.
> 
> Not really; most Web server URL spaces are structured hierarchically,
> with growing more specific towards the leaves.  So if you start from a
> server root and you do a bread-first search for a limited number of
> documents you'll get a broader (and therefore for the purposes of
> general indexing better) overview than if you do a depth-first search
> for a limited number of documents, which can shoot of down one
> specific area (especially in deep trees).

I guess our algorithm avoids shooting down into a specific area by trying
to find links to other sites first.
When you have limited search time you often don't get anywhere with breadth-
first navigation because you don't reach documents that are deep enough
to deal with a specific topic.
A robot that spends a *lot* of time, visiting very many documents, could
work equally well using breadth-first search.

> If you use maximum-depth rathyer then maximum-documents it shouldn't
> matter much (depending on the structure of the data).

We do use maximum-depth to avoid going too far in a non-relevant direction.

> > The experiments we ran with the fish-search, both real
> > and simulated, all showed that depth-first is a better navigation algorithm
> > than breadth-first.
> 
> Can you elaborate on how they were better?

They found more cross-reference links. Which need not mean anything, but
considering that cross-reference links may (in the web) be links leading to
different sites, this suggests a better chance of penetrating into a larger
part of the web. again, the fact that we search for a limited time is important
here.

> ...
> > ismap?
> 
> Dave Ragget would refer you to the HTML 3.0 facilities for specifying
> links on figures within the figure element. I think that is the only
> way; there are an infinite number of coordinates in an ISMAP, you
> don't know where they are, and you can't check two locations for
> equivalence.

nice to here something will be coming along.
there are a finite number of coordinates in an ISMAP, but the number is
large. we would never consider trying all possible coordinates.
which isn't necessary in any ismap i know.

> Incidentally, I reckon that it is bad HTML if you provide an ismap as
> sole access to a small set of URL's.

dunno. the course on hypertext which i have on line does it...
and databases that deal with mostly graphical information, providing ismaps
to zoom in on things and providing information would only have access through
ismaps as well.

> A friend of mine is working on a "click on this festival map to show
> where you are going to be" service. I'd hate to think what a random
> ISMAP coordinate-trying robot would do to that. ;-)

we'll work on it and see how it performs.

From /CN=robots-errors/@nexor.co.uk Wed Jun  8 13:46:24 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 8 Jun 1994 13:47:20 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 8 Jun 1994 13:46:24 +0100
Date: Wed, 8 Jun 1994 13:46:24 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:119080:940608124627]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 13:46:24 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"11890 Wed Jun  8 13:45:48 1994"@nexor.co.uk>
To: " (Paul De Bra)" <debra@info.win.tue.nl>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <199406081152.LAA20290@pcpaul.info.win.tue.nl>
Subject:  Re: Avoidance Algorithms 
Status: RO
Content-Length: 1998


> there are a finite number of coordinates in an ISMAP,

I was under the impression you could provide fractional
coordinates, which would make the theoretical address space
infinte, but I'll settle for large :-)

> but the number is large. we would never consider trying all possible
> coordinates.  which isn't necessary in any ismap i know.

Sure, my point as that you don't know which to try, or which map
onto the "same" document, if that concept applies (think about a
click-on-the-world-map-to-get-lat-long ismap server)

> > Incidentally, I reckon that it is bad HTML if you provide an ismap as
> > sole access to a small set of URL's.
> 
> dunno. the course on hypertext which i have on line does it...
> and databases that deal with mostly graphical information, providing ismaps
> to zoom in on things and providing information would only have access through
> ismaps as well.

And they all rule out non-graphical displays :-( It isn't always possible/
useful to provide textual links in addition, but it is quite often.

> > A friend of mine is working on a "click on this festival map to show
> > where you are going to be" service. I'd hate to think what a random
> > ISMAP coordinate-trying robot would do to that. ;-)
> 
> we'll work on it and see how it performs.

It's not the performance I'm worried about in this example, but the
fact there is a semantic associated with this action. Imagine this
cool hypothetical ismap server that uses a slide to register your
appreciation with a page, or even a graphical green(yes), red(no)
voting card. If a robot tries some random clicks in these ismaps this
could have a nasty effect. Of course all links in the Web have this
danger, but especially with graphical user-interface things like maps
and forms you expect to be interfacing with a user...

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Jun  9 16:06:34 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 9 Jun 1994 16:13:31 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 9 Jun 1994 16:06:34 +0100
Date: Thu, 9 Jun 1994 16:06:34 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:282050:940609150638]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 9 Jun 1994 16:06:34 +0100;
Alternate-Recipient: Allowed
From: Charlie Stross <charless@sco.COM>
Message-ID: <9406091059.aa02900@ruddles.sco.com>
To: debra@info.win.tue.nl, m.koster@nexor.co.uk
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: Avoidance Algorithms
X-Mailer: SCO Portfolio 2.0
Status: RO
Content-Length: 2508


(Paul De Bra) <debra@info.win.tue.nl> writes:

>> > The experiments we ran with the fish-search, both real
>> > and simulated, all showed that depth-first is a better navigation algorithm
>> > than breadth-first.
>> 
>> Can you elaborate on how they were better?

>They found more cross-reference links. Which need not mean anything, but
>considering that cross-reference links may (in the web) be links leading to
>different sites, this suggests a better chance of penetrating into a larger
>part of the web. again, the fact that we search for a limited time is important
>here.

A minor point may be of interest to you: here at SCO we distinguish
between "navigation nodes" and "information nodes" in our
formally-constructed web pages. An information node may be linked
to other nodes, but its primary function is to store text; in general
it has a link to <previous> and <next> items for linear browsing. A
navigation node, on the other hand, has scads of URLs pointing both
to information nodes and to other navigation nodes. 

This seems to be a fairly common distinction between web pages,
as it naturally falls out of most methods of structuring information
in hypertext; and it provides a clue for  ways to avoid
flooding servers while doing a depth-first search. When grabbing
a page and searching it for URLs, count the URLs on the page; if
there're three or less, the page is probably an information node
and the pointers probably point to adjacent nodes in the same
document, while four or more URLs suggest a navigation node (with
URLs that could point anywhere on the net). Put URLs from information
nodes on one stack, and URLs from navigation nodes on another stack.
When selecting the next URL to explore, alternate between the two
stacks. Alternatively, maintain a list of local stacks -- one per
server polled -- and work along the list, taking a new URL from
each stack in turn, so that no server is ever polled twice in succession.

There are other ways of avoiding flooding a server, but these 
should be pretty easy to implement and will, most of the time, ensure
that a local lookup will be followed by a remote one.


-- Charlie

PS : Anyone else run across: 

http://www.biotech.washington.edu/WebCrawler/WebCrawler.html

yet? I'm most impressed ...
--------------------------------------------------------------------------------
Charlie Stross is charless@sco.com, SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
 

From /CN=robots-errors/@nexor.co.uk Thu Jun  9 18:27:13 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 9 Jun 1994 18:27:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 9 Jun 1994 18:27:13 +0100
Date: Thu, 9 Jun 1994 18:27:13 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:019820:940609172715]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 9 Jun 1994 18:27:13 +0100;
Alternate-Recipient: Allowed
From: Brian Pinkerton <bp@biotech.washington.edu>
Message-ID: <9406091726.AA02507@biotech.washington.edu>
To: Charlie Stross <charless@sco.COM>
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: Avoidance Algorithms
Original-Received: by NeXT.Mailer (1.100)
PP-warning: Illegal Received field on preceding line
Original-Received: by NeXT Mailer                    (1.100)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 1024

I like the idea of distinguishing among nodes based on the number of links  
they have.  For the WebCrawler, this would be a good way to reduce the number  
of nodes that need to be considered when deciding which nodes to visit next.

When it's running in breadth-first mode (and generating an index), the  
WebCrawler doesn't do any kind of avoidance -- it just visits each server in  
succession, giving priority to servers that have never been visited before.   
When the number of known servers is bigger than 100 or so, then there's no  
chance the WebCrawler will get back to a server before a reasonable amount of  
time has passed.

It its "directed searching" mode, the WebCrawler will avoid a server for some  
period of time after visiting it once, because there's a good chance its  
search criteria will want to grab another document from that server.  Right  
now, it think I set that time period to 60 seconds, which roughly corresponds  
to my intuition of how fast a human would do the same operation.

bri

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 12:01:53 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:48:03 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 12:01:53 +0100
Date: Mon, 6 Jun 1994 12:01:53 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:150550:940606110159]
Content-Identifier: Best environm...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 12:01:53 +0100;
Alternate-Recipient: Allowed
From: Charlie Stross <charless@sco.COM>
Message-ID: <9406061158.aa01564@ruddles.sco.com>
To: /CN=robots/@nexor.co.uk
Subject:  Best environment for knowbot development?
X-Mailer: SCO Portfolio 2.0
Status: RO
Content-Length: 1029


I notice that an awful lot of knowbots seem to be being developed in
Perl. There's at least one written in Python, probably a couple in C
...  but there are Perl applications crawling all over the place (or
so it feels at times!).

Off the cuff, I'd attribute this to the rich string-manipulation
features and accessible TCP/IP sockets provided by Perl, along with its
fairly high execution speed (for an interpreted language).  On the other
hand, Perl is complex and syntactically dense -- a nightmare for
non-UNIXheads.

Has anyone given any serious thought to the optimal development 
environment for knowbots? Apart from Perl, are there any languages/-
platforms you'd consider to be specially suitable for developing
knowbots? And if so, what are their salient characteristics?


-- Charlie


--------------------------------------------------------------------------------
Charlie Stross is charless@sco.com, SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
 

From /CN=robots-errors/@nexor.co.uk Mon Jun  6 14:31:59 1994
Replied: Tue, 07 Jun 1994 11:54:16 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: Michael.Mauldin@NL.CS.CMU.EDU
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:30:07 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 6 Jun 1994 14:31:59 +0100
Date: Mon, 6 Jun 1994 14:31:59 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:174170:940606133201]
Content-Identifier: Description o...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 14:31:59 +0100;
Alternate-Recipient: Allowed
From: Michael.Mauldin@NL.CS.CMU.EDU
Message-ID: <"17413 Mon Jun  6 14:31:48 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Description of the Lycos WW searcher at CMU
Status: RO
Content-Length: 5944

The Lycos project at Carnegie Mellon is in the early stage, we have a Web
explorer in operation, and our indexer will come on-line later this month.
We will use the SCOUT indexer which has an HTTP gateway (a set Sample
database of the Tipster corpus from Wall Street Journal is available
intermittently from http://fuzine.mt.cs.cmu.edu/scout/home.html).

Lycos is written in Perl, but uses a C program based on CERN's libwww to
fetch URLs.  It uses a random search, keeps its record of URLs visited in a
Perl assoc list stored in DBM (thanks to Charlie Stross for the tip that Gnu
DBM doesn't have arbitrary limits!).  It searches HTTP, FTP, and GOPHER
sites, ignoreing TELNET, MAILTO, and WAIS.  Lycos uses a data reduction
scheme to reduce the stored information about each document:

	Title
	Headings and Subheadings
	100 most "weighty" words (using Tf*IDf, Term freq / Inverse doc freq)
	First 20 lines
	Size in bytes
	Number of words

Lycos keeps a word frequency count as it runs...it has read over 25
million words.  A list of the most frequent words found after searching
6.3 million words is available off the Lycos home page.

So far, Lycos has run for less than a month

	URLs found:	313,468
	URLs fetched:	41,391 (35,382 successful)
	HTTP servers:	3,138

Citation counting (number of "parents" by URL): this is the first 50
URLs sorted by number of documents that reference that URL.  What
I did not do was to count only references from different sites (the
I'm sure that 99% of the refs to http://gdbwww.gdb.orf/omim come
from the Genome Database server itself.

------------------------------------------------------------------------

1703 http://gdbwww.gdb.org/omim/
1578 http://cossack.cosmic.uga.edu/keywords.html
 692 ftp://ftp.network.com/IPSEC/rfcindex4.html
 421 ftp://ftp.network.com/IPSEC/rfcindex3.html
 322 ftp://ftp.network.com/IPSEC/rfcauthor.html
 319 ftp://ftp.network.com/IPSEC/rfcindex5.html
 234 ftp://ftp.network.com/IPSEC/rfcindex2.html
 202 ftp://ftp.network.com/IPSEC/rfcindex1.html
 177 http://info.cern.ch/hypertext/WWW/TheProject.html
 166 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html
 135 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/MetaIndex.html
 133 http://www.cs.columbia.edu/~radev/
 133 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html
 118 http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html
 108 http://www.mcs.anl.gov/home/gropp/
 107 http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html
 105 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html
 101 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html
  85 http://cui_www.unige.ch/w3catalog
  84 http://wings.buffalo.edu/world
  82 http://sass577.endo.sandia.gov/SEACAS/CUBIT/Developers/
  80 http://cui_www.unige.ch/OSG/MultimediaInfo/mmsurvey/
  79 http://www.nta.no/telektronikk/4.93.dir/
  76 http://asp.esam.nwu.edu/chris/dce_prodlist.html
  76 http://hypatia.gsfc.nasa.gov/NASA_homepage.html
  76 http://info.cern.ch/hypertext/DataSources/WWW/Servers.html
  75 http://www.ncsa.uiuc.edu/demoweb/demo.html
  75 http://www.rtd.com/people/rawn/
  74 ftp://ftp.network.com/IPSEC/rfcindex0.html
  74 http://tns-www.lcs.mit.edu/cgi-bin/value-added/sports/register.sos.texas.gov/texreg/
  73 http://rs560.cl.msu.edu/weather/getmegif.html
  71 http://rs560.cl.msu.edu/weather/interactive.html
  70 http://rs560.cl.msu.edu/weather/textindex.html
  70 http://rs560.cl.msu.edu/~henrich/
  70 http://www.seas.upenn.edu/~mengwong/
  68 http://info.cern.ch/hypertext/DataSources/WWW/Geographical.html
  68 http://rs560.cl.msu.edu/weather/uscmp.gif
  66 http://rs560.cl.msu.edu/weather/uscmp.mpg
  66 http://www.cso.uiuc.edu/~kline/cvk.html
  65 ftp://cs.nott.ac.uk/pub/sat-images/
  65 http://rs560.cl.msu.edu/weather/goes7ir.mpg
  65 http://rs560.cl.msu.edu/weather/worldir.mpg
  65 http://www.hmc.edu/~irilyth/diplomacy/
  64 gopher://burrow.cl.msu.edu/00/news/weather/lan
  64 gopher://ssec.wisc.edu
  64 http://rs560.cl.msu.edu/weather/6panel.mpg
  64 http://rs560.cl.msu.edu/weather/d2.jpg
  64 http://rs560.cl.msu.edu/weather/gmsvis.mpg
  63 http://cui_www.unige.ch/meta-index.html
  63 http://rd13doc.cern.ch/public/doc/Rd13StatusReport.html

------------------------------------------------------------------------

The Lycos philosophy is to keep a finite model of the web that enables
subsequent searches to proceed more rapidly.  The idea is to prune
the "tree" of documents and to represent the clipped ends with a summary
of the documents found under that node.  The 100 most important words
lists from several documents can be combined to produce a list of 
the 100 most important words in the set of documents.

Alternative fixed representations of documents or document sets include
the vector models such as Dumais at BellCore and Gallant & Caid at
Hecht-Neilson Corp.  The number 100 was chosen arbitarily, so we will
need to investigate to find whether than number is too high or too low.

I also subscribe to the dream of a single format and indexing scheme
that each server runs on its own data, but given the current state of
the community I believe it is premature to settle on a single format.
Various information retrieval schemes depend on wildly different kinds
of data.  We should try out more ideas and evaluate them carefully
and only then should we try to settle on a single format.

Resources:

I have agreed to share my code for research and educational users.
Should I make a requirement that recipients of the code post to this
mailing list so we can keep track of its proliferation?  I already
have promised code to two people.

I will make lists, statistics, reports, and the index server
accessible off the Lycos home page as they become available.

--Michael L. Mauldin
  Carnegie Mellon University
  Center for Machine Translation
  5000 Forbes Avenue
  Pittsburgh, PA 15213-3890
  fuzzy@cmu.edu


From /CN=robots-errors/@nexor.co.uk Tue Jun  7 11:55:08 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 11:56:17 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 7 Jun 1994 11:55:08 +0100
Date: Tue, 7 Jun 1994 11:55:08 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273170:940607105511]
Content-Identifier: Data formats ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 7 Jun 1994 11:55:08 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"27301 Tue Jun  7 11:54:46 1994"@nexor.co.uk>
To: Michael.Mauldin@NL.CS.CMU.EDU
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <"17413 Mon Jun 6 14:31:48 1994"@nexor.co.uk>
Subject:  Data formats (was Re: Description of the Lycos WW searcher at CMU)
Status: RO
Content-Length: 2295


> I also subscribe to the dream of a single format and indexing scheme
> that each server runs on its own data,

That is one step further then what I proposed; I was talking about a
single format for the database the robot uses to store its information
in locally. This should be achievable before suggesting any Web-wide
solution.

> but given the current state of the community I believe it is
> premature to settle on a single format.  Various information
> retrieval schemes depend on wildly different kinds of data.  We
> should try out more ideas and evaluate them carefully and only then
> should we try to settle on a single format.

Sure, we don't know exactly what is required yet, and identifying all
our present and future requirements is probabaly an impossible task
anyway. I believe in a gradual approach.

The fact remains that almost all robots keep a same set of core
information: which URL's were visited, which are going to be visited.
Probably when URL's were retrieved, what Last-modified time was, what
headings/keywords are, which URL's are referenced in which documents,
etc. It should be possible to extract these common elements from the
internal robot database, in a standard format. Maybe the word "data
exchange format" is more applicable than "database format"

I'd love to suggest something more concrete myself, but I can't use
personal first-hand experience as I don't have a robot. What concrete
data formats do people use?

> I have agreed to share my code for research and educational users.
> Should I make a requirement that recipients of the code post to this
> mailing list so we can keep track of its proliferation?  I already
> have promised code to two people.

Handing out robot code is exactly what needn't be required if the 
data gathered by a robot was accessible in a mungeable format.
If everybody who'd like to do some anaylisis of Web data was running
their own robot we're wasting a lot of bandwidth.
 
> I will make lists, statistics, reports, and the index server
> accessible off the Lycos home page as they become available.

Great, keep us posted.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From charless@sco.COM Tue Jun  7 14:59:24 1994
Return-Path: <charless@sco.COM>
Delivery-Date: Tue, 7 Jun 1994 14:59:57 +0100
Received: from relay1.UU.NET by lancaster.nexor.co.uk with SMTP (XTPP);
          Tue, 7 Jun 1994 14:59:24 +0100
Received: from sco.sco.COM by relay1.UU.NET with SMTP 	(rama) id QQwtgp28331;
          Tue, 7 Jun 1994 09:59:00 -0400
Received: from scol.london.sco.COM by sco.sco.COM	id aa06700;
          Tue, 7 Jun 94 7:05:39 PDT
Received: from ruddles.london.sco.com by scol.sco.COM	id ab07727;
          Tue, 7 Jun 94 14:57:54 BST
From: Charlie Stross <charless@sco.COM>
To: Michael.Mauldin@NL.CS.CMU.EDU, m.koster@nexor.co.uk
Subject: re: Data formats 
Cc: /CN=robots/@nexor.co.uk
X-Mailer: SCO Portfolio 2.0
Date: Tue, 7 Jun 1994 14:53:43 +0100 (BST)
Message-ID:  <9406071455.aa14751@ruddles.sco.com>
Status: RO
Content-Length: 8203

Martijn Koster <m.koster@nexor.co.uk> writes ...

>> I also subscribe to the dream of a single format and indexing scheme
>> that each server runs on its own data,

>That is one step further then what I proposed; I was talking about a
>single format for the database the robot uses to store its information
>in locally. This should be achievable before suggesting any Web-wide
>solution.
        :
>I'd love to suggest something more concrete myself, but I can't use
>personal first-hand experience as I don't have a robot. What concrete
>data formats do people use?

I don't use this yet, but it intrigues me as a possible future route,
for reasons that should be obvious ...

Here's the readme file from GlimpseHTTP 1.0, released earlier this
week:

--------------------------- cut here ----------------------------------

NAME
  GlimpseHTTP

WHAT IS GLIMPSE
  Glimpse (which stands for GLobal IMPlicit SEarch) is an indexing and
  query system that allows you to search through lots of files in many
  (possibly nested) directories very quickly.
  Glimpseindex, which you run by saying glimpseindex <directory(ies)>
  builds a very small index (2-5% of the text).
  With it, glimpse can search through all the files in these directories
  much the same way as grep, except that you don't have to specify file 
  names.  Glimpse supports most of agrep's options (agrep is our
  powerful version of grep, and it is part of glimpse) including
  approximate matching (e.g., finding misspelled words), Boolean queries,
  and even some limited forms of regular expressions.

DESCRIPTION
  GlimpseHTTP is a collection of tools that allows you to incorporate
  glimpse in WWW documents.  With it, you can provide general
  search capabilities to any user without incurring too much space 
  overhead.  Furthermore, these tools allow you to integrate search with 
  browsing.  If you have several nested directories which the user may
  browse, you can include the glimpse interface in each document such that
  only the relevant directories will be included in the search.  More
  details are given below.
  The current version of GlimpseHTTP was
  tested under httpd 1.2 HTML server from NCSA and
  Glimpse currently works on many Unix platforms.
  To search and browse the information any HTML browser can be used
  (this includes NCSA Mosaic for X-Windows, MS-Windows and
  Macintosh, Lynx and other browsers. For maximum convenience
  your browser should support forms, although minimal
  functionality can be achieved with any browser).

  Since GlimpseHTTP uses Glimpse, this provides some unique features

  - A very small index (3-5% of the total text).
  - Reasonably fast search.
  - Search for approximate match allowing errors.

  In addition, GlimpseHTTP provides you with the following
  capabilities:

  - You can use a combination of browsing and searching:
    first, you locate the directory where the relevant
    information can be stored, then you can use search
    to locate specific files.
  - The result of the search is a nicely formatted hypertext with
    hyperlinks to matching documents.
  - Following the hyperlink leads you not only to a particular
    file, but also to the exact place where the match occured.
  - Hyperlinks in the documents are converted on the fly to
    actual hyperlinks, which you can follow immediately. This
    makes the GlimpseHTTP particularily suitable for searching
    meta-information (Internet directories etc.).
  - Similar tools are provided for archiving and searching
    USENET newsgroups. You can maintain the archive of news articles
    and allow people to search your archive using the
    same interface. Features supported include kill-file for articles
    and fast search for particular posters. Since news archiver uses
    NNTP interface, you can archive news articles from remote
    news servers. (Browse and search for news is yet to be
    implemented: browsing in this case means selection of pertinent
    newsgroup(s), currently supported is only the search within
    one newsgroup a time)

  Among the possible applications of GlimpseHTTP we envision:

  - FTP sites with search possibilities;
  - news archiving sites;
  - any search application which should be accessed over local
    or global network where searching for approximate match and/or
    saving of disk space for indices is an issue.

GlimpseHTTP components:

  1. aglimpse - "Archive Glimpse" - a tool for searching file
     hierarchies indexed for Glimpse. aglimpse is a CGI-compliant
     program which performs the search and formats the output as 
     HTML document with hyperlinks to the matches.

  2. Administrative tools which facilitate maintaining and
     indexing of Glimpse archives. One of the programs is the
     HTML indexer which prepares hypertext indices for
     each searchable directory - this supports the concept
     of combined browsing and searching.

  3. GlimpseNews - a collection of tools for archiving and
     searching newsgroups archives.

SEE ALSO
        http://glimpse.cs.arizona.edu:1994/glimpsehttp.html -
                GlimpseHTTP home page.
        http://glimpse.cs.arizona.edu:1994 - Glimpse
                developers home page.
        README.install - directions on installing GlimpseHTTP
                on your server.
        README.amgr - description of Archive Manager.
        README.indexing - descriptioN of HTML indexer.

AUTHORS
        Paul Klark (GlimpseHTTP)
        Udi Manber, Sun Wu, and Burra Gopal (Glimpse)
        University of Arizona, Department of Computer Science
        To be put on glimpse mailing list, send mail to
        glimpse-request@cs.arizona.edu


-------------------------- CUT HERE ------------------------------

If you're still reading, what I'm thinking is:

A URL is effectively a "word"; any given document has a unique
URL. Indeed, an inverted-text index of URLs is a fairly sensible
way to keep track of large maps of the web, where the same URLs
may be replicated frequently.

GlimpseHTTP is a really convenient alternative to WAIS or Z39.50
retreival systems for providing access to stored text -- which is 
what HTML indexers need to do.

What's got me interested is this:

Imagine a web traversing 'bot that does a breadth-first search of the
web.  Every time it retrieves an HTML document, it indexes it using
glimpse.  However, rather than storing a pointer to the disk block where
the file resides, as glimpse does when it's indexing text files, the
knowbot stores a pointer to the URL of the file.  (URLs are stored in a
separate table, so that only an index pointer -- say, a 24-bit integer -- is
required to represent a URL in the actual text database.) When you do a
search for some words, the glimpse system finds the best match, then
dereferences the stored URLs to retrieve the documents containing
those words, then does an agrep search on it for context.  

This schema gives you a good compromise between index size and storage
space.  A 24-bit integer -- enough to index 16 million URLs -- is of the
same order of magnitude as the index pointers that glimpse already
provides for pointing to blocks: it would be a bit slower, but the
problem is not insuperable.

So in return for, say, 50Mb of disk space devoted to an index, you could
have a complete inverted-text database refering to 500-1000 Mb of HTML
files on the web; these would be available for retrieval with a single
URL lookup.

Now layer a client-server lookup mechanism like Alibi -- the UberNet
system -- on top of the glimpse/knowbot combo, and you have a mechanism
for propagating queries between index servers.  A properly designed
system could answer queries on a huge information domain without doing
any off-site lookups, or (if the information is not found locally)
forward the query to other servers. The result would be something like
Veronica, only with full free-text search capacity over the whole
of WebSpace.


-- Charlie


--------------------------------------------------------------------------------
Charlie Stross is charless@sco.com, SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
 

From /CN=robots-errors/@nexor.co.uk Tue Jun  7 15:00:04 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 15:00:43 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 7 Jun 1994 15:00:04 +0100
Date: Tue, 7 Jun 1994 15:00:04 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294320:940607140005]
Content-Identifier: ALIBI release...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 7 Jun 1994 15:00:04 +0100;
Alternate-Recipient: Allowed
From: Charlie Stross <charless@sco.COM>
Message-ID: <9406071455.aa14752@ruddles.sco.com>
To: /CN=robots/@nexor.co.uk
Subject:  ALIBI release readme
X-Mailer: SCO Portfolio 2.0
Status: RO
Content-Length: 10614

Please accept my apologies if you've already read this. It's the
readme for Alibi, a new resource retrieval system written at NIST,
and released earlier this week; I think it's relevent to this list ...


--------------------------- CUT HERE -----------------------------

Alibi/Unetd README (C) 1994 David Flater, dave@case50.ncsl.nist.gov
VERSION:  BETA 001


Quick Summary
-------------

Alibi is a new resource discovery and information retrieval system for the
Internet.  The acronym stands for Adaptive Location of Internetworked Bases of
Information.  Alibi provides a query interface that allows users to retrieve
information with keyword queries, without contacting remote servers or
navigating.  The resource discovery is fully automated, and the retrieval is
truly location independent.


Using Alibi
-----------

The source to the client is called alibi.c.  It's in a subdirectory called
clients and is also available separately.  The client program alibi can talk
with any Unetd anywhere, but you should talk to a local daemon if you have one
and let the system itself worry about remote access.  Alibi requires as its one
and only command line argument the name of the machine running the Unetd.  If
all is well, you will get a prompt asking for a query.  Typing 'help' or
anything else that is not a well-formed query will produce a brief help screen.
A basic query is a group of keywords in parenthesis, e.g. (cache software).
The special command 'more' (no parens) will retrieve something else like what
you just got.  More complex options are described in the help.  'quit' gets you
out of the client.  The client can be suspended and brought back without
interfering with the processing of the query by the information system; only
the final delivery will be delayed.


System Description (Read Before Installing Unetd)
-------------------------------------------------

The Ubernet is the information network used by Alibi.  Alibi is the name of the
entire system, including the simple client that allows users to retrieve
information.  Alibi is neither a navigational system like WWW nor a resource
catalog like Archie.  It is a fully distributed, fully automatic resource
discovery and information retrieval system with a query-based user interface.

The client (called alibi) contacts a Unetd, submits queries on behalf of the
user, and processes replies from the Unetd.  Unetd maintains Internet
connections with other Unetds at other sites, and it may also communicate with
mediators / resource managers at the local site to retrieve information.  When
a Unetd receives a query, it either generates a response using its local
resources or forwards the query to another Unetd.

Alibi can handle just about any kind of information.  Currently available
resources include the MS-DOS subtree of wuarchive (via NFS), the SEC's EDGAR
database (via FTP), a geographical database for Virginia, and the following
"demo-sized" information bases: a collection of sound files; a group of images
at NASA Goddard Space Flight Center; several Usenet newsgroups; the Alibi FTP
directory; and a source code reuse library.

You do not need to have an information base to run Unetd, but it would be very
nice if you would contribute what you have to the Alibi information network.
It is preferable to run a Unetd at the site having the information base than to
make Unetd access the data remotely, since Unetd handles distributed
information retrieval much better than anything else.


Installing Unetd
----------------

You do not need root privileges to run Unetd, but root should add it to
rc.local to keep it up through power failures.  As a last resort, a utility
called cron_rc has been included in the utils subdirectory to restart the
daemons after power failures without needing any privileges except access to
crontab.

Several makefiles are provided.  The default makefile assumes that you have an
ANSI C compiler, finds the maximum level of optimization it supports, and
compiles the daemon.  knrmakefile assumes that you have a K&R C compiler and
the ansi2knr utility.  makefile.sun.cc is a hacked version of knrmakefile that
bypasses the compiler-finding script, which is necessary, for example, if your
system administrator installed gcc (the preferred compiler) in a bogus way that
makes it so that you can't actually compile anything.

alibi.h contains a small number of #defines that can be altered to set the
working directory of the daemon and to bypass C library functions that are
missing on your machine.  By default, Unetd does a 'cd FQDN' (filling in the
FQDN of the local machine) when it is started.  In that directory you need to
put a file called bozo.txt (you can use the one provided in the source
directory) and a file called peers that lists the FQDN's of other hosts running
Unetds with which you want to connect.  When bringing up Unetd for the first
time, forget about the peers file and just run Unetd as an isolated daemon to
make sure it's working.  When you do eventually choose other Unetds to connect
to, choose some small number of sites that are geographically close.  Three is
a good number; six is okay; ten is getting excessive.  You won't gain anything
from creating too many links except lots of overhead.  Keep in mind that other
sites can put YOU in their peers file without telling you (just like you were
about to do to someone else) and make you have more links than you thought.

Unetd writes logging information to stdout and stderr, so redirect them to a
file.  Among the first information written is the FQDN of the local machine,
the PID of the daemon, and so on.  If the FQDN is wrong, you must put a file
called FQDN in the directory from which Unetd is started containing the correct
FQDN.  FQDN is the ONLY file that is read from the initial directory before 'cd
FQDN' happens.  If your Unetd announces itself with an incorrect FQDN, other
daemons will "bozo" it repeatedly (this will be noted in the log file) and you
will not be able to talk to other sites.  The default FQDN will be correct if
your system is correctly configured.

Unetd also writes periodic statistics on things like average response time.
Some of these statistics have known bugs, such as the fact that you can
register more delivered responses than accepted queries.  This logging
information currently accumulates at a rate that will produce 100k of log data
in a few days.  You can truncate the log by sending a HUP signal to Unetd.
(This will not kill the daemon)  After the testing period is over I intend to
greatly reduce the amount of logging that is done.

Verbose logging is also enabled by default for the cache decision function
because I haven't collected enough information to tune it yet.  You might want
to disable that by undoing the #define debugcache in alibi.h.  Queries from
users are logged since I want to know what somebody typed that crashed Unetd
when it happens.  It just logs the queries, not the identity of the users who
entered them.


Providing Information Bases
---------------------------

To provide an information base you need to install a mediator.  If you do this
wrong, you can degrade the performance of the entire system.

A mediator is a separate program and process that creates two named pipes in
the directory used by Unetd.  Unetd opens those pipes and talks with the
mediator using a simple protocol.  Unetd sends subqueries (keyword queries
with no Boolean logic) to the mediator, and the mediator returns OIDs (Object
IDentifiers) of matching data objects.  Unetd might then ask the mediator to
retrieve a data object or send another subquery.

The reasons that incorrectly installed mediators are a danger are as follows:

--  Unetd trusts mediators to provide intelligent classifications for data
    objects that mesh with the generally accepted class hierarchy of the
    Ubernet.  If you start creating lots of bogus data classes, the bogosity
    will propagate into the adaptive query classification heuristics used by
    Unetds all over the place and degrade performance.

--  Unetd trusts mediators not to give stupid answers to good queries.  If a
    mediator says that a data object matches a query, it is assumed that the
    degree of relevance is fairly high and that every keyword in the subquery
    was found to relate.  A mediator MUST NOT simply find the closest thing
    in the database regardless of the magnitude of its irrelevance!  If no
    data are relevant, a null response is expected so that some other mediator
    will be given a chance.

--  Unetd trusts mediators not to act in a Byzantine manner designed to crash
    the system.

Some examples of mediators are provided in a subdirectory called
example_resources.  rblobs.c is the most frequent starting point for building
resources.  rblobs.c will turn a file system subtree into an information base
using index files that you must provide.  A slight variation on rblobs.c was
used to provide the MS-DOS subtree of wuarchive.  rnntp.c shows how you can
overhaul rblobs.c to let Unetd retrieve information from diverse information
sources, and r_c_sources.c shows how automatic indexing can be employed.


Getting Sources and Further Reading
-----------------------------------

Alibi sources and miscellaneous Alibi-related papers are available for
anonymous FTP on speckle.ncsl.nist.gov under the directory called flater.  Of
course, you can also get them through Alibi.


Licensing and All That Jazz
---------------------------

Everything that is shipped in the Alibi/Unetd package is (C) 1994 David Flater,
but permission is granted for free copying.  The sources may be modified,
reused, or rewritten provided that fair credit to David Flater is given where
appropriate, to the extent that is appropriate for the level of reuse.

The right to use this software is granted to the public; the right to misuse it
is not.  Misuse of this software on an open network may degrade the performance
of the entire information system and violate the rights of other users.  Such
misuse is expressly prohibited, and all rights that you have been granted to
this software may be revoked in the event of such misuse.

No warranties of any kind are made with respect to this package.  The author
disclaims any and all responsibility for anything bad that happens as a result
of the use or misuse of this software.
--------------------------------------------------------------------------------
Charlie Stross is charless@sco.com, SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
 

From /CN=robots-errors/@nexor.co.uk Mon Jun 13 14:06:48 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 13 Jun 1994 14:07:27 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 13 Jun 1994 14:06:48 +0100
Date: Mon, 13 Jun 1994 14:06:48 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:050560:940613130650]
Content-Identifier: new paper
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 13 Jun 1994 14:06:48 
                      +0100;
Alternate-Recipient: Allowed
From: Charlie Stross <charless@sco.COM>
Message-ID: <9406131403.aa24556@ruddles.sco.com>
To: /CN=robots/@nexor.co.uk
Cc: charless@sco.COM
Subject:  new paper
X-Mailer: SCO Portfolio 2.0
Status: RO
Content-Length: 394

I've just written a first-draft informal paper discussing knowbots.
It's on the web:

http://gemma.demon.co.uk:8001/~charlie/websearch.html

Comments?


-- Charlie
--------------------------------------------------------------------------------
Charlie Stross is charless@sco.com, SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
 

From /CN=robots-errors/@nexor.co.uk Tue Jun 14 08:44:58 1994
Replied: Tue, 14 Jun 1994 13:51:37 +0100
Replied: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 14 Jun 1994 08:45:38 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 14 Jun 1994 08:44:58 +0100
Date: Tue, 14 Jun 1994 08:44:58 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:128690:940614074459]
Content-Identifier: libwww-perl: ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 14 Jun 1994 08:44:58 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID: <9406140044.aa03471@paris.ics.uci.edu>
To: /CN=robots/@nexor.co.uk
Cc: oscar@cui.unige.ch, grimes@raison.mro.dec.com, 
    shelden@fatty.law.cornell.edu
Subject:  libwww-perl: A generic WWW interface library for perl tools
Status: RO
Content-Length: 850

Hello all,

After some prompting from Martijn Koster and Oscar Nierstrasz at WWW94,
I decided to rewrite the core of MOMspider so that it can serve
as a generic library for WWW clients written in Perl.  So far it includes
support for all of HTTP and also local file requests. 

I am looking for more contributions to support the many other protocols
and also to provide better HTML libraries.

The distribution site and much more information about the libraries
can be found at 

       <http://www.ics.uci.edu/WebSoft/libwww-perl/>

and also at

       <ftp://liege.ics.uci.edu/pub/arcadia/libwww-perl/>

Please take a look and tell me what you think.


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

From /CN=robots-errors/@nexor.co.uk Tue Jun 14 13:31:03 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 14 Jun 1994 13:32:27 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 14 Jun 1994 13:31:03 +0100
Date: Tue, 14 Jun 1994 13:31:03 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:176910:940614123106]
Content-Identifier: Re: libwww-pe...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 14 Jun 1994 13:31:03 
                      +0100;
Alternate-Recipient: Allowed
From: " (Michael Mauldin)" <mlm@FUZINE.MT.CS.CMU.EDU>
Message-ID: <9406141228.AA18336@fuzine.mt.cs.cmu.edu>
To: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: libwww-perl: A generic WWW interface library for perl tools
Original-Received: by NeXT Mailer (1.63)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 815

I have a C program to fetch URL's based on CERN's libwww
that is available on the Web.  The value of using libwww
is that it works with HTTP, Gopher, FTP, and other protocols.
My robot uses this to fetch URLs, but does the text processing
in Perl.

I also have a Perl subroutine that implements the RobotsNot
Wanted function using Martijn's standard.  It caches the
rights file to prevent multiple accesses.

Check out

	http://fuzine.mt.cs.cmu.edu/mlm/scoutget.html
	http://fuzine.mt.cs.cmu.edu/mlm/rnw.html

	Each contains a short description, code, and
	a sample test run.

A question:

	Does anybody know a good way to randomly select
	an entry from a Perl associative list without
	looping through the whole array using 'each'?

--Michael L. Mauldin
  fuzzy@cmu.edu
  http://fuzine.mt.cs.cmu.edu/mlm/home.html

From /CN=robots-errors/@nexor.co.uk Wed Jun 15 14:58:51 1994
Replied: Wed, 15 Jun 1994 17:44:24 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 14:59:34 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 14:58:51 +0100
Date: Wed, 15 Jun 1994 14:58:51 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:025190:940615135853]
Content-Identifier: Proposed name...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 14:58:51 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID: <9406150657.aa16431@paris.ics.uci.edu>
To: /CN=robots/@nexor.co.uk
Subject:  Proposed name change for /RobotsNotWanted.txt
Status: RO
Content-Length: 955

Hello all,

I was just editing my MOMspider paper for final submission in the 
WWW94 proceedings (what a pain!) and noticed that I have several
references to the name /RobotsNotWanted.txt in the text.  I would
like to change the name before it gets written in stone (i.e. before
I hand over copyright to Elsevier).

I propose that the name be:   /spiders.txt


Reasons:  1) It fits within the 8.3 filename restrictions for PCs
          2) It is easy to remember and hard to mistake (i.e. no mixed case)
          3) It is more web-ish than /robots.txt
          4) It does not imply that all robots are excluded (/norobots.txt)

So, what's the general consensus?  I need to have a decision within
the next 24 hours in order to get my paper done on time ;-)


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

From /CN=robots-errors/@nexor.co.uk Wed Jun 15 15:14:59 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 15:15:21 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 15:14:59 +0100
Date: Wed, 15 Jun 1994 15:14:59 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:026740:940615141500]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 15:14:59 
                      +0100;
Alternate-Recipient: Allowed
From: Guido.van.Rossum@cwi.nl
Message-ID: <9406151407.AA09156=guido@voorn.cwi.nl>
To: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9406150657.aa16431@paris.ics.uci.edu>
References: <9406150657.aa16431@paris.ics.uci.edu>
Subject:  Re: Proposed name change for /RobotsNotWanted.txt 
X-Organization: CWI (Centrum voor Wiskunde en Informatica)
X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands
X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax)
Status: RO
Content-Length: 245

I vote for /robots.txt.  Seems more neutral (after all the general
term for web crawlers seems to be robots, not spiders).

--Guido van Rossum, CWI, Amsterdam <Guido.van.Rossum@cwi.nl>
URL:  <http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>


From /CN=robots-errors/@nexor.co.uk Wed Jun 15 17:01:23 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 17:01:49 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 17:01:23 +0100
Date: Wed, 15 Jun 1994 17:01:23 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039080:940615160125]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 17:01:23 
                      +0100;
Alternate-Recipient: Allowed
From: " (Michael Mauldin)" <mlm@FUZINE.MT.CS.CMU.EDU>
Message-ID: <9406151559.AA24432@fuzine.mt.cs.cmu.edu>
To: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: Proposed name change for /RobotsNotWanted.txt
Original-Received: by NeXT Mailer (1.63)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 295

I am in favor of the new name, if only because this is
the chance to put it out on paper, which is hard to change,
and I have no major objections to this name.

There are few enough RNW files out there that we can contact
all known such servers by email...

--Michael L. Mauldin
  fuzzy@cmu.edu

From /CN=robots-errors/@nexor.co.uk Wed Jun 15 17:44:43 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 17:45:00 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 17:44:43 +0100
Date: Wed, 15 Jun 1994 17:44:43 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:048580:940615164444]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 17:44:43 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"4855 Wed Jun 15 17:44:30 1994"@nexor.co.uk>
To: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9406150657.aa16431@paris.ics.uci.edu>
Subject:  Re: Proposed name change for /RobotsNotWanted.txt 
Status: RO
Content-Length: 1425


> I propose that the name be:   /spiders.txt
>
> Reasons:  1) It fits within the 8.3 filename restrictions for PCs
>           2) It is easy to remember and hard to mistake (i.e. no mixed case)

Agree. The reason for a far-out name was a smaller chance of a name
collision, but the PC's are a problem.

>           4) It does not imply that all robots are excluded (/norobots.txt)

Agree.

>           3) It is more web-ish than /robots.txt

This is hardly a convincing argument. I'd prefer /robots.txt because it
is seems a broader term which can include other automated processes,
such as mirrors. But if my vote results in a hung decision I'll happily
change.
 
> So, what's the general consensus?  I need to have a decision within
> the next 24 hours in order to get my paper done on time ;-)

Yup, This is a good occasion to fix that outstanding issue. So, what have
we got:

	/robots.txt:2	/spiders.txt:1

So Roy, as you've got the clock, let us know which name it is to be.

Another issue with the robots.txt spec; is there any problem with
allowing for shell-like "#" comment lines? This has been suggested by
two other people, and I'd like to add it when I add the new name. 
I'd also like any other comments on the proposal...

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:04:38 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 19:05:01 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 19:04:38 +0100
Date: Wed, 15 Jun 1994 19:04:38 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:054230:940615180442]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:04:38 
                      +0100;
Alternate-Recipient: Allowed
From: John.R.R.Leavitt@NL.CS.CMU.EDU
Message-ID: <"5406 Wed Jun 15 19:04:17 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Proposed name change for /RobotsNotWanted.txt
Status: RO
Content-Length: 1236

If anyone gets this twice, I apologize... I got some nasty bounce mail when
I submitted it before, so I am trying again.

I would prefer /robots.txt (or even better something like robots.lmt (limit),
since txt implies human-readable text to me).  My main preference for this
is that my robots are named after ants, not spiders (since they will cooperate
when they are done (someday...)).  Also, there is the world wide web
worm and the webcrawler, none of which seem to use the spider metaphor.

Just my $0.02.

-John.

--------------------------------jrrl@cs.cmu.edu-------------------------------
John R. R. Leavitt                            "Even through the darkest phase
Research Programmer                                       Be it thick or thin
Center for Machine Translation                   Always someone marches brave
Carnegie Mellon University                               Here beneath my skin"
Editor, Omphalos Magazine                         k.d.lang, "Constant Craving"
------------------------------------------------------------------------------
Reading: Little, Big by John Crowley
         Remaking History by Kim Stanley Robinson
------------------------------------------------------------------------------

From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:17:41 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 19:18:12 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 19:17:41 +0100
Date: Wed, 15 Jun 1994 19:17:41 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:055110:940615181742]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:17:41 
                      +0100;
Alternate-Recipient: Allowed
From: "Tronche Ch. le comique" <Christophe.Tronche@lri.fr>
Message-ID: <9406151820.AA01347@indy1.lri.fr>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Proposed name change for /RobotsNotWanted.txt
Original-Received: from indy1.lri.fr by lri.lri.fr,                    Wed, 15 
                   Jun 1994 20:14:58 +0200 
PP-warning: Illegal Received field on preceding line
Original-Received: by indy1.lri.fr, Wed, 15                    Jun 94 20:20:46 
                   +0200
PP-warning: Illegal Received field on preceding line
X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv 
        
        x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 
        
        "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ<V^N`hYM;pD#v&!` 
        _A?V7^R!
X-Www: http://www-ihm.lri.fr/~tronche/
Status: RO
Content-Length: 1404


What about /racl.txt ? (Robots Access Control List). It is short
enough for MS-DOS and a collision between an acronym and a "real
document" name is unlikely to occur...

John (John.R.R.Leavitt@NL.CS.CMU.EDU) writes:

> I would prefer /robots.txt (or even better something like robots.lmt (limit),
> since txt implies human-readable text to me).  

The file _is_ human-readable, in some sense.

Just $0.02 more.

+--------------------------+------------------------------------+
|                          |                                    |
|    Christophe TRONCHE    |    E-mail : tronche@lri.fr	        |
|                          |                                    |
|        +-=-+-=-+         |    Phone  : 33 - 1 - 69 41 66 25   |
|                          |    Fax    : 33 - 1 - 69 41 65 86   |
+--------------------------+------------------------------------+
|      ######      **                                           |
|     ##     #         Laboratoire de Recherche en Informatique |
|    ##       #   ##   Batiment 490                             |
|   ##       #   ##    Universite de Paris-Sud                  |
|  ##    ####   ##     91405 ORSAY CEDEX                        |
| ######    ## ##      FRANCE                                   |
|######      ###                                                |
+---------------------------------------------------------------+


From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:28:10 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 19:28:43 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 19:28:10 +0100
Date: Wed, 15 Jun 1994 19:28:10 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:055740:940615182811]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:28:10 
                      +0100;
Alternate-Recipient: Allowed
From: " (Michael Mauldin)" <mlm@FUZINE.MT.CS.CMU.EDU>
Message-ID: <9406151828.AA24805@fuzine.mt.cs.cmu.edu>
To: /CN=robots/@nexor.co.uk
Cc: fuzzy@CMU.EDU
Subject:  Re: Proposed name change for /RobotsNotWanted.txt
Original-Received: by NeXT Mailer (1.63)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 1057

Okay, let me modify my earlier vote.  I am still in favor
of doing the name change NOW and picking a DOS compatible
name.

How about

	agents.pol	(For agents policy).

This satisfies a number of criteria

	1. It is neutral, it does not imply that agents
	   are good or bad.

	2. "agent" is a general accepted term for what
	   spiders, worms, ants and robots do.

	3. the .pol extension does not seem to imply
	   human readability

Let me also second (or vote for) the suggestion to
add comments to the spec, with '#' being a perfectly
acceptable comment introduction character.

Finally, let's drop the notion that an empty agents.pol
file has a meaning...given the diversity of server responses
to a non-existant file, let's force someone to use the
exclusion language to deny access to every one:

	Robot: *
	Disallow: /

should be the accepted way to turn off remote agents.
We might as well change the "Robot:" to "Agent:", and
then, we'll even be consistent with the CERN WWW spec
(it is a User-Agent, after all).

--Michael Mauldin <fuzzy@cmu.edu>

From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:43:32 1994
Replied: Thu, 16 Jun 1994 09:37:30 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: John.R.R.Leavitt@NL.CS.CMU.EDU
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 19:44:06 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 19:43:32 +0100
Date: Wed, 15 Jun 1994 19:43:32 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:056630:940615184333]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:43:32 
                      +0100;
Alternate-Recipient: Allowed
From: John.R.R.Leavitt@NL.CS.CMU.EDU
Message-ID: <"5660 Wed Jun 15 19:43:27 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Proposed name change for /RobotsNotWanted.txt
Status: RO
Content-Length: 1537

"Tronche Ch. le comique" <Christophe.Tronche@lri.fr> writes:

>John (John.R.R.Leavitt@NL.CS.CMU.EDU) writes:
>
>> I would prefer /robots.txt (or even better something like robots.lmt (limit),
>> since txt implies human-readable text to me).
>
>The file _is_ human-readable, in some sense.

True.  But then, to the the right people, so are .ps files, .c files, and even
strange things like .dvi and .o files.  Around here, .perl and .lisp files
are considered human readable for the most part.  What I meant, was that .txt
seems to suggest non-computer-readable data (meaning not designed for computer
readability, since I'm sure a computer could read anything I could).  In
the end, the extension really doesn't matter all that much. :^)

a couple more cents (if we keep going, we can all chip in on a soda! :^)

-John.

--------------------------------jrrl@cs.cmu.edu-------------------------------
John R. R. Leavitt                            "Even through the darkest phase
Research Programmer                                       Be it thick or thin
Center for Machine Translation                   Always someone marches brave
Carnegie Mellon University                               Here beneath my skin"
Editor, Omphalos Magazine                         k.d.lang, "Constant Craving"
------------------------------------------------------------------------------
Reading: Little, Big by John Crowley
         Remaking History by Kim Stanley Robinson
------------------------------------------------------------------------------

From /CN=robots-errors/@nexor.co.uk Wed Jun 15 21:13:19 1994
Replied: Thu, 16 Jun 1994 09:38:53 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 15 Jun 1994 21:13:41 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 15 Jun 1994 21:13:19 +0100
Date: Wed, 15 Jun 1994 21:13:19 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:062430:940615201320]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 21:13:19 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID: <9406151307.aa07835@paris.ics.uci.edu>
To: /CN=robots/@nexor.co.uk
In-Reply-To: <"5406 Wed Jun 15 19:04:17 1994"@nexor.co.uk>
Subject:  Re: Proposed name change for /RobotsNotWanted.txt 
Status: RO
Content-Length: 1923

Hmmm...not a whole lot of consensus out there.

Acronyms such as "racl.txt" are too hard to remember.

I think the extension needs to reflect the content-type, not its purpose.
Specifically, it is not fair to ask people to define a new type just for
this file.

On the other hand, we could always call it "robots.pl" and require
the format to be in Perl4.  ;-)

Yes, the comment syntax should be "all lines starting with # and all
empty lines are ignored".

I would also like to add an "Expires: " entry, e.g.

     Expires: daily       (means don't check me again until tomorrow)
     Expires: weekly      (  "    "     "  "    "     for 7  days)
     Expires: monthly     (  "    "     "  "    "     for 30 days)
     Expires: never       (means never check me again)
     Expires: 27 Jun 1994 (means don't check again until after the given date)


Just my NZ half-penny ...
=======================================================================

Okay, the voting so far, counting my own (I think):

RTF = your's truly                  Y = Yes
GvR = Guido van Rossum              N = No
MLM = Michael L. Mauldin            O = Okay, maybe, don't care, ...
MAK = Martijn Koster
JRL = John R. R. Leavitt
CT  = Christophe Tronche


spiders.txt robots.txt robots.lmt racl.txt agents.pol agents.txt avoidURL.txt
----------- ---------- ---------- -------- ---------- ---------- ------------
RTF   Y         O          N          N        N          O          Y
GvR   N         Y                                                       
MLM   O         O                              Y 
MAK   N         Y
JRL   N         Y          Y
CT                         N          Y


and a grand total of USD $0.04 + FF $0.02 + NZD $0.005


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

From /CN=robots-errors/@nexor.co.uk Thu Jun 16 09:39:36 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 16 Jun 1994 09:41:05 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 16 Jun 1994 09:39:36 +0100
Date: Thu, 16 Jun 1994 09:39:36 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:109360:940616083939]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 09:39:36 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"10917 Thu Jun 16 09:39:14 1994"@nexor.co.uk>
To: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9406151307.aa07835@paris.ics.uci.edu>
Subject:  Re: Proposed name change for /RobotsNotWanted.txt 
Status: RO
Content-Length: 301

> I would also like to add an "Expires: " entry, e.g.

Why not rely on the HTTP Expires? That's what it's for...


-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Jun 16 09:37:50 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 16 Jun 1994 09:38:09 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 16 Jun 1994 09:37:50 +0100
Date: Thu, 16 Jun 1994 09:37:50 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:108770:940616083751]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 09:37:50 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"10874 Thu Jun 16 09:37:43 1994"@nexor.co.uk>
To: John.R.R.Leavitt@NL.CS.CMU.EDU
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <"5660 Wed Jun 15 19:43:27 1994"@nexor.co.uk>
Subject:  Re: Proposed name change for /RobotsNotWanted.txt 
Status: RO
Content-Length: 2155


Michael Mauldin <fuzzy@cmu.edu> wrote:

>        3. the .pol extension does not seem to imply
>           human readability
...
> In the end, the extension really doesn't matter all that much.

I've had problems in the past with ALIWEB's /site.idx, where servers
(in these cases CERN and the NT one) didn't recognise the extension
and made it application/binary or something. This can be a bit
annoying if the client uses an Accept line with text/plain and
text/html.

So I guess officially we'd like a separate mime type for this and
don't worry about extensions at all, but in practice using .txt
saves you hassle.

>         2. "agent" is a general accepted term for what
>           spiders, worms, ants and robots do.

Well, it's close to user-agent, ie any client. This is a larger
category than the automated robots these policy lines are directed at;
I don't mind manual browser going through all these places I want to
hide from robots. So this may not be appropriate.

On my way home last night I remembered that a while back someone
suggested "/robotsp.txt", with the P for policy.  This is even better
than /robots.txt, as it describes the contents of the file well, and
has less chance of name collision.  But then, I wouldn't want to upset
Roy's chart :-)

> Let me also second (or vote for) the suggestion to
> add comments to the spec, with '#' being a perfectly
> acceptable comment introduction character.

Good. Anybody object?

> Finally, let's drop the notion that an empty agents.pol
> file has a meaning...given the diversity of server responses
> to a non-existant file, let's force someone to use the
> exclusion language to deny access to every one:

OK, it was a bit obscure.

> should be the accepted way to turn off remote agents.
> We might as well change the "Robot:" to "Agent:", and
> then, we'll even be consistent with the CERN WWW spec
> (it is a User-Agent, after all).

"User-agent:" then? Mmm, I think I like the sound of that.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Jun 16 12:40:49 1994
Replied: Fri, 17 Jun 1994 14:07:21 +0100
Replied: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Replied: Thu, 16 Jun 1994 12:44:18 +0100
Replied: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 16 Jun 1994 12:41:47 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 16 Jun 1994 12:40:49 +0100
Date: Thu, 16 Jun 1994 12:40:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:134480:940616114051]
Content-Identifier: Re: Proposed ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 12:40:49 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID: <9406160440.aa00301@paris.ics.uci.edu>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Proposed name change for /RobotsNotWanted.txt
Status: RO
Content-Length: 2604


------- Forwarded Message

From:	Peter Beebee <beebee@parc.xerox.com>
To:	fielding@simplon.ICS.UCI.EDU
In-reply-to: "Roy T. Fielding"'s message of Wed, 15 Jun 1994 13:13:19 -0700 <9406151307.aa07835@paris.ics.uci.edu>
Subject: Re: Proposed name change for /RobotsNotWanted.txt 
Reply-to: beebee@parc.xerox.com
Message-Id: <94Jun15.164028pdt.2695@persica.parc.xerox.com>
Date:	Wed, 15 Jun 1994 16:40:23 PDT


Yet more American $$

But first, an introduction:

Hello everybody, my name is Peter Beebee.  I'm an undergraduate at MIT
currently working at Xerox PARC.  One of my recent projects is to implement
an experimental web browser which operates through existing WWW clients but
provides more natural searching options.  For this project I am writing
(in PERL) a robot currently identified as SG-Scout.  The purpose of this
robot is to collect the information needed for the searching algorithms
I will be using.

Actually, the first version of SG-Scout is already written (thanks to the
help of a couple of you); I've gotten it to run inside Xerox, but I've
had problems with our firewall when I've tried to access remote servers.  

I do (and plan to continue to) comply with the proposed standard of exclusion.

As for the name problem, I vote for something like "robots.cnf" or "robots.cfg"
(configure) over "robots.txt".  This way we could avoid creating our own
extension for one file, but we would at the same time reduce the chances
of collision. The RobotsNotWanted.txt file is more of a configuration file
than a text file...

  -- Peter  <beebee@parc.xerox.com, ptbb@ai.mit.edu>


------- End of Forwarded Message

And my reply is:

That sounds a lot like fish-search -- you should talk to Reiner Post.

The libwww-perl code includes the ability to use a proxy server.
See <http://www.ics.uci.edu:80/WebSoft/libwww-perl/>


And there is no defined mime-type for config files, so .cnf and .cfg
would be no better than .lmt in that regard.  Of course, we could always
define one and make it a standard, say

    text/config   cfg

but that would still be somewhat annoying to server maintainers.

Oh, and never mind about the Expires thing -- I agree with Martijn
that we should use the (painfully obvious) existing mechanism.
However, I do not think that "robotsp.txt" more accurately reflects
the purpose of the file -- it sounds like robot's pee (which is not
quite what we had in mind ;-)


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>


From /CN=robots-errors/@nexor.co.uk Fri Jun 17 00:59:37 1994
Replied: Fri, 17 Jun 1994 09:17:21 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: beebee@parc.xerox.com
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 17 Jun 1994 01:00:36 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 17 Jun 1994 00:59:37 +0100
Date: Fri, 17 Jun 1994 00:59:37 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:195960:940616235940]
Content-Identifier: Evolving Stan...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 00:59:37 
                      +0100;
Alternate-Recipient: Allowed
From: Peter Beebee <beebee@parc.xerox.com>
Message-ID: <94Jun16.165832pdt.2695@persica.parc.xerox.com>
To: /CN=robots/@nexor.co.uk
Subject:  Evolving Standard
Reply-To: beebee@parc.xerox.com
Status: RO
Content-Length: 169


Ok.. so how is the standard emerging out of all this turmoil?

"robots.txt"?
empty file = all robots permitted?
'#' = comment character?
no "Expires" lines?

 - Peter


From /CN=robots-errors/@nexor.co.uk Fri Jun 17 10:54:41 1994
Replied: Fri, 17 Jun 1994 14:05:53 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: beebee@parc.xerox.com
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 17 Jun 1994 10:56:08 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 17 Jun 1994 10:54:41 +0100
Date: Fri, 17 Jun 1994 10:54:41 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:247490:940617095444]
Content-Identifier: Re: Evolving ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 10:54:41 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"23074 Fri Jun 17 09:17:28 1994"@nexor.co.uk>
To: beebee@parc.xerox.com
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <94Jun16.165832pdt.2695@persica.parc.xerox.com>
Subject:  Re: Evolving Standard 
Status: RO
Content-Length: 405

> Ok.. so how is the standard emerging out of all this turmoil?
>
> "robots.txt"?
> empty file = all robots permitted?
> '#' = comment character?
> no "Expires" lines?

Yes. I'll be changing the document accordingly.


-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Fri Jun 17 14:06:28 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 17 Jun 1994 14:07:27 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 17 Jun 1994 14:06:28 +0100
Date: Fri, 17 Jun 1994 14:06:28 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273620:940617130630]
Content-Identifier: Re: Evolving ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 14:06:28 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"27354 Fri Jun 17 14:06:01 1994"@nexor.co.uk>
Cc: beebee@parc.xerox.com, /CN=robots/@nexor.co.uk
In-Reply-To: <"23074 Fri Jun 17 09:17:28 1994"@nexor.co.uk>
Subject:  Re: Evolving Standard 
Status: RO
Content-Length: 392

I wrote:

> Yes. I'll be changing the document accordingly.

I ahve in fact rewritten it entirely. Please let me know if there's
anything I've missed.

http://web.nexor.co.uk/mak/doc/robots/norobots.html


-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Fri Jun 17 17:41:25 1994
Replied: Fri, 17 Jun 1994 17:43:55 +0100
Replied: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 17 Jun 1994 17:41:59 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 17 Jun 1994 17:41:25 +0100
Date: Fri, 17 Jun 1994 17:41:25 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:007700:940617164126]
Content-Identifier: Re: Evolving ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 17:41:25 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID: <9406170939.aa07770@paris.ics.uci.edu>
To: /CN=robots/@nexor.co.uk
In-Reply-To: <"27354 Fri Jun 17 14:06:01 1994"@nexor.co.uk>
Subject:  Re: Evolving Standard 
Status: RO
Content-Length: 396

Martijn wrote:

> I have in fact rewritten it entirely. Please let me know if there's
> anything I've missed.
> 
> http://web.nexor.co.uk/mak/doc/robots/norobots.html

Oooh, very nice.  Looks great,


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

From /CN=robots-errors/@nexor.co.uk Fri Jun 17 17:37:38 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 17 Jun 1994 17:38:11 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 17 Jun 1994 17:37:38 +0100
Date: Fri, 17 Jun 1994 17:37:38 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:007030:940617163739]
Content-Identifier: Re: New code ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 17:37:38 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"699 Fri Jun 17 17:37:28 1994"@nexor.co.uk>
To: " (Michael Mauldin)" <mlm@FUZINE.MT.CS.CMU.EDU>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9406171622.AA09303@fuzine.mt.cs.cmu.edu>
Subject:  Re: New code available to implement the latest standard (Perl) 
Status: RO
Content-Length: 1174


Michael Mauldin wondered about ordering of the lines in the
record. As it stands the ordering isn't explicitly specified,
so that a User-agent line can follow a Disallow line:

>         Disallow: /
>        User-agent: GoodRobot
>         Disallow:
>
> What does this mean?

The same as

User-agent: GoodRobot
Disallow:
Disallow: /

which is silly, but is to be interpreted to mean "Allow all
URL's except those which start with a slash", which in practice
disallows all urls.

> Requiring the robot name before the action allows a simple
> way to determine how to proceed
>        1. Find your name (or *)
>        2. Read all Disallow lines and act on them.
>
> Otherwise you force the robot to read the whole file
> to figure out what to do.

I like unspecified ordering because that is how RFC822 is specified,
and this format looks very much like rfc822. I can't imagine parsing
overhead to really be a problem. But if there is a lot of resistance
to it I'll change it. comments?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Fri Jun 17 18:38:05 1994
Replied: Sun, 19 Jun 1994 15:39:08 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 17 Jun 1994 18:38:43 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 17 Jun 1994 18:38:05 +0100
Date: Fri, 17 Jun 1994 18:38:05 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:011430:940617173807]
Content-Identifier: Re: New code ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 18:38:05 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID: <9406171036.aa11854@paris.ics.uci.edu>
To: /CN=robots/@nexor.co.uk
In-Reply-To: <"699 Fri Jun 17 17:37:28 1994"@nexor.co.uk>
Subject:  Re: New code available to implement the latest standard (Perl) 
Status: RO
Content-Length: 1010

Martijn wrote:

> Michael Mauldin wondered:
>> Requiring the robot name before the action allows a simple
>> way to determine how to proceed
>>        1. Find your name (or *)
>>        2. Read all Disallow lines and act on them.
>>
>> Otherwise you force the robot to read the whole file
>> to figure out what to do.
> 
> I like unspecified ordering because that is how RFC822 is specified,
> and this format looks very much like rfc822. I can't imagine parsing
> overhead to really be a problem. But if there is a lot of resistance
> to it I'll change it. comments?

Nope, it won't work that way.  rfc822 parsers combine identical headers
into a single, comma-separated list, thus causing any blank Disallow:
lines to disappear.

I recommend defining it as ordered (it is less confusing to the reader
that way).

....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

From /CN=robots-errors/@nexor.co.uk Sun Jun 19 15:39:21 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sun, 19 Jun 1994 15:39:54 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sun, 19 Jun 1994 15:39:21 +0100
Date: Sun, 19 Jun 1994 15:39:21 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:102640:940619143922]
Content-Identifier: Re: New code ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 19 Jun 1994 15:39:21 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"10260 Sun Jun 19 15:39:14 1994"@nexor.co.uk>
To: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9406171036.aa11854@paris.ics.uci.edu>
Subject:  Re: New code available to implement the latest standard (Perl) 
Status: RO
Content-Length: 508


Roy wrote:

> Nope, it won't work that way.  rfc822 parsers combine identical headers
> into a single, comma-separated list, thus causing any blank Disallow:
> lines to disappear.

which is consistent with its semantics.

> I recommend defining it as ordered (it is less confusing to the reader
> that way).

alright...

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Sat Jun 18 03:54:36 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sat, 18 Jun 1994 03:55:12 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sat, 18 Jun 1994 03:54:36 +0100
Date: Sat, 18 Jun 1994 03:54:36 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:044150:940618025438]
Content-Identifier: libwww-perl v...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 18 Jun 1994 03:54:36 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID: <9406171953.aa14559@paris.ics.uci.edu>
To: /CN=robots/@nexor.co.uk
Cc: oscar@cui.unige.ch, grimes@raison.mro.dec.com, 
    shelden@fatty.law.cornell.edu
Subject:  libwww-perl version 0.11
Status: RO
Content-Length: 1134

Hello all,

I made a few bug fixes and upgrades to the libwww-perl in preparation
for a general announcement on www-talk.

Version 0.11                                      June 17, 1994
   Changed environment variable LIBWWW-PERL to LIBWWW_PERL because
   some systems can't handle the dash (Charlie Stross).
   Fixed bug in "get" that caused full pathname to be used as the method
   (Martijn Koster).
   Fixed handling of perverse relative URLs (e.g. ../../) in wwwurl'absolute.


The distribution site and much more information about the libraries
can be found at 

       <http://www.ics.uci.edu/WebSoft/libwww-perl/>

and also at

       <ftp://liege.ics.uci.edu/pub/arcadia/libwww-perl/>
       
If you have already picked up a copy, a patch file is available at
both locations (patch010to011.txt).

After today, I will just announce changes on www-talk (I know how
annoying it is to get several copies of the same announcement).


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

From /CN=robots-errors/@nexor.co.uk Sun Jun 19 16:19:45 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sun, 19 Jun 1994 16:20:09 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sun, 19 Jun 1994 16:19:45 +0100
Date: Sun, 19 Jun 1994 16:19:45 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:104030:940619151947]
Content-Identifier: (q)Version(q)...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 19 Jun 1994 16:19:45 
                      +0100;
Alternate-Recipient: Allowed
From: "Tronche Ch. le comique" <Christophe.Tronche@lri.fr>
Message-ID: <9406191522.AA26207@indy1.lri.fr>
To: /CN=robots/@nexor.co.uk
Subject:  "Version" field in /robots.txt
Original-Received: from indy1.lri.fr by lri.lri.fr,                    Sun, 19 
                   Jun 1994 17:16:43 +0200 
PP-warning: Illegal Received field on preceding line
Original-Received: by indy1.lri.fr, Sun, 19                    Jun 94 17:22:42 
                   +0200
PP-warning: Illegal Received field on preceding line
X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv 
        
        x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 
        
        "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ<V^N`hYM;pD#v&!` 
        _A?V7^R!
X-Www: http://www-ihm.lri.fr/~tronche/
Status: RO
Content-Length: 1332


I've just made my new robot conformant with the new standard. This led
to the following observation:

since the standard is evolving, and is likely to evolve further in the
future, a "Version" field in /robots.txt would allow a robot to easily
detect that it has a possibly out-of-date understanding of the
avoidance file.

Any comments ?

+--------------------------+------------------------------------+
|                          |                                    |
|    Christophe TRONCHE    |    E-mail : tronche@lri.fr	        |
|                          |                                    |
|        +-=-+-=-+         |    Phone  : 33 - 1 - 69 41 66 25   |
|                          |    Fax    : 33 - 1 - 69 41 65 86   |
+--------------------------+------------------------------------+
|      ######      **                                           |
|     ##     #         Laboratoire de Recherche en Informatique |
|    ##       #   ##   Batiment 490                             |
|   ##       #   ##    Universite de Paris-Sud                  |
|  ##    ####   ##     91405 ORSAY CEDEX                        |
| ######    ## ##      FRANCE                                   |
|######      ###                                                |
+---------------------------------------------------------------+


From /CN=robots-errors/@nexor.co.uk Mon Jun 20 13:27:38 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 20 Jun 1994 13:28:25 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 20 Jun 1994 13:27:38 +0100
Date: Mon, 20 Jun 1994 13:27:38 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:187400:940620122739]
Content-Identifier: Administrativ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 13:27:38 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Administrativa, and norobots.pl
Status: RO
Content-Length: 580


As a number of people have recently asked me about the robots mailing
list I have put up a Web page with some info, which has little news for
you, but you might want to keep for future reference:
<http://web.nexor.co.uk/mak/doc/robots/mailing-list.html>

Jumping on the bandwagon I have also written a /robots.txt parser in Perl,
it's on <http://web.nexor.co.uk/mak/doc/robots/norobots.html>

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From guido@cwi.nl Mon Jun 20 13:53:33 1994
Replied: Mon, 20 Jun 1994 15:15:39 +0100
Replied: Martijn Koster <m.koster@nexor.co.uk>
Replied: /DD.Common=robots/@nexor.co.uk
Replied: Guido.van.Rossum@cwi.nl
Return-Path: <guido@cwi.nl>
Delivery-Date: Mon, 20 Jun 1994 13:53:47 +0100
Received: from charon.cwi.nl by lancaster.nexor.co.uk with SMTP (XTPP);
          Mon, 20 Jun 1994 13:53:33 +0100
Received: from voorn.cwi.nl by charon.cwi.nl with SMTP	id <AA27754@cwi.nl>;
          Mon, 20 Jun 1994 14:53:15 +0200
Received: by voorn.cwi.nl with SMTP	id <AA23764@cwi.nl>;
          Mon, 20 Jun 94 14:53:14 +0200
Message-Id: <9406201253.AA23764=guido@voorn.cwi.nl>
To: Martijn Koster <m.koster@nexor.co.uk>
Cc: /DD.Common=robots/@nexor.co.uk
Subject: Re: Administrativa, and norobots.pl 
In-Reply-To: Your message of "Mon, 20 Jun 1994 14:27:38 MDT."             <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk> 
References: <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk> 
From: Guido.van.Rossum@cwi.nl
X-Organization: CWI (Centrum voor Wiskunde en Informatica)
X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands
X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax)
Date: Mon, 20 Jun 1994 14:53:13 +0200
Sender: Guido.van.Rossum@cwi.nl
Status: RO
Content-Length: 1634

Martijn, sorry if any of this has been discussed before:

Looking at your recent norobots.html, I noticed that the spec uses
"User-agent:" but the examples use "Robot:" ...  (I prefer Robot).

I personally have some problems interpreting your format description
precisely -- the language seems to be open for misinterpretation.  (I
want to write a robots.txt parser in Python.  I can't read Perl so
looking at your example parser won't do me much good, and anyway that
shouldn't be necessary :-)

E.g. records are separated by blank lines.  Is this before or after
removing comments?  (This would make a difference regarding Is there
no <optionalspace> allowed after the <value>?  (The example suggests
there is -- between the value and the #comment.)

Also since there's a strict alternation of Robots: and Disallow:
lines, why not use the appearance of a Robots: line to signal the end
of a record?  Then the syntax would be (using my own BNF variant --
hope it's clear):

	file: endline* record*
	record: robotsline+ disallowline+
	robotsline: 'Robots:' sp* value sp* endline+
	disallowline: 'Disallow:' sp* value sp* endline+
	sp: SPACE | TAB
	endline: ['#' comment] (CR | LF | CR LF)
	value: <any non-empty string not containing CR or LF and
		not beginning or ending with SPACE or TAB>
	comment: <any string not containing CR or LF>

with the proviso that 'Robots:' and 'Disallow:' should be parsed case
insensitive.

Parsers could be told to treat unrecognized headers as comments, for
future extensions.

--Guido van Rossum, CWI, Amsterdam <Guido.van.Rossum@cwi.nl>
URL:  <http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>


From /CN=robots-errors/@nexor.co.uk Mon Jun 20 13:54:12 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 20 Jun 1994 13:54:39 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 20 Jun 1994 13:54:12 +0100
Date: Mon, 20 Jun 1994 13:54:12 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:193310:940620125414]
Content-Identifier: Re: Administr...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 13:54:12 
                      +0100;
Alternate-Recipient: Allowed
From: Guido.van.Rossum@cwi.nl
Message-ID: <9406201253.AA23764=guido@voorn.cwi.nl>
To: Martijn Koster <m.koster@nexor.co.uk>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk>
References: <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk>
Subject:  Re: Administrativa, and norobots.pl 
X-Organization: CWI (Centrum voor Wiskunde en Informatica)
X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands
X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax)
Status: RO
Content-Length: 1634

Martijn, sorry if any of this has been discussed before:

Looking at your recent norobots.html, I noticed that the spec uses
"User-agent:" but the examples use "Robot:" ...  (I prefer Robot).

I personally have some problems interpreting your format description
precisely -- the language seems to be open for misinterpretation.  (I
want to write a robots.txt parser in Python.  I can't read Perl so
looking at your example parser won't do me much good, and anyway that
shouldn't be necessary :-)

E.g. records are separated by blank lines.  Is this before or after
removing comments?  (This would make a difference regarding Is there
no <optionalspace> allowed after the <value>?  (The example suggests
there is -- between the value and the #comment.)

Also since there's a strict alternation of Robots: and Disallow:
lines, why not use the appearance of a Robots: line to signal the end
of a record?  Then the syntax would be (using my own BNF variant --
hope it's clear):

	file: endline* record*
	record: robotsline+ disallowline+
	robotsline: 'Robots:' sp* value sp* endline+
	disallowline: 'Disallow:' sp* value sp* endline+
	sp: SPACE | TAB
	endline: ['#' comment] (CR | LF | CR LF)
	value: <any non-empty string not containing CR or LF and
		not beginning or ending with SPACE or TAB>
	comment: <any string not containing CR or LF>

with the proviso that 'Robots:' and 'Disallow:' should be parsed case
insensitive.

Parsers could be told to treat unrecognized headers as comments, for
future extensions.

--Guido van Rossum, CWI, Amsterdam <Guido.van.Rossum@cwi.nl>
URL:  <http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>


From /CN=robots-errors/@nexor.co.uk Mon Jun 20 15:17:46 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 20 Jun 1994 15:19:19 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 20 Jun 1994 15:17:46 +0100
Date: Mon, 20 Jun 1994 15:17:46 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:213810:940620141748]
Content-Identifier: Re: Administr...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 15:17:46 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"21338 Mon Jun 20 15:16:12 1994"@nexor.co.uk>
To: Guido.van.Rossum@cwi.nl
Cc: Martijn Koster <m.koster@nexor.co.uk>, /CN=robots/@nexor.co.uk
In-Reply-To: <9406201253.AA23764=guido@voorn.cwi.nl>
Subject:  Re: Administrativa, and norobots.pl 
Status: RO
Content-Length: 1662


> Martijn, sorry if any of this has been discussed before:
> 
> Looking at your recent norobots.html, I noticed that the spec uses
> "User-agent:" but the examples use "Robot:" ...  (I prefer Robot).

Yes, we just discussed that :-) It was felt User-agent is closer to
HTTP. It doesn't really matter what the name is...

> I personally have some problems interpreting your format description
> precisely.

OK, let us know, this thing should be interpretable.

> E.g. records are separated by blank lines.  Is this before or after
> removing comments?  (This would make a difference regarding Is there
> no <optionalspace> allowed after the <value>?  (The example suggests
> there is -- between the value and the #comment.)

Did your editor eat something there? The records are separated by
blank lines after removing comments, but I don't really see the
difference.

Yes, there is optionalspace allowed after the value, which has no
meaning, and is stripped (I've added this explicitly).
 
> Also since there's a strict alternation of Robots: and Disallow:
> lines, why not use the appearance of a Robots: line to signal the
> end of a record?

Then it all becomes one big collection of lines which is difficult to
read. With blank lines it is clearer where one record ends and another
begins.

I was hoping not to need a BNF description, but it looks like one is
needed. :-)
 
> Parsers could be told to treat unrecognized headers as comments, for
> future extensions.

OK

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From guido@cwi.nl Mon Jun 20 15:54:14 1994
Replied: Mon, 20 Jun 1994 17:25:07 +0100
Replied: /DD.Common=robots/@nexor.co.uk
Replied: Guido.van.Rossum@cwi.nl
Return-Path: <guido@cwi.nl>
Delivery-Date: Mon, 20 Jun 1994 15:54:30 +0100
Received: from charon.cwi.nl by lancaster.nexor.co.uk with SMTP (XTPP);
          Mon, 20 Jun 1994 15:54:14 +0100
Received: from voorn.cwi.nl by charon.cwi.nl with SMTP	id <AA00784@cwi.nl>;
          Mon, 20 Jun 1994 16:54:01 +0200
Received: by voorn.cwi.nl with SMTP	id <AA24145@cwi.nl>;
          Mon, 20 Jun 94 16:54:00 +0200
Message-Id: <9406201454.AA24145=guido@voorn.cwi.nl>
To: Martijn Koster <m.koster@nexor.co.uk>
Cc: /DD.Common=robots/@nexor.co.uk
Subject: Re: Administrativa, and norobots.pl 
In-Reply-To: Your message of "Mon, 20 Jun 1994 15:15:33 MDT."             <9406201415.AA29767=m.koster@nexor.co.uk@charon.cwi.nl> 
References: <9406201415.AA29767=m.koster@nexor.co.uk@charon.cwi.nl> 
From: Guido.van.Rossum@cwi.nl
X-Organization: CWI (Centrum voor Wiskunde en Informatica)
X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands
X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax)
Date: Mon, 20 Jun 1994 16:54:00 +0200
Sender: Guido.van.Rossum@cwi.nl
Status: RO
Content-Length: 1520

> > Looking at your recent norobots.html, I noticed that the spec uses
> > "User-agent:" but the examples use "Robot:" ...  (I prefer Robot).
> 
> Yes, we just discussed that :-) It was felt User-agent is closer to
> HTTP. It doesn't really matter what the name is...

Well, if my vote still counts, I'd rather see Robot -- basically for
the same reason I prefer "robots.txt" over anything else: it's easiest
to remember.  And on some sense a Robot isn't really a user agent at
all...

> Did your editor eat something there? The records are separated by
> blank lines after removing comments, but I don't really see the
> difference.

Quoting from http://web.nexor.co.uk/mak/doc/robots/norobots.html (is
that the official source?):

 The file consists of one or more records separated by one or more blank lines
 (terminated by CR,CR/NL, or NL). Each record contains lines of the form "
 <field>:<optionalspace><value><optionalspace>". The field
 name is case insensitive.

and further down:

 Comments can be included in file using UNIX bourne shell conventions: the '#'
 character is used to indicate that the remainder of the line is a comment. 

But this doesn't tell me in which order these rules are executed.

Anyway, testing for a blank line after removing comments would mean
that you can't have a whole-line comment in a record.  I prefer
requiring a blank line before comment stripping.

--Guido van Rossum, CWI, Amsterdam <Guido.van.Rossum@cwi.nl>
URL:  <http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>


From /CN=robots-errors/@nexor.co.uk Mon Jun 20 17:25:29 1994
Replied: Thu, 23 Jun 1994 10:20:41 +0100
Replied: Martijn Koster <m.koster@nexor.co.uk>
Replied: /CN=robots/@nexor.co.uk
Replied: Mon, 20 Jun 1994 18:25:04 +0100
Replied: robots
Replied: Guido.van.Rossum@cwi.nl
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 20 Jun 1994 17:26:42 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 20 Jun 1994 17:25:29 +0100
Date: Mon, 20 Jun 1994 17:25:29 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:244100:940620162530]
Content-Identifier: Re: Administr...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 17:25:29 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"24407 Mon Jun 20 17:25:15 1994"@nexor.co.uk>
To: Guido.van.Rossum@cwi.nl
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9406201454.AA24145=guido@voorn.cwi.nl>
Subject:  Re: Administrativa, and norobots.pl 
Status: RO
Content-Length: 1194


> Well, if my vote still counts, I'd rather see Robot -- basically for
> the same reason I prefer "robots.txt" over anything else: it's easiest
> to remember.  And on some sense a Robot isn't really a user agent at
> all...

Fine, if you really feel it's that important, then let's vote on that
too. The choice is "Robot" vs "User-agent", send votes only to me, not
the entire list. On Wednesday 17:00 my time I'll count the votes, and
change the spec if required.  If there's a tie I'll decide.

> Quoting from http://web.nexor.co.uk/mak/doc/robots/norobots.html (is
> that the official source?):

Well, that's as official as it gets :-)
 
> Anyway, testing for a blank line after removing comments would mean
> that you can't have a whole-line comment in a record.  I prefer
> requiring a blank line before comment stripping.

OK, I see the source for the confusion. When I strip a whole-line
comment I mean strip the entire line, which is the same as what you're
saying. Yes, I'm in favour of that too.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Mon Jun 20 18:25:24 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 20 Jun 1994 18:26:02 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 20 Jun 1994 18:25:24 +0100
Date: Mon, 20 Jun 1994 18:25:24 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:249660:940620172526]
Content-Identifier: Comments (was...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 18:25:24 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"24963 Mon Jun 20 18:25:10 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Cc: Guido.van.Rossum@cwi.nl
In-Reply-To: <"24407 Mon Jun 20 17:25:15 1994"@nexor.co.uk>
Subject:  Comments (was: Re: Administrativa, and norobots.pl )
Status: RO
Content-Length: 841


> OK, I see the source for the confusion. When I strip a whole-line
> comment I mean strip the entire line, which is the same as what
> you're saying. Yes, I'm in favour of that too.

I've changed the page to read:

        <p>

          Comments can be included in file using UNIX bourne shell
          conventions: the '<code>#</code>' character is used to
          indicate that preceding space (if any) and the remainder of
          the line up to the line termination is discarded.
          Lines containing only a comment are discarded completely, 
          and therefore do not indicate a record boundary.      
	</p>

Is that unambiguous?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From guido@cwi.nl Tue Jun 21 00:47:34 1994
Replied: Tue, 21 Jun 1994 09:27:08 +0100
Replied: /DD.Common=robots/@nexor.co.uk
Replied: Guido.van.Rossum@cwi.nl
Return-Path: <guido@cwi.nl>
Delivery-Date: Tue, 21 Jun 1994 00:48:01 +0100
Received: from charon.cwi.nl by lancaster.nexor.co.uk with SMTP (XTPP);
          Tue, 21 Jun 1994 00:47:34 +0100
Received: from voorn.cwi.nl by charon.cwi.nl with SMTP	id <AA08339@cwi.nl>;
          Tue, 21 Jun 1994 01:47:21 +0200
Received: by voorn.cwi.nl with SMTP	id <AA25606@cwi.nl>;
          Tue, 21 Jun 94 01:47:18 +0200
Message-Id: <9406202347.AA25606=guido@voorn.cwi.nl>
To: Martijn Koster <m.koster@nexor.co.uk>
Cc: /DD.Common=robots/@nexor.co.uk
Subject: norobots.py
In-Reply-To: Your message of "Mon, 20 Jun 1994 16:54:00 MDT."
From: Guido.van.Rossum@cwi.nl
X-Organization: CWI (Centrum voor Wiskunde en Informatica)
X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands
X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax)
Date: Tue, 21 Jun 1994 01:47:17 +0200
Sender: Guido.van.Rossum@cwi.nl
Status: RO
Content-Length: 3339

Here's my norobots script in Python.  Note that I haven't been able to
find the Perl code (I can actually read Perl :-) since Martijn's
norobots.html page doesn't seem to have a link to the source code --
or I missed it (using the www linemode browser in an Emacs shell
window :-).

# norobots.py

# Handle /robots.txt files.
#
# Manages a cache of parsed robots.txt files, indexed by host:port.
#
# Proposed usage:
#
#   import norobots
#   if norobots.allowed(url):
#      ...
#      # OK to read this url
#
# Author: Guido van Rossum <guido@cwi.nl>
# Version: 1.0
# Date: 21 June 1994

# XXX Worry about Expires: header later.

import urllib
import string


# Parse a robots.txt file.
# Return a list of records, where each record is represented as a
# dictionary with keys 'robots' and 'disallow' (and possibly others).
# The value for each key is a list of values, with one item for each
# corresponding line (leading and trailing whitespace stripped).

def parse(fp):
	records = []
	current = {}
	while 1:
		line = fp.readline()
		if not line: break
		line = string.strip(line)
		if not line:
			if current:
				records.append(current)
				current = {}
			continue
		i = string.find(line, '#')
		if i >= 0:
			line = line[:i]
			line = string.strip(line)
		i = string.find(line, ':')
		if i < 0: continue # Ignore bad line
		key = string.lower(line[:i])
		value = string.strip(line[i+1:])
		if key in ('robot', 'user-agent'):
			key = 'robot'
			value = string.lower(value)
		if not current.has_key(key):
			current[key] = []
		current[key].append(value)
	if current: records.append(current)
	return records


# Check whether this robot is allowed to read the given URL.

DEFAULT_NAME = 'python'

cache = {}
# Format: {'host:port': {'robot': [name, ...], 'disallow': [path, ...]}, ...}

def allowed(url, my_name = DEFAULT_NAME):
	my_name = string.lower(my_name)	# Substring must occur in record spec
	type, url = urllib.splittype(url)
	if type != 'http': return 1	# Don't mess with other protocols
	host, path = urllib.splithost(url)
	host = string.lower(host)	# Hostnames are case insensitive
	host, port = urllib.splitport(host)
	if not port: port = '80'
	key = host + ':' + port		# Normalized form
	if not cache.has_key(key):
		robots_url = '%s://%s:%s/robots.txt' % (type, host, port)
		records = []
		try:
			fp = urllib.urlopen(robots_url)
			records = parse(fp)
			fp.close()
		except IOError:
			pass
		cache[key] = records
	records = cache[key]
	for record in records:
		if not record.has_key('robot'): continue
		if not record.has_key('disallow'): continue
		specs = record['robot']
		if '*' in specs or my_name in specs:
			for prefix in record['disallow']:
				if path[:len(prefix)] == prefix:
					return 0
	return 1


# Test program

def test():
	url = 'http://web.nexor.co.uk/mak/doc/robots/robots.html'
	print url, allowed(url)
	url = 'http://web.nexor.co.uk/aliweb/data/'
	print url, allowed(url)
	url = 'http://www.cwi.nl/'
	print url, allowed(url)
	for host, records in cache.items():
		print
		print host
		for record in records:
			print
			for key, values in record.items():
				for value in values:
					print '\t' + key + ':', value

if __name__ == '__main__': test()

# end norobots.py

--Guido van Rossum, CWI, Amsterdam <Guido.van.Rossum@cwi.nl>
URL:  <http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>


From /CN=robots-errors/@nexor.co.uk Tue Jun 21 09:34:45 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 21 Jun 1994 09:35:46 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 21 Jun 1994 09:34:45 +0100
Date: Tue, 21 Jun 1994 09:34:45 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:011860:940621083447]
Content-Identifier: Re: norobots.py
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 21 Jun 1994 09:34:45 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"1082 Tue Jun 21 09:27:42 1994"@nexor.co.uk>
To: Guido.van.Rossum@cwi.nl
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9406202347.AA25606=guido@voorn.cwi.nl>
Subject:  Re: norobots.py 
Status: RO
Content-Length: 630


Guido.van.Rossum@cwi.nl wrote:

> Note that I haven't been able to find the Perl code ...  since
> Martijn's norobots.html page doesn't seem to have a link to the
> source code

Oops, I've been typing "emacs norobots.html&" so often I gave the wrong
URL, it is actually in norobots.pl. I have also added a link in the .html.
                                ^^
> (I can actually read Perl :-)

Probably a lot better than I can read python :-)

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Jun 23 10:20:57 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 23 Jun 1994 10:21:35 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 23 Jun 1994 10:20:57 +0100
Date: Thu, 23 Jun 1994 10:20:57 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:276390:940623092058]
Content-Identifier: User-agent vs...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 23 Jun 1994 10:20:57 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"27618 Thu Jun 23 10:20:48 1994"@nexor.co.uk>
To: Martijn Koster <m.koster@nexor.co.uk>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <"24407 Mon Jun 20 17:25:15 1994"@nexor.co.uk>
Subject:  User-agent vs Robot (was Re: Administrativa, and norobots.pl)
Status: RO
Content-Length: 672


I wrote:

> then let's vote on that too. The choice is "Robot" vs "User-agent",
> send votes only to me, not the entire list. On Wednesday 17:00 my
> time I'll count the votes, and change the spec if required.  If
> there's a tie I'll decide.

Only got 4 votes, so I guess people aren't all that fussed which way:

Robot: Peter Beebee, Guido van Rossum
User-agent: Michael Mauldin, Roy Fielding

There was a tie, I vote for "User-agent" and the status quo, so 
User-agent it remains.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Fri Jun 24 10:57:15 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 24 Jun 1994 10:57:56 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 24 Jun 1994 10:57:15 +0100
Date: Fri, 24 Jun 1994 10:57:15 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:138890:940624095717]
Content-Identifier: Any more robo...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 24 Jun 1994 10:57:15 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"13879 Fri Jun 24 10:56:43 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Any more robots.txt comments?
Status: RO
Content-Length: 292


Are there any more things I should change to the robots.txt standard
before announcing it to www-talk? 

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Fri Jun 24 12:39:09 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 24 Jun 1994 12:39:56 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 24 Jun 1994 12:39:09 +0100
Date: Fri, 24 Jun 1994 12:39:09 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:161940:940624113912]
Content-Identifier: quick questio...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 24 Jun 1994 12:39:09 
                      +0100;
Alternate-Recipient: Allowed
From: Paul Harrington <phrrngtn@cs.st-andrews.ac.uk>
Message-ID: <18795.9406241138@tamdhu.cs.st-andrews.ac.uk>
To: /CN=robots/@nexor.co.uk
Cc: afcondon@dsg.cs.tcd.ie
Subject:  quick question on output from robots
Status: RO
Content-Length: 984


[ sent originally to Martijn. submitted here at his suggestion ]

Hi,
just joined the robots list a few hours ago and have been browsing the
archive.

I am interested in visualisation of web structures. I wrote a very
primitive robot last year and used it to automatically (that should
really be "semi-automatically" :-)  generate navigational maps cf.
	http://www.dsg.cs.tcd.ie:1969/afc_draft.html

I would like to know if there has been any work on output formats and
descriptions from the various robots. I would like to take robot
output and write some filters to generate graphical maps. I have been
looking at applying some of Schniederman's metrics to the generation
of visual representations of nodes, subgraphs etc.

apologies for the direct mail but I don't know whether or not this may
have any bearing on the ongoing discussion.

pjjH

Paul Harrington, phrrngtn@dcs.st-andrews.ac.uk  	 +44 334 63261
Division of Computer Science, St Andrews University, Scotland KY16 9SS

From m.koster@nexor.co.uk Fri Jul  8 10:28:29 1994
Return-Path: <m.koster@nexor.co.uk>
Delivery-Date: Fri, 8 Jul 1994 10:28:52 +0100
Received: from nexor.co.uk (actually victor.nexor.co.uk) 
          by lancaster.nexor.co.uk with SMTP (PP);
          Fri, 8 Jul 1994 10:28:29 +0100
To: beebee@parc.xerox.com
cc: /CN=robots/@nexor.co.uk
cc: m.koster@nexor.co.uk
Subject: Re: New Robot: SG-Scout 
In-reply-to: Your message of "Fri, 08 Jul 1994 01:55:59 PDT."             <94Jul8.015610pdt.18822@kolsaas.parc.xerox.com> 
Date: Fri, 08 Jul 1994 10:27:58 +0100
From: Martijn Koster <m.koster@nexor.co.uk>
Status: RO
Content-Length: 2335


[Peter, I've cc'ed the robots list on this]

Peter Beebee wrote:

> I noticed in my searches that only 8 of the 6425 servers I inspected
> have a /robots.txt file ...  Is there any way we can make server
> operators more aware of the standard?  ...  Perhaps there is another
> forum which would reach more server admins...

Yes, I'm planning to do that. Periodic postings to c.i.www.providers
and updating FAQ's would be a start. A conference presentation or
something wouldn't go amiss either...

As always, it's the time, the time...

> 6425 servers

Ehr? That's about twice the number I know exist. I guess there may be
some duplicates there.

|   http://vulcan.nexor.co.uk:8001 
|   http://web.nexor.co.uk:80 
|   http://wellington.nexor.co.uk:80

Yup, here we go, web=wellington (grr, who uses wellington :-/)

I wonder how many robots don't recognise duplicates...

This sounds like a perfect start for starting to come up with some
common formats for sharing data; every robot at one point uses a list
of servers, so we might as well keep a comprehensive list in a common
format, so that each robot can convert it to their own format and use
it as required.

Something like (let's call it WWW-SERVER):

| URL:		<shortest referenced URL for this host>
| Host-Name: 	<best guess of preferred name of the host>
| Host-Port:	<port number for this server>
| Alias:	<referenced aliases for this host, separated by ','>
| IP-Address:	<IP address for this host>
| <newline>
| ...

Note that a server on a different port is considered a different server,
which happens to share Alias and IP-Address with other records.

The shortest referenced URL is an attempt to do better than "somewhere
here is a server"; for example, one of my servers (vulcan above)
"starts" at a specific path, the home page itself is not referenced
(and is empty).

The best-guess would be "www.domain" followed by "web.domain" etc, in
the hope to get the generic DNS alias, not the hostname itself.

I can adapt my Matthew-Gray's-List-parsing script to produce that
format, which would get Lycos and WWW Wanderer host as a start, if
that is required.

Good idea? Bad idea?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Fri Jul  8 10:30:15 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 8 Jul 1994 10:32:51 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 8 Jul 1994 10:30:15 +0100
Date: Fri, 8 Jul 1994 10:30:15 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:228330:940708093018]
Content-Identifier: Re: New Robot...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 10:30:15 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"22775 Fri Jul  8 10:28:45 1994"@nexor.co.uk>
To: beebee@parc.xerox.com
Cc: /CN=robots/@nexor.co.uk, m.koster@nexor.co.uk
In-Reply-To: <94Jul8.015610pdt.18822@kolsaas.parc.xerox.com>
Subject:  Re: New Robot: SG-Scout 
Status: RO
Content-Length: 2335


[Peter, I've cc'ed the robots list on this]

Peter Beebee wrote:

> I noticed in my searches that only 8 of the 6425 servers I inspected
> have a /robots.txt file ...  Is there any way we can make server
> operators more aware of the standard?  ...  Perhaps there is another
> forum which would reach more server admins...

Yes, I'm planning to do that. Periodic postings to c.i.www.providers
and updating FAQ's would be a start. A conference presentation or
something wouldn't go amiss either...

As always, it's the time, the time...

> 6425 servers

Ehr? That's about twice the number I know exist. I guess there may be
some duplicates there.

|   http://vulcan.nexor.co.uk:8001 
|   http://web.nexor.co.uk:80 
|   http://wellington.nexor.co.uk:80

Yup, here we go, web=wellington (grr, who uses wellington :-/)

I wonder how many robots don't recognise duplicates...

This sounds like a perfect start for starting to come up with some
common formats for sharing data; every robot at one point uses a list
of servers, so we might as well keep a comprehensive list in a common
format, so that each robot can convert it to their own format and use
it as required.

Something like (let's call it WWW-SERVER):

| URL:		<shortest referenced URL for this host>
| Host-Name: 	<best guess of preferred name of the host>
| Host-Port:	<port number for this server>
| Alias:	<referenced aliases for this host, separated by ','>
| IP-Address:	<IP address for this host>
| <newline>
| ...

Note that a server on a different port is considered a different server,
which happens to share Alias and IP-Address with other records.

The shortest referenced URL is an attempt to do better than "somewhere
here is a server"; for example, one of my servers (vulcan above)
"starts" at a specific path, the home page itself is not referenced
(and is empty).

The best-guess would be "www.domain" followed by "web.domain" etc, in
the hope to get the generic DNS alias, not the hostname itself.

I can adapt my Matthew-Gray's-List-parsing script to produce that
format, which would get Lycos and WWW Wanderer host as a start, if
that is required.

Good idea? Bad idea?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Fri Jul  8 11:05:09 1994
Replied: Fri, 08 Jul 1994 11:19:26 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: beebee@parc.xerox.com
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 8 Jul 1994 11:06:36 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 8 Jul 1994 11:05:09 +0100
Date: Fri, 8 Jul 1994 11:05:09 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:233720:940708100511]
Content-Identifier: Robotics
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 11:05:09 +0100;
Alternate-Recipient: Allowed
From: Peter Beebee <beebee@parc.xerox.com>
Message-ID: <94Jul8.030415pdt.18822@kolsaas.parc.xerox.com>
To: /CN=robots/@nexor.co.uk
Subject:  Robotics
Reply-To: beebee@parc.xerox.com
Status: RO
Content-Length: 780


  Actually, I checked into the situation with the duplicate entries for 
web.nexor.co.uk.  I'm fairly sure that that is a special case.  The duplicate 
only exists because your server was, for some executions of my robot, the root 
node for my search.  I entered the server name manually one way and when the
computer subsequently found links to wellington/web they were stored in the
database for the other server name.  I don't expect there are very many 
duplicates other that 5 or 6 of these exceptions.  

  I've been using the gethostbyname procedure to find what my man page 
says is the 'official' name of each host.  I believe this name is not the
preferred alias.  Is there any simple way to find the TRULY appropriate name
of a server?  (ex: WEB.nexor...)

  -- Peter

From /CN=robots-errors/@nexor.co.uk Fri Jul  8 11:19:39 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 8 Jul 1994 11:21:10 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 8 Jul 1994 11:19:39 +0100
Date: Fri, 8 Jul 1994 11:19:39 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:237660:940708101940]
Content-Identifier: Re: Robotics
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 11:19:39 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"23764 Fri Jul  8 11:19:34 1994"@nexor.co.uk>
To: beebee@parc.xerox.com
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <94Jul8.030415pdt.18822@kolsaas.parc.xerox.com>
Subject:  Re: Robotics 
Status: RO
Content-Length: 1928


> Actually, I checked into the situation with the duplicate entries
> for web.nexor.co.uk.  I'm fairly sure that that is a special case.
> The duplicate> only exists because your server was, for some
> executions of my robot, the roo t node for my search.  I entered the
> server name manually one way and when the computer subsequently
> found links to wellington/web they were stored in the database for
> the other server name.  I don't expect there are very many
> duplicates other that 5 or 6 of these exceptions.

I don't think it's that unique a case (but as said am eager to find
out). Have a look at <http://web.nexor.co.uk/mak/compre/compre.html>,
especially compre.out. This seems to happen regularly.

>   I've been using the gethostbyname procedure to find what my man
> page says is the 'official' name of each host.  I believe this name
> is not the preferred alias.

No, and this is how people get "welington" instead of "web.nexor",
which has swapped machines twice alreayd and will probably do so
again. (Doesn't anyone else have this problem of being shoved around?
:-)

> Is there any simple way to find the TRULY appropriate name of a
> server?  (ex: WEB.nexor...)

Well, I did say "best guess" :-) I use:

# decide on the "best" name for a www host
sub bestname {
        local(@hosts) = sort(@_);
        for (@hosts) { return $_ if (/^(www|web)\./); }
        for (@hosts) { return $_ if (/(www|web)/); }
        for (@hosts) { return $_ if (/(gopher)/); }
        for (@hosts) { return $_ if (/(ftp)/); }
        for (@hosts) { return $_ if (/(veronica)/); }
        for (@hosts) { return $_ if (/(gate)/); }
        return $hosts[0];
}

and update it when I see something obvious in the logs. Suggestions
welcome.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Fri Jul  8 18:00:08 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 8 Jul 1994 18:01:11 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 8 Jul 1994 18:00:08 +0100
Date: Fri, 8 Jul 1994 18:00:08 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:296500:940708170009]
Content-Identifier: Re: Server Co...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 18:00:08 +0100;
Alternate-Recipient: Allowed
From: " (Michael Mauldin)" <mlm@FUZINE.MT.CS.CMU.EDU>
Message-ID: <9407081659.AA27195@fuzine.mt.cs.cmu.edu>
To: " (Mary Morris)" <marym@finesse.COM>
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: Server Count
Original-Received: by NeXT Mailer (1.63)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 175

I currently show about 3000 servers in my list

	http://fuzine.mt.cs.cmu.edu/mlm/servers.html

I am updating it right now.

--Fuzzy

http://fuzine.mt.cs.cmu.edu/mlm/home.html

From /CN=robots-errors/@nexor.co.uk Fri Jul  8 22:11:43 1994
Replied: Mon, 11 Jul 1994 10:17:28 +0100
Replied: beebee@parc.xerox.com
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 8 Jul 1994 22:12:44 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 8 Jul 1994 22:11:43 +0100
Date: Fri, 8 Jul 1994 22:11:43 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:016250:940708211144]
Content-Identifier: Re: Server Co...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 22:11:43 +0100;
Alternate-Recipient: Allowed
From: Peter Beebee <beebee@parc.xerox.com>
Message-ID: <94Jul8.141047pdt.18822@kolsaas.parc.xerox.com>
To: marym@finesse.com, /CN=robots/@nexor.co.uk
In-Reply-To: <9407081629.AA01018@thyme.finesse.com>
Subject:  Re: Server Count
Reply-To: beebee@parc.xerox.com
Status: RO
Content-Length: 845


I recently ran a script on my database to translate the names of the servers
I've explored to the appropriate aliases ( ex:wellington.nexor.co.uk to web.nexor.co.uk ).
I just updated the html accordingly.  
The names of the 6412 servers can be found at http://www-swiss.ai.mit.edu/~ptbb/servers.html.

If you need a single file containing only all the server names (the html list is, obviously, 
cluttered with html code, and it's separated into files by domain) I'll gladly
compile one and mail it to you... maybe I'll make it available from the main list page...

My robot only follows links with the http scheme, so if my list contains gopher servers 
(as I'm sure it does...) then they were improperly linked to by someone.


Does anyone know a way to find the aliases of a server without using gethostbyname/addr 
or nslookup?

 -- Peter


From beebee@parc.xerox.com Mon Jul 11 16:18:50 1994
Return-Path: <beebee@parc.xerox.com>
Delivery-Date: Mon, 11 Jul 1994 16:19:06 +0100
Received: from alpha.xerox.com by lancaster.nexor.co.uk with SMTP (XTPP);
          Mon, 11 Jul 1994 16:18:50 +0100
Received: from skye.parc.xerox.com ([13.1.102.95]) by alpha.xerox.com 
          with SMTP id <14441(8)>; Mon, 11 Jul 1994 08:18:15 PDT
Received: by skye.parc.xerox.com id <32262>; Mon, 11 Jul 1994 08:17:59 -0700
From: Peter Beebee <beebee@parc.xerox.com>
To: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk
In-reply-to: Martijn Koster's message of Mon, 11 Jul 1994 02:17:23 -0700 <94Jul11.021746pdt.14437(8)@alpha.xerox.com>
Subject: Re: Server Count 
Reply-to: beebee@parc.xerox.com
Message-Id: <94Jul11.081759pdt.32262@skye.parc.xerox.com>
Date:	Mon, 11 Jul 1994 08:17:44 PDT
Status: RO
Content-Length: 1585


I posted my 'complete' list from my home page... the URL for the list is 
http://www-swiss.ai.mit.edu/~ptbb/servers.txt .

The c calls gethostbyname/addr don't seem to work on any of the three 
systems I have accounts on (Xerox, MIT, or MIT AI) in that they all 
yield the results (or lack of results) that I mentioned on Friday.  
Also, I can't seem to find nslookup on the system here at Xerox.

I sorted for duplicate/inappropriate names with the following inadequate
code, so there are certainly still some errors in the list.
I'll correct everything as soon as I can find another way to 
determine the 'true' name of a server. (Thanks for the suggestion, 
Mary.  I'm checking it out..)

-- Peter

code:

sub getbestname {
### this is really hurting, but I don't know a better way that works on these machines.
  local($tmp, $*, $name) = ('', 1, @_);
  local($_, $name) = ($name, gethostbyname($name));
  s/^[^\.]+\.(.+)$/www\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/web\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/gopher\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/ftp\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/veronica\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/gate\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/wais\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  return($name);
  }


From /CN=robots-errors/@nexor.co.uk Mon Jul 11 16:19:30 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 11 Jul 1994 16:20:23 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 11 Jul 1994 16:19:30 +0100
Date: Mon, 11 Jul 1994 16:19:30 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:250560:940711151931]
Content-Identifier: Re: Server Co...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 11 Jul 1994 16:19:30 
                      +0100;
Alternate-Recipient: Allowed
From: Peter Beebee <beebee@parc.xerox.com>
Message-ID: <94Jul11.081759pdt.32262@skye.parc.xerox.com>
To: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk
In-Reply-To: <94Jul11.021746pdt.14437(8)@alpha.xerox.com>
Subject:  Re: Server Count 
Reply-To: beebee@parc.xerox.com
Status: RO
Content-Length: 1585


I posted my 'complete' list from my home page... the URL for the list is 
http://www-swiss.ai.mit.edu/~ptbb/servers.txt .

The c calls gethostbyname/addr don't seem to work on any of the three 
systems I have accounts on (Xerox, MIT, or MIT AI) in that they all 
yield the results (or lack of results) that I mentioned on Friday.  
Also, I can't seem to find nslookup on the system here at Xerox.

I sorted for duplicate/inappropriate names with the following inadequate
code, so there are certainly still some errors in the list.
I'll correct everything as soon as I can find another way to 
determine the 'true' name of a server. (Thanks for the suggestion, 
Mary.  I'm checking it out..)

-- Peter

code:

sub getbestname {
### this is really hurting, but I don't know a better way that works on these machines.
  local($tmp, $*, $name) = ('', 1, @_);
  local($_, $name) = ($name, gethostbyname($name));
  s/^[^\.]+\.(.+)$/www\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/web\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/gopher\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/ftp\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/veronica\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/gate\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  s/^[^\.]+\.(.+)$/wais\.$1/;
  ($tmp) = gethostbyname($_);
  return($_) if ($tmp eq $name);
  return($name);
  }


From /CN=robots-errors/@nexor.co.uk Sun Jul 10 19:47:57 1994
Replied: Thu, 14 Jul 1994 11:18:57 +0100
Replied: " (Mary Morris)" <marym@Finesse.COM>
Replied: Mon, 11 Jul 1994 10:13:55 +0100
Replied: " (Mary Morris)" <marym@Finesse.COM>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sun, 10 Jul 1994 19:48:48 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sun, 10 Jul 1994 19:47:57 +0100
Date: Sun, 10 Jul 1994 19:47:57 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:144670:940710184759]
Content-Identifier: Deduping Serv...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 10 Jul 1994 19:47:57 
                      +0100;
Alternate-Recipient: Allowed
From: " (Mary Morris)" <marym@Finesse.COM>
Message-ID: <9407101844.AA00514@thyme.finesse.com>
To: /CN=robots/@nexor.co.uk
Subject:  Deduping Server Counts
X-Sun-Charset: US-ASCII
Status: RO
Content-Length: 1116

Hi

While going through the lists of servers that I picked up
thanks to Peter Beebee, Michael Mauldin, and Matthew Gray,
I noticed that there are times when the same server is listed
twice. The only difference is the port number.

Now in the case where the port numbers are say 70 and 80, I would
say that the port 70 is gopher and eliminate it immediatly.
However, there are some cases where the port numbers are:
80, 8000, 8001, and 8008. Does anyone know what is happening
in that case? I know that 8000 is a pretty common alternate
port for 80. What I am asking here is do you think that all
of those ports are feeding the same data? I could think
of one scenario where CERN's httpd could be on say 8001 and
NCSA's could be on 8000. Should they be considered different
here? If we find common urls should we write them off as the
same or what?

FYI - after merging all of these lists and counting only
one port per server *regardless* of port number, I have a count
of 6855 servers. My next step in the deduping will be
to do an nslookup of each of these servers and compare
IP numbers.

Comments?

Mary Morris


From /CN=robots-errors/@nexor.co.uk Mon Jul 11 10:47:43 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 11 Jul 1994 10:48:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 11 Jul 1994 10:47:43 +0100
Date: Mon, 11 Jul 1994 10:47:43 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:201030:940711094744]
Content-Identifier: Another rapid...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 11 Jul 1994 10:47:43 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"20093 Mon Jul 11 10:47:31 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Another rapid fire attack
Status: RO
Content-Length: 2430


Paul Ginsparg last week reported another rapid fire attack on his
server.  I have included several excerpts from his message, FYI.

> this one had all the classic hallmarks indicted in your
>  http://web.nexor.co.uk/mak/doc/robots/robots.html 
>
> a) no means of determining who was running robot (finger and rusers
>    gave no info on remote machine)
> b) the robot was making parallel rather than sequential requests
>    (I determined this by trapping the $REMOTE_HOST with sleeping processes
>    and saw multiple ones spawned before any timeouts)
> c) after the robot had access cut off, it continued to make rapid-fire
>    requests (in this case it was cut off after 465 requests, and made
>    another 920 requests (total of 1385 in a 3 hour period)
>    before being stopped by a helpful sysadmin at the site in question
> d) the person running the robot was neither monitoring nor available.

This was another attempt to mirror his server. From the robot operator:

: [apology deleted]

: We were preparing the cache data for Interop to demonstrate how
: wonderful the WWW system is, because the place will not have the
: bandwidth enough to make realtime connections.  And your host was,
: unfortunate for you, one of the most fascinating server to
: demonstrate the charm of WWW.

: The fault was caused by our technical mistake and I have to say that
: we were too careless about automated downloading, although I think
: caching and mirroring can be a good technology if they have had done
: in good manner and with deep thought and matured method.

This confirms my belief that the Web would benefit from a well-written,
and maintained caching program. Only today someone posted a request for
such a program to c.i.w.p.

In this case the response was a number of bulk mailings to the
adminstrator of the remote site to attract attention. This had the
desired effect, although I think it is unfortunate this sort of action
is required.

> I plan to add an explict warning on my front page re automated
> downloads (ROBOTS BEWARE) -- I'm now set up to detect automatically
> the above conditions and mail the log of requests from that site
> (typically a 100kb file) to the sysadmin at the site in question in
> response to each subsequent request (I can spare the
> bandwidth). when everyone has form-parsing clients the problem will
> ameliorate since I can hide the database behind an explicit POST or
> two.


From /CN=robots-errors/@nexor.co.uk Thu Jul 14 16:52:00 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 14 Jul 1994 16:52:56 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 14 Jul 1994 16:52:00 +0100
Date: Thu, 14 Jul 1994 16:52:00 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:187660:940714155203]
Content-Identifier: Re: Represent...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 14 Jul 1994 16:52:00 
                      +0100;
Alternate-Recipient: Allowed
From: Paul Harrington <phrrngtn@cs.st-andrews.ac.uk>
Message-ID: <1069.9407141551@tamdhu.cs.st-andrews.ac.uk>
To: Charlie Stross <charless@sco.com>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9407141527.aa23834@ruddles.sco.com>
Subject:  Re: Representing big webs visually 
Status: RO
Content-Length: 1878


Charlie> Has anyone got any ideas about good methods of visually
Charlie> representing large webs?

Lots of them!
... but most are untested. :-)

Myself and a colleague of mine in Trinity College in Dublin developed
a simple web-walker which -- like yourself -- output a description in
neato format and generated an image map. Originally, we intended to
use this as a 'goto' mechanism for home pages: they are very useful
for getting a handle on the structure of a local web and how it is
connected to other webs.

Our visual encoding techniques are _very_ primitive but were
encouraging enough to make us want to do more work on the area.
However, being the lazy people that we are, we are waiting for the
emergence of some kind of web description interchange format from the
spider writers. 

Charlie> Am I reinventing the wheel here, or is this terra incognita?

I think it is mosly terra incognita. There are a few people who have
contacted me over the last year who may be working on it still. I can
put you in touch with them if you mail me.  I am busy trying to write
papers at the moment so treat all of the stuff below as 'aspirational'

We want to take up a lot of the formal work that has been done on
Hypertext systems (See the CACM special issue from some time last year
for a good pointer to surveys) and experiment with applying them to
web structures. Other systems that may be of some use for resource
characterisation are freemont and indie. I have a prejudice that says
that visualisation of an object is no good unless there is some input
from the server(s) which serve the object. e.g. frequency of access,
frequency of remote access, age, frequency of change,

I am hoping that one of the most useful applications of Web
visualisation will be the representation of resource location queries.

How is that for a "pie in the sky" statement?

pjjH


From pp@dfnrelay.d400.de Fri Jul 15 13:52:18 1994
Return-Path: <pp@dfnrelay.d400.de>
Delivery-Date: Fri, 15 Jul 1994 13:52:35 +0100
Received: from ixgate01.dfnrelay.d400.de by lancaster.nexor.co.uk 
          with SMTP (XTPP); Fri, 15 Jul 1994 13:52:18 +0100
Received: from dfnrelay.d400.de by ixgate01.dfnrelay.d400.de           
          id <02764-0@ixgate01.dfnrelay.d400.de>;
          Fri, 15 Jul 1994 14:53:10 +0200
X400-MTS-Identifier: [/PRMD=dfnrelay/ADMD=d400/C=de/;ixgate01.d.757:15.07.94.12.53.08]
From: pp@dfnrelay.d400.de
To: /CN=robots-errors/@nexor.co.uk
Subject: Delivery Report (failure) for webmap@informatik.uni-frankfurt.de
Message-Type: Delivery Report
Date: Fri, 15 Jul 1994 14:53:10 +0200
Message-ID: <"ixgate01.d.757:15.07.94.12.53.08"@dfnrelay.d400.de>
Status: RO
Content-Length: 1607

This report relates to your message with subject: Results of list processing
        of Fri, 15 Jul 1994 14:52:35 +0200

Your message was not delivered to   webmap@informatik.uni-frankfurt.de
        for the following reason:
        Unknown Address
        MTA 'informatik.uni-frankfurt.de' gives error message 
        <webmap@informatik.uni-frankfurt.de>... User unknown 

***** The following information is directed towards the local administrator
***** and is not intended for the end user
* 
* DR generated by: mta d400relay
*         in /PRMD=dfnrelay/ADMD=d400/C=de/
*         at Fri, 15 Jul 1994 14:53:08 +0200
*
* Converted to RFC 822 at dfnrelay.d400.de
*         at Fri, 15 Jul 1994 14:53:10 +0200
*
* Delivery Report Contents:
*
* Subject-Submission-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:014631:940715125034]
* Content-Type: p2
* Original-Encoded-Information-Types: ia5-text
* Subject-Intermediate-Trace-Information:  /PRMD=dfnrelay/ADMD=d400/C=de/arrival Fri, 15 Jul 1994 14:52:35 +0200 action Relayed
* Recipient-Info: webmap@informatik.uni-frankfurt.de,
*         /S=webmap/OU=informatik/PRMD=uni-frankfurt/ADMD=d400-gw/C=de/;
*         FAILURE reason Unable-To-Transfer (1);
*         diagnostic Unrecognised-ORName (0);
*         last trace (ia5-text) Fri, 15 Jul 1994 14:52:35 +0200;
*         converted eits ia5-text;
*         supplementary info "MTA 'informatik.uni-frankfurt.de' gives
*         error message  <webmap@informatik.uni-frankfurt.de>... User
*         unknown";
****** End of administration information 

The return of the original message was not requested

From pp@dfnrelay.d400.de Fri Jul 15 13:58:18 1994
Return-Path: <pp@dfnrelay.d400.de>
Delivery-Date: Fri, 15 Jul 1994 13:58:41 +0100
Received: from ixgate01.dfnrelay.d400.de by lancaster.nexor.co.uk 
          with SMTP (XTPP); Fri, 15 Jul 1994 13:58:18 +0100
Received: from dfnrelay.d400.de by ixgate01.dfnrelay.d400.de           
          id <03061-0@ixgate01.dfnrelay.d400.de>;
          Fri, 15 Jul 1994 14:59:08 +0200
X400-MTS-Identifier: [/PRMD=dfnrelay/ADMD=d400/C=de/;ixgate01.d.037:15.07.94.12.59.04]
From: pp@dfnrelay.d400.de
To: /CN=robots-errors/@nexor.co.uk
Subject: Delivery Report (failure) for webmap@informatik.uni-frankfurt.de
Message-Type: Delivery Report
Date: Fri, 15 Jul 1994 14:59:08 +0200
Message-ID: <"ixgate01.d.037:15.07.94.12.59.04"@dfnrelay.d400.de>
Status: RO
Content-Length: 1596

This report relates to your message with subject: Re: get archive
        of Fri, 15 Jul 1994 14:58:11 +0200

Your message was not delivered to   webmap@informatik.uni-frankfurt.de
        for the following reason:
        Unknown Address
        MTA 'informatik.uni-frankfurt.de' gives error message 
        <webmap@informatik.uni-frankfurt.de>... User unknown 

***** The following information is directed towards the local administrator
***** and is not intended for the end user
* 
* DR generated by: mta d400relay
*         in /PRMD=dfnrelay/ADMD=d400/C=de/
*         at Fri, 15 Jul 1994 14:59:04 +0200
*
* Converted to RFC 822 at dfnrelay.d400.de
*         at Fri, 15 Jul 1994 14:59:08 +0200
*
* Delivery Report Contents:
*
* Subject-Submission-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:015493:940715125526]
* Content-Type: p2
* Original-Encoded-Information-Types: ia5-text
* Subject-Intermediate-Trace-Information:  /PRMD=dfnrelay/ADMD=d400/C=de/arrival Fri, 15 Jul 1994 14:58:11 +0200 action Relayed
* Recipient-Info: webmap@informatik.uni-frankfurt.de,
*         /S=webmap/OU=informatik/PRMD=uni-frankfurt/ADMD=d400-gw/C=de/;
*         FAILURE reason Unable-To-Transfer (1);
*         diagnostic Unrecognised-ORName (0);
*         last trace (ia5-text) Fri, 15 Jul 1994 14:58:11 +0200;
*         converted eits ia5-text;
*         supplementary info "MTA 'informatik.uni-frankfurt.de' gives
*         error message  <webmap@informatik.uni-frankfurt.de>... User
*         unknown";
****** End of administration information 

The return of the original message was not requested

From /CN=robots-errors/@nexor.co.uk Mon Jul 18 21:13:24 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 18 Jul 1994 21:14:04 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 18 Jul 1994 21:13:24 +0100
Date: Mon, 18 Jul 1994 21:13:24 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:089410:940718201325]
Content-Identifier: Other Web Ser...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 18 Jul 1994 21:13:24 
                      +0100;
Alternate-Recipient: Allowed
From: " (Mary Morris)" <marym@Finesse.COM>
Message-ID: <9407181923.AA02834@thyme.finesse.com>
To: /CN=robots/@nexor.co.uk
Subject:  Other Web Servers
X-Sun-Charset: US-ASCII
Status: RO
Content-Length: 229


Hi

I went through the Net-Happenings mail list and found
~280 servers that aren't on anyone's lists. I have
a list of http pointers that I can send to anyone who
wants to take a robot out and register these guys.


Mary Morris

From /CN=robots-errors/@nexor.co.uk Wed Jul 20 11:33:23 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 20 Jul 1994 11:33:50 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 20 Jul 1994 11:33:23 +0100
Date: Wed, 20 Jul 1994 11:33:23 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:100520:940720103327]
Content-Identifier: IP multicast ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 20 Jul 1994 11:33:23 
                      +0100;
Alternate-Recipient: Allowed
From: Paul Harrington <phrrngtn@cs.st-andrews.ac.uk>
Message-ID: <21786.9407201033@tamdhu.cs.st-andrews.ac.uk>
To: /CN=robots/@nexor.co.uk
Subject:  IP multicast for robot 'map' interchange? 
Status: RO
Content-Length: 499


I read an interesting paper a week or so ago entitled "Drinking for
the Firehose: Multicast USENET News" by Kurt Lidl.('usenix-muse.ps') 

Has any work been done on using a similar mechanism for propogating
robot information? Or page updates?

Would it be possible to reverse the way that some basic robots operate
i.e change them to be passive monitoring agents that listen for
updates/{classifcation information} which would be transmitted by http
servers and/or local robots?

comments?

pjjH


From /CN=robots-errors/@nexor.co.uk Wed Jul 20 16:41:52 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 20 Jul 1994 16:42:52 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 20 Jul 1994 16:41:52 +0100
Date: Wed, 20 Jul 1994 16:41:52 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:154700:940720154155]
Content-Identifier: Status update...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 20 Jul 1994 16:41:52 
                      +0100;
Alternate-Recipient: Allowed
From: " (Michael Mauldin)" <mlm@FUZINE.MT.CS.CMU.EDU>
Message-ID: <9407201517.AA00648@fuzine.mt.cs.cmu.edu>
To: /CN=robots/@nexor.co.uk
Cc: fuzzy@CMU.EDU
Subject:  Status update: Lycos search engine now on-line
Original-Received: by NeXT Mailer (1.63)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 1247

Lycos now supports searches of its database of WWW documents.
Please access the search page from the Lycos Home Page

	http://fuzine.mt.cs.cmu.edu/mlm/lycos-home.html

Because once this service becomes popular, I will probably
move the index and search server to another computer.

The very first anchor in the Lycos Home Page is the SEARCH.

Lycos' database was collected during June, and contains
summaries of 54,000 documents (about 41meg of summaries).

Current plans are

	1. Add the 250,000 documents for which I have
	   only descriptions to the search database.

	2. Resume Lycos exploration.  Lycos has not
	   been fetching documents since June.

	3. Experiment with best-first search.

	4. Release a copy of the PURSUIT search engine
	   for educational and research use.  PURSUIT
	   provides HTML formatted search results from
	   (almost) arbitrary text files.  It is suitable
	   for running via CGI from httpd. If you would
	   like to be a beta tester for PURSUIT, please
	   send email to fuzzy@cmu.edu

--
Michael L. Mauldin	"How big is the Web?  You may think it's
Carnegie Mellon Univ.	 a long way to the chemist's, but that's
fuzzy@cmu.edu		 peanuts compared to the Web, listen..."

http://fuzine.mt.cs.cmu.edu/mlm/home.html

From /CN=robots-errors/@nexor.co.uk Thu Jul 21 15:21:58 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 21 Jul 1994 15:23:20 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 21 Jul 1994 15:21:58 +0100
Date: Thu, 21 Jul 1994 15:21:58 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:287150:940721142203]
Content-Identifier: ANL/MCS/SIGGR...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 21 Jul 1994 15:21:58 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"28708 Thu Jul 21 15:21:44 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  ANL/MCS/SIGGRAPH/VROOM Walker
Status: RO
Content-Length: 637


And another robot enters (or should I say runs over) the net...
I noticed them in my logs today.

From the active.html:

| ANL/MCS/SIGGRAPH/VROOM Walker
| 
| Owner/Maintainer unknown. 
|
| Identification: sets User-agent to ANL/MCS/SIGGRAPH/VROOM Walker, and
| From to olson.anl.gov 
|
| Another rapid-fire robot that doesn't use the robot exclusion protocol.
| Depressing. 

Anybody got any more details/pointers, or know people at olson.anl.gov ?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Mon Aug  8 18:11:04 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 8 Aug 1994 18:12:12 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 8 Aug 1994 18:11:04 +0100
Date: Mon, 8 Aug 1994 18:11:04 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:167850:940808171105]
Content-Identifier: Re: ANNOUNCE:...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 8 Aug 1994 18:11:04 +0100;
Alternate-Recipient: Allowed
From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov>
Message-ID: <9408081712.AA25100@rbse.jsc.nasa.gov>
To: /CN=robots/@nexor.co.uk
Subject:  Re: ANNOUNCE: A WWWWorm for the mac!
X-Sender: eichmann@192.88.42.10
MIME-version: 1.0
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: 7BIT
Status: RO
Content-Length: 2577


Oh goodie!  Now everyone can be the proud owner of a web-wide process. This
hasn't touched my site... yet.

- Dave

>Received: by rbse.jsc.nasa.gov (4.1/SMI-4.0:RAL-041790)
>        id AA24710; Mon, 8 Aug 94 11:24:41 CDT
>Message-Id: <9408081624.AA24710@rbse.jsc.nasa.gov>
>From: lemieuse@ERE.UMontreal.CA (Lemieux Sebastien)
>Date: Sun, 7 Aug 1994 01:17:37 GMT
>Newsgroup: comp.infosystems.www.providers/2778
>Subject: ANNOUNCE: A WWWWorm for the mac!
>Apparently-To: eichmann
>
>Hi WebSurfers,
>
>  I've just finished the programmation of a WWW Worm for the mac.
>Inspired from the famous WWWWorm that is currently accessible through
>the net, I programmed it to be used by people wishing to "index" the
>web (get URLs) providing informations on a specific topic.
>
>What it does:
>
>  It systematically scan the WWW and evaluate each page according to a
>keyword search.  If the evaluation is successful, then the URL of the
>page is kept.
>
>What are the results and interests:
>
>  If you are making up WWW pages for a specific topic, one of your
>heavier job will be to accumulate enough URL to interesting sites to
>make your page worth for the people that will be using it.  By setting
>the keywords properly, the worm will do the dirty job for you and
>write you automatically a valid HTML document that can be used to test
>the URL founds.
>
>  In about 8 hours of search, it produced me a 359 entries html
>document pointing toward most of the molecular biology pages in the
>world!
>
>----------
>
>The worm is currently all written in french (i'm french speaking!), so
>I don't think it is ready for distribution.  Basically, I'm calling
>for comments on the project and want to know how much people are
>interested in such a project.
>
>Any comments, collaborations, proposals or suggestions are welcomed.
>
>----------
>
>The Web's gonna get scanned!
>
>
>--
>| Sebastien Lemieux, dept. biol.  |  rootPGPCryptoDESEscrowCIAHackDSSRIPEM
>| lemieuse@alize.ERE.UMontre|  NSASkipJackFBIKerberosRSACapstoneNIST
>| PGP public key on finger.        -------------  AnonymousMailClipperChip
>| http://alize.ere.umontreal.ca:8001/~lemieuse/ | UFC-FastCryptCrackPassWd
>
>

-----------
David Eichmann
Asst. Prof. / RBSE Director of R & D
Software Engineering Program              Phone: (713) 283-3875
University of Houston - Clear Lake        fax  : (713) 283-3810
Box 113, 2700 Bay Area Blvd.              Email: eichmann@rbse.jsc.nasa.gov
Houston, TX 77058                            or: eichmann@cl.uh.edu
RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html


From m.koster@nexor.co.uk Tue Aug  9 13:56:17 1994
Return-Path: <m.koster@nexor.co.uk>
Delivery-Date: Tue, 9 Aug 1994 14:01:07 +0100
Received: from nexor.co.uk (actually victor.nexor.co.uk) 
          by lancaster.nexor.co.uk with SMTP (PP);
          Tue, 9 Aug 1994 13:56:17 +0100
To: d-garaffa@ski.mskcc.org (Dave Garaffa)
cc: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk
Subject: Re: ANNOUNCE: A WWWWorm for the mac! 
In-reply-to: Your message of "Tue, 09 Aug 1994 08:34:49 CDT."             <m0qXqNd-0009Z7C@research-01.mskcc.org> 
Date: Tue, 09 Aug 1994 13:55:59 +0100
From: Martijn Koster <m.koster@nexor.co.uk>
Status: RO
Content-Length: 2247


d-garaffa@ski.mskcc.org (Dave Garaffa) wrote:

> Martijn,
> 
> I do see your point but we still have the problem of how to best find the
> information that servers our users...

Of course. This problem has always existed, even in pre-electronic times.
 
> Okay so the idea of having 1000's of robots riding that web adinfinitum is
> not a good thing. However who should say that NEXOR *CAN* run a robot and
> Memorial Sloan-Kettering Cancer Center *CANT* ( I don't even know if you
> do run one, this is just an example )  Why should one group have the
> indexes local and someone else not?

[ Just for the record, I don't run a robot at NEXOR. ]

I don't make judgements on how can or cannot run a robot. But i don't
want robots to roam the web looking for _one particular query_. If
they build databases of info which you can later search for queries,
great (well, maybe not great, but fine).

I do try to dissuade people from giving robots away -- that is adding
to the problem, not solving it.

> How about this??  If an institution is going to run a worm then they must
> do the following...
> 
> 1] Make their database searchable to all web-surfers
>
> 2] Make their *DATA* ftpable to all server maintainers so we can index the
>    data in a way we think is best.
>
> Its not too much to ask and 

Yes, Absolutely spot on. This I have been campaigning for for
ages. You will find most robot authors quite willing to share their
data -- it's just that nobody asks.

I personally would like to see a simple standard dataformat for this
exchange -- it's make it all so much easier. However, not having a 
robot of my own (nor time/intention to write and maintain one), there
is little I can do to speed up that progress.

> it would save me the trouble of operating my own worm...  

Sure. It'd save you time, and effort, and it's save the net bandwidth,
and servers load.

> Any thoughts??

Communicate with the authors of robot, and see if together there is
something that can be done. This is part of what the robot mailing
list is meant for.

Cheers,

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Tue Aug  9 14:01:39 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 9 Aug 1994 14:02:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 9 Aug 1994 14:01:39 +0100
Date: Tue, 9 Aug 1994 14:01:39 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:258620:940809130141]
Content-Identifier: Re: ANNOUNCE:...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 9 Aug 1994 14:01:39 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"25831 Tue Aug  9 14:01:07 1994"@nexor.co.uk>
To: " (Dave Garaffa)" <d-garaffa@ski.mskcc.org>
Cc: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk
In-Reply-To: <m0qXqNd-0009Z7C@research-01.mskcc.org>
Subject:  Re: ANNOUNCE: A WWWWorm for the mac! 
Status: RO
Content-Length: 2247


d-garaffa@ski.mskcc.org (Dave Garaffa) wrote:

> Martijn,
> 
> I do see your point but we still have the problem of how to best find the
> information that servers our users...

Of course. This problem has always existed, even in pre-electronic times.
 
> Okay so the idea of having 1000's of robots riding that web adinfinitum is
> not a good thing. However who should say that NEXOR *CAN* run a robot and
> Memorial Sloan-Kettering Cancer Center *CANT* ( I don't even know if you
> do run one, this is just an example )  Why should one group have the
> indexes local and someone else not?

[ Just for the record, I don't run a robot at NEXOR. ]

I don't make judgements on how can or cannot run a robot. But i don't
want robots to roam the web looking for _one particular query_. If
they build databases of info which you can later search for queries,
great (well, maybe not great, but fine).

I do try to dissuade people from giving robots away -- that is adding
to the problem, not solving it.

> How about this??  If an institution is going to run a worm then they must
> do the following...
> 
> 1] Make their database searchable to all web-surfers
>
> 2] Make their *DATA* ftpable to all server maintainers so we can index the
>    data in a way we think is best.
>
> Its not too much to ask and 

Yes, Absolutely spot on. This I have been campaigning for for
ages. You will find most robot authors quite willing to share their
data -- it's just that nobody asks.

I personally would like to see a simple standard dataformat for this
exchange -- it's make it all so much easier. However, not having a 
robot of my own (nor time/intention to write and maintain one), there
is little I can do to speed up that progress.

> it would save me the trouble of operating my own worm...  

Sure. It'd save you time, and effort, and it's save the net bandwidth,
and servers load.

> Any thoughts??

Communicate with the authors of robot, and see if together there is
something that can be done. This is part of what the robot mailing
list is meant for.

Cheers,

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From mak@nexor.co.uk Tue Aug  9 09:04:52 1994
To: lemieuse@ERE.UMontreal.CA, lemieuse@alize.ERE.UMontreal.Ca
cc: robots
Subject: Your WWW robot for the web
Date: Tue, 09 Aug 1994 09:04:52 +0100
From: Martijn Koster <mak@nexor.co.uk>
Status: RO
Content-Length: 1507


I was alerted to the fact that you are writing a Macintosh robot.

Are you aware of the material on <URL:http://web.nexor.co.uk/mak/doc/
robots/robots.html> ?

The main things it contains are a list of robots, guidelines for robot
writers, a robot exclusion standard, and a mailing-list (to which I
have cc'ed this message)

I have added your robot to the list of robots, but would appreciate
some further details, especially how to identify it (what does it use
for User-agent?).

I also strongly urge you to comply with the guidelines and the
standard for robot exclusion -- not doing so will give people no
control of what your robot does to their server, resources, time and
effort, and they will get rather upset.

You may find it useful to join the robots list -- you may find other
people interested in robots, many of whom run robots, who can help
you in your requirements, without having to resort to scanning the
web yourself.

Finally I'd like to urge you not to distribute your robot -- it is
just too easy to be abused by people. If only 100 people would
regularly run your robot that alone would give a noticeable overhead
to a number of resources.

You write:

> The Web's gonna get scanned!

Sure, but it is very important to be real careful about it...

Regards,

-- Martijn Koster (webmaster for web.nexor.co.uk)
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Tue Aug  9 09:05:09 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 9 Aug 1994 09:06:33 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 9 Aug 1994 09:05:09 +0100
Date: Tue, 9 Aug 1994 09:05:09 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:220730:940809080510]
Content-Identifier: Your WWW robo...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 9 Aug 1994 09:05:09 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"22063 Tue Aug  9 09:05:03 1994"@nexor.co.uk>
To: lemieuse@ERE.UMontreal.CA, lemieuse@alize.ERE.UMontreal.Ca
Cc: /CN=robots/@nexor.co.uk
Subject:  Your WWW robot for the web
Status: RO
Content-Length: 1507


I was alerted to the fact that you are writing a Macintosh robot.

Are you aware of the material on <URL:http://web.nexor.co.uk/mak/doc/
robots/robots.html> ?

The main things it contains are a list of robots, guidelines for robot
writers, a robot exclusion standard, and a mailing-list (to which I
have cc'ed this message)

I have added your robot to the list of robots, but would appreciate
some further details, especially how to identify it (what does it use
for User-agent?).

I also strongly urge you to comply with the guidelines and the
standard for robot exclusion -- not doing so will give people no
control of what your robot does to their server, resources, time and
effort, and they will get rather upset.

You may find it useful to join the robots list -- you may find other
people interested in robots, many of whom run robots, who can help
you in your requirements, without having to resort to scanning the
web yourself.

Finally I'd like to urge you not to distribute your robot -- it is
just too easy to be abused by people. If only 100 people would
regularly run your robot that alone would give a noticeable overhead
to a number of resources.

You write:

> The Web's gonna get scanned!

Sure, but it is very important to be real careful about it...

Regards,

-- Martijn Koster (webmaster for web.nexor.co.uk)
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From @cs.st-andrews.ac.uk:phrrngtn@cs.st-andrews.ac.uk Wed Aug 10 10:03:25 1994
Return-Path: <@cs.st-andrews.ac.uk:phrrngtn@cs.st-andrews.ac.uk>
Delivery-Date: Wed, 10 Aug 1994 10:03:36 +0100
Received: from cs.st-andrews.ac.uk by lancaster.nexor.co.uk via JANET 
          with NIFTP (XTPP) id <12218-0@lancaster.nexor.co.uk>;
          Wed, 10 Aug 1994 10:03:25 +0100
Message-Id: <6300.9408100903@tamdhu.cs.st-andrews.ac.uk>
Received: from jameson by tamdhu.cs.st-andrews.ac.uk;
          Wed, 10 Aug 94 10:03:53 BST
To: Martijn Koster <m.koster@nexor.co.uk>
Cc: /CN=robots/@nexor.co.uk
Subject: Output from spiders. Was Re: ANNOUNCE: A WWWWorm for the mac! 
In-Reply-To: Your message of "Tue, 09 Aug 1994 14:01:39 BST."             <"25831 Tue Aug 9 14:01:07 1994"@nexor.co.uk> 
Date: Wed, 10 Aug 1994 10:03:52 +0100
From: Paul Harrington <phrrngtn@cs.st-andrews.ac.uk>
Status: RO
Content-Length: 643


>> How about this??  If an institution is going to run a worm then they must
>> do the following...
>> 
>> 1] Make their database searchable to all web-surfers
>> 
>> 2] Make their *DATA* ftpable to all server maintainers so we can index the
>> data in a way we think is best.
>> 
>> Its not too much to ask and 

Martijn> Yes, Absolutely spot on. This I have been campaigning for for
Martijn> ages. You will find most robot authors quite willing to share their
Martijn> data -- it's just that nobody asks.

Ok, I'm asking!
Can anyone give me pointers to the some spider output together with a
description of the format of that output.


From /CN=robots-errors/@nexor.co.uk Wed Aug 10 12:33:21 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 10 Aug 1994 12:34:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 10 Aug 1994 12:33:21 +0100
Date: Wed, 10 Aug 1994 12:33:21 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:149290:940810113325]
Content-Identifier: MOMspider is ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 10 Aug 1994 12:33:21 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID: <9408100430.aa21665@paris.ics.uci.edu>
To: libwww-perl@ics.uci.edu, /CN=robots/@nexor.co.uk
Cc: vg@dcs.edinburgh.ac.uk, bbehlen@soda.csua.berkeley.edu, 
    sugino@bart.sps.mot.com, shelden@fatty.law.cornell.edu, 
    grimes@raison.mro.dec.com, mtaipale@dxcern.cern.ch, mvanheyn@cs.indiana.edu, 
    altis@ibeam.jf.intel.com, casey@ptsun00.cern.ch, jlh@linus.mitre.org, 
    dmk@allegra.att.com, Gary.Adams@east.sun.com
Subject:  MOMspider is now publicly available
Status: RO
Content-Length: 1839

Hello all,

This is just going out to people who I know have been waiting for this
release for much-too-long-a-time.  I have done my best to make the software
as robust as possible and have tested it under a variety of conditions.
Note that this is not an alpha or beta release -- the software is robust
enough to be marketable (even though I am just giving it away).
The only fault that still remains is that the documentation is rather
paltry.  That should be fixed in a future patch.

MOMspider is a bit unlike other Web spiders/robots in that it is not
engaged in resource discovery and thus generates very little load on
remote servers.  However, please let me know immediately if anyone
starts running it in an irresponsible manner.

Please do not rebroadcast this message to the high-profile mailing
lists and especially not to netnews.  If everything looks like it
is going fine, I will send a general announcement to www-talk and,
later, to comp.infosystems.www.providers.

Please do let me know if you have any problems installing/testing it
or you find the documentation incomprehensible or lacking some important
bit.  The earlier I can catch any problems, the more likely they will
be solved before the great unwashed masses get their hands on it. ;-)

MOMspider can be retrieved via HTTP from

       http://www.ics.uci.edu/WebSoft/MOMspider/

or by anonymous ftp from

       ftp://liege.ics.uci.edu/pub/arcadia/MOMspider/

See the file INSTALL.txt for notes on how to install the program
and other info.  In particular, note that I would like to receive a copy
of the initial test results from sites that are using MOMspider.


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

From libwww-perl-request@ics.UCI.EDU Wed Aug 10 12:38:45 1994
Return-Path: <libwww-perl-request@ics.UCI.EDU>
Delivery-Date: Wed, 10 Aug 1994 12:39:03 +0100
Received: from binky.ics.uci.edu by lancaster.nexor.co.uk with SMTP (XTPP);
          Wed, 10 Aug 1994 12:38:45 +0100
Received: from ics.uci.edu by binky.ics.uci.edu id aa15790; 10 Aug 94 4:31 PDT
Received: from paris.ics.uci.edu by binky.ics.uci.edu id aa15786;
          10 Aug 94 4:30 PDT
Received: from simplon.ics.uci.edu by paris.ics.uci.edu id aa21665;
          10 Aug 94 4:30 PDT
To: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk
cc: vg@dcs.edinburgh.ac.uk, bbehlen@soda.csua.berkeley.edu, 
    sugino@bart.sps.mot.com, shelden@fatty.law.cornell.edu, 
    grimes@raison.mro.dec.com, mtaipale@dxcern.cern.ch, mvanheyn@cs.indiana.edu, 
    altis@ibeam.jf.intel.com, casey@ptsun00.cern.ch, jlh@linus.mitre.org, 
    dmk@allegra.att.com, Gary.Adams@east.sun.com
Subject: MOMspider is now publicly available
Date: Wed, 10 Aug 1994 04:29:41 -0700
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-ID:  <9408100430.aa21665@paris.ics.uci.edu>
Status: RO
Content-Length: 1839

Hello all,

This is just going out to people who I know have been waiting for this
release for much-too-long-a-time.  I have done my best to make the software
as robust as possible and have tested it under a variety of conditions.
Note that this is not an alpha or beta release -- the software is robust
enough to be marketable (even though I am just giving it away).
The only fault that still remains is that the documentation is rather
paltry.  That should be fixed in a future patch.

MOMspider is a bit unlike other Web spiders/robots in that it is not
engaged in resource discovery and thus generates very little load on
remote servers.  However, please let me know immediately if anyone
starts running it in an irresponsible manner.

Please do not rebroadcast this message to the high-profile mailing
lists and especially not to netnews.  If everything looks like it
is going fine, I will send a general announcement to www-talk and,
later, to comp.infosystems.www.providers.

Please do let me know if you have any problems installing/testing it
or you find the documentation incomprehensible or lacking some important
bit.  The earlier I can catch any problems, the more likely they will
be solved before the great unwashed masses get their hands on it. ;-)

MOMspider can be retrieved via HTTP from

       http://www.ics.uci.edu/WebSoft/MOMspider/

or by anonymous ftp from

       ftp://liege.ics.uci.edu/pub/arcadia/MOMspider/

See the file INSTALL.txt for notes on how to install the program
and other info.  In particular, note that I would like to receive a copy
of the initial test results from sites that are using MOMspider.


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

From /CN=robots-errors/@nexor.co.uk Sun Aug 14 23:34:41 1994
Replied: Mon, 22 Aug 1994 10:06:53 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: chakl <olafabbe@w250zrz.zrz.tu-berlin.d400.de>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sun, 14 Aug 1994 23:35:28 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sun, 14 Aug 1994 23:34:41 +0100
Date: Sun, 14 Aug 1994 23:34:41 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:218920:940814223442]
Content-Identifier: new robot pre...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 14 Aug 1994 23:34:41 
                      +0100;
Alternate-Recipient: Allowed
From: chakl <olafabbe@w250zrz.zrz.tu-berlin.d400.de>
Message-ID: <9408142234.AA04543@w250zrz.zrz.TU-Berlin.DE>
To: /CN=robots/@nexor.co.uk
Subject:  new robot pre-announce
X-Mailer: ELM [version 2.4 PL5]
Content-Type: text
Status: RO
Content-Length: 2171

Hi,

I have also written a robot that has just made its first steps out of
my local system (but staying in the local area).  I still find bugs
with the handling of "non-standard" URLs and I won't let the robot go
out of area until these are fixed.  Consider this a pre-announce :-)

Purpose
-------

The robot is intended to retrieve subgraphs of the Web, saving each
document to local disk.  Files ending in ".html" are parsed for further
URLs which are then retrieved recursively.

The robot will NOT follow links to other servers than the initial one,
and will NOT follow links to documents below the directory of the
initial URL.  It will also ignore links it can't handle.

Its main purpose is to retrieve complete hyperdocuments that are spread
over several files on the remote server (typically 3-100 files).

I'm running a non-networked system with dialup IP and a local httpd
(Linux1.0, NCSAhttpd1.1, term).  I find it useful to have local copies
of documents with "long-term value" (eg the httpd docs) rather than
having to establish a dialup connection each time.

Further Plans
-------------

This program is rather a test of the robot mechanics routines I have
written, limiting the possible Web load to a small number of documents.
In the long term, I'd like to couple this 'mechanics' to an AI system.
I'm a student working in Distributed AI, hacking an experimental LISP 
based multiagent testbed for fine food (read money ;-)

I imagine a society of cooperating intelligent agents in the Web domain,
each agent being an expert in some particular area.  So if my-agent gets
a request from another agent (possibly human :-), it might know that
your-agent is specialized in this area and would contact your-agent
directly using some inter-agent language and knowledge-exchange format.

Misc
----

I'm aware of the material on M. Koster's robot page.  The robot follows
the Guidelines and the Exclusion Standard.

Written in perl based on libwww0.12 (wasn't aware of 0.30 then).
Many thanks to the authors.

Comments welcome.

ciao,
chakl

chakl is 	Olaf Schreck, FU Berlin information science, student
		chakl@fu-berlin.de	olafabbe@w250zrz.zrz.tu-berlin.de

From /CN=robots-errors/@nexor.co.uk Mon Aug 15 16:15:43 1994
Replied: Mon, 22 Aug 1994 10:08:10 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: Billy Barron <billy@utdallas.edu>
Replied: " (chakl)" <olafabbe@w250zrz.zrz.tu-berlin.d400.de>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 15 Aug 1994 16:17:23 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 15 Aug 1994 16:15:43 +0100
Date: Mon, 15 Aug 1994 16:15:43 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:006700:940815151545]
Content-Identifier: Re: new robot...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 15 Aug 1994 16:15:43 
                      +0100;
Alternate-Recipient: Allowed
From: Billy Barron <billy@utdallas.edu>
Message-ID: <94Aug15.101514cdt.14417@utdallas.edu>
To: " (chakl)" <olafabbe@w250zrz.zrz.tu-berlin.d400.de>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9408142234.AA04543@w250zrz.zrz.TU-Berlin.DE>
Subject:  Re: new robot pre-announce
X-WWW-Page: http://www.utdallas.edu/acc/billy.html
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 564

In reply to chakl's message:
>
>I imagine a society of cooperating intelligent agents in the Web domain,
>each agent being an expert in some particular area.  So if my-agent gets
>a request from another agent (possibly human :-), it might know that
>your-agent is specialized in this area and would contact your-agent
>directly using some inter-agent language and knowledge-exchange format.
>
Look at http://rd.cs.colorado.edu/harvest/.  Such a system is under
development.

-- 
Billy Barron,  Network Services Manager, Univ of Texas at Dallas
billy@utdallas.edu 

From /CN=robots-errors/@nexor.co.uk Mon Aug 22 10:09:59 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 22 Aug 1994 10:12:43 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 22 Aug 1994 10:09:59 +0100
Date: Mon, 22 Aug 1994 10:09:59 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:178740:940822091000]
Content-Identifier: Re: new robot...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 22 Aug 1994 10:09:59 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"17869 Mon Aug 22 10:09:47 1994"@nexor.co.uk>
To: Billy Barron <billy@utdallas.edu>
Cc: " (chakl)" <olafabbe@w250zrz.zrz.tu-berlin.d400.de>, 
    /CN=robots/@nexor.co.uk
In-Reply-To: <94Aug15.101514cdt.14417@utdallas.edu>
Subject:  Re: new robot pre-announce 
Status: RO
Content-Length: 361

Billy Barron wrote:

> Look at http://rd.cs.colorado.edu/harvest/. Such a system is under
> development.

Definately required reading, especially the main paper on Harvest.


-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Mon Aug 22 10:07:13 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 22 Aug 1994 10:08:53 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 22 Aug 1994 10:07:13 +0100
Date: Mon, 22 Aug 1994 10:07:13 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:177850:940822090715]
Content-Identifier: Re: new robot...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 22 Aug 1994 10:07:13 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"17781 Mon Aug 22 10:06:58 1994"@nexor.co.uk>
To: chakl <olafabbe@w250zrz.zrz.tu-berlin.d400.de>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9408142234.AA04543@w250zrz.zrz.TU-Berlin.DE>
Subject:  Re: new robot pre-announce 
Status: RO
Content-Length: 558


> Consider this a pre-announce :-)

For the active list: Has it got a name? What does it use as
User-Agent?  Where is it run from? Has it got a page with details?
Have you got a personal page?

> [it] will NOT follow links to documents below the directory of the
> initial URL. 

Ehr... while I always like restrictions I don't understand the
reasoning behind this one?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Sun Sep 18 19:06:49 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sun, 18 Sep 1994 19:07:39 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sun, 18 Sep 1994 19:06:49 +0100
Date: Sun, 18 Sep 1994 19:06:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:088110:940918180650]
Content-Identifier: where are rob...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 18 Sep 1994 19:06:49 
                      +0100;
Alternate-Recipient: Allowed
From: James Binkley <jrb@cs.pdx.edu>
Message-ID: <9409181803.AA18218@cs.pdx.edu>
To: chakl <olafabbe@w250zrz.zrz.tu-berlin.d400.de>
Cc: /CN=robots/@nexor.co.uk, jrb@cs.pdx.edu
In-Reply-To: <9408142234.AA04543@w250zrz.zrz.TU-Berlin.DE>
Subject:  where are robots going anyway? (was Robot pre-announce)
Status: RO
Content-Length: 4197


I've been working with a couple of students at Portland State University
for a number of years (very slowly...) on a internet info-retrieval 
system.  (If you want to read about it -- check out the "rama" TR paper via
my home page: http://www.cs.pdx.edu/~jrb  (the ascii version
is more up to date but there are later developments of course)).  The basic
idea for rama is that you have a server that takes asynchronous queries 
from a number of users on a daily basis, searches
some number of remote information bases (typically USENET news)
and returns results via email.  
The original query mechanism was just pattern matching - but
it is about to be upgraded to a combination of pattern matching
via "agrep" and relevance feedback and "NOT" (probably the
most important feature :->). 
The system is basically used for searching local USENET news.
The server is capable of being extended to search other information
spheres (e.g., we have a ftp "web-walker" but it hasn't been
used much).  Anyway, I've been puzzling over how it might be
extended to work with the WWW.  Right now there isn't a client
interface but something simple with forms could be cobbled together.
Back-end work is more puzzling to me.  (Yes I know about harvest although
I've got to do some reading and some thinking there yet).  
Our search paradigm up until now has been something like this:
	you can search a "remote" directory (for news, read news group)
	you can throw out immediately anything that isn't NEW. (Typically
		we only search for new stuff since yesterday).
	we search on "subject, body, or both (all) and "map", which
		is a fuzzy notion that currently means "give me some
		idea about the terrain I am searching"

		subject for NEWs means the subject line. for ftp
			it means a filename.
NNTP can have the server search since a given time -- which is extremely
useful.  

In reference to Martijn's previous comment about why
searching one HTML doc would be useful.  It depends on what you are doing I
would think.  If you are running something like Lycos - you are walking
a subset of the web/world and trying to build an index for it. So naturally
you want a "wide-area" walker.  On the other hand, I might just want
to search the NCSA "what's new" page to see what's new in it (e.g assume
they add a few new items and don't roll the entire content over every time
they change it...).  I think the question here is: "is there one
web-walker model or many".  And what are those models? 
(And where the heck are web-walkers evolving to anyway?)

My other question is: assume rama, what should a model web-walker be?  
My inclination is to think that rama should have something like
what chakl suggests.  A limited web-walker that sticks only
to one site and to HTML docs at that site only.  It uses HEAD to determine
the date but unfortunately still has to "walk the directory" (that html
doc) to get lower-level URLs (although searching the non-URL contents can
be skipped).  It can optionally not look farther
afield than a given home page if a user wants. (ls, not ls -R)
What is useful about the rama server is that 
queries by a lot of folks can be centralized through it and
query optimization (and cacheing) can take place.  (There is
not much done there now but it is eminently possible and should
be done RSN).  

Right now, I'm not thrilled about the notion of agents communicating
with other agents since I have visions of security problems and more
important Mickey Mouse and all those brooms (:-> by which I mean
scaleability).  I can see something like a centralized lycos system thaat
accepts remote machine "distributed queries".  That seems less of a jump.
Regarding rama, I've wondered if some sort of merger with 
something like the CERN
proxy server would be useful.  The proxy server caches what the
local "affinity" group uses.  Certainly a cache like that is useful.
Maybe a model where remote users walk webs with results cached locally
makes more sense then trying to walk a LOT of webs at once and indexing
all the results.  Just a thought.  Certainly a lot of possibilities :->.  
Comments very welcome.

				Jim Binkley
				cs, Portland State University
				jrb@cs.pdx.edu

From /CN=robots-errors/@nexor.co.uk Mon Aug 15 22:35:42 1994
Replied: Mon, 22 Aug 1994 10:23:38 +0100
Replied: Matthew K Gray <mkgray@MIT.EDU>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 15 Aug 1994 22:36:32 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 15 Aug 1994 22:35:42 +0100
Date: Mon, 15 Aug 1994 22:35:42 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:042460:940815213544]
Content-Identifier: Wandex, the W...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 15 Aug 1994 22:35:42 
                      +0100;
Alternate-Recipient: Allowed
From: Matthew K Gray <mkgray@MIT.EDU>
Message-ID: <9408152135.AA12434@deathtongue.MIT.EDU>
To: /CN=robots/@nexor.co.uk
Subject:  Wandex, the World Wide Web Wanderer Index
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 1041

I've finally set up access to the indexes generated by W4 (the World Wide
Web Wanderer).  It indexes 13000 documents from more than 5000 different sites.
Please don't distribute the URL in public forums (such as c.i.www.*) just yet,
but tell whoever you want and use it.  In addition to the full web search,
it allows one to search a number of other documents and their children.
These indices contain more data than the full web index.

All of the other indeces are in the process of being generated right as I
compose this message and may take a while for them to finish, so don't
expect them to be complete, yet.

It currently appears that the web is growing faster than I can update the
comprehensive list, but I'll keep updating it.  (eventually)

Suggestions for other document trees to index are welcome, but I may ignore
them :-)

					...Matthew

Relevant URLs:

Wandex			http://www.mit.edu:8001/cgi/wandex
Comprehensive List	http://www.mit.edu:8001/people/mkgray/compre3.html
Me			http://www.mit.edu:8001/people/mkgray/mkgray.html

From /CN=robots-errors/@nexor.co.uk Tue Aug 30 08:49:51 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 30 Aug 1994 08:51:10 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 30 Aug 1994 08:49:51 +0100
Date: Tue, 30 Aug 1994 08:49:51 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:298170:940830074954]
Content-Identifier: The latest on...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 30 Aug 1994 08:49:51 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"29809 Tue Aug 30 08:49:33 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  The latest on the Mac HC robot...
Status: RO
Content-Length: 3166


Found this in my mailbox, may be of interest to the list,
as this robot was discussed here before...


------- Forwarded Message

Return-Path: <lemieuse@ERE.UMontreal.CA>
Delivery-Date: Mon, 29 Aug 1994 00:05:27 +0100
Received: from condor.CC.UMontreal.CA by lancaster.nexor.co.uk 
          with SMTP (XTPP); Mon, 29 Aug 1994 00:05:15 +0100
Received: from eole.ERE.UMontreal.CA by condor.CC.UMontreal.CA with SMTP 
          id AA05758  (5.65c/IDA-1.4.4 for m.koster@nexor.co.uk);
          Sun, 28 Aug 1994 19:03:40 -0400
Received: from alize.ERE.UMontreal.CA 
          by eole.ERE.UMontreal.CA (940406.SGI/5.17)	id AA09306;
          Sun, 28 Aug 94 19:03:39 -0400
Received: by alize.ERE.UMontreal.CA (940406.SGI/5.17)	id AA17708;
          Sun, 28 Aug 94 19:03:38 -0400
Message-Id: <9408282303.AA17708@alize.ERE.UMontreal.CA>
Subject: [announce] Mac WWW Worm
From: lemieuse@ERE.UMontreal.CA (Mac WWW Worm)
Date: Sun, 28 Aug 1994 19:03:37 -0400 (EDT)
Reply-To: lemieuse@ERE.UMontreal.CA (Sebastien Lemieux)
To: m.koster@nexor.co.uk
X-Mailer: fastmail [version 2.4 PL21]

     First, sorry for my french colleagues for this english answer.  I
     just didn't want to write it twice...  

			      ----------

Here are my presents thoughts about that:

1- Due to the net traffic that would be produce by such an easy-to-use
   'bot, I first decided that it should _never_ be widely released.

2- My Mac WWW worm was an engine designed to search for specific
   topics.  He was downloading lots of pages, but kept informations
   only about a little portion of them.  This way there's a lot of
   wasting in net resource.

   So, if you were striving to get such a tool, you should consider
   using one of the publicly accessible WWW Database.

3- Everyone running a bot without letting other people acces the data
   is _wasting_ resources, and should not be permitted to do that...


Anyone interested in the subject of WWW Robot should consider reading
the following document:

  http://web.nexor.co.uk/mak/doc/robots/robots.html

Before flaming me for not releasing the 'bot, read every thing you can
find under that URL.


			      ----------

Beside that, the MacWWW worm program still contains lots of neat
HyperCard script that can be easily recycled for any internet based
material...  I would accept to share all this material with any other
HC-minded people.

Be aware that building net program is not a little thing.  Even if HC
permit it to be really easy, you should always keep in mind that the
internet is a _public_ network.  Don't waste other's resources...


Anyway, thanks for your interest.

| Sebastien Lemieux, dept. biol.  || rootPGPCryptoDESEscrowCIAHackDSSRIPEM
| lemieuse@alize.ERE.UMontreal.CA || NSASkipJackFBIKerberosRSACapstoneNIST
| PGP public key on finger.       || AnonymousMailUFC-FastCryptCrackPassWd
  http://alize.ere.umontreal.ca:8001/~lemieuse/


- ----------------------------------------------------------------------
	    Ce message a ete reposte par le reposteur TCL
		 Pour info: lemieuse@ere.umontreal.ca
- ----------------------------------------------------------------------


------- End of Forwarded Message


From clv2m@server.cs.virginia.edu Wed Sep  7 19:37:41 1994
Return-Path: <clv2m@server.cs.virginia.edu>
Delivery-Date: Wed, 7 Sep 1994 19:37:55 +0100
Received: from virginia.edu (actually host uvaarpa.Virginia.EDU) 
          by lancaster.nexor.co.uk with SMTP (XTPP);
          Wed, 7 Sep 1994 19:37:41 +0100
Received: from server.cs.virginia.edu by uvaarpa.virginia.edu id aa21729;
          7 Sep 94 14:37 EDT
Received: from mamba.cs.Virginia.EDU (mamba-fo.cs.Virginia.EDU) 
          by uvacs.cs.virginia.edu (4.1/5.1.UVA)	id AA27582;
          Wed, 7 Sep 94 14:36:55 EDT
Posted-Date: Wed, 7 Sep 1994 14:36:33 +0500
Return-Path: <clv2m@mamba.cs.Virginia.EDU>
Received: by mamba.cs.Virginia.EDU (5.0/SMI-2.0)	id AA17297;
          Wed, 7 Sep 1994 14:36:33 +0500
Date: Wed, 7 Sep 1994 14:36:33 +0500
From: Charles Viles <clv2m@uvacs.cs.virginia.edu>
Message-Id: <9409071836.AA17297@mamba.cs.Virginia.EDU>
To: hogeveen@fys.ruu.nl, ejk@ux2.cso.uiuc.edu, 
    phillips.cs.ubc.ca@uvacs.cs.virginia.edu, sfsh@rome.classics.lsa.umich.edu, 
    benw@chemistry.leeds.ac.uk, bob@num-alg-grp.co.uk, mike@arl.mil, 
    mln@blearg.larc.nasa.gov, warnock@hypatia.gsfc.nasa.gov, 
    m.koster@nexor.co.uk
Cc: /CN=robots/@nexor.co.uk
Subject: Latency measurements: TR Available
Reply-To: clv2m@uvacs.cs.virginia.edu
Status: RO
Content-Length: 2248

You are receiving this because you communicated with me at some time
during the "TESTCOMMAND" experiment or might otherwise be interested
in the paper. 

The following paper is now available via ftp/WWW.

Viles, Charles L. and James C. French. "Availability and Latency of World
Wide Web Information Servers", Technical Report CS-94-36, Department
of Computer Science, University of Virginia.

The report is available in postscript form at
ftp://uvacs.cs.virginia.edu/pub/techreports/CS-94-36.ps.Z

Abstract: During a 90 day period in 1994, we measured the availability and 
connection latency of HTTP (hypertext transport protocol) information 
servers. These measurements were made from an Eastern United States site. 
The list of servers included 192 servers from Europe and 321 servers from 
North America. Our measurements indicate that on average, 4.6% of North 
American servers and 5.9% of European servers were unavailable from the 
measurement site on any given day. As seen from the measurement site, the 
day-to-day variation in availability was much greater for the European 
servers than for the North American servers. The measurements also show 
a wide variation in availability for individual information servers. For 
example, more than 80% of all North American servers were available at 
least 95% of the time, but 5% of the servers were available less than 80% of 
the time. The pattern of unavailability suggests a strong correlation between 
unavailability and geographic location. Median connection latency from the 
measurement site was in the 0.2 - 0.5s range to other North American sites 
and the 0.4 - 2.5s to European sites, depending upon the day of the week. 
Latencies were much more variable to Europe than to North America. The 
magnitude of the latencies suggest the addition of an MGET method to 
HTTP to help alleviate large TCP set-up times associated with the retrieval 
of web pages with embedded images. The data show that 97% and 99% of 
all successful connections from the measurement site to Europe and North 
America respectively were made within the first 10 s. This suggests the 
establishment of client-side time-out intervals much shorter than those used 
for normal TCP connection establishment.


From /CN=robots-errors/@nexor.co.uk Mon Sep 12 15:59:07 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 12 Sep 1994 16:02:26 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 12 Sep 1994 15:59:07 +0100
Date: Mon, 12 Sep 1994 15:59:07 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:060560:940912145919]
Content-Identifier: WWW Worm for ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 12 Sep 1994 15:59:07 
                      +0100;
Alternate-Recipient: Allowed
From: Billy Barron <billy@utdallas.edu>
Message-ID: <94Sep12.095839cdt.14462@utdallas.edu>
To: /CN=robots/@nexor.co.uk
Subject:  WWW Worm for Tcl? (fwd)
X-WWW-Page: http://www.utdallas.edu/acc/billy.html
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 506

I caught wind of this from a friend:

>
>Newsgroups: comp.lang.tcl
>From: msh@mserv1.dl.ac.uk (M.S. Smith)
>Date: 9 Sep 1994 08:15:21 GMT
>Organization: Daresbury Laboratory, UK 
>Subject: WWW Worm for Tcl?
>
>Hello, I'm about to embark on writing a WWW worm using Tcl, what I
>would like to know is if one already exists in Tcl or if anyone has
>already done any groundwork on one?
>
>Any help will be appreciated, and I would prefer it if you could mail
>me directly.  Thanks!
>
>
>		Mark	msh@dl.ac.uk
>

From /CN=robots-errors/@nexor.co.uk Fri Oct 14 07:45:13 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 14 Oct 1994 07:46:02 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 14 Oct 1994 07:45:13 +0100
Date: Fri, 14 Oct 1994 07:45:13 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:254370:941014064514]
Content-Identifier: Anybody know ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 14 Oct 1994 07:45:13 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"25426 Fri Oct 14 07:41:26 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Anybody know either of these two robots?
Status: RO
Content-Length: 2492


------- Forwarded Message

Replied: Thu, 13 Oct 1994 09:49:41 +0100
Replied: lamontg@u.washington.edu
Return-Path: <lamontg@u.washington.edu>
Delivery-Date: Wed, 12 Oct 1994 23:25:58 +0100
Received: from saul5.u.washington.edu by lancaster.nexor.co.uk 
          with SMTP (XTPP); Wed, 12 Oct 1994 23:25:38 +0100
Received: by saul5.u.washington.edu	(5.65+UW94.4/UW-NDC Revision: 2.30 ) 
          id AA29545;	Wed, 12 Oct 94 15:25:26 -0700
Date: Wed, 12 Oct 94 15:25:26 -0700
Message-Id: <9410122225.AA29545@saul5.u.washington.edu>
X-Sender: lamontg@saul5.u.washington.edu
To: m.koster@nexor.co.uk
X-Url: 
X-Mailer: Lynx, Version 2.3 BETA
X-Personal_Name: Lamont Granquist
From: lamontg@u.washington.edu
Subject: Putative New Webcrawlers

I run http://stein1.u.washington.edu:2012/ at the University of Wash
in Seattle, which is not an 'offical' UW server -- I'm just a student.
I've got a stats page on this at:

http://stein1.u.washington.edu:2012/admin/wwwstats.html

Which was useful in identifying two putative new webcrawlers that
i don't see on your list, the information from my logfile indicates:

broo.tele.nokia.fi on 11 Oct 94 between 4:02 am and 4:49 am (47 mins)
accessed 367 documents (i believe that this is the entire site) and
downloaded 3 megabytes in 47 minutes.

lanczos.maths.tcd.ie on 4 Oct 94 between 7:43 am and 8:43 am (exactly
1 hour) downloaded 109 requests and then on 10 Oct 94 between 4:22am
and 4:50am downloaded 93 requests for a cumulative total of 1.6 megs
of files.  I also believe that this putative bot isn't following links
which are on forms pages, while the finnish putative bot is following
links which are on forms pages (although i think this might be
over-cautious behavior on the part of the lanczos bot by simply ending
the search on any interactive page...  not sure...).  Of course it
may come back and hit me again.  There was very little duplication
between the two passes.

Doesn't bug me much because 3 megs is just a *blip* across the
university ethernet, and it was done at reasonably decent times of day...

I don't have any ID fields off of these bots -- i don't know how to
configure my server to tell me that.  I'm running CERN httpd and if
you could tell me how i could modify it to generate reports on bots
who are nice and ID themselves i'd appreciate it (although if you've
got this info on your WWW site and i just haven't seen it, don't
waste your time telling me more than just the URL...)


------- End of Forwarded Message


From /CN=robots-errors/@nexor.co.uk Fri Oct 14 18:10:07 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 14 Oct 1994 18:16:24 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 14 Oct 1994 18:10:07 +0100
Date: Fri, 14 Oct 1994 18:10:07 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:077960:941014171008]
Content-Identifier: Re: What if w...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 14 Oct 1994 18:10:07 
                      +0100;
Alternate-Recipient: Allowed
From: bcutter@pdn.paradyne.com
Message-ID: <9410141550.AA01494@paradyne.com>
To: uunet!robots <uunet!robots%nexor.co.uk@uunet.uu.net>
Subject:  Re: What if we offered a local spider?
Reply-To: bcutter@paradyne.att.com
X-Sun-Charset: US-ASCII
Status: RO
Content-Length: 2626

> > The robots discussion that I prompted with my indexing offer gave me an idea.
> > 
> > If we built a free spider that operated only via the file system, which
> > would build an index mapped to URL-space,
> 
> I suggested this to at least one robot author a while ago in the
> context of URL checking (Hi Roy :-), but there are a number of
> problems: CGI-script generated pages are excluded, access
> authorisation is ignored, and you need to parse server config files to
> look at URL mappings.

Roy Fielding's MOMSpider would definately be the best way to do this,
because it does walk the web..

However, if you definately want the bot to walk the filesystem, and
you're willing to live with broken URL's due to URL->directory
remapping in the server (like /icons in NCSA), and no CGI output..

Back before MOMSpider, I implemented a crude link checker called "checkweb"
which walked the filesystem checking the integrity of links, and as
a side effect it also created a map which listed all the page links to
other pages...  When run with a flag, it produced a experimental (and
crude) table of contents..

I've long since abandoned it, but if you must parse by the filesystem
rather than the web, you may want to start there:

http://www.stuff.com/~bcutter/home/programs/checkweb.html


> > then offered to serve those indexes from here, would people use it?
> 
> Well, by just making the file available on a well-known place anybody
> can use locally-generated map. Ehr /ls-R.txt ?

It would be nice to provide in a flat file a list of all files, ala ls-lR,
so rather than doing multiple HEAD's against a site, I can pull down the
single file and get my last modified and sizes from there..  

However, there may be some issues of security/privacy... Most web
sites put a "index.html" (or Welcome.html") file in place so you
can't browse the directory structures, and in effect use security
through obscurity to in effect force people to access only those
pages which are linked.  Providing a master ls-lR file would provide
a way so that I could find out what pages existed in the filesystem,
regardless of links.  (I'd like to do this on hooho.ncsa.uiuc.edu,
which has a number of interesting html pages, most of which are not
listed off the home page.)

If we can solve this problem, it would be nice to also regularly
generate a file showing the relationship of document links, so
robots won't have to walk the web to find this out... (some could
prune their search looking just at the files - and those that just
walk rather than index won't need to retrieve the HTML files)

-Brooks
bcutter@paradyne.att.com


From narnett@verity.com Sat Oct 15 15:13:27 1994
Return-Path: <narnett@verity.com>
Delivery-Date: Sat, 15 Oct 1994 15:13:42 +0100
Received: from verity.com (actually host unknown-143-5.verity.com) 
          by lancaster.nexor.co.uk with SMTP (XTPP);
          Sat, 15 Oct 1994 15:13:27 +0100
Received: from nasty.verity.com (nasty.verity.com [192.187.143.63]) 
          by verity.com (8.6.6.Beta9/8.6.6.Beta9) with SMTP id HAA15198;
          Sat, 15 Oct 1994 07:15:38 -0700
Received: from [192.187.143.12] (portanick) 
          by nasty.verity.com (4.1/SMI-4.1)	id AA18229;
          Sat, 15 Oct 94 07:11:47 PDT
Message-Id: <9410151411.AA18229@nasty.verity.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Sat, 15 Oct 1994 10:14:32 -0500
To: bcutter@paradyne.att.com, m.koster@nexor.co.uk
From: narnett@verity.com (Nick Arnett)
Subject: Re: What if we offered a local spider?
Cc: /CN=robots/@nexor.co.uk
Status: RO
Content-Length: 1354

At 11:46 AM 10/14/94 -0400, bcutter@pdn.paradyne.com wrote:

>However, if you definately want the bot to walk the filesystem, and
>you're willing to live with broken URL's due to URL->directory
>remapping in the server (like /icons in NCSA), and no CGI output..

I think our main goal would be to design a spider that would be very useful
for indexing one site, but couldn't be used (easily, anyway) to index
remote sites. So we'd obviously rather not end up with broken URLs, etc.

It seems that we'd want some combination of each...  There's probably no
way to avoid using HTTP for part of the indexing, but perhaps its utility
could be limited.

>It would be nice to provide in a flat file a list of all files, ala ls-lR,
>so rather than doing multiple HEAD's against a site, I can pull down the
>single file and get my last modified and sizes from there..

Yes, absolutely.  I'm considering a feature for our server that would do
this *and* given a date as a parameter, would generate a file that only
contained the meta-data for the files that have changed since that date.
That would greatly simplify updates.  I can't remember if Harvest has
something like that, but I have the Harvest paper with me (I'm on the
road).

Your other suggestions would add a lot of utility to the indexing spider at
a very low cost of programming, I suspect.

Nick


From /CN=robots-errors/@nexor.co.uk Thu Oct 20 16:53:43 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 20 Oct 1994 16:55:54 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 20 Oct 1994 16:53:43 +0100
Date: Thu, 20 Oct 1994 16:53:43 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294220:941020155345]
Content-Identifier: Web Navigatio...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Oct 1994 16:53:43 
                      +0100;
Alternate-Recipient: Allowed
From: Bowden Wise <wiseb@cs.rpi.edu>
Message-ID: <199410201551.AA02397@cs.rpi.edu>
To: /CN=robots/@nexor.co.uk
Subject:  Web Navigational Aids
X-Mailer: MH 6.7.1/exmh version 1.5beta 8/10/94
Status: RO
Content-Length: 1633

Hi, 

I am interested in maintaining a map of a user's current document for
navigational purposes.  Does anyone know of any techniques used to
visualize the hypertext structure of the Web?  

I would like to incorporate these ideas into a Web robot first, and then
add them to a browser as a navigational aid.  Does anyone know of any
existing source code that shows how to set up a search?  I prefer C or C++
code. 

You can think of the map as a grah with the start document as the initial
node.  As links are found, they are added to the graph itself.

A breadth first search of such a graph will visit nodes level by level,
first by visiting all links from the inital node before proceeding to the
next level.

A depth first search would visit deeper into the graph.

Obviously, the search must be limited, because you could traverse links
forever.  One could stop the search once a link is followed that goes off
of the current document's server, or when the same node is visited again
(for the case of a circular path), and by only searching to a certain
level from an initial node.

I would like to write a Web robot to build such a graph given a starting
page.  I prefer to write my code in C or C++.  I would be grateful to any
pointers to algorithms or code for doing similar searches.

Where can I find a minimal robot written in C or C++ that uses the www
library? 

--------------------------------------------------------------------
- G. Bowden Wise                    Computer Science Department
  Internet: wiseb@cs.rpi.edu        Rensselaer Polytechnic Institute
                                    Troy, NY 12180


From /CN=robots-errors/@nexor.co.uk Fri Oct 21 21:40:26 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 21 Oct 1994 21:42:22 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 21 Oct 1994 21:40:26 +0100
Date: Fri, 21 Oct 1994 21:40:26 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:113970:941021204027]
Content-Identifier: Re: Web Navig...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 21 Oct 1994 21:40:26 
                      +0100;
Alternate-Recipient: Allowed
From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov>
Message-ID: <9410212042.AA11639@rbse.jsc.nasa.gov>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Web Navigational Aids
X-Sender: eichmann@192.88.42.10
MIME-version: 1.0
Content-type: text/plain; charset="iso-8859-1"
Content-transfer-encoding: quoted-printable
Status: RO
Content-Length: 1080

At 10:55 AM 10/20/94 +0000, Bowden Wise wrote:
...
>I am interested in maintaining a map of a user's current document for
>navigational purposes.  Does anyone know of any techniques used to
>visualize the hypertext structure of the Web?
>
>I would like to incorporate these ideas into a Web robot first, and then
>add them to a browser as a navigational aid.  Does anyone know of any
>existing source code that shows how to set up a search?  I prefer C or C++
>code.

This concept was demo'ed at the 2nd WWW Conference in Chicago on Tuesday by
Peter D=F6mel - "Webmap - A Graphical Hypertext Navigation Tool."  Peter's
email address is doemel@informatik.uni-frankfurt.de.

- Dave

-----------
David Eichmann
Asst. Prof. / RBSE Director of R & D
Software Engineering Program              Phone: (713) 283-3875
University of Houston - Clear Lake        fax  : (713) 283-3810
Box 113, 2700 Bay Area Blvd.              Email: eichmann@rbse.jsc.nasa.gov
Houston, TX 77058                            or: eichmann@cl.uh.edu
RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html


From /CN=robots-errors/@nexor.co.uk Fri Oct 21 06:01:47 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 21 Oct 1994 06:03:02 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 21 Oct 1994 06:01:47 +0100
Date: Fri, 21 Oct 1994 06:01:47 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:074790:941021050149]
Content-Identifier: Robots in Int...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 21 Oct 1994 06:01:47 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"7476 Fri Oct 21 06:01:37 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Robots in Internet Courses?
Status: RO
Content-Length: 394


I've heard rumours that among the many new Internet courses at the
various universities, some have "write a robot" assignments/
projects. Can anyone substantiate these rumours? A worrying thought
really...

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Nov  3 14:26:57 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 3 Nov 1994 14:29:21 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 3 Nov 1994 14:26:57 +0000
Date: Thu, 3 Nov 1994 14:26:57 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:026870:941103142658]
Content-Identifier: Mechanical Sp...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 3 Nov 1994 14:26:57 +0000;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"2682 Thu Nov  3 14:26:37 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Mechanical Spider Image
Status: RO
Content-Length: 509


Hardly a real robot question, but does anyone know where I could get
an image of that small spider robot done at MIT? I have seen it in
print somewhere but cannot possibly remember, and I browsed the
various robotics departments a while back without success.
It would look better on the robot page than the robot arm :-)

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Nov  3 14:42:33 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 3 Nov 1994 14:45:53 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 3 Nov 1994 14:42:33 +0000
Date: Thu, 3 Nov 1994 14:42:33 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:030640:941103144235]
Content-Identifier: code implemen...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 3 Nov 1994 14:42:33 +0000;
Alternate-Recipient: Allowed
From: Mike Schwartz <schwartz@latour.cs.colorado.edu>
Message-ID: <199411031438.HAA23590@latour.cs.colorado.edu>
To: /CN=robots/@nexor.co.uk
Subject:  code implementing robots.txt "Disallow" mechanism?
Status: RO
Content-Length: 135

Does anyone have some C code implementing the robots.txt "Disallow"
mechanism?  Thanks
 - Mike Schwartz
   Univ. of Colorado - Boulder

From /CN=robots-errors/@nexor.co.uk Tue Nov 15 08:09:53 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 15 Nov 1994 08:18:21 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 15 Nov 1994 08:09:53 +0000
Date: Tue, 15 Nov 1994 08:09:53 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:078320:941115080954]
Content-Identifier: Storing un-ch...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 15 Nov 1994 08:09:53 
                      +0000;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"7828 Tue Nov 15 08:09:44 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Cc: Rob Hartill <hartill@ooo.lanl.gov>, 
    Jack Applin <neutron@swttools.fc.hp.com>, warrenw@hp10cux8.nsr.hp.com
Subject:  Storing un-checked links
Status: RO
Content-Length: 1297


Hi all,

Robert Hartill brought a problem to my attention where a robot stored
an invalid link to server A, found on a host B, without checking it
was still valid, or that it wasn't governed by /robots.txt.

I'm considering adding a note to guidelines.html that specifies that
URL's shouldn't be listed unless their existence has been validated by
an explicit retrieval (which of course falls under the /robots.txt
restrictions). This is especially a good idea when the links are found
on remote servers.

This is vaguely related to an issue Jack Applin and Warren Waldo
mailed me about a while ago: Does the robots.txt "disallow" mean
"don't retrieve this tree", or "don't list a link to this tree". There
is no distinction in norobots.html between these cases, in my mind
robots.txt covered the first case, and the second case shouldn't
happen.

These extra retrievals reduce the harvest of links per document from
many to one, but increases the quality of the robot's output. I guess at
the very least non-checked links should be marked as such.

Any thoughts? How do robots work in these cases currently?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From narnett@verity.com Thu Nov 17 22:38:21 1994
Replied: Thu, 17 Nov 1994 22:50:46 +0000
Replied: Martijn Koster <m.koster@nexor.co.uk>
Replied: /DD.Common=robots/@nexor.co.uk
Replied: narnett@verity.com (Nick Arnett)
Return-Path: <narnett@verity.com>
Delivery-Date: Thu, 17 Nov 1994 22:39:01 +0000
Received: from verity.com (actually host unknown-143-5.verity.com) 
          by lancaster.nexor.co.uk with SMTP (XTPP);
          Thu, 17 Nov 1994 22:38:21 +0000
Received: from nasty.verity.com (nasty.verity.com [192.187.143.63]) 
          by verity.com (8.6.6.Beta9/8.6.6.Beta9) with SMTP id OAA18655;
          Thu, 17 Nov 1994 14:40:57 -0800
Received: from  by nasty.verity.com (4.1/SMI-4.1)	id AB05203;
          Thu, 17 Nov 94 14:36:34 PST
Message-Id: <aaf18aab160210040e68@[192.187.143.12]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Thu, 17 Nov 1994 14:39:23 -0800
To: Martijn Koster <m.koster@nexor.co.uk>, /DD.Common=robots/@nexor.co.uk
From: narnett@verity.com (Nick Arnett)
Subject: Re: Storing un-checked links
Status: RO
Content-Length: 671

At 8:01 AM 11/17/94, Martijn Koster wrote:
>Hi all,
>
>Robert Hartill brought a problem to my attention where a robot stored
>an invalid link to server A, found on a host B, without checking it
>was still valid, or that it wasn't governed by /robots.txt.

The browser should be sending referrer information.  I must admit that I
don't know the mechanism, but apparently there's a means for the client to
tell "host B" in essence "host A sent me."

Can anyone offer details?  I only know this in theory at the moment.

I dislike the idea of a robot that has to validate every link.  That's a
lot of overhead for a piece of information whose life is quite limited.

Nick


From /CN=robots-errors/@nexor.co.uk Thu Nov 17 22:39:35 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 17 Nov 1994 22:43:33 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 17 Nov 1994 22:39:35 +0000
Date: Thu, 17 Nov 1994 22:39:35 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:288830:941117223939]
Content-Identifier: Re: Storing u...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 22:39:35 
                      +0000;
Alternate-Recipient: Allowed
From: " (Nick Arnett)" <narnett@verity.com>
Message-ID: <aaf18aab160210040e68@[192.187.143.12]>
To: Martijn Koster <m.koster@nexor.co.uk>, /CN=robots/@nexor.co.uk
Subject:  Re: Storing un-checked links
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 671

At 8:01 AM 11/17/94, Martijn Koster wrote:
>Hi all,
>
>Robert Hartill brought a problem to my attention where a robot stored
>an invalid link to server A, found on a host B, without checking it
>was still valid, or that it wasn't governed by /robots.txt.

The browser should be sending referrer information.  I must admit that I
don't know the mechanism, but apparently there's a means for the client to
tell "host B" in essence "host A sent me."

Can anyone offer details?  I only know this in theory at the moment.

I dislike the idea of a robot that has to validate every link.  That's a
lot of overhead for a piece of information whose life is quite limited.

Nick


From m.koster@nexor.co.uk Thu Nov 17 22:50:45 1994
Return-Path: <m.koster@nexor.co.uk>
Delivery-Date: Thu, 17 Nov 1994 22:50:52 +0000
Received: from nexor.co.uk (actually host victor.nexor.co.uk) 
          by lancaster.nexor.co.uk with SMTP (PP);
          Thu, 17 Nov 1994 22:50:45 +0000
To: narnett@verity.com (Nick Arnett)
cc: Martijn Koster <m.koster@nexor.co.uk>, /DD.Common=robots/@nexor.co.uk
Subject: Re: Storing un-checked links 
In-reply-to: Your message of "Thu, 17 Nov 1994 14:39:23 PST."             <aaf18aab160210040e68@[192.187.143.12]> 
Date: Thu, 17 Nov 1994 22:50:41 +0000
From: Martijn Koster <m.koster@nexor.co.uk>
Status: RO
Content-Length: 1982


> At 8:01 AM 11/17/94, Martijn Koster wrote:
> >Hi all,
> >
> >Robert Hartill brought a problem to my attention where a robot stored
> >an invalid link to server A, found on a host B, without checking it
> >was still valid, or that it wasn't governed by /robots.txt.
> 
> The browser should be sending referrer information.  I must admit that I
> don't know the mechanism, but apparently there's a means for the client to
> tell "host B" in essence "host A sent me."

There is, but that doesn't help here. The problem is that the dead
link is never retrieved by the robot, so it doesn't know it is dead,
and it can't tell anyone if it wanted to. The only place where
Referer comes in is when a client uses the Robot-served dead link, in
which case server A is told "Robot database xxx sent me", which
doesn't help much.

There might be scope for a new HTTP method DEADLINK or something,
where a client explicitly goes back to the server that served the
documents containing the dead (or moved) link, and notfies it of this
fact. Of course it should only do this on a positive refusal (when it
gets a "not found") not just on any failure (when it can't get to the
host). But somehow I don't expect that to get into HTTP anytime soon.

> Can anyone offer details?  I only know this in theory at the moment.

Refere is useful, but not with a third party. Oh, and lots of clients
lie about Referer too :-)

> I dislike the idea of a robot that has to validate every link.  That's a
> lot of overhead for a piece of information whose life is quite limited.

That depends, it may not be that limited, it may go for months without
being refreshed. I suppose you could even do the validation based on
some usage pattern: "this URL has been found during the last n
searches, let's make sure it exists."

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From mcbryan@cs.colorado.edu Thu Nov 17 23:37:47 1994
Return-Path: <mcbryan@cs.colorado.edu>
Delivery-Date: Thu, 17 Nov 1994 23:38:06 +0000
Received: from piper.cs.colorado.edu by lancaster.nexor.co.uk with SMTP (XTPP);
          Thu, 17 Nov 1994 23:37:47 +0000
Received: from [198.11.16.30] (mac3bryan.cs.colorado.edu [198.11.16.30]) 
          by piper.cs.colorado.edu (8.6.9/8.6.9) with SMTP id QAA28066;
          Thu, 17 Nov 1994 16:37:33 -0700
Date: Thu, 17 Nov 1994 16:37:33 -0700
X-Sender: mcbr@piper.cs.colorado.edu
Message-Id: <aaf133690b02100406df@[198.11.16.30]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: Martijn Koster <m.koster@nexor.co.uk>, 
    " (Nick Arnett)" <narnett@verity.com>
From: mcbryan@cs.colorado.edu (Oliver A. McBryan)
Subject: Re: Storing un-checked links
Cc: Martijn Koster <m.koster@nexor.co.uk>, /DD.Common=robots/@nexor.co.uk
Status: RO
Content-Length: 466

The Worm regards every link it cannot reach.
It also records every host it cannot reach
Finally it records the date of each attempted acccess.

Those data are used later to decide whether and when to revisit a link and
whatto output to the publicly accessible archive.

Oliver McBryan; mcbryan@cs.colorado.edu; 303-6650544; Fax 303-4922844
Dept of Computer Science, University of Colorado, Boulder, CO 80309.
WWW: http://www.cs.colorado.edu/home/mcbryan/Home.html


From mlm@FUZINE.MT.CS.CMU.EDU Thu Nov 17 23:44:09 1994
Return-Path: <mlm@FUZINE.MT.CS.CMU.EDU>
Delivery-Date: Thu, 17 Nov 1994 23:46:29 +0000
Received: from fuzine.mt.cs.cmu.edu by lancaster.nexor.co.uk with SMTP (XTPP);
          Thu, 17 Nov 1994 23:44:09 +0000
Received: by fuzine.mt.cs.cmu.edu (NeXT-1.0 (From Sendmail 5.52)/NeXT-0.9)	id AA05964;
          Thu, 17 Nov 94 18:42:50 EST
Date: Thu, 17 Nov 94 18:42:50 EST
From: mlm@FUZINE.MT.CS.CMU.EDU (Michael Mauldin)
Message-Id: <9411172342.AA05964@fuzine.mt.cs.cmu.edu>
Original-Received:  by NeXT Mailer (1.63)
PP-warning: Illegal Received field on preceding line
To: Martijn Koster <m.koster@nexor.CO.UK>
Subject: Re: Storing un-checked links
Cc: /DD.Common=robots/@nexor.CO.UK, Rob Hartill <hartill@ooo.LANL.GOV>, 
    Jack Applin <neutron@swttools.fc.hp.com>, warrenw@hp10cux8.nsr.HP.com
Status: RO
Content-Length: 1200

Lycos is the Robot in question, and I have already been in
touch with Rob Hartill about the answer.

One feature of Lycos is that it collects all the descriptions
of a page and brings them together in a single retrieval even
if Lycos has not retrieved the document in question.

Informal polls of my users indicate that the increased data
coverage of this approach is extremely useful, but it does
mean that the link is not known to be valid.

One thought that he suggested that I do like is to not store
these indirect documents if the robots.txt file for that
server would disallow the robot from accessing that file.

This is not nearly as expensive as checking the existence of
every URL, because there are only 9-10 thousand HTTP servers,
and Lycos at least caches the robots.txt file for each of
them.  That means that Lycos can (in principal and soon in
reality) check the publication status of a URL before including
it in the database.

I would not be willing to go farther and state that a robot
must verify the existence of a URL before revealing the URL
pointer.  That reduces the usefulness of my robot far too
much.

--Dr. Michael L. Mauldin <fuzzy@cmu.edu>
  http://lycos.cs.cmu.edu/


From /CN=robots-errors/@nexor.co.uk Tue Nov 15 18:47:17 1994
Replied: Wed, 16 Nov 1994 08:24:29 +0000
Replied: /CN=robots/@nexor.co.uk
Replied: "David M. Chess" <chess@watson.ibm.com>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 15 Nov 1994 18:48:52 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 15 Nov 1994 18:47:17 +0000
Date: Tue, 15 Nov 1994 18:47:17 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:194730:941115184718]
Content-Identifier: Hello(b)
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 15 Nov 1994 18:47:17 
                      +0000;
Alternate-Recipient: Allowed
From: "David M. Chess" <chess@watson.ibm.com>
Message-ID: <"19470 Tue Nov 15 18:47:08 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Hello!
Status: RO
Content-Length: 1627

I've just joined the list, and thought I'd introduce myself
very briefly and ask some stupid questions.

I'm David M. Chess, and I work in the High-Integrity Computing
Lab at IBM's T. J. Watson Research Center in Westchester County,
New York, USA.  We work on computer viruses and related
replicating things; we are the R&D staff for IBM AntiVirus,
and we do research into virus-like problems on current and
future highly-distributed systems.  We're currently starting
to look into the Web and electronic commerce.  Just for exercise,
I wrote a web-walker in REXX for OS/2, and while thinking about
the various positive and negative aspects of the beasts, I
stumbled across this list.

  - Should I send in a semi-formal description of my
    OS/2 REXX robot, for the list?  I don't imagine it's
    ever been used outside the IBM Internal web so far,
    but it might be eventually.  If so, just what information
    should I send, and to whom?

  - There are some reasonably obvious-looking things one
    could do to http to make at least some kinds of robots
    more efficient.  For instance, if one could say "give
    me just the <title> text and all <a> tags in this document",
    some robots could avoid reading all the text of an
    html document just to find those things.  Is this a
    good place to toss around that sort of idea, or would
    one of the general www lists be better?

- -- -
David M. Chess                 | "Master, how may I comprehend the One?"
High Integrity Computing Lab   | "Have you finished your coding?"  "Yes."
IBM Watson Research            | "Then go and compile!"   -- Hacker Koan

From /CN=robots-errors/@nexor.co.uk Wed Nov 16 08:24:41 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 16 Nov 1994 08:26:29 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 16 Nov 1994 08:24:41 +0000
Date: Wed, 16 Nov 1994 08:24:41 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273850:941116082442]
Content-Identifier: Re: Hello(b)
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 16 Nov 1994 08:24:41 
                      +0000;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"27382 Wed Nov 16 08:24:35 1994"@nexor.co.uk>
To: "David M. Chess" <chess@watson.ibm.com>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <"19470 Tue Nov 15 18:47:08 1994"@nexor.co.uk>
Subject:  Re: Hello! 
Status: RO
Content-Length: 2352


> We're currently starting
> to look into the Web and electronic commerce.  Just for exercise,
> I wrote a web-walker in REXX for OS/2, and while thinking about
> the various positive and negative aspects of the beasts, I
> stumbled across this list.
> 
>   - Should I send in a semi-formal description of my
>     OS/2 REXX robot, for the list?  I don't imagine it's
>     ever been used outside the IBM Internal web so far,
>     but it might be eventually.  If so, just what information
>     should I send, and to whom?

Have a look at http://web.nexor.co.uk/mak/doc/robots/robots.html 
there is a list of robots there, with basic information to enable
system admins to recognise them visiting, and to serve a central
list with pointers to a Web robot material; I would suggest putting
a full description of your robot up in the web, and then give me a
short summary, with a pointer to more info.

>   - There are some reasonably obvious-looking things one
>     could do to http to make at least some kinds of robots
>     more efficient.  For instance, if one could say "give
>     me just the <title> text and all <a> tags in this document",
>     some robots could avoid reading all the text of an
>     html document just to find those things.  Is this a
>     good place to toss around that sort of idea, or would
>     one of the general www lists be better?

This is probably a good place to hash out any requirements and the
issues, before proposing any formal protocol changes to www-talk.

As for the "Only title and links" idea, this is generally no thought
be sufficient for indexing purposes (see eg the WebWalker paper at the
Chicago conference), and obviously not enough for content-checking
purposes. That leaves "dead link" checking and simple statistics, and
it is my opinion that no robot should only do those. To implement it
would increase server complexity, and it wouldn't work for CGI
generated pages (unless the server does even more work). So I don't
think it'd be worth the effort.

The obvious way to make robots more efficient is for them to share
results, maybe by acting as a Harvest gatherer.

Regards and welcome to the club,

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Nov 17 22:39:15 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 17 Nov 1994 22:43:24 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 17 Nov 1994 22:39:15 +0000
Date: Thu, 17 Nov 1994 22:39:15 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:288810:941117223916]
Content-Identifier: Re: Hello(b)
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 22:39:15 
                      +0000;
Alternate-Recipient: Allowed
From: " (Nick Arnett)" <narnett@verity.com>
Message-ID: <aaf1896715021004c275@[192.187.143.12]>
To: "David M. Chess" <chess@watson.ibm.com>, /CN=robots/@nexor.co.uk
Subject:  Re: Hello!
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 1644

At 8:03 AM 11/17/94, David M. Chess wrote:

>  - Should I send in a semi-formal description of my
>    OS/2 REXX robot, for the list?  I don't imagine it's
>    ever been used outside the IBM Internal web so far,
>    but it might be eventually.  If so, just what information
>    should I send, and to whom?

I, for one, am always interested in design comparison.  Our spider is still
fairly crude, but getting smarter every day.

>  - There are some reasonably obvious-looking things one
>    could do to http to make at least some kinds of robots
>    more efficient.  For instance, if one could say "give
>    me just the <title> text and all <a> tags in this document",
>    some robots could avoid reading all the text of an
>    html document just to find those things.  Is this a
>    good place to toss around that sort of idea, or would
>    one of the general www lists be better?

I think this list is the ideal place to discuss such ideas.  I'm afraid
that the more general the list, the more likely you'll get flamed just for
having written a spider at all.  Makes it hard to have a rational
discussion.

Do you have any empirical evidence that getting the title and anchors would
yield useful indexes?  It sounds like a good idea, but my guess is that
it's not going to work well, since those pieces wouldn't add up to an
abstract of a typical document.  A standard name for an abstract would be
great, especially if it were in the header so that the HEAD command in HTTP
would retrieve it along with other meta-information.  We're supporting the
META element in the HTML 2.0 RFC for encoding custom document attributes.

Nick


From /CN=robots-errors/@nexor.co.uk Thu Nov 17 23:04:35 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 17 Nov 1994 23:05:46 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 17 Nov 1994 23:04:35 +0000
Date: Thu, 17 Nov 1994 23:04:35 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:293220:941117230436]
Content-Identifier: Re: Hello(b)
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 23:04:35 
                      +0000;
Alternate-Recipient: Allowed
From: " (Oliver A. McBryan)" <mcbryan@cs.colorado.edu>
Message-ID: <aaf12df809021004bf6f@[198.11.16.30]>
To: " (Nick Arnett)" <narnett@verity.com>, 
    "David M. Chess" <chess@watson.ibm.com>, /CN=robots/@nexor.co.uk
Subject:  Re: Hello!
X-Sender: mcbr@piper.cs.colorado.edu
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 391

The WWWW - World Wide WEb Worm - retrieves only Title and anchors
(incvluding their associated text or icons).  It has been quite useful -
usage is at 1/3 million per month last time I checked.

Oliver McBryan; mcbryan@cs.colorado.edu; 303-6650544; Fax 303-4922844
Dept of Computer Science, University of Colorado, Boulder, CO 80309.
WWW: http://www.cs.colorado.edu/home/mcbryan/Home.html


From /CN=robots-errors/@nexor.co.uk Thu Nov 17 23:13:53 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 17 Nov 1994 23:15:39 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 17 Nov 1994 23:13:53 +0000
Date: Thu, 17 Nov 1994 23:13:53 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:295000:941117231401]
Content-Identifier: Re: Hello(b)
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 23:13:53 
                      +0000;
Alternate-Recipient: Allowed
From: " (Nick Arnett)" <narnett@verity.com>
Message-ID: <aaf1935d190210041989@[192.187.143.12]>
To: " (Oliver A. McBryan)" <mcbryan@cs.colorado.edu>, 
    "David M. Chess" <chess@watson.ibm.com>, /CN=robots/@nexor.co.uk
Subject:  Re: Hello!
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 346

At 3:01 PM 11/17/94, Oliver A. McBryan wrote:
>The WWWW - World Wide WEb Worm - retrieves only Title and anchors
>(incvluding their associated text or icons).  It has been quite useful -
>usage is at 1/3 million per month last time I checked.

Forgive my glibness, but I think that's a measure of its usability, not its
usefulness... ;-)

Nick


From /CN=robots-errors/@nexor.co.uk Thu Nov 17 23:40:50 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 17 Nov 1994 23:44:00 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 17 Nov 1994 23:40:50 +0000
Date: Thu, 17 Nov 1994 23:40:50 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:299200:941117234059]
Content-Identifier: Re: Hello(b)
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 23:40:50 
                      +0000;
Alternate-Recipient: Allowed
From: " (Oliver A. McBryan)" <mcbryan@cs.colorado.edu>
Message-ID: <aaf1341a0c021004304a@[198.11.16.30]>
To: " (Nick Arnett)" <narnett@verity.com>, 
    "David M. Chess" <chess@watson.ibm.com>, /CN=robots/@nexor.co.uk
Subject:  Re: Hello!
X-Sender: mcbr@piper.cs.colorado.edu
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 829

At 11:13 PM 11/17/94, Nick Arnett wrote:
>At 3:01 PM 11/17/94, Oliver A. McBryan wrote:
>>The WWWW - World Wide WEb Worm - retrieves only Title and anchors
>>(incvluding their associated text or icons).  It has been quite useful -
>>usage is at 1/3 million per month last time I checked.
>
>Forgive my glibness, but I think that's a measure of its usability, not its
>usefulness... ;-)
>
>Nick

Agreed.  However an awful lot of accesses are repeat from the same machine,
suggesting usefulness as well.

However I'll be the first to agree that full text search is needed.  I just
was not interested in providing that service myself.

Oliver McBryan; mcbryan@cs.colorado.edu; 303-6650544; Fax 303-4922844
Dept of Computer Science, University of Colorado, Boulder, CO 80309.
WWW: http://www.cs.colorado.edu/home/mcbryan/Home.html


From /CN=robots-errors/@nexor.co.uk Fri Nov 18 22:51:18 1994
Replied: Mon, 21 Nov 1994 08:38:00 +0000
Replied: " (Brian Pinkerton)" <bp@cs.washington.edu>
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 18 Nov 1994 22:53:05 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 18 Nov 1994 22:51:18 +0000
Date: Fri, 18 Nov 1994 22:51:18 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:206000:941118225119]
Content-Identifier: Web-wide Inde...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 18 Nov 1994 22:51:18 
                      +0000;
Alternate-Recipient: Allowed
From: " (Brian Pinkerton)" <bp@cs.washington.edu>
Message-ID: <199411182250.OAA22744@june.cs.washington.edu>
To: /CN=robots/@nexor.co.uk
Subject:  Web-wide Indexing Workshop
Return-Path: <bp@cs.washington.edu>
X-Sender: bp@fishtail.biotech.washington.edu (Unverified)
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 1140

I am trying to gauge the interest for a web-wide indexing workshop at WWW95
in Darmstadt next April.

The motivation for this workshop is to find ways that we can share indexing
and Web-structure information.  The idea is not to change the way any of
you offer indexes, but rather to find ways to cooperate on finding the
information that goes into building them.

I think we could focus on two specific goals: 1) take a hard look at some
of the existing protocols for sharing information, and see what we can use
and what we need to build and 2) come up with an operational plan for
putting these tools to use on a production basis.

The workshop environment is the ideal place to do this work: we will bring
together people with lots of experience building and running Web-wide
indexes, and hopefully some who are experts in information retrieval.  We
will keep the numbers small, so we can actually make some progress.

If you're interested, or if you would like to come but can't make
Darmstadt, send me mail.  I will collect comments, see if this will be
worth doing, and submit a proposal to the conference oganizers if it is.

bri


From /CN=robots-errors/@nexor.co.uk Sat Nov 19 03:47:09 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sat, 19 Nov 1994 03:49:36 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sat, 19 Nov 1994 03:47:09 +0000
Date: Sat, 19 Nov 1994 03:47:09 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:234010:941119034711]
Content-Identifier: Re: Web-wide ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 19 Nov 1994 03:47:09 
                      +0000;
Alternate-Recipient: Allowed
From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov>
Message-ID: <9411190348.AB25045@rbse.jsc.nasa.gov>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Web-wide Indexing Workshop
X-Sender: eichmann@192.88.42.10
MIME-version: 1.0
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: 7BIT
Status: RO
Content-Length: 2078

At  4:52 PM 11/18/94 +0000,  (Brian Pinkerton) wrote:
>I am trying to gauge the interest for a web-wide indexing workshop at WWW95
>in Darmstadt next April.

  I believe that the time for some initial consensus is upon us, for if no
other reason than it can act as a driver for additional research beyond the
simple application of indexing schemes out of the IR community to Web
documents.

>
>The motivation for this workshop is to find ways that we can share indexing
>and Web-structure information.  The idea is not to change the way any of
>you offer indexes, but rather to find ways to cooperate on finding the
>information that goes into building them.

  There are some heavy-duty projects working on deep semantics approaches
to this kind of thing (e.g., the ARPA Knowledge Sharing project) as well as
some fairly pragmatic collaborations on sharing metadata (e.g., the DoD
Reuse Interoperability Group (RIG)).  The last I heard, Joe Nieten was
heading up an AIAA task force to create a standard for repository
interoperation that would also be relevant.  There's also an IEEE working
group in metadata with their own mailing list and workshop series.

   The trick here will be to build an exchange structure and protocol that
is sufficiently extensible as to support a fairly limited skeleton (i.e., a
URL graph) and something as sophisticated as a semantic net.  The Harvest
group is part-way down this path with broker/broker interaction.  The
distributed database community has worked similar problems in the areas of
heterogeneous and federated databases.

- Dave

p.s.  Hmmm... I guess that the above is a 'yes' vote for a workshop.

-----------
David Eichmann
Asst. Prof. / RBSE Director of R & D     Web: http://ricis.cl.uh.edu/eichmann/
Software Engineering Program           Phone: (713) 283-3875
University of Houston - Clear Lake       fax: (713) 283-3810
Box 113, 2700 Bay Area Blvd.           Email: eichmann@rbse.jsc.nasa.gov
Houston, TX 77058                         or: eichmann@cl.uh.edu
RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html


From /CN=robots-errors/@nexor.co.uk Tue Dec 27 16:05:47 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 27 Dec 1994 16:07:16 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 27 Dec 1994 16:05:47 +0000
Date: Tue, 27 Dec 1994 16:05:47 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:201110:941227160549]
Content-Identifier: Another page ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 27 Dec 1994 16:05:47 
                      +0000;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"20101 Tue Dec 27 16:05:27 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Another page to avoid
Status: RO
Content-Length: 291

http://csclub.uwaterloo.ca/u/zblaxell/useless.html contains
deliberately broken links. How ridiculous.


-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Tue Jan  3 02:40:46 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 3 Jan 1995 02:42:20 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 3 Jan 1995 02:40:46 +0000
Date: Tue, 3 Jan 1995 02:40:46 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:162490:950103024047]
Content-Identifier: Forms-based e...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 3 Jan 1995 02:40:46 +0000;
Alternate-Recipient: Allowed
From: " (Nick Arnett)" <narnett@verity.com>
Message-ID: <ab2e5aa20602100451eb@[192.187.143.12]>
To: /CN=robots/@nexor.co.uk
Subject:  Forms-based editor for robots.txt?
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 323

Has anyone taken a stab at an HTML forms-based editor for robots.txt?  We
plan to make use of it for our indexing agents, so we'd like to make it
easy for administrators to modify it.  In keeping with an overall strategy
of using CGI-based admin tools, we'd like to have something along those
lines for robots.txt.

Nick


From /CN=robots-errors/@nexor.co.uk Tue Jan  3 19:09:23 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 3 Jan 1995 19:11:07 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 3 Jan 1995 19:09:23 +0000
Date: Tue, 3 Jan 1995 19:09:23 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:052620:950103190924]
Content-Identifier: Another page ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 3 Jan 1995 19:09:23 +0000;
Alternate-Recipient: Allowed
From: "David M. Chess" <chess@watson.ibm.com>
Message-ID: <"5243 Tue Jan  3 19:09:05 1995"@nexor.co.uk>
To: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk
Subject:  Another page to avoid
Status: RO
Content-Length: 125

Hey, if the Web didn't have all sorts of odd things in it,
think how much harder stress-testing robots would be!  *8)     DC

From /CN=robots-errors/@nexor.co.uk Sun Feb  5 10:49:06 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sun, 5 Feb 1995 10:50:37 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sun, 5 Feb 1995 10:49:06 +0000
Date: Sun, 5 Feb 1995 10:49:06 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:201910:950205104908]
Content-Identifier: bad robot?
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 5 Feb 1995 10:49:06 +0000;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU>
Message-ID: <9502050205.aa11376@paris.ics.uci.edu>
To: /CN=robots/@nexor.co.uk
Subject:  bad robot?
Status: RO
Content-Length: 533

Hi all,

I've been getting bad requests like the following at my site:

nirvana.rns.com - - [31/Jan/1995:19:14:41 -0800] "GET /WebSoft/wwwstat/" ADD_DATE="786312654" LAST_VISIT="786312648 HTTP/1.0" 404 -

It looks like a robot, but is not on the list and I don't know anyone
from rns.com.  Does anyone know who it is?

......Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                                     <fielding@ics.uci.edu>
                     <URL:http://www.ics.uci.edu/dir/grad/Software/fielding>

From /CN=robots-errors/@nexor.co.uk Wed Feb 22 18:55:16 1995
Replied: Mon, 27 Mar 1995 15:38:11 +0100
Replied: m.koster
Replied: bp@haole.cs.washington.edu
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 22 Feb 1995 18:57:22 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 22 Feb 1995 18:55:16 +0000
Date: Wed, 22 Feb 1995 18:55:16 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:174850:950222185517]
Content-Identifier: WWW95 Indexin...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 22 Feb 1995 18:55:16 
                      +0000;
Alternate-Recipient: Allowed
From: Brian Pinkerton <bp@haole.cs.washington.edu>
Message-ID: <9502221853.AA08285@haole.cs.washington.edu>
To: /CN=robots/@nexor.co.uk
Subject:  WWW95 Indexing Workshop
Reply-To: bp@haole.cs.washington.edu
Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
Original-Received: by NeXT.Mailer (1.118.2)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 1155

There will be an indexing workshop at the upcoming WWW conference in
Darmstadt.  The intent of the workshop is to get together those people
who are actively involved in building or providing resource discovery
systems for the Web.  My intent is that we will discuss the state of
the art, talk about some of its current limitations, and focus on some
proposed solutions.  It would be great to come out of this with a real
plan to share data, or at least with the understanding that will enable
us to do that in the future.

Since the workshop is short, we hope to spend most of our time in
discussion.  For more information on the workshop, see the description at

  http://www.igd.fhg.de/www/www95/workshops/work-a.html

General conference information is available at

  http://www.igd.fhg.de/www/www95/general.html

If you would like to participate in the workshop, send either Bipin or
I a short statement of why you'd like to be there.  This statement
doesn't have to be long -- 1 page will be fine.

Deadlines:

  March 1: Workshop submission deadline
  March 15: early-bird conference registration deadline

Brian Pinkerton
University of Washington


From /CN=robots-errors/@nexor.co.uk Tue Mar  7 13:57:53 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Mar 1995 14:00:49 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 7 Mar 1995 13:57:53 +0000
Date: Tue, 7 Mar 1995 13:57:53 +0000
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273570:950307135755]
Content-Identifier: CFP: Intellig...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 7 Mar 1995 13:57:53 +0000;
Alternate-Recipient: Allowed
From: Michael Wooldridge <M.Wooldridge@doc.manchester-metropolitan-university.ac.uk>
Message-ID: <9503071353.AA13738@patsy.doc.aca.mmu.ac.uk>
To: /CN=robots/@nexor.co.uk
Subject:  CFP: Intelligent Agents and the Next Information Revolution
X-Mailer: ELM [version 2.3 PL3]
Status: RO
Content-Length: 5133


       INTELLIGENT AGENTS AND THE NEXT INFORMATION REVOLUTION

           A One-Day Meeting of the International CKBS-SIG
              Manchester, UK --- Tuesday May 9th, 1995

           http://www.doc.mmu.ac.uk/STAFF/mike/ckbs95.html

         * Call for Participation * Call for Participation * 


INTRODUCTION

It is by now a cliche that the widespread use of distributed
information services will radically alter the way in which both
organisations and individuals work. There are many indicators of this
coming information revolution. The growth of network technology in
commercial organisations, the routine use of email within academia,
and the astonishing extent of interest in the World Wide Web are three
obvious examples. Yet while the enormous potential presented by
distributed information services is widely recognised, the software
required to fully realise this potential is not yet available. There
are several reasons for this, but among the most important is that
current software paradigms simply do not lend themselves to developing
the kind of applications required. In order to build computer systems
that must operate in large, open, distributed, highly heterogeneous
environments, we must make use of entirely new software technologies.
The concept of an intelligent agent, that can operate autonomously and
rationally on behalf of some user in such complex environments, is
increasingly promoted as the foundation upon which to construct such a
technology. The purpose of this meeting is to bring together
researchers and practitioners interested in realising and exploiting
this important emerging technology.

MEETING STRUCTURE

The day-long meeting will consist of: an introductory overview of the
area and issues; keynote presentations from influential researchers;
long presentations describing major applications, projects, and
research results; and short presentations describing ongoing work. The
emphasis throughout the day will be on informality, discussion, and
informed speculation.

HOW TO PARTICIPATE

If you would like to give a presentation, then email one of the
organisers (below) enclosing your full contact details and a short
(one paragraph) summary of your intended presentation. Topics of
interest include, but are by no means limited to, the following:

  network agents ** WWW agents ** agent-based information systems **
  agents in decision support ** agents for resource location  **
  distributed information services ** software agents/softbots **
  knowbots ** authentication and security issues in cooperative
  systems ** agent communication languages ** KQML and KIF **
  cooperative information retrieval and management ** shared 
  ontologies ** the electronic marketplace ** information management 
  and filtering agents ** knowledge sharing

The deadline for presentation proposals is Friday March 17th.
Presenters will have the opportunity to publish their work in the
CKBS-SIG 1995 proceedings volume.

If you would like to attend without giving a presentation, then please
register by simply emailing one of the organisers, enclosing your full
contact and affiliation details. All are welcome. No charge will be
made for attendance. Please do *not* turn up without registering.

WHEN AND WHERE

The meeting will be held on Tuesday May 9th, 1995, in the Department
of Computing at Manchester Metropolitan University. The Department is
located in the centre of Manchester, an industrial city in the
north-west of England. Manchester has excellent public transport links
(with hourly trains to London), and is served by an international
airport with scheduled flights to all major European centres. Contact
the organisers for more information. Full details (including precise
location and schedule) are available via the meeting WWW page (see
above), and will be provided upon registration.

ORGANISERS

  Michael Wooldridge and Michael Fisher

  Department of Computing, Manchester Metropolitan University
  Chester Street, Manchester M1 5GD, United Kingdom

  email {M.Wooldridge, M.Fisher}@doc.mmu.ac.uk
  tel   (+44 1 61) 247 {1531, 1488}
  fax   (+44 1 61) 247 1483


ABOUT THE CKBS-SIG

The Special Interest Group (SIG) on Cooperating Knowledge Based
Systems (CKBS) was established in 1990, and is funded by the DTI to
provide a focus for UK activities in this area through the
organisation of regular meetings. In addition to the meeting described
above, three other CKBS-SIG events are planned for 1995:

**   The University of Newcastle Upon Tyne is holding a one and a half day
     meeting in June (to be arranged by Andrew Blyth).

**   Loughborough University is holding a one day meeting in September (to
     be arranged by Rachel Jones).

**   Glasgow Caledonian University is holding a one day round table 
     discussion in December (to be arranged by Cherif Branki).

-- 
Michael Wooldridge                 | email M.Wooldridge@doc.mmu.ac.uk
Department of Computing            | http://www.doc.mmu.ac.uk/STAFF/mikew.html
Manchester Metropolitan University | tel  (+44 161) 247 1531 
Chester St., Manchester M1 5GD, UK | fax  (+44 161) 247 1483


From /CN=robots-errors/@nexor.co.uk Thu Apr 13 23:38:16 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 13 Apr 1995 23:40:42 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 13 Apr 1995 23:38:16 +0100
Date: Thu, 13 Apr 1995 23:38:16 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:078500:950413223817]
Content-Identifier: Want specific...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 13 Apr 1995 23:38:16 
                      +0100;
Alternate-Recipient: Allowed
From: " (Bob Carter)" <carter@cs.bu.edu>
Message-ID: <199504132236.SAA18398@csb.bu.edu>
To: /CN=robots/@nexor.co.uk
Subject:  Want specific size files on various WWW servers
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit
Status: RO
Content-Length: 422

I am interested in finding objects of specific sizes (1k, 5k ...) on various
WWW servers. 

I have started writing a script to gather this info but having found this
list it seems that one of the other robots may well have generated a log I 
can use instead. 

So, if such a list of HTTP documents and their sizes for some set of WWW 
servers exists please let me know before I duplicate the effort.

Thanks,

 Bob Carter

From /CN=robots-errors/@nexor.co.uk Fri Apr 14 09:57:49 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 14 Apr 1995 10:01:30 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 14 Apr 1995 09:57:49 +0100
Date: Fri, 14 Apr 1995 09:57:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:160690:950414085755]
Content-Identifier: Re: (fwd) Har...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 14 Apr 1995 09:57:49 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU>
Message-ID: <9504140155.aa09734@paris.ics.uci.edu>
To: Partl <partl@hp01.boku.ac.at>
Cc: /CN=robots/@nexor.co.uk, hgsupport@iicm.tu-graz.ac.at
In-Reply-To: <"2235 Wed Apr 12 11:01:44 1995"@nexor.co.uk>
Subject:  Re: (fwd) Harvesters/Spiders/Crawlers/Lycos vs. Hyper-G 
Status: RO
Content-Length: 1372

> Hyper-G has a lot of very good features: Hypermedia in hierarchical
> structures, keyword searches and fulltext searches over sub-hierarchies,
> multi-lingual documents, bi-directional links...
> The WWW interface of Hyper-G ("wwwmaster") introduces the necessary
> "state" of a user's session (the preferred language etc.) into the
> "stateless" HTTP protocol by using dynamic URLs of the form
>    http://host:port/session-id/object-id

Well, now that's a dumb idea -- the state doesn't belong in the URL.

> ...
> Since search engines / spiders / harvesters are most essential for
> finding information in the vast worlwide web, this problem must be
> solved as fast as possible! I suggest that the maintainers of all
> search engines and spiders and harvesters like Lycos, WebCrawler, 
> WWW-Worm, Aliweb, Yahoo etc. etc. contact the maintainers of the
> Hyper-G server at
>    hgsupport@iicm.tu-graz.ac.at
> and agree on a solution to this problem.

The solution is simple -- don't include the state in the URL.
It doesn't belong there and will break far more than just spiders.
Global history lists and hotlists will also fail to work properly.

 ....Roy T. Fielding  Department of ICS, University of California, Irvine USA
                                       <fielding@ics.uci.edu>
                      <URL:http://www.ics.uci.edu/dir/grad/Software/fielding>

From /CN=robots-errors/@nexor.co.uk Mon Apr 17 16:52:35 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 17 Apr 1995 16:54:40 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 17 Apr 1995 16:52:35 +0100
Date: Mon, 17 Apr 1995 16:52:35 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:236170:950417155242]
Content-Identifier: YA mirroring ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 17 Apr 1995 16:52:35 
                      +0100;
Alternate-Recipient: Allowed
From: Cyril Slobin <slobin@feast.fe.msk.ru>
Message-ID: <PK4vealWw7@feast.fe.msk.ru>
To: /CN=robots/@nexor.co.uk
Subject:  YA mirroring robot
Organization: Institute for Commercial Engineering
X-Mailer: Mail/@ [v2.28 FreeBSD]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 1632
Lines: 46

Hello, World!

Yes, I'm writing Yet Another Mirroring Robot.

User-Agent: w3steal/2.1 libwww_perl/0.40

(Version may change of course)

Let me try to discharge myself:

# 10. You will visit the list of known robots before writing a new one.
#     Look for one you can use or modify if necessary before writing a

I have explore three or four of them and wasn't statisfied. My first attempt
was to modify htget/w3mir, but after fixing some bugs in it and finding more
ones I have change my mind. libwww_perl seems much more reliable and much
more understandable then scg libs. Really I'm to rewrite w3mir using
libwww_perl and make it a bit smarter.

# 8.  Post a message to comp.infosystems.www.providers and send mail to
#     robots@nexor.co.uk announcing your intentions to write a robot.

Here is it.

# 6.  Make you set informative headers like 'User-Agent' with
#     the name and version of your robot and 'From' with your email

See above.

# 3.  [Shameless plug] Use wwwbot.pl in Roy Fielding's excellent libwww-perl
#     package, because it implements the latest Robot exclusion protcol

Yes, it's REALLY excellent library.

#     Of course, we have No doubts that you will joyfully provide this
#     information to everyone on the net for free, since most of the

Sources will be put on the Net when debugged.
I will announce them here.

PS - Sorry for my terrible english.
I hope my perl writings are more readable :-)

-- 
Cyril Slobin <slobin@fe.msk.ru> | `And what is the use of a book,' thought
<http://www.fe.msk.ru/~slobin/> | Alice `without pictures or conversation?'

From /CN=robots-errors/@nexor.co.uk Mon Apr 17 17:43:14 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 17 Apr 1995 17:52:21 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 17 Apr 1995 17:43:14 +0100
Date: Mon, 17 Apr 1995 17:43:14 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:243830:950417164315]
Content-Identifier: Simple robots
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 17 Apr 1995 17:43:14 
                      +0100;
Alternate-Recipient: Allowed
From: " (Nancy Lehrer)" <nlehrer@isx.com>
Message-ID: <9504171642.AA11917@isx.com>
To: /CN=robots/@nexor.co.uk
Subject:  Simple robots
Status: RO
Content-Length: 351

I'm looking for a simple robot that I can modify to take a set of
URLs, traverse and index these pages and create a new set of URL's
which includes the first set's links. Language preference would be
C++/C, Tcl and possibly Perl.

Certainly, there is lots of code out there. Any recommendations?

Thanks,
Nancy Lehrer
ISX Corporation
nlehrer@isx.com


From /CN=robots-errors/@nexor.co.uk Tue Apr 18 06:52:48 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 18 Apr 1995 06:56:42 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 18 Apr 1995 06:52:48 +0100
Date: Tue, 18 Apr 1995 06:52:48 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039250:950418055252]
Content-Identifier: Re: Simple ro...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 18 Apr 1995 06:52:48 
                      +0100;
Alternate-Recipient: Allowed
From: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU>
Message-ID: <9504172250.aa05277@paris.ics.uci.edu>
To: " (Nancy Lehrer)" <nlehrer@isx.com>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9504171642.AA11917@isx.com>
Subject:  Re: Simple robots 
Status: RO
Content-Length: 552

> I'm looking for a simple robot that I can modify to take a set of
> URLs, traverse and index these pages and create a new set of URL's
> which includes the first set's links. Language preference would be
> C++/C, Tcl and possibly Perl.

That is one of the side-effects of MOMspider.

    http://www.ics.uci.edu/WebSoft/MOMspider/


 ....Roy T. Fielding  Department of ICS, University of California, Irvine USA
                                       <fielding@ics.uci.edu>
                      <URL:http://www.ics.uci.edu/dir/grad/Software/fielding>

From /CN=robots-errors/@nexor.co.uk Wed Apr 19 18:17:25 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 19 Apr 1995 18:20:51 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 19 Apr 1995 18:17:25 +0100
Date: Wed, 19 Apr 1995 18:17:25 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:128530:950419171727]
Content-Identifier: Basic clueles...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 19 Apr 1995 18:17:25 
                      +0100;
Alternate-Recipient: Allowed
From: "David M. Chess" <chess@watson.ibm.com>
Message-ID: <"12851 Wed Apr 19 18:17:11 1995"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Basic clueless questions
Status: RO
Content-Length: 2175

I looked over the Robots and /robots.txt stuff awhile back,
and didn't quite understand it at the time.  I've now let it
percolate around a bit in my brain, and since I still don't
quite understand it, I thought I'd expose my basic cluelessness
here by asking if anyone can explain it to me.

As I see it, there are three basic problems that a particular
spider/robot/wwworm can cause:

 1) They can overload some server, or set of servers, by
    doing more or more frequent accesses than a human with
    a normal client would,

 2) They can mess up things like votes, by following "Click
    here if you love artichokes" links without understanding
    what they're doing,

 3) They can increase the general web load to no good
    purpose, by repeatedly asking for lots of information,
    much of which is thrown away (of course, human surfers
    can be guilty of this, too!).

The solution to (1) would seem to be to have some rough guidelines
for how often and how much a spider should hit a given server,
and I see y'all have done that, and that's good.  The solution
to (2) would seem to be some sort of marker in the <A> tag
that means "follow this link only if you really understand
it, and aren't some robot or something".  Sort of like
<a href="votes/artichokes/love" robots=NO> or whatever.
Has there been any talk of that?  The solution to (3) is
at least partly to make robots efficient, not run too
often, make results public, and the other good things that
have been suggested here.  An HTTP verb meaning "give me
only the stuff between the following tags in the following
document" (and/or a more powerful server-side-search-and-extract
protocol) would also help.

The /robots.txt idea doesn't seem designed to solve (1), (2),
or (3), and I'm somewhat unclear on what problems it really
is designed to solve.  Why would I want to exclude certain
web-crawlers from certain subtrees of my server, any more than
I'd want to exclude (say) users of certain browsers, or people
with certain middle names?

- -- -
David M. Chess                     /    Invest for the Nanotech Era:
High Integrity Computing Lab      /             Buy Atoms!
IBM Watson Research

From /CN=robots-errors/@nexor.co.uk Thu Apr 20 04:42:33 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 20 Apr 1995 04:46:44 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 20 Apr 1995 04:42:33 +0100
Date: Thu, 20 Apr 1995 04:42:33 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:206300:950420034236]
Content-Identifier: Re: Basic clu...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 04:42:33 
                      +0100;
Alternate-Recipient: Allowed
From: JamesB@werple.mira.net.au
Message-ID: <208a2303.83d61-JamesB@ArtWorks.mira.net.au>
To: /CN=robots/@nexor.co.uk
References: <"12851 Wed Apr 19 18:17:11 1995"@nexor.co.uk>,
            <chess@watson.ibm.com>
Subject:  Re: Basic clueless questions
Reply-To: JamesB@werple.mira.net.au
X-Mailer: //\\miga Electronic Mail (AmiElm 5.42)
Organization: Melbourne ArtWorks
Status: RO
Content-Length: 1673

Hi David,

[...]
> As I see it, there are three basic problems that a particular
> spider/robot/wwworm can cause:
> 
>  1) They can overload some server, or set of servers, by
>     doing more or more frequent accesses than a human with
>     a normal client would,
> 
>  2) They can mess up things like votes, by following "Click
>     here if you love artichokes" links without understanding
>     what they're doing,

Mostly this can be solved by searching for a '?' character in the URL
If it contains one, it's a query of some sort and isn't much use to
the robot. Pretty crude, but in most circumstances a reasonable restriction.

[...]
> The /robots.txt idea doesn't seem designed to solve (1), (2),
> or (3), and I'm somewhat unclear on what problems it really
> is designed to solve.  Why would I want to exclude certain
> web-crawlers from certain subtrees of my server, any more than
> I'd want to exclude (say) users of certain browsers, or people
> with certain middle names?

I must admit now that you've described it like that it doesn't
seem to make sense. All it will do is give the site maintainer control
over what NICE robots do when run by NICE people. It seems to me that 
NICE people are not the problem. (NICE == thoughtful, responsible, don't
arbitrarily waste bandwidth...). On the good side it doesn't require
modifying any 'standards' and therefore it doesn't require modifying
any software (other than those naughty robots).

Anybody else?

james
-- 
James Burton                                    | 
EMail: JamesB@werple.mira.net.au                | Latrobe University
WWW  : http://www.cs.latrobe.edu.au/~burton/    | Melbourne, Australia


From /CN=robots-errors/@nexor.co.uk Thu Apr 20 12:49:03 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 20 Apr 1995 12:56:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 20 Apr 1995 12:49:03 +0100
Date: Thu, 20 Apr 1995 12:49:03 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:296250:950420114923]
Content-Identifier: Re: Basic clu...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 12:49:03 
                      +0100;
Alternate-Recipient: Allowed
From: Gorm Haug Eriksen <g.h.eriksen@usit.uio.no>
Message-ID: <"mons.uio.n.416:20.03.95.11.32.15"@mons.uio.no>
To: JamesB@werple.mira.net.au
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <208a2303.83d61-JamesB@ArtWorks.mira.net.au>
Subject:  Re: Basic clueless questions
X-Mailer: exmh version 1.5.3 12/28/94
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 1192

JamesB@werple.mira.net.au said on the mailinglist robots@..uk:
-> I must admit now that you've described it like that it doesn't
-> seem to make sense. All it will do is give the site maintainer control
-> over what NICE robots do when run by NICE people. It seems to me that 
-> NICE people are not the problem. (NICE == thoughtful, responsible, don't
-> arbitrarily waste bandwidth...). On the good side it doesn't require
-> modifying any 'standards' and therefore it doesn't require modifying
-> any software (other than those naughty robots).
-> 
-> Anybody else?
-> james

This isn't primarly a robot problem, but it gets to be:

We have seen a couple of commercial products out there that uses 
WWW-wanders/robots to gain and index information on The Web. This information 
is presented to users, that needs to pay for it. I think we will see much more 
of this in the future because people doesn't want to share the information 
their 
wanders has collected. 

What do you people think about this? 
Can we do anything to help? Perhaps make a sort of ultimate index that will be 
saved to disk in a nice format, and let everyone mirror it?

Gorm Haug Eriksen   
USIT / UiO / Norway  


From /CN=robots-errors/@nexor.co.uk Thu Apr 20 13:23:22 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 20 Apr 1995 13:38:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 20 Apr 1995 13:23:22 +0100
Date: Thu, 20 Apr 1995 13:23:22 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:007560:950420122335]
Content-Identifier: Re: Basic clu...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 13:23:22 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"636 Thu Apr 20 13:17:46 1995"@nexor.co.uk>
To: Gorm Haug Eriksen <g.h.eriksen@usit.uio.no>
Cc: JamesB@werple.mira.net.au, /CN=robots/@nexor.co.uk
In-Reply-To: <"mons.uio.n.416:20.03.95.11.32.15"@mons.uio.no>
Subject:  Re: Basic clueless questions 
Status: RO
Content-Length: 1600

In message <"mons.uio.n.416:20.03.95.11.32.15"@mons.uio.no>, Gorm Haug Eriksen 
writes:

> We have seen a couple of commercial products out there that uses
> WWW-wanders/robots to gain and index information on The Web. This
> information is presented to users, that needs to pay for it. I think
> we will see much more of this in the future because people doesn't
> want to share the information their wanders has collected.

> What do you people think about this?

I personally think this is inevitable. There was a big debate about
this during the Web Conference last week. Should these guys be paying
me for allowing them to use my data as content, or should I pay them
for including me in their service? Endless fun, no solution :-)

> Can we do anything to help? Perhaps make a sort of ultimate index
> that will be saved to disk in a nice format, and let everyone mirror
> it?

This was discussed also. The basic problem is that all these indexers
really want the full content of your pages, so they can differentiate
themselves from others by designing a super-duper
html-parsing/linguistic-analysing/AI-based sumarisers etc.

This kind of limits you to sharing only the URL structure with update
information. Robots are now starting to do that, but with today's
servers it's difficult for maintainers to do this. You'd need
something wonderful like the system described by Pitkow&Jones in
"Towards an intelligent publishing environment" at WWW'95

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Apr 20 06:03:52 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 20 Apr 1995 06:08:41 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 20 Apr 1995 06:03:52 +0100
Date: Thu, 20 Apr 1995 06:03:52 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:217390:950420050355]
Content-Identifier: Re: Basic clu...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 06:03:52 
                      +0100;
Alternate-Recipient: Allowed
From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov>
Message-ID: <9504200505.AA11072@rbse.jsc.nasa.gov>
To: /CN=robots/@nexor.co.uk
Subject:  Re: Basic clueless questions
X-Sender: eichmann@192.88.42.10
MIME-version: 1.0
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: 7BIT
Status: RO
Content-Length: 3021

At  4:42 AM 4/20/95 +0100, JamesB@werple.mira.net.au wrote:
...
>> As I see it, there are three basic problems that a particular
>> spider/robot/wwworm can cause:
>>
>>  1) They can overload some server, or set of servers, by
>>     doing more or more frequent accesses than a human with
>>     a normal client would,
>>
>>  2) They can mess up things like votes, by following "Click
>>     here if you love artichokes" links without understanding
>>     what they're doing,
>
>Mostly this can be solved by searching for a '?' character in the URL
>If it contains one, it's a query of some sort and isn't much use to
>the robot. Pretty crude, but in most circumstances a reasonable restriction.

At this point in the growth/popularity of the Web, it would take a pretty
persistent spider to generate the load that the Web community by mass
inquistiveness can accomplish.  In real practice, a well-run spider doesn't
come close to the bandwidth consumption that idle browsing generates -
particularly when automatic image loading is the norm.  Active servers
aren't really going to notice spider activity unless they go looking for it
(at least for the spiders that are currently active).  I seem to remember
Martijn observing in Darmstadt that we'd outgrown the instability and
recklessness of our youth (well maybe not *exactly* that wording, but hey -
it's late in my timezone...)

>
>[...]
>> The /robots.txt idea doesn't seem designed to solve (1), (2),
>> or (3), and I'm somewhat unclear on what problems it really
>> is designed to solve.  Why would I want to exclude certain
>> web-crawlers from certain subtrees of my server, any more than
>> I'd want to exclude (say) users of certain browsers, or people
>> with certain middle names?
>
>I must admit now that you've described it like that it doesn't
>seem to make sense. All it will do is give the site maintainer control
>over what NICE robots do when run by NICE people. It seems to me that
>NICE people are not the problem. (NICE == thoughtful, responsible, don't
>arbitrarily waste bandwidth...). On the good side it doesn't require
>modifying any 'standards' and therefore it doesn't require modifying
>any software (other than those naughty robots).

The file is intended *precisely* to allow well-behaved agents to be able to
discern what to avoid, with the clear recognition that given an open and
unauthenticated (in general) Web, there is no stopping rogues.  Definition
by agent is useful because not all agents are created equally, and some are
more robust about interesting "gravitation wells" than others...

- Dave

-----------
David Eichmann
Asst. Prof. / RBSE Director of R & D     Web: http://ricis.cl.uh.edu/eichmann/
Software Engineering Program           Phone: (713) 283-3875
University of Houston - Clear Lake       fax: (713) 283-3869
Box 113, 2700 Bay Area Blvd.           Email: eichmann@rbse.jsc.nasa.gov
Houston, TX 77058                         or: eichmann@cl.uh.edu
RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html


From /CN=robots-errors/@nexor.co.uk Thu Apr 20 09:29:36 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 20 Apr 1995 09:32:59 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 20 Apr 1995 09:29:36 +0100
Date: Thu, 20 Apr 1995 09:29:36 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:244960:950420082939]
Content-Identifier: Re: Basic clu...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 09:29:36 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"24493 Thu Apr 20 09:29:14 1995"@nexor.co.uk>
To: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9504200505.AA11072@rbse.jsc.nasa.gov>
Subject:  Re: Basic clueless questions 
Status: RO
Content-Length: 2799

In message <9504200505.AA11072@rbse.jsc.nasa.gov>, " (David Eichmann)" writes:

> At this point in the growth/popularity of the Web, it would take a pretty
> persistent spider to generate the load that the Web community by mass
> inquistiveness can accomplish.  In real practice, a well-run spider doesn't
> come close to the bandwidth consumption that idle browsing generates -
> particularly when automatic image loading is the norm.

This seems true for the major spiders currently running. Spiders that
traverse the entire tree, and would visit regularly (say once every 
three days) would of course be noticed.

> Active servers aren't really going to notice spider activity unless
> they go looking for it (at least for the spiders that are currently
> active).  I seem to remember Martijn observing in Darmstadt that
> we'd outgrown the instability and recklessness of our youth (well
> maybe not *exactly* that wording, but hey - it's late in my
> timezone...)

I don't think I was quite that eloquent :-)

> The file is intended *precisely* to allow well-behaved agents to be
> able to discern what to avoid, with the clear recognition that given
> an open and unauthenticated (in general) Web, there is no stopping
> rogues.  Definition by agent is useful because not all agents are
> created equally, and some are more robust about interesting
> "gravitation wells" than others...

To illustrate, take my server. We have some company marketing stuff,
some public services, and some company support stuff. Before
/robots.txt I had spiders happily index all the subfiles of my Mac
archive, and nothing else. This might make my company look like a Mac
house, which we're not. I also had spiders happily index all our bug
reports, and nothing else. This isn't so good for our imago either
:-)

With robots.txt I can control that better. Together with /site.idx this
gives me a reasoneable way of informing robots what they ought to index
on my server, with very little cost on the administrators part and on
the robots part.

There are other cases, especially in large archive servers, where the
subtree exclusion is really useful.

As for:

>>  1) They can overload some server...
>>  2) They can mess up things like votes...
>>  3) They can increase the general web load...

To fix 1, not just for robots but also general use, servers should
have load measurements, and return "Too Busy"... For 2, well, people
who use GET links for non idempotent actions deserve what they get,
they should use POST. Spiders that POST should be shot :-) There isn't
much anyone can enforce much against 3 other than by peer pressure and
charging for bandwidth.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Sat Apr 22 16:43:09 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sat, 22 Apr 1995 16:45:35 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sat, 22 Apr 1995 16:43:09 +0100
Date: Sat, 22 Apr 1995 16:43:09 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:144710:950422154311]
Content-Identifier: The indexing ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 22 Apr 1995 16:43:09 
                      +0100;
Alternate-Recipient: Allowed
From: " (Lemieux Sebastien)" <lemieuse@ERE.UMontreal.CA>
Message-ID: <9504221541.AA11731@alize.ERE.UMontreal.CA>
To: /CN=robots/@nexor.co.uk
Subject:  The indexing problem...
X-Mailer: ELM [version 2.4 PL21]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 1221

Hi,

  Here is my suggestion for solving the problem of indexing the web.
It should simplify the works to be done by agents and minimize load on
the net...  It should also provide much more consistent results.

  The HTML should be enhanced with a 'KeyWord' tag that should appear
inside the <HEAD> of the document, something like:

<HEAD>
<KEYWORD list="biology,computing,protein">
</HEAD>

  These keywords would have been choosen by the author of the page
which is the best person to tell what his page is about!

  Also, to avoid transfering the whole page, the servers should be
able to answer to a special query (EXPLORE?  or PROBE?) that would
return two lists:

  - The list of keywords extracted from the KEYWORD tag in the HEAD.
  - And a list of all the 'href' items of the <A> tags.

  It could also send back some info about the pages (length, number of
images, author, last date of modification...).

Are those changes really complicated to implement?

-- 
| Sebastien Lemieux, dept. biol.  ||  Look behind the wave of changes
| lemieuse@alize.ERE.UMontreal.CA ||  Feel the future taking shape
| PGP public key on finger.       ||  I can see the world to come (KJ)
  http://alize.ere.umontreal.ca/~lemieuse/
From /CN=robots-errors/@nexor.co.uk Mon Apr 24 05:42:38 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 24 Apr 1995 05:45:38 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 24 Apr 1995 05:42:38 +0100
Date: Mon, 24 Apr 1995 05:42:38 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:004760:950424044247]
Content-Identifier: Re: The index...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 24 Apr 1995 05:42:38 
                      +0100;
Alternate-Recipient: Allowed
From: " (Michael Mauldin)" <mlm@FUZINE.MT.CS.CMU.EDU>
Message-ID: <9504240441.AA11998@fuzine.mt.cs.cmu.edu>
To: " (Lemieux Sebastien)" <lemieuse@ere.UMONTREAL.CA>
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: The indexing problem...
Original-Received: by NeXT Mailer (1.63)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 202

I completely disagree.  The author is that last person who
should index the document.

Lycos' philosophy is that a single agent do the indexing,
thus assuring a consistent level of mediocrity.

--Fuzzy

From /CN=robots-errors/@nexor.co.uk Mon Apr 24 11:05:26 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 24 Apr 1995 11:10:54 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 24 Apr 1995 11:05:26 +0100
Date: Mon, 24 Apr 1995 11:05:26 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:050880:950424100528]
Content-Identifier: Re: The index...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 24 Apr 1995 11:05:26 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"5050 Mon Apr 24 11:04:00 1995"@nexor.co.uk>
To: " (Lemieux Sebastien)" <lemieuse@ERE.UMontreal.CA>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9504221541.AA11731@alize.ERE.UMontreal.CA>
Subject:  Re: The indexing problem... 
Status: RO
Content-Length: 1756

In message <9504221541.AA11731@alize.ERE.UMontreal.CA>, " (Lemieux Sebastien)" 
writes:

>  The HTML should be enhanced with a 'KeyWord' tag that should appear
>inside the <HEAD> of the document, something like:
>
><HEAD>
><KEYWORD list="biology,computing,protein">
></HEAD>
>
>  These keywords would have been choosen by the author of the page
>which is the best person to tell what his page is about!

You'll be pleased to know you can already do this with HTML,
using the META tag.

As it happens it's not all that great, as the choices of keywords
are not always easy, and people don't do it anyway :-)

>  Also, to avoid transfering the whole page, the servers should be
>able to answer to a special query (EXPLORE?  or PROBE?) that would
>return two lists:
>
>  - The list of keywords extracted from the KEYWORD tag in the HEAD.
>  - And a list of all the 'href' items of the <A> tags.

If you want this on a per-document basis you should probably do it with
a content-type: text/urls-only. This has been discussed before, and it
appears robots want the whole document anyway...

>  It could also send back some info about the pages (length, number of
>images, author, last date of modification...).

Yes, and the <H?> so you know the structure of the document, and
the first paragraphs, so you have quick introductions, and ... :-)
You pretty quickly end up shipping the whole document.

>Are those changes really complicated to implement?

Changing servers is more an issue than simply coding it up:
you need to provide code for _all_ servers, and then get the new
servers deployed. That's not so easy.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From narnett@Verity.COM Mon Apr 24 17:07:32 1995
Return-Path: <narnett@Verity.COM>
Delivery-Date: Mon, 24 Apr 1995 17:07:45 +0100
Received: from verity.com (actually host unknown-143-5.verity.com) 
          by lancaster.nexor.co.uk with SMTP (XTPP);
          Mon, 24 Apr 1995 17:07:32 +0100
Received: from  by verity.com (4.1/SMI-4.1_Verity-Main-950202)	id AB05399;
          Mon, 24 Apr 95 09:04:57 PDT
X-Sender: narnett@hawaii.verity.com
Message-Id: <abc17b8103021004e1c9@[192.187.143.12]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Mon, 24 Apr 1995 09:04:31 -0700
To: Martijn Koster <m.koster@nexor.co.uk>, 
    " (Lemieux Sebastien)" <lemieuse@ERE.UMontreal.CA>
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: The indexing problem...
Cc: /CN=robots/@nexor.co.uk
Status: RO
Content-Length: 598

At 3:05 AM 4/24/95, Martijn Koster wrote:

>As it happens it's not all that great, as the choices of keywords
>are not always easy, and people don't do it anyway :-)

It's also becoming clear that search engines with knowledgebases can do a
better job of assigning keywords than humans can (or will, anyway).  What's
more, such a gizmo can go back and assign new keywords to old articles,
which is impractical for humans, given the volume of documents being
produced.

Even more to the point, keyword searching doesn't provide nearly the
accuracy that is possible with full-text searching.

Nick


From /CN=robots-errors/@nexor.co.uk Mon Apr 24 17:08:07 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 24 Apr 1995 17:12:45 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 24 Apr 1995 17:08:07 +0100
Date: Mon, 24 Apr 1995 17:08:07 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140220:950424160809]
Content-Identifier: Re: The index...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 24 Apr 1995 17:08:07 
                      +0100;
Alternate-Recipient: Allowed
From: " (Nick Arnett)" <narnett@Verity.COM>
Message-ID: <abc17b8103021004e1c9@[192.187.143.12]>
To: Martijn Koster <m.koster@nexor.co.uk>, 
    " (Lemieux Sebastien)" <lemieuse@ERE.UMontreal.CA>
Cc: /CN=robots/@nexor.co.uk
Subject:  Re: The indexing problem...
X-Sender: narnett@hawaii.verity.com
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 598

At 3:05 AM 4/24/95, Martijn Koster wrote:

>As it happens it's not all that great, as the choices of keywords
>are not always easy, and people don't do it anyway :-)

It's also becoming clear that search engines with knowledgebases can do a
better job of assigning keywords than humans can (or will, anyway).  What's
more, such a gizmo can go back and assign new keywords to old articles,
which is impractical for humans, given the volume of documents being
produced.

Even more to the point, keyword searching doesn't provide nearly the
accuracy that is possible with full-text searching.

Nick


From /CN=robots-errors/@nexor.co.uk Sun Apr 30 19:38:23 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sun, 30 Apr 1995 19:40:21 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sun, 30 Apr 1995 19:38:23 +0100
Date: Sun, 30 Apr 1995 19:38:23 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:103490:950430183824]
Content-Identifier: Re: The index...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 30 Apr 1995 19:38:23 
                      +0100;
Alternate-Recipient: Allowed
From: " (Tim Bray)" <tbray@opentext.com>
Message-ID: <m0s5cvs-0001syC@giant.mindlink.net>
To: /CN=robots/@nexor.co.uk
Subject:  Re: The indexing problem...
X-Sender: a07893@giant.mindlink.net
X-Mailer: Windows Eudora Version 2.0.3
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 1211

Nick Arnett @ Verity writes:

>It's also becoming clear that search engines with knowledgebases can do a
>better job of assigning keywords than humans can (or will, anyway).

Just to be fair, it should be noted that lots of pepole disagree with this 
assertion.

What is pretty clear from IR research is that search engines can cluster 
text objects by degree of similarity.  The ability to assign human-credible
keywords is a *VERY* strong claim and one that I haven't seen running code
to support.  There are vendors whose marketing literature makes precisely
this claim, although to be fair to Verity, I don't think they're one of them.

>Even more to the point, keyword searching doesn't provide nearly the
>accuracy that is possible with full-text searching.

I disagree completely.  I think that subject keywords, assigned by unhurried,
professional, intelligent humans, with subject matter expertise, will support
a much greater degree of search accuracy - and I'm a full-text search vendor!
Unfortunately, there is too much material and not enough people to do such
keywording, so we full-text vendors are the only alternative for retrieval
in many situations.

Cheers, Tim Bray, Open Text Corporation


From /CN=robots-errors/@nexor.co.uk Tue May  2 21:10:40 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 2 May 1995 21:12:45 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 2 May 1995 21:10:40 +0100
Date: Tue, 2 May 1995 21:10:40 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:018650:950502201042]
Content-Identifier: unsubsrci(008...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 2 May 1995 21:10:40 +0100;
Alternate-Recipient: Allowed
From: "Kilaman........." <ktutu@meceng.coe.neu.edu>
Message-ID: <199505022010.QAA01821@splinter.coe.neu.edu>
To: /CN=robots/@nexor.co.uk
Subject:  unsubsrci[4?[4?[4?[4?
Status: RO
Content-Length: 12

unsubscribe

From /CN=robots-errors/@nexor.co.uk Tue May  2 22:43:31 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 2 May 1995 22:48:14 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 2 May 1995 22:43:31 +0100
Date: Tue, 2 May 1995 22:43:31 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:030220:950502214332]
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 2 May 1995 22:43:31 +0100;
Alternate-Recipient: Allowed
From: " (Drew Dupont)" <dsdupont@indiana.edu>
Message-ID: <v01510100abcc57a40525@[129.79.18.32]>
To: /CN=robots/@nexor.co.uk
X-Sender: dsdupont@silver.ucs.indiana.edu
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 14

unsubscribe


From /CN=robots-errors/@nexor.co.uk Wed May  3 04:47:49 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 3 May 1995 04:52:52 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 3 May 1995 04:47:49 +0100
Date: Wed, 3 May 1995 04:47:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:080330:950503034752]
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 3 May 1995 04:47:49 +0100;
Alternate-Recipient: Allowed
From: JERATCLIFF@vaxsar.vassar.edu
Message-ID: <01HQ1NSLMKJ6002WU0@vassar.edu>
To: /CN=robots/@nexor.co.uk
X-Envelope-to: /CN=robots/@nexor.co.uk
X-VMS-To: IN%"/CN=robots/@nexor.co.uk"
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Status: RO
Content-Length: 12

UNSUBSCRIBE

From /CN=robots-errors/@nexor.co.uk Wed May  3 05:11:01 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 3 May 1995 05:17:32 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 3 May 1995 05:11:01 +0100
Date: Wed, 3 May 1995 05:11:01 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:086500:950503041103]
Content-Identifier: The (036)50 p...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 3 May 1995 05:11:01 +0100;
Alternate-Recipient: Allowed
From: Atif Ahmad Khan <aak2@Ra.MsState.Edu>
Message-ID: <199505030408.XAA21676@Ra.MsState.Edu>
To: /CN=robots/@nexor.co.uk
Subject:  The $50 prize - results
X-Mailer: ELM [version 2.4 PL22]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 815


Sorry folks, for taking so long to get back.  Had to drive
out of town for the weekend.  Here are the results of the
$50 Prize contest.  I received a total of 13 solutions from
people all around the world.  I still say,  Wow !

First entry was by :

Victor A. Parada
vparada@inf.utfsm.cl

Victor's script works fine and therefore Victor has won the
prize.  I have included his name after getting his permission.

Some people also suggested the "lynx -source" solution that is
simple yet works great.

I am now also looking to be able to submit a "form" using a script.
For example is there a way to submit data using a script or any 
other automated means to a simple form at : 
http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/example-1.html 
?

Thanx a million.

Atif Khan 
aak2@ra.msstate.edu


From /CN=robots-errors/@nexor.co.uk Wed May  3 16:11:03 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 3 May 1995 16:15:41 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 3 May 1995 16:11:03 +0100
Date: Wed, 3 May 1995 16:11:03 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:050920:950503151108]
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 3 May 1995 16:11:03 +0100;
Alternate-Recipient: Allowed
From: William Glantz <wglantz@VNET.IBM.COM>
Message-ID: <kjdthAs91H4yBiXY52@rchland.ibm.com>
To: /CN=robots/@nexor.co.uk
Reply-To: William Glantz <wglantz@VNET.IBM.COM>
Status: RO
Content-Length: 112

unsubscribe 
. 
Wm Glantz [An Andrew ToolKit view (a footnote) was included here, but
could not be displayed.] 

From /CN=robots-errors/@nexor.co.uk Thu May  4 17:02:44 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 4 May 1995 17:09:52 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 4 May 1995 17:02:44 +0100
Date: Thu, 4 May 1995 17:02:44 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:019070:950504160246]
Content-Identifier: list
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 17:02:44 +0100;
Alternate-Recipient: Allowed
From: KHANCOCK@acc.rwu.edu
Message-ID: <7210061645.ccSSN652@acc.rwu.edu>
To: /CN=robots/@nexor.co.uk
Subject:  list
Status: RO
Content-Length: 13

unsubscribe


From /CN=robots-errors/@nexor.co.uk Thu May  4 17:26:50 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 4 May 1995 17:36:04 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 4 May 1995 17:26:50 +0100
Date: Thu, 4 May 1995 17:26:50 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:024500:950504162651]
Content-Identifier: Administrativ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 17:26:50 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"2384 Thu May  4 17:25:31 1995"@nexor.co.uk>
To: KHANCOCK@acc.rwu.edu
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <7210061645.ccSSN652@acc.rwu.edu>
Subject:  Administrativa (was Re: list )
Status: RO
Content-Length: 974

In message <7210061645.ccSSN652@acc.rwu.edu>, KHANCOCK@acc.rwu.edu writes:

> unsubscribe

Please stop sending unsubscribe messages to the list.
Like thousands of mailing lists, you send subscribe
and unsubscribe requests to '-request', i.e.
robots-request@nexor.co.uk. If you send it to the
list, the message will be distributed to all hundreds
of subscribed people.

With this particular list server, you just put the
words "unsubscribe", "stop" on separate lines in the
body of the message, and it all happens automatically.

If you have problems, mail the list owner (e) direct.
I usually respond within minutes of receiving them...

Yes, I know this is going to whole list too, I am just
trying to prevent other people also making that mistake.
No, I my list manager can't filter them out, and I can't
switch list managers.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu May  4 19:35:43 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 4 May 1995 19:42:55 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 4 May 1995 19:35:43 +0100
Date: Thu, 4 May 1995 19:35:43 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:049980:950504183553]
Content-Identifier: ditto...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 19:35:43 +0100;
Alternate-Recipient: Allowed
From: wlam@MIT.EDU
Message-ID: <9505041637.AA06666@m33-222-3.MIT.EDU>
To: /CN=robots/@nexor.co.uk
Subject:  ditto...
Status: RO
Content-Length: 12

unsubscribe

From /CN=robots-errors/@nexor.co.uk Thu May  4 21:59:44 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 4 May 1995 22:06:14 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 4 May 1995 21:59:44 +0100
Date: Thu, 4 May 1995 21:59:44 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:075070:950504205946]
Content-Identifier: Another (036)...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 21:59:44 +0100;
Alternate-Recipient: Allowed
From: Atif Ahmad Khan <aak2@Ra.MsState.Edu>
Message-ID: <199505042058.PAA06365@Ra.MsState.Edu>
To: /CN=robots/@nexor.co.uk
Subject:  Another $50 challenge !
X-Mailer: ELM [version 2.4 PL22]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 399


If you can send me a script/program/utility that can be run through
cron and can submit some data to the following form and get me the
results, you win $50.

http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/example-1.html

I will try to monitor incoming mail after every hour and will send a 
message to the mailing list as soon as we have a winner. 

Atif Khan
aak2@ra.msstate.edu

From /CN=robots-errors/@nexor.co.uk Thu May  4 22:11:45 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 4 May 1995 22:15:46 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 4 May 1995 22:11:45 +0100
Date: Thu, 4 May 1995 22:11:45 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:077630:950504211147]
Content-Identifier: unsubscribe
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 22:11:45 +0100;
Alternate-Recipient: Allowed
From: LJLJ@aol.com
Message-ID: <950504171046_106926701@aol.com>
To: /CN=robots/@nexor.co.uk
Subject:  unsubscribe
Status: RO
Content-Length: 12

unsubscribe

From /CN=robots-errors/@nexor.co.uk Fri May  5 04:39:49 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 5 May 1995 04:53:26 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 5 May 1995 04:39:49 +0100
Date: Fri, 5 May 1995 04:39:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:142920:950505034020]
Content-Identifier: Rash of unsub...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 5 May 1995 04:39:49 +0100;
Alternate-Recipient: Allowed
From: " (Dave Balderstone)" <balderd@crocus.sasknet.sk.ca>
Message-ID: <v02110102abcf594fd0a7@[142.165.5.138]>
To: /CN=robots/@nexor.co.uk
Subject:  Rash of unsubscribes
X-Sender: balderd@mailhost.sasknet.sk.ca
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Status: RO
Content-Length: 921

I have not sent any "unsubscribe" messages to the list, but my attempts at
convincing the listserver to allow me to unsubscribe have been futile. I am
closing this account in 24 hours, so messages will simply bounce back.
C'est la vie. Goodbye.
I'm commenting on this just to say that as an experienced user, this
listserver appears to have some serious problems. It is bouncing my
*repeated* requests to unsubscribe (says I am not a member) yet keeps
sending me messages. I can understand others' frustrations.

Dave Balderstone
(new address: balderstone@producer.com)

Dave Balderstone, Manager Business Analysis |    balderd@producer.com
Western Producer Publications               |    --------------------
2310 Millar Ave, Saskatoon, Canada S7K 2C4  |            OR
Voice 306-665-3545, Fax 306-665-9614        |  75211.3630@compuserve.com
--------------------------------------------------------------------------


From /CN=robots-errors/@nexor.co.uk Fri May  5 07:52:59 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 5 May 1995 07:59:09 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 5 May 1995 07:52:59 +0100
Date: Fri, 5 May 1995 07:52:59 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:159710:950505065300]
Content-Identifier: Results of th...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 5 May 1995 07:52:59 +0100;
Alternate-Recipient: Allowed
From: Atif Ahmad Khan <aak2@Ra.MsState.Edu>
Message-ID: <199505050606.BAA11080@Ra.MsState.Edu>
To: /CN=robots/@nexor.co.uk
Subject:  Results of the 2nd challenge !
X-Mailer: ELM [version 2.4 PL22]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 191


I received only 1 solution and that too was from Victor A. Parada
(vparada@inf.utfsm.cl).  It works wonderfully. 

I'll be back soon with a third challenge :)

Atif Khan
aak2@ra.msstate.edu

From /CN=robots-errors/@nexor.co.uk Fri May  5 09:57:59 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 5 May 1995 10:02:17 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 5 May 1995 09:57:59 +0100
Date: Fri, 5 May 1995 09:57:59 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:191960:950505085800]
Content-Identifier: Amusing robot...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 5 May 1995 09:57:59 +0100;
Alternate-Recipient: Allowed
From: m.koster@nexor.co.uk
Message-ID: <"19193 Fri May  5 09:57:40 1995"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Amusing robots.txt post - comp.infosystems.www.providers #17113
Status: RO
Content-Length: 2225

For those who missed it:

In article <3o8po7$g7l@news.cerf.net>, paulp@nic.cerf.net (Paul Phillips) writes:
|> Found on comp.security.unix.  Especially note his last line.
|> 
|>  -PSP
|> 
|> From dave@maths.newcastle.edu.au Wed May  3 13:38:38 PDT 1995
|> Newsgroups: comp.security.unix
|> Subject: Re: httpd security
|> Message-ID: <3o1doa$hog@seagoon.newcastle.edu.au>
|> From: dave@maths.newcastle.edu.au (David M. Williams)
|> Date: Mon, 1 May 95 02:32:58 BST
|> Organization: The University of Newcastle
|> Lines: 41
|> 
|> kevin kohn (ess2426) wrote:
|> : I've noticed several httpd request coming in requesting this file:
|> 
|> :    robots.txt
|> 
|> : Does anyone know if this is a possible file inserted or created by a
|> : hack attempt?  The first time is saw it, I really didn't pay attention.
|> : But I've had atleast 10 requests from about 4 different sources for this
|> : file.
|> 
|>  Yes - I would like to know the answer to this question also.  I have had
|> hundreds of requests for this file on my httpd server.  Eventually I created
|> the following file robots.txt ...
|> 
|> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|>  So many people seem to want to access the file "robots.txt" that I thought
|> I'd create one for people to look at.
|> 
|>  Get a life pal - you'll never make a hacker - and I'll always be a step ahead
|> of you.
|> 
|> Regards,
|>  System Manager
|> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|> 
|>  Since I did this the number of requests has been dramatically reduced!
|> 
|> Regards,
|>  David Williams
|> 
|> --
|>       .---.        .-----------
|>      /     \  __  /    ------
|>     / /     \(  )/    -----         David M. Williams
|>    //////   ' \/ `   ---            System Manager
|>   //// / // :    : ---              Computing Services
|>  // /   /  /`    '--                University of Newcastle
|> //          //..\\                  dave@maths.newcastle.edu.au
|>        ----UU----UU-------------------------------------------------
|>            '//||\\`
|>              ''``

-- Martijn
__________
Internet: m.koster@nexor.co.uk
WWW: http://web.nexor.co.uk/users/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Sat May  6 02:51:39 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sat, 6 May 1995 02:53:50 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sat, 6 May 1995 02:51:39 +0100
Date: Sat, 6 May 1995 02:51:39 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:051140:950506015141]
Content-Identifier: Re: Amusing r...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 6 May 1995 02:51:39 +0100;
Alternate-Recipient: Allowed
From: " (James Burton)" <JamesB@werple.mira.net.au>
Message-ID: <209f218e.222e0-JamesB@ArtWorks.mira.net.au>
To: /CN=robots/@nexor.co.uk
In-Reply-To: <"19193 Fri May  5 09:57:40 1995"@nexor.co.uk>
Subject:  Re: Amusing robots.txt post - comp.infosystems.www.providers #17113
Reply-To: JamesB@werple.mira.net.au
X-Mailer: //\\miga Electronic Mail (AmiElm 5.42)
Organization: Melbourne ArtWorks
Status: RO
Content-Length: 1756

> |> kevin kohn (ess2426) wrote:
> |> : I've noticed several httpd request coming in requesting this file:
> |> 
> |> :    robots.txt
> |> 
> |> : Does anyone know if this is a possible file inserted or created by a
> |> : hack attempt?  The first time is saw it, I really didn't pay attention.
> |> : But I've had atleast 10 requests from about 4 different sources for this
> |> : file.
> |> 
> |>  Yes - I would like to know the answer to this question also.  I have had
> |> hundreds of requests for this file on my httpd server.  Eventually I created
> |> the following file robots.txt ...
> |> 
> |> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> |>  So many people seem to want to access the file "robots.txt" that I thought
> |> I'd create one for people to look at.
> |> 
> |>  Get a life pal - you'll never make a hacker - and I'll always be a step ahead
> |> of you.
> |> 
> |> Regards,
> |>  System Manager
> |> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> |> 
> |>  Since I did this the number of requests has been dramatically reduced!

You guys may laugh. But I did almost exectly this when I first set up my
HTTPd (at a place I worked at until Jan95). I kept getting these requests
for a strangley named non-existent file. I too wondered whether somebody
was trying to exploit a 'well known' security hole. I eventually created
the file with a message asking why they hell were these people were trying
to access the file.

Strangely enough I never got a reply :-)

James
-- 
James Burton                                    | 
EMail: JamesB@werple.mira.net.au                | Latrobe University
WWW  : http://www.cs.latrobe.edu.au/~burton/    | Melbourne, Australia


From /CN=robots-errors/@nexor.co.uk Tue May  9 11:53:59 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 9 May 1995 11:57:46 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 9 May 1995 11:53:59 +0100
Date: Tue, 9 May 1995 11:53:59 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:019920:950509105400]
Content-Identifier: any comments ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 9 May 1995 11:53:59 +0100;
Alternate-Recipient: Allowed
From: " (Reinier Post)" <reinpost@win.tue.nl>
Message-ID: <199505091053.MAA12536@wsinis10.win.tue.nl>
To: /CN=robots/@nexor.co.uk
Subject:  any comments on AUtoWinNet?
Reply-To: reinpost@win.tue.nl
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Status: RO
Content-Length: 707

Just thought you might be interested in hearing about a new shareware tool
for MS-Windows: AutoWinNet, it will graze the Internet while you're playing
tennis.  Quote: 'Now you can have unlimited access to TERABYTES of Internet
files without forcing you to sit in front of your computer for hours, upload
files, hammer at FTP sites that are busy, get updates to your favorite
programs automatically - direct from their support site.'

I found this at

  http://www.computek.net:80/physics/

and there is no mention of it in the list of robots,

  http://web.nexor.co.uk/mak/doc/robots/active.html

Any comments?  (I'm hoping for another of Martijn's outbursts ;-)

-- 
Reinier Post						 reinpost@win.tue.nl

From /CN=robots-errors/@nexor.co.uk Tue May  9 18:33:57 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 9 May 1995 18:37:02 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 9 May 1995 18:33:57 +0100
Date: Tue, 9 May 1995 18:33:57 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:112980:950509173358]
Content-Identifier: Re: any comme...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 9 May 1995 18:33:57 +0100;
Alternate-Recipient: Allowed
From: Billy Barron <billy@utdallas.edu>
Message-ID: <199505091732.MAA29086@utdallas.edu>
To: reinpost@win.tue.nl
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <199505091053.MAA12536@wsinis10.win.tue.nl>
Subject:  Re: any comments on AUtoWinNet?
X-WWW-Page: http://www.utdallas.edu/acc/billy.html
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 842

In reply to /CN=robots-errors/@nexor.co.uk's message:
>
>Just thought you might be interested in hearing about a new shareware tool
>for MS-Windows: AutoWinNet, it will graze the Internet while you're playing
>tennis.  Quote: 'Now you can have unlimited access to TERABYTES of Internet
>files without forcing you to sit in front of your computer for hours, upload
>files, hammer at FTP sites that are busy, get updates to your favorite
>programs automatically - direct from their support site.'
>
>I found this at
>
>  http://www.computek.net:80/physics/
>
>Any comments?  (I'm hoping for another of Martijn's outbursts ;-)
>
Right in my backyard too (oh boy!).  It looks no more dangerous than
FTP Mirror (if it was easier to use).  I think all it does is grab
files at off hours.  I don't see anything it in that looks
like a robot.

Billy

From /CN=robots-errors/@nexor.co.uk Wed May 10 12:20:55 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 10 May 1995 12:26:39 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 10 May 1995 12:20:55 +0100
Date: Wed, 10 May 1995 12:20:55 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:015750:950510112057]
Content-Identifier: Re: any comme...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 10 May 1995 12:20:55 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"1551 Wed May 10 12:19:54 1995"@nexor.co.uk>
To: Billy Barron <billy@utdallas.edu>
Cc: reinpost@win.tue.nl, /CN=robots/@nexor.co.uk
In-Reply-To: <199505091732.MAA29086@utdallas.edu>
Subject:  Re: any comments on AUtoWinNet? 
Status: RO
Content-Length: 643

In message <199505091732.MAA29086@utdallas.edu>, Billy Barron writes
[about AutoWinNet]:

> Right in my backyard too (oh boy!).  It looks no more dangerous than
> FTP Mirror (if it was easier to use).  I think all it does is grab
> files at off hours.  I don't see anything it in that looks
> like a robot.

I couldn't actually get it to work (complains about not being 
registered when trying WWW access, and ignores my FTP requests :-).
Anyway, it looks like it only does single files/documents.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu May 11 16:39:20 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 11 May 1995 16:43:23 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 11 May 1995 16:39:20 +0100
Date: Thu, 11 May 1995 16:39:20 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:296490:950511153921]
Content-Identifier: Please let me...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 11 May 1995 16:39:20 
                      +0100;
Alternate-Recipient: Allowed
From: Judy Feder <Judy_Feder@cq.com>
Message-ID: <9505111727.AA3827@worldcom-18.worldcom.com>
To: /CN=robots/@nexor.co.uk
Subject:  Please let me know how I can subscribe to this group
Mime-Version: 1.0
Content-Type: Text/Plain
Status: RO
Content-Length: 75

Thanks!

Judith Feder
Director of Market Relations
ConQuest Software, Inc.

From /CN=robots-errors/@nexor.co.uk Tue May 23 09:22:17 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 23 May 1995 09:28:23 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 23 May 1995 09:22:17 +0100
Date: Tue, 23 May 1995 09:22:17 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:277250:950523082220]
Content-Identifier: url mirror
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 23 May 1995 09:22:17 
                      +0100;
Alternate-Recipient: Allowed
From: Loic Dachary <loic@afp.com>
Message-ID: <199505230812.EAA12359@pinot.par.afp.com>
To: /CN=robots/@nexor.co.uk
Subject:  url mirror
Status: RO
Content-Length: 1628


	Hi,

	I'm writing a robot to maintain my emacs-w3 cache directory in
synch with the net.
	The emacs-w3 cache is a file tree that maps the url name space
on a file name space.

	Here is a short example that can give you an idea.

w3cache-1/loic/http/org/w3/www/hypertext/DataSources/
w3cache-1/loic/http/org/w3/www/hypertext/DataSources/bySubject/
w3cache-1/loic/http/org/w3/www/hypertext/DataSources/bySubject/Libraries.html
w3cache-1/loic/http/org/w3/www/hypertext/DataSources/bySubject/Libraries.html.hdr
w3cache-1/loic/http/org/w3/www/Overview.html.hdr
w3cache-1/loic/http/org/cnidr/
w3meta/loic/http/org/w3/www/hypertext/DataSources/
w3meta/loic/http/org/w3/www/hypertext/DataSources/bySubject/
w3meta/loic/http/org/w3/www/hypertext/DataSources/bySubject/Libraries.html.header
w3meta/loic/http/org/w3/www/Overview.html.error
w3meta/loic/http/org/cnidr/
	
	To prevent obsolescence of this tree I need a robot that will
check if each file still mimics a valid URL or not.
	I've written a small perl script that does the job, using the
libwww-perl library. I'm writing the robot exclusion code now.

	Please keep me informed if you know about a robot that
would be able to map the URL name space into the file name space,
gracefully handling the errors (redirection, time out etc.). Ideally
I'd like a file system that would transparently do this for me. I'd be
able to use the thousand tools I have to manipulate files instead of
the few that know about URLs. *sigh*

	Thanks,

	Loic

P.S. I will be activating my robot every week or every two weeks. I
guess that it will do an average of 500 requests each time it is run.

From /CN=robots-errors/@nexor.co.uk Tue May 23 09:54:09 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 23 May 1995 11:50:34 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 23 May 1995 09:54:09 +0100
Date: Tue, 23 May 1995 09:54:09 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:284990:950523085413]
Content-Identifier: Re: url mirror
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 23 May 1995 09:54:09 
                      +0100;
Alternate-Recipient: Allowed
From: " (Reinier Post)" <reinpost@win.tue.nl>
Message-ID: <199505230840.KAA05341@wsinis10.win.tue.nl>
To: " (Loic Dachary)" <loic@afp.com>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <199505230812.EAA12359@pinot.par.afp.com>
Subject:  Re: url mirror
Reply-To: reinpost@win.tue.nl
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Status: RO
Content-Length: 234

You (Loic Dachary) write:

>	Hi,
>
>	I'm writing a robot to maintain my emacs-w3 cache directory in
>synch with the net.

Why don't you use CERN httpd's proxy cache and some existing robot?

-- 
Reinier Post						 reinpost@win.tue.nl

From /CN=robots-errors/@nexor.co.uk Fri May 26 00:22:55 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 26 May 1995 00:25:38 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 26 May 1995 00:22:55 +0100
Date: Fri, 26 May 1995 00:22:55 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:182460:950525232257]
Content-Identifier: Indexing non-...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 26 May 1995 00:22:55 
                      +0100;
Alternate-Recipient: Allowed
From: " (Tim Bray)" <tbray@opentext.com>
Message-ID: <m0sElGL-0001wpC@giant.mindlink.net>
To: /CN=robots/@nexor.co.uk
Cc: lauren@sqwest.bc.ca
Subject:  Indexing non-ASCII characters in Web pages
X-Sender: a07893@giant.mindlink.net
X-Mailer: Windows Eudora Version 2.0.3
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Status: RO
Content-Length: 1948

One of the problems that comes up in building/maintaining our Web Index
(or any other) is the proper handling of non-ASCII characters.  Right now,
such characters are de facto stored in a fairly random assortment of=20
8-bit ISO-Latin codes, HTML/SGML entity references, and then there's a
certain amount of JIS for Japanese, which may be New-JIS, Shift-JIS, or
EUC-JIS (some of which are easy to handle, others of which conflict with
ISO-Latin, sigh...).

Anyhow, it seems that at indexing time, when you've copied in someone
else's page, you ought to convert all characters to some canonical form
before indexing them, to make the lives of people searching your index
easier.  One would like this form to be:

 (a) compact
 (b) a stable vendor-neutral standard
 (c) easy to type in
 (d) workable in combination with popular browser technology

The most standard way would be to use entities for everything, which=20
conflicts with (a) and (c), and the current repertoire is not=20
comprehensive, and if they can't agree on basic HTML extensions, why should=
 we=20
expect them to agree on hard/political stuff like char entities?  The most=
=20
compact way  would be to just use ISO Latin, and char. entities for those=20
*really* hard characters (you'll always need some if only for &gt;,=20
&amp;, and so on).

As for which form best satisfies (d), all bets are off; I'd like to lock
the vendors in a room and not provide a toilet until they'd agreed on
how someone should type the French word for summer (&eacu;t&eacu;, =E9t=E9,=
 take
your pick) into (a) an HTML editor and (b) an HTML form, and what the client=
=20
sends up the pipe when said form is GET/POSTed.

Unicode may be the way to go, but for now they ain't no real software and
furthermore a lot of people in Japan are against it.

Don't suppose anyone out there happened to solve this one in his/her
spare time...=20

Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com)

From /CN=robots-errors/@nexor.co.uk Tue May 30 08:46:59 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 30 May 1995 08:47:14 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 30 May 1995 08:46:59 +0100
Date: Tue, 30 May 1995 08:46:59 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:204740:950530074704]
Content-Identifier: Admin
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 30 May 1995 08:46:59 
                      +0100;
Alternate-Recipient: Allowed
From: World Wide Web <www@nexor.co.uk>
Message-ID: <"20467 Tue May 30 08:46:41 1995"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Admin
Status: RO
Content-Length: 134


I'm switching this list to moderated for now, to prevent all
these subscription requests etc. If you have any problems let
me know.


From /CN=robots-errors/@nexor.co.uk Fri Jun  9 10:48:31 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 9 Jun 1995 10:48:51 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 9 Jun 1995 10:48:31 +0100
Date: Fri, 9 Jun 1995 10:48:31 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: m.koster@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294190:950609094833]
Content-Identifier: HTTP library
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 9 Jun 1995 10:48:31 +0100;
Alternate-Recipient: Allowed
From: Luc Ihli <Luc.Ihli@cginn.cgs.fr>
Message-ID: <"29345 Fri Jun  9 10:47:22 1995"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  HTTP library
Status: RO
Content-Length: 392


Can someone tell where I could find a PERL or C HTTP library.
I would like to have a function to issue HTTP requests to HTTP servers.

It's urgent.

Thanks.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Luc Ihli				Luc.Ihli@cginn.cgs.fr
Cap Gemini Innovation			Tel: +33 76.76.47.37
7 Chemin du Vieux Chene			Fax: +33 76.76.47.48
ZIRST  4206
38942  Meylan  CEDEX
FRANCE

From /CN=robots-errors/@nexor.co.uk Fri Jun  9 14:40:59 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 9 Jun 1995 14:41:13 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 9 Jun 1995 14:40:59 +0100
Date: Fri, 9 Jun 1995 14:40:59 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: m.koster@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:051970:950609134100]
Content-Identifier: Libraries
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 9 Jun 1995 14:40:59 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"5165 Fri Jun  9 14:39:50 1995"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Libraries
Status: RO
Content-Length: 1192


------- Forwarded Message

Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 9 Jun 1995 10:48:51 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 9 Jun 1995 10:48:31 +0100
Date: Fri, 9 Jun 1995 10:48:31 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: m.koster@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294190:950609094833]
Content-Identifier: HTTP library
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 9 Jun 1995 10:48:31 +0100;
Alternate-Recipient: Allowed
From: Luc Ihli <Luc.Ihli@cginn.cgs.fr>
Message-ID: <"29345 Fri Jun  9 10:47:22 1995"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  HTTP library


Can someone tell where I could find a PERL or C HTTP library.
I would like to have a function to issue HTTP requests to HTTP servers.

It's urgent.

Thanks.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Luc Ihli				Luc.Ihli@cginn.cgs.fr
Cap Gemini Innovation			Tel: +33 76.76.47.37
7 Chemin du Vieux Chene			Fax: +33 76.76.47.48
ZIRST  4206
38942  Meylan  CEDEX
FRANCE

------- End of Forwarded Message


From /CN=robots-errors/@nexor.co.uk Mon Jun 12 09:53:01 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 12 Jun 1995 09:55:20 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 12 Jun 1995 09:53:01 +0100
Date: Mon, 12 Jun 1995 09:53:01 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, 
                 fred@nvg.unit.no, christen.krogh@filosofi.uio.no, 
                 Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, 
                 floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, 
                 andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, 
                 musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, 
                 altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, 
                 Sean.McGrath@UCG.ie, 
                 rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, 
                 Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, 
                 sac@compsci.stirling.ac.uk, 
                 phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, 
                 R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, 
                 marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, 
                 username%cerfnet.com@nsfnet-relay.ac.uk, 
                 murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, 
                 line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, 
                 kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, 
                 isamu%hopf.dnai.com@nsfnet-relay.ac.uk, 
                 gorme%ifi.uio.no@nsfnet-relay.ac.uk, 
                 bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, 
                 oklee@computer-science.nottingham.ac.uk, 
                 darren.sanders@northumbria.ac.uk, 
                 mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, 
                 lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, 
                 js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, 
                 jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, 
                 howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, 
                 gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, 
                 alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, 
                 M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, 
                 brownc@computer-science.manchester.ac.uk, 
                 csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, 
                 kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, 
                 ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, 
                 David.Halls@computer-lab.cambridge.ac.uk, 
                 omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , 
                 zandy@cedar.buffalo.edu, wyliamh@sco.COM, 
                 wozz@chewy.wookie.net, wlee@chaph.usc.edu, 
                 wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, 
                 wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, 
                 username@cerfnet.com, ulla@stupi.se, 
                 twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, 
                 treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, 
                 tomasic@almaden.ibm.com, tom@jax-inter.net, 
                 tom@iiidns.iii.org.tw, tom@amnh.org, 
                 tobor@psycco.msae.wisc.edu, tmorling@rtm.com, 
                 tmaslen@verity.com, tm3k+@andrew.cmu.edu, 
                 tkoyt@theseas.ntua.gr, tjc4@cornell.edu, 
                 TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, 
                 tierno.bah@his.com, thrift@osage.csc.ti.com, 
                 thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, 
                 therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, 
                 telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, 
                 tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, 
                 swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, 
                 support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, 
                 ssutarwala@cc1.dttus.com, ssandke@Verity.COM, 
                 slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, 
                 skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, 
                 shah2170@css1s0.engr.ccny.cuny.edu, 
                 sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, 
                 seb1@gate.net, schwartz@latour.cs.colorado.edu, 
                 sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, 
                 satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, 
                 samy@ru.cs.gmr.com, sallystan@aol.com, 
                 s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, 
                 rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, 
                 Rshearer@cris.com, rs_butner@ccmail.pnl.gov, 
                 rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, 
                 ROMANBP@delphi.com, roland@technet.sg, 
                 roger@hazelton.demon.co.uk, rob@iconics.com, 
                 rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rick@vivo.com, 
                 rhb@hotsand.att.com, rgh@slc.unisys.com, 
                 reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, 
                 reader@server.blueline.com, rcalder@cfara1.harvard.edu, 
                 rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, 
                 pypodima@athena.auth.gr, pvp@intgp1.att.com, 
                 pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, 
                 pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, 
                 plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, 
                 phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, 
                 pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, 
                 ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, 
                 ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, 
                 nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, 
                 neutron@swttools.fc.hp.com, narnett@verity.com, 
                 naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, 
                 mowens@advtech.uswest.com, mouche@metronet.com, 
                 moriya@st.rim.or.jp, mkgray@MIT.EDU, miyata@musys.com, 
                 mikeo@world.std.com, mikeg@innovation.com, 
                 Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, 
                 mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, 
                 mdmays@server.iadfw.net, mdg1@hidec01.engr.uark.edu, 
                 mcbryan@cs.colorado.edu, marym@Finesse.COM, 
                 marty@Ahip.Getty.EDU, martinb@ix.netcom.com, 
                 Mark_M_Lee@ccm.ch.intel.com, 
                 mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, 
                 mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, 
                 lwarne01@ccsf.cc.ca.us, luciw@starwave.com, 
                 loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, 
                 loic@afp.com, logan@cs.cornell.edu, 
                 lists@konishiki.stanford.edu, lgg@cs.brown.edu, 
                 lentz@annie.astro.nwu.edu, leefi@microsoft.com, 
                 LBURKE@gco5.pb.gov.bc.ca, kumarv@apple.com, KUL@FSC.CA, 
                 kstamper@cybernetics.net, konop@techunix.technion.ac.il, 
                 kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, 
                 king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, 
                 kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, 
                 Judy_Feder@cq.com, jswift@timber.infohwy.com, 
                 jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, 
                 joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, 
                 jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, 
                 joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, 
                 jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, 
                 jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, 
                 jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, 
                 jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, 
                 jamesb@werple.mira.net.au, jamesb@optical.fiber.net, 
                 jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, 
                 its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, 
                 ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, 
                 hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, 
                 hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, 
                 helper@law.uark.edu, hardy@powell.cs.colorado.edu, 
                 hajime@st.rim.or.jp, habermann@dow.com, 
                 gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, 
                 gregg@fly2.berkeley.edu, grayson@char.vnet.net, 
                 grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, 
                 gpl53044@uxa.cso.uiuc.edu, gjv@io.org, 
                 gfowler@wilkins.iaims.bcm.tmc.edu, 
                 garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, 
                 FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, 
                 francis@cactus.slab.ntt.jp, flash@cyber.net, 
                 finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, 
                 fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, 
                 ew974@nextsun.ins.cwru.edu, 
                 essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, 
                 eoh@hacom.nl, emery@squawfish.fsr.com, 
                 eichmann@rbse.jsc.nasa.gov, efinet@insist.com, 
                 edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, 
                 duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, 
                 dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, 
                 dhart@titanic.cs.umass.edu, detter@databank.com, 
                 dcornwal@mail.utexas.edu, dbakin@sybase.com, 
                 DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, 
                 david@police.tas.gov.au, daveg@fultech.com, 
                 Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, 
                 cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, 
                 close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, 
                 chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, 
                 choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, 
                 charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, 
                 carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, 
                 bynum@CS.WM.EDU, burkhart@tis.andersen.com, 
                 BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, 
                 brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, 
                 bonnie@dev.prodigy.com, bonini@panix.com, 
                 bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, 
                 bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, 
                 benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, 
                 beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, 
                 barrie@scs.unr.edu, barrett@almaden.ibm.com, 
                 bal@martigny.ai.mit.edu, atc@ornl.gov, 
                 ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, 
                 amonge@cs.ucsd.edu, amohesky@itg.ti.com, 
                 allsop@swttools.fc.hp.com, allmedia@world.std.com, 
                 allied@biddeford.com, allain@waiter.ira.rl.af.mil, 
                 Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, 
                 ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, 
                 adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, 
                 a-mikebi@microsoft.com, 100627.2502@compuserve.com, 
                 0004103477@mcimail.com, 
                 igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, 
                 jims%globalvillag.com@pmail.globalvillag.com, 
                 NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, 
                 PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, 
                 BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, 
                 Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, 
                 ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, 
                 adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, 
                 nwoh@software-ag.de, ralf@egd.igd.fhg.de, 
                 stegmann@rzo2.sari.fh-wuerzburg.de, 
                 Andreas.Ley@rz.uni-karlsruhe.de, 
                 olafabbe@w250zrz.zrz.tu-berlin.d400.de, 
                 pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, 
                 jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, 
                 casey@ptsun00.cern.ch, pam@sunbim.be, 
                 Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, 
                 /CN=robots-archive/@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:077150:950612085303]
Content-Identifier: Robot Announce
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 12 Jun 1995 09:53:01 
                      +0100;
Alternate-Recipient: Allowed
From: " (Yoshihiko HAYASHI)" <hayashi@nttnly.isl.ntt.jp>
Message-ID: <199506120855.RAA11047@nttnly.isl.ntt.jp>
To: /CN=robots/@nexor.co.uk
Cc: titan-admin@nttnly.isl.ntt.jp
Subject:  Robot Announce
Return-Path: <hayashi@nttnly.isl.ntt.jp>
Status: RO
Content-Length: 609

Greetings.

We are going to run a robot program named TITAN/0.1 in order to
collect text (html/plain) files from the WWW space. Our primary
purpose is to develop an advanced method for analyzing and indexing
the documents on the WWW. The robot is based on a perl library,
libwww-perl (http://www.ics.uci.edu/WebSoft/libwww-perl/), which is
being developed by the Arcadia Project at UCI. If the robot, by
chance, visits your site, please let him in.

Comments and suggestions should go to "titan-admin@nttnly.isl.ntt.jp".
Thank you.

--
Yoshihiko HAYASHI
NTT Information and Communication Systems Laboratories

From /CN=robots-errors/@nexor.co.uk Wed Jun 14 02:05:57 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 14 Jun 1995 02:11:58 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 14 Jun 1995 02:05:57 +0100
Date: Wed, 14 Jun 1995 02:05:57 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, 
                 fred@nvg.unit.no, christen.krogh@filosofi.uio.no, 
                 Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, 
                 floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, 
                 andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, 
                 musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, 
                 altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, 
                 Sean.McGrath@UCG.ie, 
                 rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, 
                 Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, 
                 sac@compsci.stirling.ac.uk, 
                 phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, 
                 R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, 
                 marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, 
                 username%cerfnet.com@nsfnet-relay.ac.uk, 
                 murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, 
                 line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, 
                 kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, 
                 isamu%hopf.dnai.com@nsfnet-relay.ac.uk, 
                 gorme%ifi.uio.no@nsfnet-relay.ac.uk, 
                 bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, 
                 oklee@computer-science.nottingham.ac.uk, 
                 darren.sanders@northumbria.ac.uk, 
                 mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, 
                 lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, 
                 js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, 
                 jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, 
                 howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, 
                 gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, 
                 alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, 
                 M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, 
                 brownc@computer-science.manchester.ac.uk, 
                 csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, 
                 kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, 
                 ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, 
                 David.Halls@computer-lab.cambridge.ac.uk, 
                 omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , 
                 zandy@cedar.buffalo.edu, wyliamh@sco.COM, 
                 wozz@chewy.wookie.net, wlee@chaph.usc.edu, 
                 wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, 
                 wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, 
                 username@cerfnet.com, ulla@stupi.se, 
                 twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, 
                 treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, 
                 tomasic@almaden.ibm.com, tom@jax-inter.net, 
                 tom@iiidns.iii.org.tw, tom@amnh.org, 
                 tobor@psycco.msae.wisc.edu, tmorling@rtm.com, 
                 tmaslen@verity.com, tm3k+@andrew.cmu.edu, 
                 tkoyt@theseas.ntua.gr, tjc4@cornell.edu, 
                 TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, 
                 tierno.bah@his.com, thrift@osage.csc.ti.com, 
                 thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, 
                 therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, 
                 telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, 
                 tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, 
                 swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, 
                 support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, 
                 ssutarwala@cc1.dttus.com, ssandke@Verity.COM, 
                 slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, 
                 skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, 
                 shah2170@css1s0.engr.ccny.cuny.edu, 
                 sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, 
                 seb1@gate.net, schwartz@latour.cs.colorado.edu, 
                 sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, 
                 satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, 
                 samy@ru.cs.gmr.com, sallystan@aol.com, 
                 s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, 
                 rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, 
                 Rshearer@cris.com, rs_butner@ccmail.pnl.gov, 
                 rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, 
                 ROMANBP@delphi.com, roland@technet.sg, 
                 roger@hazelton.demon.co.uk, rob@iconics.com, 
                 rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rick@vivo.com, 
                 rhb@hotsand.att.com, rgh@slc.unisys.com, 
                 reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, 
                 reader@server.blueline.com, rcalder@cfara1.harvard.edu, 
                 rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, 
                 pypodima@athena.auth.gr, pvp@intgp1.att.com, 
                 pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, 
                 pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, 
                 plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, 
                 phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, 
                 pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, 
                 ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, 
                 ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, 
                 nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, 
                 neutron@swttools.fc.hp.com, narnett@verity.com, 
                 naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, 
                 mowens@advtech.uswest.com, mouche@metronet.com, 
                 moriya@st.rim.or.jp, mkgray@MIT.EDU, miyata@musys.com, 
                 mikeo@world.std.com, mikeg@innovation.com, 
                 Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, 
                 mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, 
                 medlar@ua.com, mdmays@server.iadfw.net, 
                 mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, 
                 marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, 
                 Mark_M_Lee@ccm.ch.intel.com, 
                 mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, 
                 mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, 
                 lwarne01@ccsf.cc.ca.us, luciw@starwave.com, 
                 loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, 
                 loic@afp.com, logan@cs.cornell.edu, 
                 lists@konishiki.stanford.edu, lgg@cs.brown.edu, 
                 lentz@annie.astro.nwu.edu, leefi@microsoft.com, 
                 LBURKE@gco5.pb.gov.bc.ca, kumarv@apple.com, KUL@FSC.CA, 
                 kstamper@cybernetics.net, konop@techunix.technion.ac.il, 
                 kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, 
                 king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, 
                 Kelly_Carney@stortek.com, kcoffee@panix.com, kball@Novell.COM, 
                 jvisagie@active.co.za, Judy_Feder@cq.com, 
                 jswift@timber.infohwy.com, jsteer@BitScout.com, 
                 jshakes@cs.washington.edu, jrb@cs.pdx.edu, 
                 joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, 
                 jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, 
                 joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, 
                 jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, 
                 jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, 
                 jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, 
                 jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, 
                 jamesb@werple.mira.net.au, jamesb@optical.fiber.net, 
                 jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, 
                 its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, 
                 ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, 
                 hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, 
                 hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, 
                 helper@law.uark.edu, hardy@powell.cs.colorado.edu, 
                 hajime@st.rim.or.jp, habermann@dow.com, 
                 gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, 
                 gregg@fly2.berkeley.edu, grayson@char.vnet.net, 
                 grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, 
                 gpl53044@uxa.cso.uiuc.edu, gjv@io.org, 
                 gfowler@wilkins.iaims.bcm.tmc.edu, 
                 garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, 
                 FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, 
                 francis@cactus.slab.ntt.jp, flash@cyber.net, 
                 finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, 
                 fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, 
                 ew974@nextsun.ins.cwru.edu, 
                 essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, 
                 eoh@hacom.nl, emery@squawfish.fsr.com, 
                 eichmann@rbse.jsc.nasa.gov, efinet@insist.com, 
                 edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, 
                 duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, 
                 dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, 
                 dhart@titanic.cs.umass.edu, detter@databank.com, 
                 dcornwal@mail.utexas.edu, dbakin@sybase.com, 
                 DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, 
                 david@police.tas.gov.au, daveg@fultech.com, 
                 Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, 
                 cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, 
                 close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, 
                 chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, 
                 choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, 
                 charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, 
                 carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, 
                 bynum@CS.WM.EDU, burkhart@tis.andersen.com, 
                 BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, 
                 brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, 
                 bonnie@dev.prodigy.com, bonini@panix.com, 
                 bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, 
                 bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, 
                 benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, 
                 beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, 
                 barrie@scs.unr.edu, barrett@almaden.ibm.com, 
                 bal@martigny.ai.mit.edu, atc@ornl.gov, 
                 ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, 
                 amonge@cs.ucsd.edu, amohesky@itg.ti.com, 
                 allsop@swttools.fc.hp.com, allmedia@world.std.com, 
                 allied@biddeford.com, allain@waiter.ira.rl.af.mil, 
                 Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, 
                 ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, 
                 adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, 
                 a-mikebi@microsoft.com, 100627.2502@compuserve.com, 
                 0004103477@mcimail.com, 
                 igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, 
                 jims%globalvillag.com@pmail.globalvillag.com, 
                 NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, 
                 PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, 
                 BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, 
                 Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, 
                 ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, 
                 adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, 
                 nwoh@software-ag.de, ralf@egd.igd.fhg.de, 
                 stegmann@rzo2.sari.fh-wuerzburg.de, 
                 Andreas.Ley@rz.uni-karlsruhe.de, 
                 olafabbe@w250zrz.zrz.tu-berlin.d400.de, 
                 pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, 
                 jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, 
                 casey@ptsun00.cern.ch, pam@sunbim.be, 
                 Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, 
                 /CN=robots-archive/@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:158720:950614010600]
Content-Identifier: how many glob...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 14 Jun 1995 02:05:57 
                      +0100;
Alternate-Recipient: Allowed
From: " (Paul Francis)" <francis@cactus.slab.ntt.jp>
Message-ID: <9506140103.AA26385@cactus.slab.ntt.jp>
To: /CN=robots/@nexor.co.uk
Subject:  how many global robots do we need?
Status: RO
Content-Length: 449


Hi,

Like a number of you out there, I would also like to get a
list of as many of the world's URLs as I can.  But, since
a number of robots are already out collecting this stuff,
it seems silly to add yet another.  So, what I'm wondering
is:

1.  Is there anybody out there with a large list of URLs
    willing to share it?
2.  If so, is anybody set up to send out periodic updates
    of their lists (what's new, what's obsolete)?

Thanks,

PF


From /CN=robots-errors/@nexor.co.uk Wed Jun 14 05:23:49 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 14 Jun 1995 05:26:16 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 14 Jun 1995 05:23:49 +0100
Date: Wed, 14 Jun 1995 05:23:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, 
                 fred@nvg.unit.no, christen.krogh@filosofi.uio.no, 
                 Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, 
                 floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, 
                 andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, 
                 musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, 
                 altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, 
                 Sean.McGrath@UCG.ie, 
                 rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, 
                 Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, 
                 sac@compsci.stirling.ac.uk, 
                 phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, 
                 R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, 
                 marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, 
                 username%cerfnet.com@nsfnet-relay.ac.uk, 
                 murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, 
                 line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, 
                 kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, 
                 isamu%hopf.dnai.com@nsfnet-relay.ac.uk, 
                 gorme%ifi.uio.no@nsfnet-relay.ac.uk, 
                 bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, 
                 oklee@computer-science.nottingham.ac.uk, 
                 darren.sanders@northumbria.ac.uk, 
                 mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, 
                 lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, 
                 js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, 
                 jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, 
                 howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, 
                 gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, 
                 alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, 
                 M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, 
                 brownc@computer-science.manchester.ac.uk, 
                 csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, 
                 kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, 
                 ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, 
                 David.Halls@computer-lab.cambridge.ac.uk, 
                 omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , 
                 zandy@cedar.buffalo.edu, wyliamh@sco.COM, 
                 wozz@chewy.wookie.net, wlee@chaph.usc.edu, 
                 wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, 
                 wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, 
                 username@cerfnet.com, ulla@stupi.se, 
                 twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, 
                 treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, 
                 tomasic@almaden.ibm.com, tom@jax-inter.net, 
                 tom@iiidns.iii.org.tw, tom@amnh.org, 
                 tobor@psycco.msae.wisc.edu, tmorling@rtm.com, 
                 tmaslen@verity.com, tm3k+@andrew.cmu.edu, 
                 tkoyt@theseas.ntua.gr, tjc4@cornell.edu, 
                 TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, 
                 tierno.bah@his.com, thrift@osage.csc.ti.com, 
                 thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, 
                 therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, 
                 telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, 
                 tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, 
                 swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, 
                 support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, 
                 ssutarwala@cc1.dttus.com, ssandke@Verity.COM, 
                 slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, 
                 skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, 
                 shah2170@css1s0.engr.ccny.cuny.edu, 
                 sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, 
                 seb1@gate.net, schwartz@latour.cs.colorado.edu, 
                 sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, 
                 satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, 
                 samy@ru.cs.gmr.com, sallystan@aol.com, 
                 s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, 
                 rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, 
                 Rshearer@cris.com, rs_butner@ccmail.pnl.gov, 
                 rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, 
                 ROMANBP@delphi.com, roland@technet.sg, 
                 roger@hazelton.demon.co.uk, rob@iconics.com, 
                 rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rick@vivo.com, 
                 rhb@hotsand.att.com, rgh@slc.unisys.com, 
                 reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, 
                 reader@server.blueline.com, rcalder@cfara1.harvard.edu, 
                 rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, 
                 pypodima@athena.auth.gr, pvp@intgp1.att.com, 
                 pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, 
                 pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, 
                 plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, 
                 phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, 
                 pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, 
                 ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, 
                 ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, 
                 nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, 
                 neutron@swttools.fc.hp.com, narnett@verity.com, 
                 naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, 
                 mowens@advtech.uswest.com, mouche@metronet.com, 
                 moriya@st.rim.or.jp, mkgray@MIT.EDU, miyata@musys.com, 
                 mikeo@world.std.com, mikeg@innovation.com, 
                 Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, 
                 mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, 
                 medlar@ua.com, mdmays@server.iadfw.net, 
                 mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, 
                 marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, 
                 Mark_M_Lee@ccm.ch.intel.com, 
                 mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, 
                 mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, 
                 lwarne01@ccsf.cc.ca.us, luciw@starwave.com, 
                 loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, 
                 loic@afp.com, logan@cs.cornell.edu, 
                 lists@konishiki.stanford.edu, lgg@cs.brown.edu, 
                 lentz@annie.astro.nwu.edu, leefi@microsoft.com, 
                 LBURKE@gco5.pb.gov.bc.ca, kumarv@apple.com, KUL@FSC.CA, 
                 kstamper@cybernetics.net, konop@techunix.technion.ac.il, 
                 kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, 
                 king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, 
                 Kelly_Carney@stortek.com, kcoffee@panix.com, kball@Novell.COM, 
                 jvisagie@active.co.za, Judy_Feder@cq.com, 
                 jswift@timber.infohwy.com, jsteer@BitScout.com, 
                 jshakes@cs.washington.edu, jrb@cs.pdx.edu, 
                 joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, 
                 jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, 
                 joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, 
                 jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, 
                 jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, 
                 jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, 
                 jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, 
                 jamesb@werple.mira.net.au, jamesb@optical.fiber.net, 
                 jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, 
                 its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, 
                 ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, 
                 hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, 
                 hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, 
                 helper@law.uark.edu, hardy@powell.cs.colorado.edu, 
                 hajime@st.rim.or.jp, habermann@dow.com, 
                 gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, 
                 gregg@fly2.berkeley.edu, grayson@char.vnet.net, 
                 grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, 
                 gpl53044@uxa.cso.uiuc.edu, gjv@io.org, 
                 gfowler@wilkins.iaims.bcm.tmc.edu, 
                 garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, 
                 FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, 
                 francis@cactus.slab.ntt.jp, flash@cyber.net, 
                 finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, 
                 fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, 
                 ew974@nextsun.ins.cwru.edu, 
                 essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, 
                 eoh@hacom.nl, emery@squawfish.fsr.com, 
                 eichmann@rbse.jsc.nasa.gov, efinet@insist.com, 
                 edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, 
                 duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, 
                 dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, 
                 dhart@titanic.cs.umass.edu, detter@databank.com, 
                 dcornwal@mail.utexas.edu, dbakin@sybase.com, 
                 DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, 
                 david@police.tas.gov.au, daveg@fultech.com, 
                 Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, 
                 cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, 
                 close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, 
                 chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, 
                 choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, 
                 charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, 
                 carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, 
                 bynum@CS.WM.EDU, burkhart@tis.andersen.com, 
                 BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, 
                 brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, 
                 bonnie@dev.prodigy.com, bonini@panix.com, 
                 bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, 
                 bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, 
                 benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, 
                 beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, 
                 barrie@scs.unr.edu, barrett@almaden.ibm.com, 
                 bal@martigny.ai.mit.edu, atc@ornl.gov, 
                 ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, 
                 amonge@cs.ucsd.edu, amohesky@itg.ti.com, 
                 allsop@swttools.fc.hp.com, allmedia@world.std.com, 
                 allied@biddeford.com, allain@waiter.ira.rl.af.mil, 
                 Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, 
                 ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, 
                 adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, 
                 a-mikebi@microsoft.com, 100627.2502@compuserve.com, 
                 0004103477@mcimail.com, 
                 igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, 
                 jims%globalvillag.com@pmail.globalvillag.com, 
                 NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, 
                 PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, 
                 BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, 
                 Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, 
                 ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, 
                 adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, 
                 nwoh@software-ag.de, ralf@egd.igd.fhg.de, 
                 stegmann@rzo2.sari.fh-wuerzburg.de, 
                 Andreas.Ley@rz.uni-karlsruhe.de, 
                 olafabbe@w250zrz.zrz.tu-berlin.d400.de, 
                 pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, 
                 jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, 
                 casey@ptsun00.cern.ch, pam@sunbim.be, 
                 Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, 
                 /CN=robots-archive/@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:190970:950614042351]
Content-Identifier: Re: how many ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 14 Jun 1995 05:23:49 
                      +0100;
Alternate-Recipient: Allowed
From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov>
Message-ID: <9506140422.AA16671@rbse.jsc.nasa.gov>
To: /CN=robots/@nexor.co.uk
Subject:  Re: how many global robots do we need?
X-Sender: eichmann@192.88.42.10
MIME-version: 1.0
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: 7BIT
Status: RO
Content-Length: 1576

At  2:05 AM 6/14/95 +0100,  (Paul Francis) wrote:
>Like a number of you out there, I would also like to get a
>list of as many of the world's URLs as I can.  But, since
>a number of robots are already out collecting this stuff,
>it seems silly to add yet another.  So, what I'm wondering
>is:
>
>1.  Is there anybody out there with a large list of URLs
>    willing to share it?

Our current list is rather dated.  We have version two of the RBSE Spider
just about ready for prime-time - we're currently porting to Postgres95 to
see if we can achieve better stability of the database engine.  When we get
the port done, we'll be building up an incrementally updateable index.

>2.  If so, is anybody set up to send out periodic updates
>    of their lists (what's new, what's obsolete)?

One of the features of version 2 will be an interface allowing folks to
execute raw queries - supporting pretty much anything you wish to
interrogate the database about.  More details when we go public with the
interface.  This will include the ability to query as of any given
time/date in the life of the Spider, using Postgres' temporal features.

- Dave

-----------
David Eichmann
Asst. Prof. / RBSE Director of R & D     Web: http://ricis.cl.uh.edu/eichmann/
Software Engineering Program           Phone: (713) 283-3875
University of Houston - Clear Lake       fax: (713) 283-3869
Box 113, 2700 Bay Area Blvd.           Email: eichmann@rbse.jsc.nasa.gov
Houston, TX 77058                         or: eichmann@cl.uh.edu
RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html


From /CN=robots-errors/@nexor.co.uk Wed Jun 14 09:01:27 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 14 Jun 1995 09:05:15 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 14 Jun 1995 09:01:27 +0100
Date: Wed, 14 Jun 1995 09:01:27 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, 
                 fred@nvg.unit.no, christen.krogh@filosofi.uio.no, 
                 Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, 
                 floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, 
                 andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, 
                 musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, 
                 altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, 
                 Sean.McGrath@UCG.ie, 
                 rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, 
                 Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, 
                 sac@compsci.stirling.ac.uk, 
                 phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, 
                 R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, 
                 marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, 
                 username%cerfnet.com@nsfnet-relay.ac.uk, 
                 murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, 
                 line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, 
                 kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, 
                 isamu%hopf.dnai.com@nsfnet-relay.ac.uk, 
                 gorme%ifi.uio.no@nsfnet-relay.ac.uk, 
                 bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, 
                 oklee@computer-science.nottingham.ac.uk, 
                 darren.sanders@northumbria.ac.uk, 
                 mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, 
                 lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, 
                 js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, 
                 jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, 
                 howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, 
                 gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, 
                 alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, 
                 M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, 
                 brownc@computer-science.manchester.ac.uk, 
                 csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, 
                 kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, 
                 ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, 
                 David.Halls@computer-lab.cambridge.ac.uk, 
                 omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , 
                 zandy@cedar.buffalo.edu, wyliamh@sco.COM, 
                 wozz@chewy.wookie.net, wlee@chaph.usc.edu, 
                 wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, 
                 wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, 
                 username@cerfnet.com, ulla@stupi.se, 
                 twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, 
                 treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, 
                 tomasic@almaden.ibm.com, tom@jax-inter.net, 
                 tom@iiidns.iii.org.tw, tom@amnh.org, 
                 tobor@psycco.msae.wisc.edu, tmorling@rtm.com, 
                 tmaslen@verity.com, tm3k+@andrew.cmu.edu, 
                 tkoyt@theseas.ntua.gr, tjc4@cornell.edu, 
                 TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, 
                 tierno.bah@his.com, thrift@osage.csc.ti.com, 
                 thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, 
                 therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, 
                 telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, 
                 tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, 
                 swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, 
                 support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, 
                 ssutarwala@cc1.dttus.com, ssandke@Verity.COM, 
                 slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, 
                 skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, 
                 shah2170@css1s0.engr.ccny.cuny.edu, 
                 sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, 
                 seb1@gate.net, schwartz@latour.cs.colorado.edu, 
                 sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, 
                 satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, 
                 samy@ru.cs.gmr.com, sallystan@aol.com, 
                 s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, 
                 rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, 
                 Rshearer@cris.com, rs_butner@ccmail.pnl.gov, 
                 rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, 
                 ROMANBP@delphi.com, roland@technet.sg, 
                 roger@hazelton.demon.co.uk, rob@iconics.com, 
                 rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rick@vivo.com, 
                 rhb@hotsand.att.com, rgh@slc.unisys.com, 
                 reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, 
                 reader@server.blueline.com, rcalder@cfara1.harvard.edu, 
                 rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, 
                 pypodima@athena.auth.gr, pvp@intgp1.att.com, 
                 pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, 
                 pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, 
                 plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, 
                 phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, 
                 pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, 
                 ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, 
                 ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, 
                 nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, 
                 neutron@swttools.fc.hp.com, narnett@verity.com, 
                 naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, 
                 mowens@advtech.uswest.com, mouche@metronet.com, 
                 moriya@st.rim.or.jp, mkgray@MIT.EDU, miyata@musys.com, 
                 mikeo@world.std.com, mikeg@innovation.com, 
                 Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, 
                 mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, 
                 medlar@ua.com, mdmays@server.iadfw.net, 
                 mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, 
                 marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, 
                 Mark_M_Lee@ccm.ch.intel.com, 
                 mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, 
                 mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, 
                 lwarne01@ccsf.cc.ca.us, luciw@starwave.com, 
                 loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, 
                 loic@afp.com, logan@cs.cornell.edu, 
                 lists@konishiki.stanford.edu, lgg@cs.brown.edu, 
                 lentz@annie.astro.nwu.edu, leefi@microsoft.com, 
                 LBURKE@gco5.pb.gov.bc.ca, kumarv@apple.com, KUL@FSC.CA, 
                 kstamper@cybernetics.net, konop@techunix.technion.ac.il, 
                 kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, 
                 king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, 
                 Kelly_Carney@stortek.com, kcoffee@panix.com, kball@Novell.COM, 
                 jvisagie@active.co.za, Judy_Feder@cq.com, 
                 jswift@timber.infohwy.com, jsteer@BitScout.com, 
                 jshakes@cs.washington.edu, jrb@cs.pdx.edu, 
                 joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, 
                 jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, 
                 joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, 
                 jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, 
                 jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, 
                 jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, 
                 jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, 
                 jamesb@werple.mira.net.au, jamesb@optical.fiber.net, 
                 jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, 
                 its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, 
                 ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, 
                 hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, 
                 hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, 
                 helper@law.uark.edu, hardy@powell.cs.colorado.edu, 
                 hajime@st.rim.or.jp, habermann@dow.com, 
                 gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, 
                 gregg@fly2.berkeley.edu, grayson@char.vnet.net, 
                 grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, 
                 gpl53044@uxa.cso.uiuc.edu, gjv@io.org, 
                 gfowler@wilkins.iaims.bcm.tmc.edu, 
                 garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, 
                 FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, 
                 francis@cactus.slab.ntt.jp, flash@cyber.net, 
                 finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, 
                 fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, 
                 ew974@nextsun.ins.cwru.edu, 
                 essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, 
                 eoh@hacom.nl, emery@squawfish.fsr.com, 
                 eichmann@rbse.jsc.nasa.gov, efinet@insist.com, 
                 edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, 
                 duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, 
                 dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, 
                 dhart@titanic.cs.umass.edu, detter@databank.com, 
                 dcornwal@mail.utexas.edu, dbakin@sybase.com, 
                 DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, 
                 david@police.tas.gov.au, daveg@fultech.com, 
                 Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, 
                 cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, 
                 close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, 
                 chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, 
                 choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, 
                 charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, 
                 carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, 
                 bynum@CS.WM.EDU, burkhart@tis.andersen.com, 
                 BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, 
                 brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, 
                 bonnie@dev.prodigy.com, bonini@panix.com, 
                 bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, 
                 bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, 
                 benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, 
                 beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, 
                 barrie@scs.unr.edu, barrett@almaden.ibm.com, 
                 bal@martigny.ai.mit.edu, atc@ornl.gov, 
                 ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, 
                 amonge@cs.ucsd.edu, amohesky@itg.ti.com, 
                 allsop@swttools.fc.hp.com, allmedia@world.std.com, 
                 allied@biddeford.com, allain@waiter.ira.rl.af.mil, 
                 Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, 
                 ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, 
                 adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, 
                 a-mikebi@microsoft.com, 100627.2502@compuserve.com, 
                 0004103477@mcimail.com, 
                 igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, 
                 jims%globalvillag.com@pmail.globalvillag.com, 
                 NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, 
                 PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, 
                 BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, 
                 Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, 
                 ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, 
                 adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, 
                 nwoh@software-ag.de, ralf@egd.igd.fhg.de, 
                 stegmann@rzo2.sari.fh-wuerzburg.de, 
                 Andreas.Ley@rz.uni-karlsruhe.de, 
                 olafabbe@w250zrz.zrz.tu-berlin.d400.de, 
                 pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, 
                 jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, 
                 casey@ptsun00.cern.ch, pam@sunbim.be, 
                 Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, 
                 /CN=robots-archive/@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:219190:950614080131]
Content-Identifier: Re: how many ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 14 Jun 1995 09:01:27 
                      +0100;
Alternate-Recipient: Allowed
From: " (Paul Francis)" <francis@cactus.slab.ntt.jp>
Message-ID: <9506140759.AA03235@cactus.slab.ntt.jp>
To: /CN=robots/@nexor.co.uk, eichmann@rbse.jsc.nasa.gov
Subject:  Re: how many global robots do we need?
Status: RO
Content-Length: 706

>  
>  One of the features of version 2 will be an interface allowing folks to
>  execute raw queries - supporting pretty much anything you wish to
>  interrogate the database about.  More details when we go public with the
>  interface.  This will include the ability to query as of any given
>  time/date in the life of the Spider, using Postgres' temporal features.
>  

So, someone will be able to make a query of the sort:

  "get me all URLs entered into the database since Wed Jun 14 13:12:53 1995"?

(understanding, of course, that you may not support natural
language query   :-)

When will such a thing be available, and how many of URLs will you
expect it to contain steady state?

Thanks,

PF


From /CN=robots-errors/@nexor.co.uk Fri Jun 16 04:31:53 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 16 Jun 1995 04:41:31 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 16 Jun 1995 04:31:53 +0100
Date: Fri, 16 Jun 1995 04:31:53 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, 
                 fred@nvg.unit.no, christen.krogh@filosofi.uio.no, 
                 Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, 
                 floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, 
                 andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, 
                 musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, 
                 altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, 
                 Sean.McGrath@UCG.ie, 
                 rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, 
                 Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, 
                 sac@compsci.stirling.ac.uk, 
                 phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, 
                 R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, 
                 viral@dcs.qmw.ac.uk, marques@vax.oxford.ac.uk, 
                 s.j.masterton@open.ac.uk, 
                 username%cerfnet.com@nsfnet-relay.ac.uk, 
                 murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, 
                 line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, 
                 kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, 
                 isamu%hopf.dnai.com@nsfnet-relay.ac.uk, 
                 gorme%ifi.uio.no@nsfnet-relay.ac.uk, 
                 bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, 
                 oklee@computer-science.nottingham.ac.uk, 
                 darren.sanders@northumbria.ac.uk, 
                 mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, 
                 lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, 
                 js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, 
                 jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, 
                 howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, 
                 gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, 
                 alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, 
                 M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, 
                 brownc@computer-science.manchester.ac.uk, 
                 csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, 
                 kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, 
                 ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, 
                 David.Halls@computer-lab.cambridge.ac.uk, 
                 omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , 
                 zandy@cedar.buffalo.edu, wyliamh@sco.COM, 
                 wozz@chewy.wookie.net, wlee@chaph.usc.edu, 
                 wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, 
                 wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, 
                 username@cerfnet.com, ulla@stupi.se, 
                 twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, 
                 tsmith@cmp.com, treadwell@maine.com, 
                 tong@numbat.cs.rmit.edu.au, tomd@bga.com, 
                 tomasic@almaden.ibm.com, tom@jax-inter.net, 
                 tom@iiidns.iii.org.tw, tom@amnh.org, 
                 tobor@psycco.msae.wisc.edu, tmorling@rtm.com, 
                 tmaslen@verity.com, tm3k+@andrew.cmu.edu, 
                 tkoyt@theseas.ntua.gr, tjc4@cornell.edu, 
                 TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, 
                 tierno.bah@his.com, thrift@osage.csc.ti.com, 
                 thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, 
                 therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, 
                 telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, 
                 tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, 
                 swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, 
                 support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, 
                 ssutarwala@cc1.dttus.com, ssandke@Verity.COM, 
                 slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, 
                 skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, 
                 shah2170@css1s0.engr.ccny.cuny.edu, 
                 sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, 
                 seb1@gate.net, schwartz@latour.cs.colorado.edu, 
                 sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, 
                 satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, 
                 samy@ru.cs.gmr.com, sallystan@aol.com, 
                 s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, 
                 rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, 
                 Rshearer@cris.com, rs_butner@ccmail.pnl.gov, 
                 rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, 
                 ROMANBP@delphi.com, roland@technet.sg, 
                 roger@hazelton.demon.co.uk, rob@iconics.com, 
                 rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rlandon@scruznet.com, 
                 rick@vivo.com, rhb@hotsand.att.com, rgh@slc.unisys.com, 
                 reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, 
                 reader@server.blueline.com, rcalder@cfara1.harvard.edu, 
                 rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, 
                 pypodima@athena.auth.gr, pvp@intgp1.att.com, 
                 pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, 
                 pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, 
                 plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, 
                 phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, 
                 pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, 
                 ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, 
                 ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, 
                 nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, 
                 neutron@swttools.fc.hp.com, narnett@verity.com, 
                 naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, 
                 mowens@advtech.uswest.com, mouche@metronet.com, 
                 moriya@st.rim.or.jp, mkgray@MIT.EDU, mjoly@insa.insa-lyon.fr, 
                 miyata@musys.com, mikeo@world.std.com, mikeg@innovation.com, 
                 Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, 
                 mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, 
                 medlar@ua.com, mdmays@server.iadfw.net, 
                 mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, 
                 marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, 
                 Mark_M_Lee@ccm.ch.intel.com, 
                 mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, 
                 mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, 
                 lwarne01@ccsf.cc.ca.us, luciw@starwave.com, 
                 loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, 
                 loic@afp.com, logan@cs.cornell.edu, 
                 lists@konishiki.stanford.edu, lgg@cs.brown.edu, 
                 lentz@annie.astro.nwu.edu, leefi@microsoft.com, 
                 LBURKE@gco5.pb.gov.bc.ca, kvale@ivy.physics.mcmaster.ca, 
                 kumarv@apple.com, KUL@FSC.CA, kstamper@cybernetics.net, 
                 konop@techunix.technion.ac.il, kojo@ccs.mt.nec.co.jp, 
                 kljackso@cerfnet.com, king@hobart.cs.umass.edu, 
                 kimoto@people.flab.fujitsu.co.jp, Kelly_Carney@stortek.com, 
                 kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, 
                 Judy_Feder@cq.com, jswift@timber.infohwy.com, 
                 jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, 
                 joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, 
                 jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, 
                 joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, 
                 jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, 
                 jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, 
                 jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, 
                 jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, 
                 jamesb@werple.mira.net.au, jamesb@optical.fiber.net, 
                 jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, 
                 its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, 
                 ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, 
                 hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, 
                 hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, 
                 helper@law.uark.edu, hardy@powell.cs.colorado.edu, 
                 hajime@st.rim.or.jp, habermann@dow.com, 
                 gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, 
                 gregg@fly2.berkeley.edu, grayson@char.vnet.net, 
                 grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, 
                 gpl53044@uxa.cso.uiuc.edu, gjv@io.org, 
                 gfowler@wilkins.iaims.bcm.tmc.edu, 
                 garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, 
                 FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, 
                 francis@cactus.slab.ntt.jp, flash@cyber.net, 
                 finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, 
                 fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, 
                 ew974@nextsun.ins.cwru.edu, 
                 essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, 
                 eoh@hacom.nl, emery@squawfish.fsr.com, 
                 eichmann@rbse.jsc.nasa.gov, efinet@insist.com, 
                 edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, 
                 duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, 
                 dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, 
                 dhart@titanic.cs.umass.edu, detter@databank.com, 
                 dcornwal@mail.utexas.edu, dbakin@sybase.com, 
                 DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, 
                 david@police.tas.gov.au, daveg@fultech.com, 
                 Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, 
                 cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, 
                 close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, 
                 chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, 
                 choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, 
                 charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, 
                 carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, 
                 bynum@CS.WM.EDU, burkhart@tis.andersen.com, 
                 BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, 
                 brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, 
                 bonnie@dev.prodigy.com, bonini@panix.com, 
                 bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, 
                 bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, 
                 benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, 
                 beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, 
                 barrie@scs.unr.edu, barrett@almaden.ibm.com, 
                 bal@martigny.ai.mit.edu, atc@ornl.gov, 
                 ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, 
                 amonge@cs.ucsd.edu, amohesky@itg.ti.com, amills@rmplc.co.uk, 
                 allsop@swttools.fc.hp.com, allmedia@world.std.com, 
                 allied@biddeford.com, allain@waiter.ira.rl.af.mil, 
                 Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, 
                 ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, 
                 adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, 
                 a-mikebi@microsoft.com, 100627.2502@compuserve.com, 
                 0004103477@mcimail.com, 
                 igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, 
                 jims%globalvillag.com@pmail.globalvillag.com, 
                 NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, 
                 PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, 
                 BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, 
                 Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, 
                 ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, 
                 adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, 
                 nwoh@software-ag.de, ralf@egd.igd.fhg.de, 
                 stegmann@rzo2.sari.fh-wuerzburg.de, 
                 Andreas.Ley@rz.uni-karlsruhe.de, 
                 olafabbe@w250zrz.zrz.tu-berlin.d400.de, 
                 pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, 
                 jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, 
                 casey@ptsun00.cern.ch, pam@sunbim.be, 
                 Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, 
                 /CN=robots-archive/@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:074660:950616033203]
Content-Identifier: Re: how many ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 16 Jun 1995 04:31:53 
                      +0100;
Alternate-Recipient: Allowed
From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov>
Message-ID: <9506160329.AB18212@rbse.jsc.nasa.gov>
To: /CN=robots/@nexor.co.uk
Subject:  Re: how many global robots do we need?
X-Sender: eichmann@192.88.42.10
MIME-version: 1.0
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: 7BIT
Status: RO
Content-Length: 1796

At  4:59 PM 6/14/95 +0900, Paul Francis wrote:
>>
>>  One of the features of version 2 will be an interface allowing folks to
>>  execute raw queries - supporting pretty much anything you wish to
>>  interrogate the database about.  More details when we go public with the
>>  interface.  This will include the ability to query as of any given
>>  time/date in the life of the Spider, using Postgres' temporal features.
>>
>
>So, someone will be able to make a query of the sort:
>
>  "get me all URLs entered into the database since Wed Jun 14 13:12:53 1995"?
>
>(understanding, of course, that you may not support natural
>language query   :-)

Yes, as well as
        "get me all URLs in the database that have changed since
         Wed Jun 14 13:12:53 1995"
among other things.

>When will such a thing be available, and how many of URLs will you
>expect it to contain steady state?

We'll be going public as soon as we get our database engine stable.
Postgres 4.2 has been proving problematic for us.  Postgres95 appears to
have resolved our problems, but uses a SQL variant instead of TQUEL as its
programming language.  We haven't decided just how wide a net to throw.
We're planning on attempting to generate coverage of one or more conceptual
areas (e.g. computer science / software engineering) rather than making a
run at Lycos' total URL count.

- Dave

-----------
David Eichmann
Asst. Prof. / RBSE Director of R & D     Web: http://ricis.cl.uh.edu/eichmann/
Software Engineering Program, Box 113  Phone: (713) 283-3875
University of Houston - Clear Lake       fax: (713) 283-3869
2700 Bay Area Blvd.                    Email: eichmann@rbse.jsc.nasa.gov
Houston, TX 77058                         or: eichmann@cl.uh.edu
RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html


From /CN=robots-errors/@nexor.co.uk Fri Jun 16 22:59:54 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 16 Jun 1995 23:04:20 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 16 Jun 1995 22:59:54 +0100
Date: Fri, 16 Jun 1995 22:59:54 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, 
                 fred@nvg.unit.no, christen.krogh@filosofi.uio.no, 
                 Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, 
                 floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, 
                 andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, 
                 musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, 
                 altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, 
                 Sean.McGrath@UCG.ie, 
                 rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, 
                 Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, 
                 sac@compsci.stirling.ac.uk, 
                 phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, 
                 R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, 
                 viral@dcs.qmw.ac.uk, marques@vax.oxford.ac.uk, 
                 s.j.masterton@open.ac.uk, 
                 username%cerfnet.com@nsfnet-relay.ac.uk, 
                 murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, 
                 line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, 
                 kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, 
                 isamu%hopf.dnai.com@nsfnet-relay.ac.uk, 
                 gorme%ifi.uio.no@nsfnet-relay.ac.uk, 
                 bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, 
                 oklee@computer-science.nottingham.ac.uk, 
                 darren.sanders@northumbria.ac.uk, 
                 mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, 
                 lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, 
                 js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, 
                 jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, 
                 howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, 
                 gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, 
                 alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, 
                 M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, 
                 brownc@computer-science.manchester.ac.uk, 
                 csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, 
                 kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, 
                 ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, 
                 David.Halls@computer-lab.cambridge.ac.uk, 
                 omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , 
                 zandy@cedar.buffalo.edu, wyliamh@sco.COM, 
                 wozz@chewy.wookie.net, wlee@chaph.usc.edu, 
                 wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, 
                 wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, 
                 username@cerfnet.com, ulla@stupi.se, 
                 twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, 
                 tsmith@cmp.com, treadwell@maine.com, 
                 tong@numbat.cs.rmit.edu.au, tomd@bga.com, 
                 tomasic@almaden.ibm.com, tom@jax-inter.net, 
                 tom@iiidns.iii.org.tw, tom@amnh.org, 
                 tobor@psycco.msae.wisc.edu, tmorling@rtm.com, 
                 tmaslen@verity.com, tm3k+@andrew.cmu.edu, 
                 tkoyt@theseas.ntua.gr, tjc4@cornell.edu, 
                 TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, 
                 tierno.bah@his.com, thrift@osage.csc.ti.com, 
                 thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, 
                 therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, 
                 telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, 
                 tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, 
                 swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, 
                 support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, 
                 ssutarwala@cc1.dttus.com, ssandke@Verity.COM, 
                 srinivas@cs.ualberta.ca, slobin@feast.fe.msk.ru, 
                 skipmoon@chattanooga.net, skg@shiva.pls.com, 
                 SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, 
                 shah2170@css1s0.engr.ccny.cuny.edu, 
                 sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, 
                 seb1@gate.net, schwartz@latour.cs.colorado.edu, 
                 sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, 
                 satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, 
                 samy@ru.cs.gmr.com, sallystan@aol.com, 
                 s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, 
                 rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, 
                 Rshearer@cris.com, rs_butner@ccmail.pnl.gov, 
                 rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, 
                 ROMANBP@delphi.com, roland@technet.sg, 
                 roger@hazelton.demon.co.uk, rob@iconics.com, 
                 rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rlandon@scruznet.com, 
                 rick@vivo.com, rhb@hotsand.att.com, rgh@slc.unisys.com, 
                 reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, 
                 reader@server.blueline.com, rcalder@cfara1.harvard.edu, 
                 rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, 
                 pypodima@athena.auth.gr, pvp@intgp1.att.com, 
                 pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, 
                 pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, 
                 plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, 
                 phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, 
                 pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, 
                 ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, 
                 ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, 
                 nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, 
                 neutron@swttools.fc.hp.com, narnett@verity.com, 
                 naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, 
                 mowens@advtech.uswest.com, mouche@metronet.com, 
                 moriya@st.rim.or.jp, mkgray@MIT.EDU, mjoly@insa.insa-lyon.fr, 
                 miyata@musys.com, mikeo@world.std.com, mikeg@innovation.com, 
                 Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, 
                 mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, 
                 medlar@ua.com, mdmays@server.iadfw.net, 
                 mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, 
                 marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, 
                 Mark_M_Lee@ccm.ch.intel.com, 
                 mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, 
                 mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, 
                 lwarne01@ccsf.cc.ca.us, luciw@starwave.com, 
                 loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, 
                 loic@afp.com, logan@cs.cornell.edu, 
                 lists@konishiki.stanford.edu, lgg@cs.brown.edu, 
                 lentz@annie.astro.nwu.edu, leefi@microsoft.com, 
                 LBURKE@gco5.pb.gov.bc.ca, kvale@ivy.physics.mcmaster.ca, 
                 kumarv@apple.com, KUL@FSC.CA, kstamper@cybernetics.net, 
                 konop@techunix.technion.ac.il, kojo@ccs.mt.nec.co.jp, 
                 kljackso@cerfnet.com, king@hobart.cs.umass.edu, 
                 kimoto@people.flab.fujitsu.co.jp, Kelly_Carney@stortek.com, 
                 kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, 
                 Judy_Feder@cq.com, jswift@timber.infohwy.com, 
                 jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, 
                 joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, 
                 jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, 
                 joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, 
                 jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, 
                 jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, 
                 jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, 
                 jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, 
                 jamesb@werple.mira.net.au, jamesb@optical.fiber.net, 
                 jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, 
                 its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, 
                 ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, 
                 hseuping@cs.utexas.edu, hschilling@lerc.nasa.gov, 
                 HotShield@aol.com, horsager@wln.com, hornlo@okra.millsaps.edu, 
                 hewett@cs.utexas.edu, helper@law.uark.edu, 
                 hardy@powell.cs.colorado.edu, hajime@st.rim.or.jp, 
                 habermann@dow.com, gsk@edgtuws1.qed.qld.gov.au, 
                 gregors@edo032pc.pipe.nova.ca, gregg@fly2.berkeley.edu, 
                 grayson@char.vnet.net, grahamh@mail.mpx.com.au, 
                 gr@pogo.ccd.bnl.gov, gpl53044@uxa.cso.uiuc.edu, gjv@io.org, 
                 gfowler@wilkins.iaims.bcm.tmc.edu, 
                 garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, 
                 FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, 
                 francis@cactus.slab.ntt.jp, flash@cyber.net, 
                 finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, 
                 fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, 
                 ew974@nextsun.ins.cwru.edu, 
                 essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, 
                 eoh@hacom.nl, emery@squawfish.fsr.com, 
                 eichmann@rbse.jsc.nasa.gov, efinet@insist.com, 
                 edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, 
                 duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, 
                 dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, 
                 dhart@titanic.cs.umass.edu, detter@databank.com, 
                 dcornwal@mail.utexas.edu, dbakin@sybase.com, 
                 DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, 
                 david@police.tas.gov.au, daveg@fultech.com, 
                 Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, 
                 cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, 
                 close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, 
                 chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, 
                 choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, 
                 charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, 
                 carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, 
                 bynum@CS.WM.EDU, burkhart@tis.andersen.com, 
                 BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, 
                 brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, 
                 bonnie@dev.prodigy.com, bonini@panix.com, 
                 bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, 
                 bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, 
                 benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, 
                 beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, 
                 barrie@scs.unr.edu, barrett@almaden.ibm.com, 
                 bal@martigny.ai.mit.edu, atc@ornl.gov, 
                 ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, 
                 amonge@cs.ucsd.edu, amohesky@itg.ti.com, amills@rmplc.co.uk, 
                 allsop@swttools.fc.hp.com, allmedia@world.std.com, 
                 allied@biddeford.com, allain@waiter.ira.rl.af.mil, 
                 Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, 
                 ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, 
                 adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, 
                 a-mikebi@microsoft.com, 100627.2502@compuserve.com, 
                 0004103477@mcimail.com, 
                 igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, 
                 jims%globalvillag.com@pmail.globalvillag.com, 
                 NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, 
                 PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, 
                 BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, 
                 Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, 
                 ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, 
                 adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, 
                 nwoh@software-ag.de, ralf@egd.igd.fhg.de, 
                 stegmann@rzo2.sari.fh-wuerzburg.de, 
                 Andreas.Ley@rz.uni-karlsruhe.de, 
                 olafabbe@w250zrz.zrz.tu-berlin.d400.de, 
                 pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, 
                 jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, 
                 casey@ptsun00.cern.ch, pam@sunbim.be, 
                 Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, 
                 /CN=robots-archive/@nexor.co.uk
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:289590:950616215955]
Content-Identifier: ANNOUNCE: Web...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 16 Jun 1995 22:59:54 
                      +0100;
Alternate-Recipient: Allowed
From: "Victor Parada G." <vparada@inf.utfsm.cl>
Message-ID: <199506162158.AA19839@camahueto.inf.utfsm.cl>
To: /CN=robots/@nexor.co.uk
Subject:  ANNOUNCE: WebCopy 0.97b now available.
Status: RO
Content-Length: 735

Hola mundo.

I've just released a new version of WebCopy (0.97b), a command-line
http-protocol file retriever with recursivity.

It includes some new features:

- better and more flexible code
- proxy support
- POST method

The new on-line documentation is at the same location as ever:

<URL:http://www.inf.utfsm.cl/~vparada/webcopy.html>

I'd like some feedback about it, to release version 1.0 as soon as
posible.

Bye...   ++Vitoco
--
Lic. Victor A. Parada                  __     __     Universidad Tecnica
Ingenieria Civil en Informatica     o-''))_____\\    Federico Santa Maria,
mailto:vparada@inf.utfsm.cl         "--__/ * * * )   Valparaiso, CHILE.
http://www.inf.utfsm.cl/~vparada/   c_c__/-c____/    +56 32 626364 x431 :-)

From /CN=robots-errors/@nexor.co.uk Fri Jun 23 15:42:02 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 23 Jun 1995 15:47:49 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 23 Jun 1995 15:42:02 +0100
Date: Fri, 23 Jun 1995 15:42:02 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:028450:950623144204]
Content-Identifier: weblayers
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 23 Jun 1995 15:42:02 
                      +0100;
Alternate-Recipient: Allowed
From: Loic Dachary <loic@afp.com>
Message-ID: <199506231435.QAA29867@pinot.par.afp.com>
To: /CN=robots/@nexor.co.uk
Cc: hugues@afp.com, thy@univ-paris8.fr, zull@coplanet.fr, ps@shiraz.uplift.fr, 
    jcc@france.sun.com
Subject:  weblayers
Status: RO
Content-Length: 1335


  I sent a mail dated Tue May 23 10:12:26 +0200 1995 to announce that
a robot to maintain a emacs-w3 cache directory in
synch with the net was under construction. 
  I finally released it under the name weblayers.

  Here is a http://web.nexor.co.uk/mak/doc/robots/active.html like
entry that describes it:

<hr>

    <h2>weblayers</h2>
    
    <a href="http://www.univ-paris8.fr/~loic/weblayers/">weblayers</a>
    
    is maintained by <a href="http://www.univ-paris8.fr/~loic/">Loic
    Dachary</a> <a
    href="mailto:loic@afp.com">&lt;loic@afp.com&gt;</a>.
    <p>
    Its purpose is to validate, cache and maintain links.
    <p>
    The HTTP <code>User-agent</code> field is set to
    'weblayers/0.0'.
    <p>
    The
    <a href="http://web.nexor.co.uk/users/mak/doc/robots/norobots.html">
    Proposed
    Standard for Robot Exclusion</a>
    is supported.<p>It is a standalone program
written in <a href="http://web.nexor.co.uk/public/perl/perl.html">Perl
5</a>.<p>
    It is designed to maintain the cache generated by the emacs
<a href="http://www.cs.indiana.edu/elisp/w3/docs.html">w3 mode</a>
(N*tscape replacement) and to support annotated documents (keep them
in sync with the original document via diff/patch).
    <p>This information was last updated on
      Fri Jun 23 16:30:42 FRE 1995.
<hr>


	Cheers,
			Loic

From /CN=robots-errors/@nexor.co.uk Fri Jun 23 18:16:49 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 23 Jun 1995 18:20:04 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 23 Jun 1995 18:16:49 +0100
Date: Fri, 23 Jun 1995 18:16:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:067900:950623171653]
Content-Identifier: Re: weblayers
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 23 Jun 1995 18:16:49 
                      +0100;
Alternate-Recipient: Allowed
From: Brian Joseph Starr <bstarr@monet.ICS.UCI.EDU>
Message-ID: <9506231011.aa01907@paris.ics.uci.edu>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <199506231435.QAA29867@pinot.par.afp.com>
Subject:  Re: weblayers 
Status: RO
Content-Length: 107

I wonder how I can get off this mailing list?  I've tried
unsubscribe, but it doesn't seem to work.

Brian

From /CN=robots-errors/@nexor.co.uk Sun Jun 25 13:54:23 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Sun, 25 Jun 1995 13:57:54 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Sun, 25 Jun 1995 13:54:23 +0100
Date: Sun, 25 Jun 1995 13:54:23 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:247000:950625125424]
Content-Identifier: How to unsubs...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 25 Jun 1995 13:54:23 
                      +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"24698 Sun Jun 25 13:54:03 1995"@nexor.co.uk>
To: Brian Joseph Starr <bstarr@monet.ICS.UCI.EDU>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9506231011.aa01907@paris.ics.uci.edu>
Subject:  How to unsubscribe (was Re: weblayers )
Status: RO
Content-Length: 923

In message <9506231011.aa01907@paris.ics.uci.edu>, Brian Joseph Starr writes:

> I wonder how I can get off this mailing list? 

Before anyone else asks:

http://web.nexor.co.uk/users/mak/doc/robots/mailing-list.html:

  To unsubscribe, DO NOT send an unsubscribe message to
  robots@nexor.co.uk, but send a message to robots-request@nexor.co.uk
  with the words "unsubscribe", "stop" on separate lines in the body.
  If you have problems, mail the list owner <m.koster@nexor.co.uk>

> I've tried unsubscribe, but it doesn't seem to work.

It doesn't work for you because you subscribed as bstarr@liege.ICS.UCI.EDU,
not bstarr@monet.ICS.UCI.EDU, and apart from that you got stung by a mail
routing problem...

> Brian

I have unsubscribed you.

RSN these issues will get fixed)

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Tue Jul  4 10:07:50 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 4 Jul 1995 10:14:27 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 4 Jul 1995 10:07:50 +0100
Date: Tue, 4 Jul 1995 10:07:50 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:065790:950704090803]
Content-Identifier: web searches ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 10:07:50 +0100;
Alternate-Recipient: Allowed
From: James Lick <jlick@shoreside.com>
Message-ID: <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>
To: /CN=robots/@nexor.co.uk
Cc: wwwstaff@qrd.org, hawkeye@tcp.com
Subject:  web searches index under wrong hostname
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
Content-Length: 4025


Hello Robot Gurus (plus cc folks),

	I've perused the standard sources for information on the problem I
have and haven't come up with any solutions as yet.  I am hoping that
through this forum I may discover the information I need to fix this or if
not possible to help bring out changes in the search engines to make it
possible. 

	Basically, I have a web server which is on a machine known under
several names.  Due to this, various web search robots have indexed pages
under pretty much all of the various names, so that some pages show under
one host name, some under another, and some even under several hostnames. 

	In general this is ugly, but not of too much concern.  Most of the
hostnames it is known by always point to the same host at all times. 
E.G., tcp.com and www.tcp.com and venice.tcp.com are always 128.95.44.29. 
However, I also do a mirror of the Queer Resources Directory which has an
hostname www.qrd.org which cycles randomly through the various servers the
QRD is mirrored on.  One time you may get tcp.com's address, other times
the server in Israel, or the one in San Francisco. 

	Now the big problem is that the web search engines are going and
finding a server "www.qrd.org" which at the moment points to tcp.com, and
goes to index all the pages on there, including all the pages which are
not part of the QRD, but are other archives or personal pages, etc. 

	Later on, someone executes a search for one of our non-QRD pages
and finds a reference with url pointing to www.qrd.org, which they follow,
and due to the luck of the draw get the server in Iowa which never heard
of that page. 

	Needless to say, my users who are not in the QRD section are
getting quite irked that they are being indexed under a name that only
works maybe 10% of the time. 

	I've tried out some things to see if I could force the server to
give out information to force the client to pick up what I consider to be
the "correct" hostname in the url which is "http://www.tcp.com/...". 

	Unfortunately this does not seem to be possible in the current
scheme of things. 

	My first attempt was adding in "URI:" and "Location:" fields in
the meta-headers.  No go, the clients don't care about these unless they
get a 3xx type response, i.e. a redirect or moved, etc. 

	OK, so why not just send a redirect?  Oops, can't do that.  The
part we care about changing is the host part, and that has been stripped
off by the time the http server gets it, so it doesn't know whether you
used a "correct" hostname or not.  You can't just do a global redirect
either since this will just loop.  (Fortunately the client is smart enough
to abort this loop.)

	The only progress I've made at all is finding the HTML BASE tag. 
Unfortunately it is only half a solution, and I'm not even sure if the web
searchers interpret it correctly or at all.  What the BASE tag does is
specify the base URL to use when interpreting any relative URLs in the
page.  For example, you can put in: 

<BASE HREF="http://www.tcp.com/">

	Then if there is a link further down the page such as: 

<A HREF="~jlick/">Foo!</a>

	Then clicking on it would load "http://www.tcp.com/~jlick/" no
matter what hostname was used to originally get there.  As mentioned this
is only a halfway solution since it only affects links made from that page
instead of the page itself. 

	Another possibility is to split to two servers, using the cloned
network interface method or seperate machines.  Unfortunately in this case
it is not possible since this server is a guest on another's network, and
getting another IP address is not possible at this time. 

	Another possibility is to get robot admins to purge their server
lists of "floating hosts" such as www.qrd.org and mirror only the actual
hosts.  This might be the best short-term solution but I'm not clear of
the feasibility of this. 

	Thanks for the consideration of this problem and I look forward 
to any responses.

--- James Lick -- jlick@tcp.com -- http://www.tcp.com/~jlick/ for more info ---


From /CN=robots-errors/@nexor.co.uk Tue Jul  4 11:45:31 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 4 Jul 1995 11:49:39 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 4 Jul 1995 11:45:31 +0100
Date: Tue, 4 Jul 1995 11:45:31 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:087550:950704104533]
Content-Identifier: Re: web searc...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 11:45:31 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"8700 Tue Jul  4 11:44:48 1995"@nexor.co.uk>
To: James Lick <jlick@shoreside.com>
Cc: /CN=robots/@nexor.co.uk, wwwstaff@qrd.org, hawkeye@tcp.com
In-Reply-To: <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>
Subject:  Re: web searches index under wrong hostname 
Status: RO
Content-Length: 3028

In message <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>, J
ames Lick writes:

> Basically, I have a web server which is on a machine known under
> several names.  Due to this, various web search robots have indexed
> pages under pretty much all of the various names, so that some pages
> show under one host name, some under another, and some even under
> several hostnames.

You can't just blame that on the robots; I've seen people link to the
wrong names too :-( If a robot then follows that you're out of luck.

I do think robots should at least keep track of IP addresses to
prevent pointless duplicate transfers.

> 	Now the big problem is that the web search engines are going
> and finding a server "www.qrd.org" which at the moment points to
> tcp.com, and goes to index all the pages on there, including all the
> pages which are not part of the QRD, but are other archives or
> personal pages, etc.

That's interesting, because there are no relative links on th QRD
stuff to other stuff on your server. How do robots guess these other
URL's?  By trying an empty path component to get to the root and
working down from there?

> 	Unfortunately this does not seem to be possible in the current
> scheme of things. 

No, I can't think of anything either.
 
> 	My first attempt was adding in "URI:" and "Location:" fields in
> the meta-headers.  No go, the clients don't care about these unless they
> get a 3xx type response, i.e. a redirect or moved, etc. 

This is something I've been wondering about, and something that should
probably be suggested for HTTP/1.1: It'd be nice if a preferred URL
could be sent along, ie have a Redirect within a server, without an
extra round trip (not unlike the If-modified-since).

> 	Another possibility is to split to two servers, using the cloned
> network interface method or seperate machines.

I think that's the only technical option at the moment.

> 	Another possibility is to get robot admins to purge their server
> lists of "floating hosts" such as www.qrd.org and mirror only the actual
> hosts.  This might be the best short-term solution but I'm not clear of
> the feasibility of this. 

As another short-term solution one could extend the /robots.txt to include
a full URL. Then you could say:

URL-Disallow: http://www.qrd.org/private/
URL-Disallow: http://www.tcp.com/qrd/

Of course you'd need some more logic to ensure that these rules are
only applied to the IP address the /robots.txt came from, to prevent
Microsoft disallowing Apple etc :-)

I guess that needs to go on the wish-list.
 
> Thanks for the consideration of this problem and I look forward to
> any responses.

An optimistic note for the future: I believe passing a full URL
including the access and netloc parts is on the wishlist for HTTP/1.1
This would allow a server to be more precise about what URL's it
serves and denies.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Tue Jul  4 14:27:09 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 4 Jul 1995 14:32:54 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 4 Jul 1995 14:27:09 +0100
Date: Tue, 4 Jul 1995 14:27:09 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:119720:950704132711]
Content-Identifier: Re: web searc...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 14:27:09 +0100;
Alternate-Recipient: Allowed
From: " (Reinier Post)" <reinpost@win.tue.nl>
Message-ID: <199507041325.PAA02474@wsinis10.win.tue.nl>
To: " (Martijn Koster)" <m.koster@nexor.co.uk>
Cc: jlick@shoreside.com, /CN=robots/@nexor.co.uk, wwwstaff@qrd.org, 
    hawkeye@tcp.com
In-Reply-To: <"8700 Tue Jul  4 11:44:48 1995"@nexor.co.uk>
Subject:  Re: web searches index under wrong hostname
Reply-To: reinpost@win.tue.nl
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Status: RO
Content-Length: 2056

You (Martijn Koster) write:
>
>In message <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>, J
>ames Lick writes:
>
>> Basically, I have a web server which is on a machine known under
>> several names.  Due to this, various web search robots have indexed
>> pages under pretty much all of the various names, so that some pages
>> show under one host name, some under another, and some even under
>> several hostnames.

This scheme is nice, but it further corrodes the notion of a URL as a
persistent (non-unique) identifier for a document.  If a document was fetched
successfully under a given URL, in my opinion it must be accessible there
forever, unless the document itself expires.  (The same problem arises with
caching script results, and is technically solved with the Expires: header.)

You fail to comply with this criterion, so I would prefer to regard it as an
implementation failure on your side.  If you don't want people or robots to
retrieve certain documents under certain URLs, then don't serve them.  So
the preferred solution, in my opinion, is to teach the server to disallow
requests for documents based on the server name used by the client.  However,
if I understand correctly, there is no way to extract this information from
the connection; the client would need to send it explicitly.  It already
sends this information for the previous document, in the REFERER-header; you
need the information for the current document.  The solution is a new header;
if REFERER headers are set on redirects, you might find a working solution
using redirects and the REFERER header, but it would not work with most
clients.

A better solution, in my opinion: abandon the scheme, and set up www.qrd.org
to serve redirections only.  It's slower, but you'll get the control you need.
You can even set up multiple hosts to serve the redirections.

-- 
Reinier Post						 reinpost@win.tue.nl
a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A>
[LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK]

From /CN=robots-errors/@nexor.co.uk Tue Jul  4 17:02:47 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 4 Jul 1995 17:06:20 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 4 Jul 1995 17:02:47 +0100
Date: Tue, 4 Jul 1995 17:02:47 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:150110:950704160249]
Content-Identifier: Re: web searc...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 17:02:47 +0100;
Alternate-Recipient: Allowed
From: " (Tim Bray)" <tbray@opentext.com>
Message-ID: <m0sTANi-0001lzC@giant.mindlink.net>
To: /CN=robots/@nexor.co.uk
Subject:  Re: web searches index under wrong hostname
X-Sender: a07893@giant.mindlink.net
X-Mailer: Windows Eudora Version 2.0.3
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 559

James Lick points out that it's hard for robots/indexers to work
around duplicates.  IP address aliases are one problem, but symlinks
and foxy/modified httpd servers and so on all make it impossible in
principle to do this.

However, doesn't the BASE element provide a place to hang a solution
to the problem?  What we, the robot-floggers and indexers of the world,
need to do is get on our high horse and shriek in the conferences and
newsgroups and at the editor vendors and get them to use it.

Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com)

From /CN=robots-errors/@nexor.co.uk Tue Jul  4 17:23:10 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 4 Jul 1995 17:32:02 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 4 Jul 1995 17:23:10 +0100
Date: Tue, 4 Jul 1995 17:23:10 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:153430:950704162312]
Content-Identifier: Re: web searc...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 17:23:10 +0100;
Alternate-Recipient: Allowed
From: Brian Pinkerton <bp@webcrawler.com>
Message-ID: <9507041621.AA02352@webcrawler.com>
To: James Lick <jlick@shoreside.com>
Cc: /CN=robots/@nexor.co.uk, wwwstaff@qrd.org, hawkeye@tcp.com
References: <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>
Subject:  Re: web searches index under wrong hostname
Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
Original-Received: by NeXT.Mailer (1.118.2)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 938

Martijn and Reinier are right -- there's currently no perfect solution to  
this problem.  I'll second Martijn's wish for the ability (though not the  
requirement) to include full URLs in an HTTP request.

There's just no good way to get URLs right all the time because they are  
issued from the beholder's perspective: if it works, it works!  The most  
common request for a change to the WebCrawler index is a change of the  
hostname part of the URL: in the case where multiple names map to a single IP  
address, the WebCrawler is certain to get half the URLs wrong because it  
identifies servers by unique IP address.

The best current solution I know is to make sure that if a URL works, it will  
always work, and to use virtual hosts (aka APB patches) where more than one  
hostname per physical server is desired.  Virtual hosts are supported by  
Apache (see http://www.apache.org/), and can be hacked in to NCSA httpd.

bri

From /CN=robots-errors/@nexor.co.uk Wed Jul  5 04:53:41 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 5 Jul 1995 04:57:07 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 5 Jul 1995 04:53:41 +0100
Date: Wed, 5 Jul 1995 04:53:41 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:235090:950705035342]
Content-Identifier: Re: web searc...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 5 Jul 1995 04:53:41 +0100;
Alternate-Recipient: Allowed
From: " (James Burton)" <james@Snark.apana.org.au>
Message-ID: <20ee5915.be6e0-james@Snark.apana.org.au>
To: /CN=robots/@nexor.co.uk
References: <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>,
            <jlick@shoreside.com>
Subject:  Re: web searches index under wrong hostname
X-Mailer: //\\miga Electronic Mail (AmiElm 5.42)
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
Organization: Melbourne ArtWorks
Status: RO
Content-Length: 1926

> Hello Robot Gurus (plus cc folks),
>=20
>       Basically, I have a web server which is on a machine known unde=
r
> several names.  Due to this, various web search robots have indexed p=
ages
> under pretty much all of the various names, so that some pages show u=
nder
> one host name, some under another, and some even under several hostna=
mes.=20
>=20
>       In general this is ugly, but not of too much concern.  Most of =
the
> hostnames it is known by always point to the same host at all times.=20
> E.G., tcp.com and www.tcp.com and venice.tcp.com are always 128.95.44=
=2E29.=20
> However, I also do a mirror of the Queer Resources Directory which ha=
s an
> hostname www.qrd.org which cycles randomly through the various server=
s the
> QRD is mirrored on.  One time you may get tcp.com's address, other ti=
mes
> the server in Israel, or the one in San Francisco.=20
[...]
>       Thanks for the consideration of this problem and I look forward=
=20
> to any responses.
>=20
> --- James Lick -- jlick@tcp.com -- http://www.tcp.com/~jlick/ for mor=
e info ---

Call me thick but why doesn't the following work.
On all the possible servers of www.qrd.org configure exactly which
URLs are to be accepted. e.g. on the CERN server I can do (in /etc/http=
d.conf)

Pass    /httpd-internal-icons/*         /icons/*
Pass    /*                              /home/WWW/*
Pass    http:*
Pass    ftp:*
Pass    gopher:*
Pass    wais:*
Pass    news:*

Now if I was to (on www.tcp.com) change this to

Pass http://www.tcp.com/*   /home/WWW/*
Pass http://www.qrd.org/*   /home/WWW/QRD/*

and nothing else. Then the indexing robot would never find any wrong UR=
Ls
unless somebody has put a nasty absolute URL in a link

James
--=20
James Burton                                    |=20
EMail: James@Snark.apana.org.au                 | Latrobe University
WWW  : http://www.cs.latrobe.edu.au/~burton/    | Melbourne, Australia


From /CN=robots-errors/@nexor.co.uk Wed Jul  5 08:57:26 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 5 Jul 1995 09:06:47 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 5 Jul 1995 08:57:26 +0100
Date: Wed, 5 Jul 1995 08:57:26 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:262710:950705075728]
Content-Identifier: Re: web searc...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 5 Jul 1995 08:57:26 +0100;
Alternate-Recipient: Allowed
From: " (Reinier Post)" <reinpost@win.tue.nl>
Message-ID: <199507050756.JAA04177@wsinis10.win.tue.nl>
To: james@Snark.apana.org.au
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <20ee5915.be6e0-james@Snark.apana.org.au>
Subject:  Re: web searches index under wrong hostname
Reply-To: reinpost@win.tue.nl
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Status: RO
Content-Length: 357

You (/CN=robots-errors/@nexor.co.uk) write:

>Now if I was to (on www.tcp.com) change this to
>
>Pass http://www.tcp.com/*   /home/WWW/*
>Pass http://www.qrd.org/*   /home/WWW/QRD/*

Is this allowed at all?  The problem is, the server has no way of knowing
by what name it was called, www.tcp.com or www.qrd.org.

-- 
Reinier Post						 reinpost@win.tue.nl

From /CN=robots-errors/@nexor.co.uk Wed Jul  5 09:19:19 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 5 Jul 1995 09:32:30 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 5 Jul 1995 09:19:19 +0100
Date: Wed, 5 Jul 1995 09:19:19 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:267500:950705081931]
Content-Identifier: Re: web searc...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 5 Jul 1995 09:19:19 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"26729 Wed Jul  5 09:18:16 1995"@nexor.co.uk>
To: reinpost@win.tue.nl
Cc: james@Snark.apana.org.au, /CN=robots/@nexor.co.uk
In-Reply-To: <199507050756.JAA04177@wsinis10.win.tue.nl>
Subject:  Re: web searches index under wrong hostname 
Status: RO
Content-Length: 706

In message <199507050756.JAA04177@wsinis10.win.tue.nl>, " (Reinier Post)" write
s:

> >Pass http://www.tcp.com/*   /home/WWW/*
> >Pass http://www.qrd.org/*   /home/WWW/QRD/*
> 
> Is this allowed at all?

I think the above configuration is probably for the proxy side
of the CERN server; a proxy gets a full target URL, complete with
access scheme and hostname, which you can configure access
control on.

> The problem is, the server has no way of knowing
> by what name it was called, www.tcp.com or www.qrd.org.

Indeed, unless they're separate IP addresses.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Jul  6 16:14:41 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 6 Jul 1995 16:18:54 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 6 Jul 1995 16:14:41 +0100
Date: Thu, 6 Jul 1995 16:14:41 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:297060:950706151443]
Content-Identifier: How big is th...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 6 Jul 1995 16:14:41 +0100;
Alternate-Recipient: Allowed
From: Josef Pellizzari <jpellizz@afsmail.cern.ch>
Message-ID: <Pine.A32.3.91.950706165934.23703D-100000@sp066.cern.ch>
To: /CN=robots/@nexor.co.uk
Subject:  How big is the Web?
X-Sender: jpellizz@sp066.cern.ch
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
Content-Length: 678


I have three simple questions, to which only the Masters of the Robots 
may have the answers:

* I need rough estimates of the current number of 
  servers and URLs.

* In case someone has the data that document the growth of the Web
  ready at hand, I would be happy to receive them.

* Does anyone have an idea when a robot the last time traversed the 
  whole Web?

Thanks for your help!

Josef

----------------------------------------------------------------
 Josef PELLIZZARI		tel : +41 22 767 9627
 CN Division 31 2-013		fax : +41 22 767 7155
 CERN				mail: Josef.Pellizzari@cern.ch
 CH-1211 Geneve 23
----------------------------------------------------------------


From /CN=robots-errors/@nexor.co.uk Thu Jul  6 16:37:06 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 6 Jul 1995 16:43:07 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 6 Jul 1995 16:37:06 +0100
Date: Thu, 6 Jul 1995 16:37:06 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:005060:950706153708]
Content-Identifier: Re: How big i...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 6 Jul 1995 16:37:06 +0100;
Alternate-Recipient: Allowed
From: Chris Eborn <chris@dcs.kingston.ac.uk>
Message-ID: <199507061535.QAA25785@kite.dcs.king.ac.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Re: How big is the Web?
Status: RO
Content-Length: 736


> 
> I have three simple questions, to which only the Masters of the Robots 
> may have the answers:
> 
> * I need rough estimates of the current number of 
>   servers and URLs.

According to Lycos - around 4 million URLs on (at least) 23,550 http servers
(see the pages mentioned below)

> 
> * In case someone has the data that document the growth of the Web
>   ready at hand, I would be happy to receive them.
> 
> * Does anyone have an idea when a robot the last time traversed the 
>   whole Web?
> 
> Thanks for your help!
> 
> Josef

The lycos project has produced some estimates of the size of the web.
Try:

http://lycos.cs.cmu.edu/lycos-websize.html

or more generally:

http://lycos.cs.cmu.edu/lycos-websize.html


-chris

From /CN=robots-errors/@nexor.co.uk Thu Jul  6 20:06:44 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 6 Jul 1995 20:09:40 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 6 Jul 1995 20:06:44 +0100
Date: Thu, 6 Jul 1995 20:06:44 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:024820:950706190645]
Content-Identifier: Re: How big i...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 6 Jul 1995 20:06:44 +0100;
Alternate-Recipient: Allowed
From: Matthew Gray <mkgray@netgen.com>
Message-ID: <Pine.OSF.3.91.950706145934.4743B-100000@thoth.netgen.com>
To: Josef Pellizzari <jpellizz@afsmail.cern.ch>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <Pine.A32.3.91.950706165934.23703D-100000@sp066.cern.ch>
Subject:  Re: How big is the Web?
Organization: net.Genesis
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
Content-Length: 1052

> * I need rough estimates of the current number of 
>   servers and URLs.

More than 23,000 servers.

> * In case someone has the data that document the growth of the Web
>   ready at hand, I would be happy to receive them.

I have monthly data on the number of servers since June of 1993(*) and will 
announce it to the list when I make all the figures available.

Below are figures for every 6 months.  Feel free to redistribute these 
figures, but please keep the attribution of "Matthew Gray 
<mkgray@netgen.com> of net.Genesis Corp" with any graphs or 
representations of the data.

Matthew Gray --------------------------------- voice: (617) 577-9800
net.Genesis                                      fax: (617) 577-9850
56 Rogers St.                                      mkgray@netgen.com 
Cambridge, MA 02142-1119 ------------- http://www.netgen.com/~mkgray

Growth of the web, number of sites over time
 6/93     130
12/93     623
 6/94	 2738
12/94   10022
 6/95	23517


(*) Based on the results of my Wanderer, the first wandering web robot.

From /CN=robots-errors/@nexor.co.uk Fri Jul  7 08:55:38 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 7 Jul 1995 09:00:46 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 7 Jul 1995 08:55:38 +0100
Date: Fri, 7 Jul 1995 08:55:38 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:194810:950707075539]
Content-Identifier: Re: How big i...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 7 Jul 1995 08:55:38 +0100;
Alternate-Recipient: Allowed
From: Brian Pinkerton <bp@webcrawler.com>
Message-ID: <9507070755.AA08058@webcrawler.com>
To: /CN=robots/@nexor.co.uk
References: <Pine.A32.3.91.950706165934.23703D-100000@sp066.cern.ch>
Subject:  Re: How big is the Web?
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Original-Received: by NeXT.Mailer (1.118.2)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 477

Hmmm.  Those estimates [Lycos, net.Genesis] of the number of Web servers =
seem pretty low.  Check out

	http://webcrawler.com/WebCrawler/WebSize.html

for the WebCrawler's data.  Our latest number is just shy of 39,000 unique =
HTTP servers (by IP address).  We can't really compete with Lycos on the =
number of total URLs, but if you take their figure for the average number of =
URLs per server and multiply by our 39K number, then you get something =
around 7M URLs.

bri

From /CN=robots-errors/@nexor.co.uk Fri Jul  7 13:27:07 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 7 Jul 1995 13:35:08 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 7 Jul 1995 13:27:07 +0100
Date: Fri, 7 Jul 1995 13:27:07 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:260960:950707122709]
Content-Identifier: Re: How big i...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 7 Jul 1995 13:27:07 +0100;
Alternate-Recipient: Allowed
From: Billy Barron <billy@utdallas.edu>
Message-ID: <199507071224.HAA29290@utdallas.edu>
To: " (Brian Pinkerton)" <bp@webcrawler.com>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <9507070755.AA08058@webcrawler.com>
Subject:  Re: How big is the Web?
X-WWW-Page: http://www.utdallas.edu/acc/billy.html
X-Mailer: ELM [version 2.4 PL24]
Content-Type: text
Status: RO
Content-Length: 1003

In reply to Brian Pinkerton's message:
>
>Hmmm.  Those estimates [Lycos, net.Genesis] of the number of Web servers =
>seem pretty low.  Check out
>
>	http://webcrawler.com/WebCrawler/WebSize.html
>
>for the WebCrawler's data.  Our latest number is just shy of 39,000 unique =
>HTTP servers (by IP address).  We can't really compete with Lycos on the =
>number of total URLs, but if you take their figure for the average number of =
>URLs per server and multiply by our 39K number, then you get something =
>around 7M URLs.
>
IP address is not really an accurate measurement.  First, some large
sites (e.g. NCSA) use DNS shuffle records so the IP address changes
possibly on repeated queries.  Second, machines with multiple Ethernets
may shown up more than once.  Apache is making this situation worse too.
Even using the FDQN (eliminating aliases) suffer from this problem
and I don't see a good solution to it.

-- 
Billy Barron,  Network Services Manager, Univ of Texas at Dallas
billy@utdallas.edu 

From /CN=robots-errors/@nexor.co.uk Fri Jul  7 13:54:36 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 7 Jul 1995 14:33:15 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 7 Jul 1995 13:54:36 +0100
Date: Fri, 7 Jul 1995 13:54:36 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:268020:950707125445]
Content-Identifier: Lycos Answer:...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 7 Jul 1995 13:54:36 +0100;
Alternate-Recipient: Allowed
From: " (Michael Mauldin)" <mlm@fuzine.mt.cs.cmu.edu>
Message-ID: <9507071248.AA19441@fuzine.mt.cs.cmu.edu>
To: Brian Pinkerton <bp@webcrawler.com>
Cc: /CN=robots/@nexor.co.uk
Subject:  Lycos Answer: 6.9 million URLs, 57k servers
Original-Received: by NeXT Mailer (1.63)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 232

But that's a lower bound.  Also, since we don't track
servers by IP address, there may well be only 39k machines
hosting those 57k servers.

Our current count is 4.49 million URLs located and 1.015 million
URLs downloaded.

--Fuzzy

From /CN=robots-errors/@nexor.co.uk Thu Jul 13 13:14:03 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 13 Jul 1995 13:17:29 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 13 Jul 1995 13:14:03 +0100
Date: Thu, 13 Jul 1995 13:14:03 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:145380:950713121406]
Content-Identifier: unsubscribe
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 13 Jul 1995 13:14:03 
                      +0100;
Alternate-Recipient: Allowed
From: Chung.Kang.Tsen@OZARK.EDRC.CMU.EDU
Message-ID: <"14528 Thu Jul 13 13:13:23 1995"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  unsubscribe
Status: RO
Content-Length: 12

unsubscribe

From /CN=robots-errors/@nexor.co.uk Thu Jul 13 19:13:02 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 13 Jul 1995 19:16:39 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 13 Jul 1995 19:13:02 +0100
Date: Thu, 13 Jul 1995 19:13:02 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:219710:950713181305]
Content-Identifier: unsubscribe
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 13 Jul 1995 19:13:02 
                      +0100;
Alternate-Recipient: Allowed
From: Peter A Schwartz <pschwar@world.std.com>
Message-ID: <Pine.3.89.9507131025.A20327-0100000@world.std.com>
To: /CN=robots/@nexor.co.uk
Subject:  unsubscribe
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
Content-Length: 12

unsubscribe

From /CN=robots-errors/@nexor.co.uk Fri Jul 14 18:10:10 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 14 Jul 1995 18:13:53 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 14 Jul 1995 18:10:10 +0100
Date: Fri, 14 Jul 1995 18:10:10 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:167900:950714171014]
Content-Identifier: Opportunities...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 14 Jul 1995 18:10:10 
                      +0100;
Alternate-Recipient: Allowed
From: John.R.R.Leavitt@NL.CS.CMU.EDU
Message-ID: <"16777 Fri Jul 14 18:09:37 1995"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject:  Opportunities at Lycos, Inc.
Status: RO
Content-Length: 1574

[I hope this doesn't offend anyone.  If it does, I apologize.  Hi, Martijn!]

Employment Opportunities

Lycos, Inc builds, licenses and serves the catalog of the internet. We
are currently seeking individuals for a variety of positions in an
exciting but demanding startup environment:

Technical Positions (Pittsburgh, PA)

Highly skilled computer scientists and MIS professionals. Skills
sought include experience programming networking and operating systems
level (Solaris, Windows NT, OSF/1, SunOS, Windows 3.1, MacOS, ...),
database management, Web Services (HTTP, HTML, CGI, Perl), software
engineering and performance analysis. Technical positions are in the
Pittsburgh area.

Electronic resumes may be mailed to jobs@www.lycos.com. 

Physical resumes may be sent to:

     Lycos, Inc.
     c/o Center for Machine Translation
     Carnegie Mellon University
     4910 Forbes Avenue
     Cyert Hall 2nd floor
     Pittsburgh, PA 15213-3890 


Sales and Marketing (Boston, MA) 

We have a variety of assignments available for net savvy marketing
professionals.

Electronic resumes may be mailed to bdavis@www.lycos.com. 

Physical resumes may be sent to:

     Lycos, Inc.
     187 Ballardvale St.
     Suite B110
     Wilmington, MA 01887-7000 

     Technical Positions 
     Sales and Marketing 


     John R. R. Leavitt | Director, Product Development | Lycos, Inc.
   412 268 7282 | jrrl@lycos.com | http://thule.mt.cs.cmu.edu:8001/jrrl/
            Editor: Omphalos | Member: Pittsburgh Worldwrights
             Reading: All My Sins Remembered by Joe Haldeman

From /CN=robots-errors/@nexor.co.uk Mon Jul 24 01:42:42 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 24 Jul 1995 01:46:56 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Mon, 24 Jul 1995 01:42:42 +0100
Date: Mon, 24 Jul 1995 01:42:42 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:136730:950724004244]
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 24 Jul 1995 01:42:42 
                      +0100;
Alternate-Recipient: Allowed
From: " (Nick Arnett)" <narnett@Verity.COM>
Message-ID: <ac3898f101021004f2a4@[192.187.143.12]>
To: /CN=robots/@nexor.co.uk
X-Sender: narnett@hawaii.verity.com
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 5452

FYI -- Here's the press release that we'll send out tomorrow morning
regarding our spider.  As you'll see, it not only follows "robots.txt," by
default it requires the presence of a "robots.txt" file.  We thought this
was a safer way to go in a commercial products.  As it also says below, but
I think particularly relevant to the people on this list, it's designed to
index only one site at a time (and by default, only documents "below" the
starting point in the directory tree), so that it won't go wandering off to
discover new resources.  That's not the intent of this product.  It's
designed for companies that have a number of servers for which they want to
build a single index.

I'll be happy to answer whatever questions people on this list might have.

Nick


     VERITY ANNOUNCES REMOTE INDEXING CAPABILITIES
                  FOR TOPIC(r) WEBSEARCHER

     Introducing the First Commercial Web "spider"


MOUNTAIN VIEW, Calif. -- July 24, 1995 -- Verity, Inc., the leading
developer of search and retrieval software for the enterprise and the
Internet, today announced that it is enhancing its Topic WebSearcher
product with remote indexing capabilities. This new Remote Web Indexer will
be the first commercially available indexing robot, or "spider" for the
Web, allowing customers to build a searchable full-text index of any Web
site via the Internet.

"This new indexer makes it easy for our customers to work with distributed
information," said Philippe Courtot, Verity's chairman and CEO.  "The value
of the Web is multiplied by giving users the power to search information
across the organization as well as external sources."

Topic WebSearcher is a sophisticated search and retrieval tool that
incorporates Verity's powerful Topic technology for concept-based search
and relevancy-ranked results.  It is designed to work with Web servers via
a gateway application, giving access to documents from the Web, file
systems, databases and other repositories, supporting multiple formats
including the Web's HyperText Markup Language (HTML), native support for
Adobe Acrobat indexes and more than 50 standard word processing formats.
The new remote Web indexer follows the widely used "robots.txt" exclusion
file convention, which allows Web server administrators to restrict or deny
access to their servers.

"Topic Remote Web Indexer makes it easy for us to build and maintain the
tools that our researchers and others will use to find research
collaborators, sponsors and technology licensees," said Jay Creutz, program
manager at SAIC.  "Verity's search tools are a powerful complement to the
Web's browsing capabilities."

SAIC, a large systems integrator, is using Topic WebSearcher and Topic
Remote Web Indexer to create powerful search for UC-ACCESS, an on-line
system of databases that includes information about technologies,
researchers and data from the nine campuses of the UC system and the UC
Office of the President.

"Our organization has troubles with the labor intensive task of collecting
and organizing all the information we are responsible for so that anyone
can find it whenever they require it," said Brent Allsop of the Support
Technology Center at Hewlett Packard. "This is the tool that will make this
task easily and automatically possible."

Key features of the Remote Web Indexer include:

* Automatic indexing of HyperText Markup Language and text files.

* Built-in capture of fielded and zone information such as titles,
headlines and more, Web page modification dates, and Uniform Resource
Locators for more precise searching.

* Observes "Safeguard" default behaviors to ensure only authorized sites
are indexed: will only index if a "robots.txt" file is present; will not
jump hypertext links between servers.

Topic Remote Web Indexer is in beta test now at more than 10 customer
sites.  Topic WebSearcher version 1.1, which will include the Remote Web
Indexer, is scheduled for release in August, 1995 and is priced at $9,995.

For a demonstration of the Topic search engine and databases built with
Topic Remote Web Indexer, see Verity's Web Publishers Virtual Library at
http://www.verity.com/library.html.

Verity, Inc., headquartered in Mountain View, CA, develops and markets the
Topic family of information retrieval tools and applications for publishing
and disseminating information across the enterprise, the Internet and
CD-ROM.  The company's products and services are used by more than 650
corporations and organizations worldwide as well as by hundreds of
development partners.  Verity's Topic search engine is the engine of choice
for Adobe Systems, Lotus Development Corporation, Netscape Communications,
Quarterdeck Corporation, Frame Technology Corporation, Saros Corporation,
PC Docs, Odesta Systems Corporation, Documentum, Inc. Restrac, and many
other software developers.

                       ###

For more information contact Verity at info@verity.com or at the World Wide
Web site http://www/verity.com/ or by calling 415/960-7600.

Verity and TOPIC are registered trademarks of Verity, Inc. in the United
States and other countries.  Verity, Inc. is not related to the
International Stock Exchange of the United Kingdom and the Republic of
Ireland Limited, which provide computerized information under the name
Topic.  All other trademarks are the property of their respective holders.

--
Verity Inc.
narnett@verity.com
<URL:http://www.verity.com/>
(415) 960-7660


From /CN=robots-errors/@nexor.co.uk Fri Jul 28 14:51:53 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 28 Jul 1995 14:56:17 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 28 Jul 1995 14:51:53 +0100
Date: Fri, 28 Jul 1995 14:51:53 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:124010:950728135155]
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 28 Jul 1995 14:51:53 
                      +0100;
Alternate-Recipient: Allowed
From: Ivan_Lindenfeld@pcmailgw.ml.com
Message-ID: <9506288069.AA806950406@pcmailgw.ml.com>
To: /CN=robots/@nexor.co.uk
Encoding: 3 Text
Return-Receipt-To: Ivan_Lindenfeld@pcmailgw.ml.com
Status: RO
Content-Length: 30

     
     unsubscribe


From /CN=robots-errors/@nexor.co.uk Wed Aug  2 14:41:27 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 2 Aug 1995 14:47:20 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 2 Aug 1995 14:41:27 +0100
Date: Wed, 2 Aug 1995 14:41:27 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:266080:950802134130]
Content-Identifier: reindexing pa...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 2 Aug 1995 14:41:27 +0100;
Alternate-Recipient: Allowed
From: ".. G. Edward Johnson" <lorax@speckle.ncsl.nist.gov>
Message-ID: <Pine.3.89.9508020937.B6175-0100000@speckle>
To: /CN=robots/@nexor.co.uk
Subject:  reindexing pages.
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
Content-Length: 392


I am interested to know what stratigies web databases use for updating 
their indexes.  If a page has changed, and is reindexed, will it match both
what used to be on the page and what is now on the page or just what is 
currently on the page?  Also, on a related note is there a way to remove 
a page from the index (if for instance it no longer exists, or has moved?)

Thanks.
  Edward.


From /CN=robots-errors/@nexor.co.uk Wed Aug  2 15:53:38 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 2 Aug 1995 16:03:24 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 2 Aug 1995 15:53:38 +0100
Date: Wed, 2 Aug 1995 15:53:38 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:287850:950802145353]
Content-Identifier: Re: reindexin...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 2 Aug 1995 15:53:38 +0100;
Alternate-Recipient: Allowed
From: " (Tim Bray)" <tbray@opentext.com>
Message-ID: <m0sdf7n-0003CUC@giant.mindlink.net>
To: /CN=robots/@nexor.co.uk
Subject:  Re: reindexing pages.
X-Sender: a07893@giant.mindlink.net
X-Mailer: Windows Eudora Version 2.0.3
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Status: RO
Content-Length: 650

>I am interested to know what stratigies web databases use for updating 
>their indexes.  If a page has changed, and is reindexed, will it match both
>what used to be on the page and what is now on the page or just what is 
>currently on the page?  Also, on a related note is there a way to remove 
>a page from the index (if for instance it no longer exists, or has moved?)

Ours (Open Text http://www.opentext.com:8080), when it detects that a
page has changed, indexes only the new version.  Various different indexes
all have their own methods/tools for requesting deletion/update.  

Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com)

From /CN=robots-errors/@nexor.co.uk Wed Aug  2 18:23:45 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 2 Aug 1995 18:27:52 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Wed, 2 Aug 1995 18:23:45 +0100
Date: Wed, 2 Aug 1995 18:23:45 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:033570:950802172351]
Content-Identifier: I need a robot
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 2 Aug 1995 18:23:45 +0100;
Alternate-Recipient: Allowed
From: " (Ryan Waldron)" <rew@CrystalData.COM>
Message-ID: <m0sdhQp-000hAiC@cdsgw.CrystalData.COM>
To: /CN=robots/@nexor.co.uk
Subject:  I need a robot
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 1948

Hi, all.  I have a client (well, actually a client's client, but it's
effectively mine here) that wants to do a searchable index on the
contents of about 10 sites.  These sites are not big, they are all on
a similar topic, and the 'bot doesn't need to wander off the sites
following links to other hosts and documents.  In short, this robot
needs to grab a relatively small number of documents (compared to
normal resource-discovery robots) and then make a searchable index of
certain key words and phrases of interest to my client.  They have
been very specific that they want their own search facility, and are
even willing to pay for us to set up a dedicated machine to do it, if
need be.  So I can't very well tell them, "Just go look at Harvest."
They don't want the whole Web searched, just very specific searches
from just these few sites.

Now, I've never written a robot, though I've written lots of little
utilities to grab this and that.  I've read the robot exclusion
standards, I've looked at the big robots' sites, and so on.  I've done
what I could to educate myself on how this works.

I've grabbed MOMspider and libwww and started hacking away.  But I'm
having difficulty getting it to do exactly what I need, so I'm asking
here, just out of hope that someone can help me:

Is there anywhere a robot that already does this, for which I could
get the source?  I'd a whole lot rather use the code written by someone
who knew *exactly* what they're doing than risk my silly robot getting
loose and making people mad at me.

If not, I will do my utmost to make a very well-behaved robot out of
the MOMspider and libwww stuff I have.

-- 
Ryan Waldron   |||   http://www.traveller.com/~rew   |||   rew@traveller.com
The Software Tailors         (205) 232-2706             "Software that fits"
Consulting & Contract programming                       Unix / Windows / DOS
           C / C++ / XVT / OWL / MFC / E-Mail / News / WWW / HTML

From /CN=robots-errors/@nexor.co.uk Thu Aug  3 09:06:06 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 3 Aug 1995 09:09:14 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 3 Aug 1995 09:06:06 +0100
Date: Thu, 3 Aug 1995 09:06:06 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:224390:950803080608]
Content-Identifier: Re: I need a ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 3 Aug 1995 09:06:06 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <m.koster@nexor.co.uk>
Message-ID: <"22432 Thu Aug  3 09:05:41 1995"@nexor.co.uk>
To: " (Ryan Waldron)" <rew@CrystalData.COM>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <m0sdhQp-000hAiC@cdsgw.CrystalData.COM>
Subject:  Re: I need a robot 
Status: RO
Content-Length: 1013

In message <m0sdhQp-000hAiC@cdsgw.CrystalData.COM>, " (Ryan Waldron)" writes:

> In short, this robot
> needs to grab a relatively small number of documents (compared to
> normal resource-discovery robots) and then make a searchable index of
> certain key words and phrases of interest to my client.  They have
> been very specific that they want their own search facility, and are
> even willing to pay for us to set up a dedicated machine to do it, if
> need be.  So I can't very well tell them, "Just go look at Harvest."
> They don't want the whole Web searched, just very specific searches
> from just these few sites.

Maybe I don't quite understand your requirement, but why can you not
use Harvest for this purpose? You can configure that to stay within
bounds, and once you have the data in SOIF format you can pretty 
much do anything with it what you want.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

From /CN=robots-errors/@nexor.co.uk Thu Aug  3 15:38:15 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Thu, 3 Aug 1995 15:45:15 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Thu, 3 Aug 1995 15:38:15 +0100
Date: Thu, 3 Aug 1995 15:38:15 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:295630:950803143816]
Content-Identifier: Re: I need a ...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 3 Aug 1995 15:38:15 +0100;
Alternate-Recipient: Allowed
From: Tim Jung <tjung@i1.net>
Message-ID: <Pine.BSD/.3.91.950803093113.638C-100000@mail1.i1.net>
To: /CN=robots/@nexor.co.uk
In-Reply-To: <"22432 Thu Aug 3 09:05:41 1995"@nexor.co.uk>
Subject:  Re: I need a robot
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
Content-Length: 2006


On Thu, 3 Aug 1995, Martijn Koster wrote:

> In message <m0sdhQp-000hAiC@cdsgw.CrystalData.COM>, " (Ryan Waldron)" writes:
> 
> > In short, this robot
> > needs to grab a relatively small number of documents (compared to
> > normal resource-discovery robots) and then make a searchable index of
> > certain key words and phrases of interest to my client.  They have
> > been very specific that they want their own search facility, and are
> > even willing to pay for us to set up a dedicated machine to do it, if
> > need be.  So I can't very well tell them, "Just go look at Harvest."
> > They don't want the whole Web searched, just very specific searches
> > from just these few sites.
> 
> Maybe I don't quite understand your requirement, but why can you not
> use Harvest for this purpose? You can configure that to stay within
> bounds, and once you have the data in SOIF format you can pretty 
> much do anything with it what you want.
> 

Yes this is true. I already sent him private email telling him the same 
thing. "Harvest" is actually quite a nice package. It is one of the few 
indexing and search engines on the net for use with web where you can 
define a narrow search/retrieve parameters so you are not trying to index 
the whole internet but rather a very small detailed section of it on 
what you are interested in. It is also unusual in the fact that you don't 
need to maintain awhole set of robots, you can instead share them with 
neighbors around you so as to reduce the traffic load on the net. It is 
actually a very nice engine, and should be one of the better things to 
come out for this purpose in the last few months that I have heard of. It 
also allows you to share your index databases, not just your robot 
information. So if a site didn't want to maintain their own set of 
indexing routines, and qualifiers they could just get them from another 
site. All in all I think this program will be one of the greatest things 
to reduce traffic on the net in a long time.

From /CN=robots-errors/@nexor.co.uk Tue Aug  8 16:26:42 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 8 Aug 1995 16:30:33 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 8 Aug 1995 16:26:42 +0100
Date: Tue, 8 Aug 1995 16:26:42 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:205940:950808152643]
Content-Identifier: Harvester for...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 8 Aug 1995 16:26:42 +0100;
Alternate-Recipient: Allowed
From: " (Bonnie Scott)" <bonnie@dev.prodigy.com>
Message-ID: <199508081517.LAA31319@tinman.dev.prodigy.com>
To: /CN=robots/@nexor.co.uk
Subject:  Harvester for AIX?
X-Sender: bonnie@tinman.dev.prodigy.com
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Mailer: <Windows Eudora Version 2.0.2>
Status: RO
Content-Length: 345

I read at 

http://harvest.cs.colorado.edu/harvest/FAQ.html#platforms

that someone had ported Harvester to AIX 3.2. Does anyone know where to get
this or whom to write to? Do you think it would run under AIX 4.1? 

Also, what do you all think of NASA's MORE (extension of their RBSE project)?

Thank you,
Bonnie Scott
Prodigy Services Company


From /CN=robots-errors/@nexor.co.uk Tue Aug  8 17:39:21 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 8 Aug 1995 17:45:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 8 Aug 1995 17:39:21 +0100
Date: Tue, 8 Aug 1995 17:39:21 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:223300:950808163923]
Content-Identifier: Re: Harvester...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 8 Aug 1995 17:39:21 +0100;
Alternate-Recipient: Allowed
From: " (Mike Schwartz)" <mfs@cse.ogi.edu>
Message-ID: <m0sfrVq-00000eC@latour.cse.ogi.edu>
To: bonnie@dev.prodigy.com
Cc: /CN=robots/@nexor.co.uk
Subject:  Re:  Harvester for AIX?
Status: RO
Content-Length: 667

> Date: Tue, 8 Aug 1995 16:26:42 +0100
> From: " (Bonnie Scott)" <bonnie@dev.prodigy.com>
> To: /CN=robots/@nexor.co.uk
> Subject:  Harvester for AIX?
> 
> I read at 
> 
> http://harvest.cs.colorado.edu/harvest/FAQ.html#platforms
> 
> that someone had ported Harvester to AIX 3.2. Does anyone know where to get
> this or whom to write to? Do you think it would run under AIX 4.1? 

Bonnie,

See ftp://ftp.cs.colorado.edu/pub/distribs/harvest/contrib/AIX-binaries/

More generally, if you would like technical support for Harvest, please see
http://harvest.cs.colorado.edu/harvest/support-policy.html

Finally, please note that it's "Harvest", not "Harvester"
 - Mike

From /CN=robots-errors/@nexor.co.uk Tue Aug  8 21:03:33 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 8 Aug 1995 21:06:55 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Tue, 8 Aug 1995 21:03:33 +0100
Date: Tue, 8 Aug 1995 21:03:33 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:252930:950808200334]
Content-Identifier: Harvest on So...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 8 Aug 1995 21:03:33 +0100;
Alternate-Recipient: Allowed
From: " (Kelly Carney)" <kcarney@magellan.teq.stortek.com>
Message-ID: <9508082001.AA04406@gomer.teq.stortek.com>
To: /CN=robots/@nexor.co.uk
Cc: Kelly_Carney@stortek.com
Subject:  Harvest on Solaris?
X-Sun-Charset: US-ASCII
Status: RO
Content-Length: 629

While waiting for the new newsgroup for
Harvest to come online (comp.infosystems.harvest),
I thought I ask this forum a question...

Anyone RUNNING harvest under Solaris?

I've gotten binary as well as source distributions
to execute under Solaris 2.4, but I've never been
able to make it work correctly.  It seems to get
confused while looking through links to directories
and eventually locks up my host.

I know there is a new beta ready (1.3), but before
going to the trouble of building it, I'd appreciate
hearing if ANYONE has older versions working under
Solaris 2.4

                 Much obliged,
                 Kelly

From /CN=robots-errors/@nexor.co.uk Fri Aug 18 18:18:12 1995
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Fri, 18 Aug 1995 18:21:30 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
               Relayed; Fri, 18 Aug 1995 18:18:12 +0100
Date: Fri, 18 Aug 1995 18:18:12 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:055900:950818171813]
Content-Identifier: Prototype web...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 18 Aug 1995 18:18:12 
                      +0100;
Alternate-Recipient: Allowed
From: Razzakul Haider Chowdhury <a94385@cs.ait.ac.th>
Message-ID: <Pine.SUN.3.91.950819000052.14768A-100000@cs4.cs.ait.ac.th>
To: /CN=robots/@nexor.co.uk
Subject:  Prototype web robot
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Length: 1237

As pert of my thesis for M.Engg. in Computer Science, I have implemented 
a prototype web robot. Using Perl and its rich library robot is created 
to generate an index of HTML documents in the web servers. Though the robot 
exclusion protocol is not implemented yet, it is expected to implement in 
future before the test run in the net.

A form-based search interface is available which will provide boolean (OR 
only) keyward search facility on the Html Index (HI testbed, try with 
energy, environment, power sector etc. keywords). The URL is:

             http://www.cs.ait.ac.th/~a94385/pa.html


	Razzakul Haider Chowdhury
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
		E-mail: a94385@cs.ait.ac.th
		Home Page:  http://www.cs.ait.ac.th/~a94385/index.html
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
Mailing Address:                |FAX Address:
Mail Box-171,                   |      Dormitory- R106B
AIT,          			| Fax# (66-2) 524-2126 & 516-1418
G.P.O. Box-2754,		| Tel: (66-2) 524-5980
Bangkok-10501,			|        "    524-6170  (8:00 to 12:00 pm BK)
THAILAND			|        "    524-6171
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=