-
Notifications
You must be signed in to change notification settings - Fork 88
/
Copy pathrobots-nexor-mbox.txt
12549 lines (10636 loc) · 524 KB
/
robots-nexor-mbox.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
From /CN=robots-errors/@nexor.co.uk Wed Jun 1 21:17:14 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 1 Jun 1994 21:17:35 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Wed, 1 Jun 1994 21:17:14 +0100
Date: Wed, 1 Jun 1994 21:17:14 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039230:940601201716]
Content-Identifier: WWW robots di...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 1 Jun 1994 21:17:14 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <[email protected]>
Message-ID: <"3912 Wed Jun 1 21:17:00 1994"@nexor.co.uk>
To: Jonathon Fletcher <[email protected]>,
David Eichmann <[email protected]>,
Oliver McBryan <[email protected]>,
Roy Fielding <[email protected]>,
Brian Pinkerton <[email protected]>, Fred Barrie <[email protected]>,
Matthew Gray <[email protected]>, Paul De Bra <[email protected]>,
Guido van Rossum <[email protected]>,
"James E. Pitkow" <[email protected]>,
Andreas Ley <[email protected]>,
Christophe Tronche <[email protected]>,
Charlie Stross <[email protected]>, [email protected],
Michael L Mauldin <[email protected]>
Cc: /CN=robots/@nexor.co.uk
Subject: WWW robots discussion list
Status: RO
Content-Length: 1305
At the WWW'94 Conference the robot authors present expressed an
interest in some closer collaboration. I volunteered to set up a
mailing list to serve as a platform for these technical discussions.
This list is now active. As you are all developing or administering
robots I'd urge you to make use of this facility; together we
should be able to reduce the occurence of problems caused by
robots, to reduce some of the duplicate effort, and improve
the service to users of robot-generated facilities.
If you'd like to subscribe, send a message to
[email protected], with the lines
subscribe
help
stop
in the body of the message. The list manager is of course NXDLM, which
we market as product, and is configured to keep an archive of traffic
on the list. This archive is accessible from the Web vie our experimental
gateway reacheable from <http://web.nexor.co.uk/>.
To send messages to the list itself use [email protected].
Next week (allowing people time to register) I'll post a proposed
charter to the list, and list some issues I'd like to see discussed.
Looking forward to your contributions,
-- Martijn
__________
Internet: [email protected]
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 09:37:38 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 6 Jun 1994 09:38:15 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 09:37:38 +0100
Date: Mon, 6 Jun 1994 09:37:38 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:130800:940606083743]
Content-Identifier: Proposed Char...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 09:37:38 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <[email protected]>
Message-ID: <"13056 Mon Jun 6 09:37:06 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject: Proposed Charter
Status: RO
Content-Length: 1401
Welcome to you all...,
Here is the proposed charter for this list, for future
reference by new subscribers. It's straightforward, but
if anybody would like to see any changes let me know.
--
Proposed charter for [email protected].
This list is intended as a technical forum
for authors, maintainers and administrators of WWW robots.
Its aim is to maximise the benefits WWW robots can offer
while minimising drawbacks and duplication of effort.
It is intended to address both development and operational
aspects of WWW robots.
This list is not intended for general discussion of WWW
development efforts, or as a first line of support for
users of robot facilities.
Postings to this list are informal, and decisions and
recommendations formulated here do not constitute any
official standards. Postings to this list will be made
available publicly through the list-managers archive,
and NEXOR doesn't accept any responsibility for the content
of the postings.
Related lists:
[email protected]: technical WWW development discussions
[email protected]: HTML specific development discussions
[email protected]: technical discussions on proxys and caching
comp.infosystems.www.*: WWW discussions
-- Martijn
__________
Internet: [email protected]
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 09:39:16 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 6 Jun 1994 09:39:54 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 09:39:16 +0100
Date: Mon, 6 Jun 1994 09:39:16 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:131210:940606083924]
Content-Identifier: Topics
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 09:39:16 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <[email protected]>
Message-ID: <"13118 Mon Jun 6 09:39:07 1994"@nexor.co.uk>
To: /CN=robots/@nexor.co.uk
Subject: Topics
Status: RO
Content-Length: 7634
Here is a long list topics I'd like to see discussed at
some point, in no particular order. I look forward to comments
on these topics, other issues, and what the priorities should be.
* public information
- robot profile matrix
It would be nice to have a matrix of certain attributes of
the various robots that exist, that is available to the Web
public at large. The list I maintain on
<http://web.nexor.co.uk/mak/doc/robots/active.html> could
server as a basis; are there any additions people would like
to see?
* sharing of data
- format / access protocol of database
Most indexing robots generate a database of information,
that can then be searched through publicly accessible
ISINDEX/Form pages. It would be nice if the actual database
was publicly available, or where applicable an access
protocol can be made publicly available (eg. SQL).
Others could then run local mirrors of the search engines,
write their own search engines, or do analysis of the data.
- distributed data gathering
If there was a standard database format / access protocol
the data gathering could be distributed over the net, either
by separate robots, or multiple copies of the same robot.
Jonathon, you mentioned once you were working on some robot
database synchronisation scheme. Did you get anywhere?
* data analysis
As robots traverse the Web, they could do a lot of statistical
analysis, either real-time, or on the resulting database. It seems
silly that multiple robots go out over the same data, all doing
slightly different analysis.
It would be really nice to publish:
- a list of servers
Like Mathew Gray's list, but then one that is as up-to-date as
the latest robot run, has only got hosts that actually exist,
and are smart about multiple DNS names for the same IP address.
- inverse maps
Robots can create inverse maps, so that I can find out which pages
refer to a particular page. Until the Referer HTTP field becomes
more used this could be very valuable to find bad links. And it'd
be nice to know the average number of links to a page; how inter-
linked is the Web? We could have a most-referenced league table;
which is the most popular page in the web in terms of links?
- general stats like
avergae number of visited documents per server (and min & max),
total number of documents visited,
total number of hosts visited,
percentage of links that are bad,
percentage of HTML documents that are a tag soup,
percentage of documents not changed in x days,
etc. etc.
* sharing of operational tips
All robot maintainers hit the same problems at certain sites,
and get things like:
- seed documents
What documents are good to start robots from.
- site exclusion lists
Which sites explicitly ask not to be visited.
- black hole lists
Which cgi-scripts create infinitely linked web spaces
- avoidance lists
Which data should be avoided (e.g. the UNIX manual gateways)
- robotsnotwanted proposal
I'd like to get some more discussion on this. As all the robot writers
are on this list we should be able to decide on something that can
easily be implemented by robots and users. The only outstanding issue
is the name of the file; it is too long for DOS-based servers. Is there
any problem with changing the filename to robotsp.txt (for robots
policy) ?
- scheduled runs
It might be nice to know when which robots are running, just in case
people start wondering.
- ALIWEB
For those sites that have a /site.idx file it might be worth to take
the documents referenced in it special consideration.
* sharing of algorithms
All robots have different algorithms for a lot of the same functions.
It should be possible to find the best algorithm that all robots can use:
- document selection
Which documents do you visit? A lot of robots to "n levels deep" which
seems pretty arbitrary to me. Doing "n levels from the root document"
might make more sense.
- HTML parsing
This is tricky, with so much bad HTML out there. There must be a "best
way" to extract URL's from documents; I am sure that at the moments
some robots barf on some documents.
- load balancing
How do you decide when to query a site as to balance the load most?
It is by now clear the "visit one site at top speed" approach is
nasty; what is used now? Roud robin? Can time zones be used? How
fast do robots run?
- search algorithms
Once you have a database, what algorithm can one use to search it?
At the moment there are Perl scripts, SQL scripts, WAIS database etc.
If there was a standard database format these could be benchmarked.
- error recovery
Robots should be restarteable without having to backtrack. How is this
best achieved?
* sharing of code
There is a lot of duplication of effort in the coding and maintenance
of robot code. It would be useful if there was one common code base
for robots to draw from, implementing the separate algorithms used in
robots.
I would really like to see a single robot implemetation (TUM: The Unified
Robot?), that could run cooperatively around the world. Is this me
dreaming, or is it something more of you see as beneficial. If so, how
can we make this a reality? What language is most suiteable (Perl,
surely :-) ? What design allows the most flexibility and safety?
* HTML/HTTP extensions
It maybe that there are things in HTTP/HTML that robots could use but
don't at the moment, and it may even be worth extending the protocol
to put facilities aimed at automated tools in (eg If-modified-since).
At WWW'94 one idea was for example to implement as server-side facility
to parse an HTML document, and return only the links.
* Caching issues
The increased use of caching presents special problems for robots:
how does a robot recognise a cached document sitting in the cache
data area of a chaching server? Should it document them?
But caches and robot do similar things, a robot uses it's own database
as a cache (I hope!), but a caching server could also use that data.
This comes back to standardising the database; maybe the structure
used by the CERN cache can be used as format for robot gathering output.
Robots can also be useful for pre-loading a cache, to do mirroring, or
to prepare for off-line demo's. Maybe robots should have command-line
options to facilitate this. Then again, robot code should probably not
be handed out freely.
* Testing
The person running a robot should keep close tabs on what it is doing
at any one time. What sort of monitoring tools are used to do that?
Testing robot modifications is another issue. I have noticed in the
past that a robot did the same run several times in a day, which it
turned out to do "for testing". Surely tests should be done locally.
Right, I have been waiting to get all these off my chest. I think TUM is
the most challenging long-term topic, but in the short term I think the
standard database(s) is the most important; it would bring immediate benefit,
and a lot of the other issues can follow on from that.
Any comments?
-- Martijn
__________
Internet: [email protected]
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:06:46 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 6 Jun 1994 10:07:47 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 10:06:46 +0100
Date: Mon, 6 Jun 1994 10:06:46 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:136200:940606090649]
Content-Identifier: Inverse Maps
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:06:46 +0100;
Alternate-Recipient: Allowed
From: [email protected]
Message-ID: <[email protected]>
To: /CN=robots/@nexor.co.uk
Subject: Inverse Maps
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 770
I am currently working on W4v3.0, and one of the features I have implemented
so far is some inverse mapping features. It's yielded some interesting
results. Not surprisingly, the most pointed to sites in the documents
examined in a preliminary run were info.cern.ch and www.ncsa.uiuc.edu.
Other highly pointed to sites include nearnet.gnn.com (:-),
www.cis.ohio-state.edu, www.cs.cmu.edu, gopher.vt.edu, and sunsite.unc.edu.
For the initial portion of the implementation, I am only constructing
interconnectivity within sites. That is, I keep track of what documents
point to site FOO, not what documents point to what documents. Any ideas
on implemenation of the latter that is reasonable?
Has anyone else done such interconnectivity mapping?
...Matthew
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:36:42 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:48:42 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 10:36:42 +0100
Date: Mon, 6 Jun 1994 10:36:42 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140740:940606093644]
Content-Identifier: re: Inverse M...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:36:42 +0100;
Alternate-Recipient: Allowed
From: Charlie Stross <[email protected]>
Message-ID: <[email protected]>
To: /CN=robots/@nexor.co.uk, [email protected]
Subject: re: Inverse Maps
X-Mailer: SCO Portfolio 2.0
Status: RO
Content-Length: 1799
[email protected] writes ...
>I am currently working on W4v3.0, and one of the features I have implemented
>so far is some inverse mapping features. It's yielded some interesting
>results. Not surprisingly, the most pointed to sites in the documents
>examined in a preliminary run were info.cern.ch and www.ncsa.uiuc.edu.
>Other highly pointed to sites include nearnet.gnn.com (:-),
>www.cis.ohio-state.edu, www.cs.cmu.edu, gopher.vt.edu, and sunsite.unc.edu.
>For the initial portion of the implementation, I am only constructing
>interconnectivity within sites. That is, I keep track of what documents
>point to site FOO, not what documents point to what documents. Any ideas
>on implemenation of the latter that is reasonable?
One idea I was playing with when I was working on websnarf 2
(which is currently on the shelf) was the idea of using a whacking
great .dbm file to store either entire HTML files, indexed on their
URL, or a list of URLs extracted from such files. (I ran into a
problem in that the standard dbm and Berkeley dbm libraries have a
maximum record size of 1024 or 2096 bytes respectively; GNU dbm
apparently doesn't have this restriction, but I didn't have time to
rebuild my version of Perl with a new library.) Anyway, the idea
is that keeping such a database would reduce the problem of cross-
referencing large webs; simply read a record, and for each URL
in the record (which contains a list) do a lookup on the database.
(The output could then be turned into input for a graph-generating
program like AT&T's NEATO.)
-- Charlie
--------------------------------------------------------------------------------
Charlie Stross is [email protected], SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 18:28:28 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:46:57 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 18:28:28 +0100
Date: Mon, 6 Jun 1994 18:28:28 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:203320:940606172830]
Content-Identifier: Re: Inverse M...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 18:28:28 +0100;
Alternate-Recipient: Allowed
From: Brian Pinkerton <[email protected]>
Message-ID: <[email protected]>
Cc: /CN=robots/@nexor.co.uk
Subject: Re: Inverse Maps
Original-Received: by NeXT.Mailer (1.100)
PP-warning: Illegal Received field on preceding line
Original-Received: by NeXT Mailer (1.100)
PP-warning: Illegal Received field on preceding line
Status: RO
Content-Length: 403
I've done some inverse mapping with the WebCrawler, but not to any great
extent. Right now, I just generate the "Top 25" list -- a list of the 25 most
frequently referenced sites on the Web (at least, based the WebCrawler's
limited experience). This turns out to work pretty well -- you can see the
(predictable) results at
http://www.biotech.washington.edu/WebCrawler/Top25.html.
bri
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:14:53 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Mon, 6 Jun 1994 10:15:25 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 10:14:53 +0100
Date: Mon, 6 Jun 1994 10:14:53 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:137830:940606091454]
Content-Identifier: Avoidance Alg...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:14:53 +0100;
Alternate-Recipient: Allowed
From: [email protected]
Message-ID: <[email protected]>
To: /CN=robots/@nexor.co.uk
Subject: Avoidance Algorithms
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 1101
One of the features that I implemented in W4v1.0 was an avoidance algorithm
I called 'boredom'. First a brief implementation profile of W4v1.0:
W4v1.0 was written in June of 1993 as a simple depth first search that kept
the entire database in memory of where it had been and dumped to disk when
it had exhausted a document tree. Very simple.
So, one issue I was concerned about was infinite trees (this is a bad thing
with depth first searches :-) so I added a feature to the Wanderer that allowed
it to 'get bored'. Specifically, if it retrieved more than N documents with
the same path (except for the last element) and a few other heuristics, it
bailed out and found something more interesting to do. For the most part
this was very successful.
W4v2.0 was a modification to do breadth first searching, and in that revision
'boredom' got removed, as it was not as useful to the algorithm. I am planning
on reimplementing a more advanced version of 'boredom' in W4v3.0, partially
based on content parsing.
Suggestions? Comments? Other implementations to avoid large trees?
...Matthew
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:26:27 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:48:21 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 10:26:27 +0100
Date: Mon, 6 Jun 1994 10:26:27 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:138740:940606092628]
Content-Identifier: Database/memo...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:26:27 +0100;
Alternate-Recipient: Allowed
From: [email protected]
Message-ID: <[email protected]>
To: /CN=robots/@nexor.co.uk
Subject: Database/memory implementation
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 1128
How have people in general implemented the DB? By the database (DB) I mean
the robot's record of where it has been, not necassarily anything it
constructs for later consumption.
W4v1.0 implemented a completely in memory DB. This worked fine when there
were 100 sites on the web. It doesn't work any more :-) Plus if the
Wanderer crashed, it wouldn't always successfully dump it's DB.
W4v2.0 implemented a disk based DB which has a number of advantages
1) It can get as big as it wants and not kill the machine
2) It saves state, so arbitrary crashes don't lose any substantial data
On the other hand, it is somewhat slower, though most of the time is spent
waiting for HTTP responses.
Currently, it maintains one record of where it has been ('log') and another
record of where it plans on going ('dq') and another set of analogous
in-memory lists which regularly get flushed to disk.
Any other more novel implementations out there? I've given a passing thought
to trying a heierarchical DB, but I'm not sure it would be useful.
Any ideas on how to make an in-memory DB smaller? Or a disk DB faster?
...Matthew
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:34:35 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:48:38 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 10:34:35 +0100
Date: Mon, 6 Jun 1994 10:34:35 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140130:940606093437]
Content-Identifier: Server list
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:34:35 +0100;
Alternate-Recipient: Allowed
From: [email protected]
Message-ID: <[email protected]>
To: /CN=robots/@nexor.co.uk
Subject: Server list
X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html
Status: RO
Content-Length: 977
Once I get W4v3.0 finished, I intend to add a number of the modifications
mentioned by Martijn in his initial letter (DNS identification of identical
servers, bogus servers eliminated, etc.)
Additionally, I would welcome any other lists of servers. I can merge such
lists with the comprehensive list. I will continue to maintain the
"Comprehensive List of WWW Sites", so anything to make this as up to date and
accurate as possible would be great.
Suggestions on other useful techniques for sorting the comprehensive list would
be great too. If you don't know what I'm talking about, or have lost the URL:
http://www.mit.edu:8001/people/mkgray/compre.bydomain.html
So, please do send me any sitelists. No desperate need to crosscheck with my
list, I can do that. Of course, if you want to, that just makes my life
easier.
...Matthew
BTW, I'm sending all these messages out separately to keep the topic threads
vaguely separate, in case that wasn't apparent.
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 11:26:46 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:47:04 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 11:26:46 +0100
Date: Mon, 6 Jun 1994 11:26:46 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:145260:940606102648]
Content-Identifier: Avoidance Alg...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 11:26:46 +0100;
Alternate-Recipient: Allowed
From: Charlie Stross <[email protected]>
Message-ID: <[email protected]>
To: [email protected], /CN=robots/@nexor.co.uk
Subject: Avoidance Algorithms
X-Mailer: SCO Portfolio 2.0
Status: RO
Content-Length: 3168
[email protected] writes ...
>One of the features that I implemented in W4v1.0 was an avoidance algorithm
>I called 'boredom'. First a brief implementation profile of W4v1.0:
>W4v1.0 was written in June of 1993 as a simple depth first search that kept
>the entire database in memory of where it had been and dumped to disk when
>it had exhausted a document tree. Very simple.
:
>W4v2.0 was a modification to do breadth first searching, and in that revision
>'boredom' got removed, as it was not as useful to the algorithm. I am planning
>on reimplementing a more advanced version of 'boredom' in W4v3.0, partially
>based on content parsing.
>Suggestions? Comments? Other implementations to avoid large trees?
Well, my first cut at websnarf was a recursive depth-first probe. This
rapidly ran away into the web, and as my bandwidth is limited to my
share of a 64K line this seemed like a bad idea. It also had a tendency
to dump core due to stack frame overflows.
I went to the bookshelf and was most interested to read the chapter on
graph searching in Sedgewick (Algorithms, 2nd edn, can't remember the
year). It turns out that you can use a stack to emulate a recursive
depth-first traversal, and a queue to emulate a recursive breadth-first
traversal, both without the need for recursion. Perl provides a handy
data structure -- the list -- and calls to use a given list as either a
queue or a stack.
I modified websnarf so that it could do both breadth- and depth-
traversals (with the switchover being handled in a small subroutine
that decided whether to push or shift URLs onto the list, and pop or
unshift them off the list).
Because my nonrecursive implementation created a list of URLs
representing the current state of your tree walk, it was then relatively
trivial to scan along the list. If you see two occurences of the same
URL in the list, you know there's some danger of getting into a loop,
and you can just prune one of them out of the list. It's also useful to
store the "depth" (i.e. number of links away from home) along with each
URL in the list. If two pointers to the same URL occur at the same
depth, the odds are that they're fairly safe -- just time-consuming.
But if one is above the other, there's the possibility of some kind of
weird loop occuring.
Finally, one thing I'd do immediately if I was working on websnarf right
now [*] would be to ensure that it avoids any URLs that look like search
commands or internal document pointers. That way lies madness ...
-- Charlie
[*] websnarf is a personal effort, not a company-sanctioned project.
It's on the shelf due to lack of spare time at work. I'm hoping
(fingers firmly crossed) to get a grant for next year to do the job
professionally, i.e. to spend all my time on it, not just a couple of
hours a week. Meantime, I'm spending my time thinking about how to get
right all the things I got wrong the first time round ...
--------------------------------------------------------------------------------
Charlie Stross is [email protected], SCO Technical Publications
GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+
w++ t-(---) r-(++) y+
From /CN=robots-errors/@nexor.co.uk Mon Jun 6 11:52:10 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Tue, 7 Jun 1994 09:47:52 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 6 Jun 1994 11:52:10 +0100
Date: Mon, 6 Jun 1994 11:52:10 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:148030:940606105212]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 11:52:10 +0100;
Alternate-Recipient: Allowed
From: Lee McLoughlin <[email protected]>
Message-ID: <"swan.doc.i.352:06.05.94.10.50.10"@doc.ic.ac.uk>
To: Charlie Stross <[email protected]>, [email protected], /CN=robots/@nexor.co.uk
In-Reply-To: <[email protected]>
Subject: Re: Avoidance Algorithms
X-Mailer: Mail User's Shell (7.2.5 10/14/92)
Status: RO
Content-Length: 747
My scanning routine was the usual depth-first search. But this meant that
certain sites were scanned before others. Apart from causing sites way down
the list to be left out it also meant that one site would get "soaked". In
the end I went for random scanning of all stored URLs looking for a URL and
site that I hadn't gone to recently.
My system is also written in perl except that I store all the retrieved data
in dbm files. Since I have URL and site timers I can now also run multiple
scanning processes at the same time.
--
--
Lee McLoughlin. Phone: +44 71 589 5111 X 5085
Dept of Computing, Imperial College, Fax: +44 71 581 8024
180 Queens Gate, London, SW7 2BZ, UK. Email: [email protected]
From /CN=robots-errors/@nexor.co.uk Wed Jun 8 10:09:49 1994
Replied: Wed, 08 Jun 1994 11:02:20 +0100
Replied: /CN=robots/@nexor.co.uk
Replied: " (Paul De Bra)" <[email protected]>
Replied: [email protected]
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 8 Jun 1994 10:11:38 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Wed, 8 Jun 1994 10:09:49 +0100
Date: Wed, 8 Jun 1994 10:09:49 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:090430:940608090957]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 10:09:49 +0100;
Alternate-Recipient: Allowed
From: " (Paul De Bra)" <[email protected]>
Message-ID: <[email protected]>
To: /CN=robots/@nexor.co.uk
Subject: Re: Avoidance Algorithms
Status: RO
Content-Length: 857
Strange to hear that the strategy in W4 changed from depth-first in 1.0 to
breadth-first in 2.0. The experiments we ran with the fish-search, both real
and simulated, all showed that depth-first is a better navigation algorithm
than breadth-first.
We also have boredom, set to 1, meaning that we never retrieve the same
url twice.
Another thing the fish-search does is to try to not load url's from the same
host in succession. (It searches among the first 30 or so url's in its list
to find one from another host.)
A feature still missing in the fish-search, which I would like to hear about
from others is the use of ISMAP's.
My idea is to try a limited selection of coordinates first, and put a larger
selection further back in the "queue" of url's to be tried.
How do other robots find out which url's can be reached by clicking in an
ismap?
Paul.
From /CN=robots-errors/@nexor.co.uk Wed Jun 8 11:02:39 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 8 Jun 1994 11:03:19 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Wed, 8 Jun 1994 11:02:39 +0100
Date: Wed, 8 Jun 1994 11:02:39 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:096060:940608100242]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 11:02:39 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <[email protected]>
Message-ID: <"9595 Wed Jun 8 11:02:26 1994"@nexor.co.uk>
To: " (Paul De Bra)" <[email protected]>
Cc: /CN=robots/@nexor.co.uk, [email protected]
In-Reply-To: <[email protected]>
Subject: Re: Avoidance Algorithms
Status: RO
Content-Length: 1994
> Strange to hear that the strategy in W4 changed from depth-first in 1.0 to
> breadth-first in 2.0.
Not really; most Web server URL spaces are structured hierarchically,
with growing more specific towards the leaves. So if you start from a
server root and you do a bread-first search for a limited number of
documents you'll get a broader (and therefore for the purposes of
general indexing better) overview than if you do a depth-first search
for a limited number of documents, which can shoot of down one
specific area (especially in deep trees).
If you use maximum-depth rathyer then maximum-documents it shouldn't
matter much (depending on the structure of the data).
> The experiments we ran with the fish-search, both real
> and simulated, all showed that depth-first is a better navigation algorithm
> than breadth-first.
Can you elaborate on how they were better?
> A feature still missing in the fish-search, which I would like to hear about
> from others is the use of ISMAP's.
> My idea is to try a limited selection of coordinates first, and put a larger
> selection further back in the "queue" of url's to be tried.
>
> How do other robots find out which url's can be reached by clicking in an
> ismap?
Dave Ragget would refer you to the HTML 3.0 facilities for specifying
links on figures within the figure element. I think that is the only
way; there are an infinite number of coordinates in an ISMAP, you
don't know where they are, and you can't check two locations for
equivalence.
Incidentally, I reckon that it is bad HTML if you provide an ismap as
sole access to a small set of URL's.
A friend of mine is working on a "click on this festival map to show
where you are going to be" service. I'd hate to think what a random
ISMAP coordinate-trying robot would do to that. ;-)
-- Martijn
__________
Internet: [email protected]
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html
From [email protected] Wed Jun 8 12:50:17 1994
Replied: Wed, 08 Jun 1994 13:45:38 +0100
Replied: robots
Replied: [email protected] (Paul De Bra)
Return-Path: <[email protected]>
Delivery-Date: Wed, 8 Jun 1994 12:50:28 +0100
Received: from svin04.info.win.tue.nl by lancaster.nexor.co.uk
with SMTP (XTPP); Wed, 8 Jun 1994 12:50:17 +0100
Received: from pcpaul.info.win.tue.nl
by svin04.info.win.tue.nl (8.6.8/1.45) id NAA28716;
Wed, 8 Jun 1994 13:50:05 +0200
Received: from localhost by pcpaul.info.win.tue.nl (8.6.4/1.60) id LAA20290;
Wed, 8 Jun 1994 11:52:19 GMT
From: [email protected] (Paul De Bra)
Message-Id: <[email protected]>
Subject: Re: Avoidance Algorithms
To: [email protected] (Martijn Koster)
Date: Wed, 8 Jun 1994 13:52:17 +0200 (MET DST)
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <[email protected]> from "Martijn Koster" at Jun 8, 94 11:02:11 am
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: RO
Content-Length: 2835
> > Strange to hear that the strategy in W4 changed from depth-first in 1.0 to
> > breadth-first in 2.0.
>
> Not really; most Web server URL spaces are structured hierarchically,
> with growing more specific towards the leaves. So if you start from a
> server root and you do a bread-first search for a limited number of
> documents you'll get a broader (and therefore for the purposes of
> general indexing better) overview than if you do a depth-first search
> for a limited number of documents, which can shoot of down one
> specific area (especially in deep trees).
I guess our algorithm avoids shooting down into a specific area by trying
to find links to other sites first.
When you have limited search time you often don't get anywhere with breadth-
first navigation because you don't reach documents that are deep enough
to deal with a specific topic.
A robot that spends a *lot* of time, visiting very many documents, could
work equally well using breadth-first search.
> If you use maximum-depth rathyer then maximum-documents it shouldn't
> matter much (depending on the structure of the data).
We do use maximum-depth to avoid going too far in a non-relevant direction.
> > The experiments we ran with the fish-search, both real
> > and simulated, all showed that depth-first is a better navigation algorithm
> > than breadth-first.
>
> Can you elaborate on how they were better?
They found more cross-reference links. Which need not mean anything, but
considering that cross-reference links may (in the web) be links leading to
different sites, this suggests a better chance of penetrating into a larger
part of the web. again, the fact that we search for a limited time is important
here.
> ...
> > ismap?
>
> Dave Ragget would refer you to the HTML 3.0 facilities for specifying
> links on figures within the figure element. I think that is the only
> way; there are an infinite number of coordinates in an ISMAP, you
> don't know where they are, and you can't check two locations for
> equivalence.
nice to here something will be coming along.
there are a finite number of coordinates in an ISMAP, but the number is
large. we would never consider trying all possible coordinates.
which isn't necessary in any ismap i know.
> Incidentally, I reckon that it is bad HTML if you provide an ismap as
> sole access to a small set of URL's.
dunno. the course on hypertext which i have on line does it...
and databases that deal with mostly graphical information, providing ismaps
to zoom in on things and providing information would only have access through
ismaps as well.
> A friend of mine is working on a "click on this festival map to show
> where you are going to be" service. I'd hate to think what a random
> ISMAP coordinate-trying robot would do to that. ;-)
we'll work on it and see how it performs.
From /CN=robots-errors/@nexor.co.uk Wed Jun 8 13:46:24 1994
Return-Path: </CN=robots-errors/@nexor.co.uk>
Delivery-Date: Wed, 8 Jun 1994 13:47:20 +0100
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Wed, 8 Jun 1994 13:46:24 +0100
Date: Wed, 8 Jun 1994 13:46:24 +0100
X400-Originator: /CN=robots-errors/@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:119080:940608124627]
Content-Identifier: Re: Avoidance...
Priority: Non-Urgent
DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 13:46:24 +0100;
Alternate-Recipient: Allowed
From: Martijn Koster <[email protected]>
Message-ID: <"11890 Wed Jun 8 13:45:48 1994"@nexor.co.uk>
To: " (Paul De Bra)" <[email protected]>
Cc: /CN=robots/@nexor.co.uk
In-Reply-To: <[email protected]>
Subject: Re: Avoidance Algorithms
Status: RO
Content-Length: 1998
> there are a finite number of coordinates in an ISMAP,
I was under the impression you could provide fractional
coordinates, which would make the theoretical address space
infinte, but I'll settle for large :-)
> but the number is large. we would never consider trying all possible
> coordinates. which isn't necessary in any ismap i know.
Sure, my point as that you don't know which to try, or which map
onto the "same" document, if that concept applies (think about a
click-on-the-world-map-to-get-lat-long ismap server)
> > Incidentally, I reckon that it is bad HTML if you provide an ismap as
> > sole access to a small set of URL's.
>
> dunno. the course on hypertext which i have on line does it...