You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the collections table we have the following entry for total collected files for the collection IA:
However, counting the number of status 200 CDXJ entries for collection IA we get 132113441
$ grep -a 'status": "200"' IA.cdxj | wc -l
132113441
This discrepancy could be due to duplication, but more likely it has to do with the numbers being outdated due to shifting from NutchWax to pywb for URL indexing.
We need confirm for all collections whether this number is correct or not, and rectify it when needed.
The text was updated successfully, but these errors were encountered:
VascoRatoFCCN
changed the title
Verify that the "Total collected files" number from our collections table to reflect the number of CDX entries with status 200
Verify that the "Total collected files" number from our collections table reflects the number of CDX entries with status 200
Sep 21, 2023
The "Total collected files" usually is counted from the crawl logs. However, the IA collection was not crawled, it was donated. Therefore probably this number was obtained from the initial Nuthwax indexing. The CDX count is more reliable.
On the collections table we have the following entry for total collected files for the collection IA:
However, counting the number of status 200 CDXJ entries for collection IA we get 132113441
This discrepancy could be due to duplication, but more likely it has to do with the numbers being outdated due to shifting from NutchWax to pywb for URL indexing.
We need confirm for all collections whether this number is correct or not, and rectify it when needed.
The text was updated successfully, but these errors were encountered: