Data:

There are three datasets:

UNT.edu
Texas.gov
USDA.gov

All datasets are stored according to the following, common structure:

dataset/
       /docs/       	<- pdf documents
       /split_ids/  	<- containing 3 train-dev-test splits of ids (append .pdf to locate the pdf file in `docs/` directory)
       /positives.txt/  <- list of IDs for the documents in the positive class
       /negatives.txt/  <- list of IDs for the documents in the negative class

Cite

If you use the datasets, please cite the following paper:

Krutarth Patel, Cornelia Caragea, and Mark E. Phillips. Dynamic Classification in Web Archiving Collections. Proceedings of the Twelth International Conference on Language Resources and Evaluation, 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Texas.gov		Texas.gov
UNT.edu		UNT.edu
USDA.gov		USDA.gov
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data:

Cite

About

Releases

Packages

PatelKrutarth/LREC-20_datasets

Folders and files

Latest commit

History

Repository files navigation

Data:

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages