-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the format of the raw data? #3
Comments
@Howal It is the raw text format. the wiki data is the output of wikiextractor. For the data, unfortunately, we cannot distribute it, due to the license issue of Book Corpus data. |
Thank you! |
hi, @guolinke , in ./preprocess/pretrain/process.sh, I saw the bookcorpus data stored in two files BOOK_RAW="$DATA_DIR/book_corpus_epub.txt $DATA_DIR/book_corpus_txt.txt" |
@sowhatyc we crawl the book corpus by our own. there are two formats: txt and epub. and we save them separately. |
Nice work!
I wonder what the format of those raw data (wiki and bc) is. Is it that one sentence per line, and an empty line between different articles?
That would be great if you can share those two raw files you mentioned in ./preprocess/pretrain/process.sh.
The text was updated successfully, but these errors were encountered: