Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the format of the raw data? #3

Open
Howal opened this issue Aug 3, 2020 · 4 comments
Open

What's the format of the raw data? #3

Howal opened this issue Aug 3, 2020 · 4 comments

Comments

@Howal
Copy link

Howal commented Aug 3, 2020

Nice work!

I wonder what the format of those raw data (wiki and bc) is. Is it that one sentence per line, and an empty line between different articles?

That would be great if you can share those two raw files you mentioned in ./preprocess/pretrain/process.sh.

@guolinke
Copy link
Owner

guolinke commented Aug 3, 2020

@Howal It is the raw text format. the wiki data is the output of wikiextractor.
We don't specially handle the newline token, just keep it as it is.
The first lines of wiki data.
image

For the data, unfortunately, we cannot distribute it, due to the license issue of Book Corpus data.
For wiki data, you can easily download it.

@Howal
Copy link
Author

Howal commented Aug 4, 2020

Thank you!

@sowhatyc
Copy link

hi, @guolinke , in ./preprocess/pretrain/process.sh, I saw the bookcorpus data stored in two files BOOK_RAW="$DATA_DIR/book_corpus_epub.txt $DATA_DIR/book_corpus_txt.txt"
I have got one version of bookcorpus data, but the format looks different from yours.
Could you tell me is there any relationship between these two files or is the whole data of bookcorpus just stored separately in these two files ?

@guolinke
Copy link
Owner

guolinke commented Jun 2, 2021

@sowhatyc we crawl the book corpus by our own. there are two formats: txt and epub. and we save them separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants