What's the format of the raw data? #3

Howal · 2020-08-03T09:26:38Z

Nice work!

I wonder what the format of those raw data (wiki and bc) is. Is it that one sentence per line, and an empty line between different articles?

That would be great if you can share those two raw files you mentioned in ./preprocess/pretrain/process.sh.

guolinke · 2020-08-03T12:21:54Z

@Howal It is the raw text format. the wiki data is the output of wikiextractor.
We don't specially handle the newline token, just keep it as it is.
The first lines of wiki data.

For the data, unfortunately, we cannot distribute it, due to the license issue of Book Corpus data.
For wiki data, you can easily download it.

Howal · 2020-08-04T04:32:03Z

Thank you!

sowhatyc · 2021-05-31T10:13:29Z

hi, @guolinke , in ./preprocess/pretrain/process.sh, I saw the bookcorpus data stored in two files BOOK_RAW="$DATA_DIR/book_corpus_epub.txt $DATA_DIR/book_corpus_txt.txt"
I have got one version of bookcorpus data, but the format looks different from yours.
Could you tell me is there any relationship between these two files or is the whole data of bookcorpus just stored separately in these two files ？

guolinke · 2021-06-02T02:17:41Z

@sowhatyc we crawl the book corpus by our own. there are two formats: txt and epub. and we save them separately.

guolinke mentioned this issue Aug 10, 2020

我跑你的代码，数据集那些怎么弄，怎么装载字典，我没有dict.txt #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the format of the raw data? #3

What's the format of the raw data? #3

Howal commented Aug 3, 2020 •

edited

Loading

guolinke commented Aug 3, 2020

Howal commented Aug 4, 2020

sowhatyc commented May 31, 2021

guolinke commented Jun 2, 2021

What's the format of the raw data? #3

What's the format of the raw data? #3

Comments

Howal commented Aug 3, 2020 • edited Loading

guolinke commented Aug 3, 2020

Howal commented Aug 4, 2020

sowhatyc commented May 31, 2021

guolinke commented Jun 2, 2021

Howal commented Aug 3, 2020 •

edited

Loading