Questions about training data #14

JunZhan2000 · 2023-08-25T09:16:03Z

What type of in-house data is used in the pre-training phase?
In the Multi-task Pre-training stage, OCR data is used, including SynthDoG-en & zh, Common Crawl pdf & HTML. How is Common Crawl pdf & HTML obtained? Which dataset is it from or is it made by yourself? If it is made by yourself, how is it done?

Thanks for your work!

tinytangent · 2023-08-25T09:37:50Z

Common Crawl pdfs are obtained from https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated, which is indicated on page 6, footnote 3 of the paper.

simonJJJ · 2023-08-25T09:53:46Z

Hi @Guokr233, the In-house data is just image-text pairs.

JunZhan2000 closed this as completed Aug 25, 2023

Provide feedback