You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What type of in-house data is used in the pre-training phase?
In the Multi-task Pre-training stage, OCR data is used, including SynthDoG-en & zh, Common Crawl pdf & HTML. How is Common Crawl pdf & HTML obtained? Which dataset is it from or is it made by yourself? If it is made by yourself, how is it done?
Thanks for your work!
The text was updated successfully, but these errors were encountered:
Thanks for your work!
The text was updated successfully, but these errors were encountered: