-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT Pre-training Data #3
Comments
Common Crawl
|
WebText 1 (from Language Models are Unsupervised Multitask Learners)
|
WebText 2 (from Scaling Laws for Neural Language Models)
|
Book CorpusThe paper does not clearly explain book1 and book2. Therefore, we can only speculate about these. Here are several articles that offer interesting conjectures.
Other References:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
GPT-3 data mix
High-quality datasets
Data preparation process
The text was updated successfully, but these errors were encountered: