Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gordicaleksa fix dataloader2 #740

Merged
merged 13 commits into from
Aug 13, 2024
Merged

Gordicaleksa fix dataloader2 #740

merged 13 commits into from
Aug 13, 2024

Conversation

karpathy
Copy link
Owner

commit on top of @gordicaleksa PR that makes a bunch of bugfixes

  • be more explicit with treatment of EOT token
  • be careful with API for AutoTokenizer
  • bugfix on dtype in fineweb.py
  • use model_desc instead of model because the latter is usually a nn.Module
  • few more

for i, s in enumerate(sections):
tokens.append(eot)
# there was a mild bug where I originally intended to remove \n\n, but instead just added
# the EOT right after each \n\n, so I'm keeping that behavior for backwards compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backwards compatibility with what code?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous version of this code

@karpathy karpathy merged commit 4c84bc7 into master Aug 13, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants