-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for text data & tokenization #577
Conversation
Tokenize = 'tokenize' | ||
} | ||
|
||
export function getPreprocessImage (task: Task): PreprocessText { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this one be called image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also mind adding a comment if you will output a stream of token ids?
for LLMs, we can then also support datasets without any label being needed
also let's say where/how people could load different tokenizers (task config or hardcoded either is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make sure that the PR follows your comments once it's out of the "draft" stage!
very cool, thanks for getting this started! |
283003b
to
8834d99
Compare
57803d6
to
e8c307f
Compare
55be642
to
6711773
Compare
699116f
to
9c96d71
Compare
bce22ad
to
f78a7e9
Compare
closes #572 and closes #491
add support for text data & tokenization:
this PR includes a rework of the data preprocessing pipeline, which is much more modular and makes it easy to add new preprocessing functions!
it also fixes the CI by: