Add support for text data & tokenization #577

s314cy · 2023-04-20T09:15:35Z

closes #572 and closes #491

add support for text data & tokenization:

tokenize samples
load labelled text data in the browser/node
load unlabelled text data in the browser/node
lazily load text data in node

this PR includes a rework of the data preprocessing pipeline, which is much more modular and makes it easy to add new preprocessing functions!

it also fixes the CI by:

making the github actions data cache run-specific
ensuring the data download script bypasses the gbucket cache
replacing the example data's archive from a BSD tar to a GNU tar (macOS vs. linux) which caused issues in the CI

martinjaggi · 2023-04-20T09:37:20Z

discojs/discojs-core/src/dataset/preprocessing/text_preprocessing.ts

+  Tokenize = 'tokenize'
+}
+
+export function getPreprocessImage (task: Task): PreprocessText {


should this one be called image?

also mind adding a comment if you will output a stream of token ids?

for LLMs, we can then also support datasets without any label being needed

also let's say where/how people could load different tokenizers (task config or hardcoded either is fine)

I'll make sure that the PR follows your comments once it's out of the "draft" stage!

martinjaggi · 2023-04-20T09:39:20Z

very cool, thanks for getting this started!

s314cy added feature New feature or request discojs Related to Disco.js labels Apr 20, 2023

s314cy self-assigned this Apr 20, 2023

martinjaggi reviewed Apr 20, 2023

View reviewed changes

s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 283003b to 8834d99 Compare April 24, 2023 14:26

s314cy mentioned this pull request May 4, 2023

add documentation in TASK for nlp and lstm task #565

Closed

s314cy force-pushed the 572-tokenizer-support-s314cy branch from 57803d6 to e8c307f Compare May 4, 2023 10:37

s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 55be642 to 6711773 Compare May 23, 2023 12:54

s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 699116f to 9c96d71 Compare July 6, 2023 12:13

s314cy added 2 commits July 31, 2023 13:48

feat: add support for text data & tokenization

0892e40

fix export of immutable collections

f78a7e9

s314cy force-pushed the 572-tokenizer-support-s314cy branch from bce22ad to f78a7e9 Compare July 31, 2023 11:48

s314cy marked this pull request as ready for review July 31, 2023 11:48

s314cy merged commit acd4250 into develop Jul 31, 2023

s314cy deleted the 572-tokenizer-support-s314cy branch July 31, 2023 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for text data & tokenization #577

Add support for text data & tokenization #577

s314cy commented Apr 20, 2023 •

edited

Loading

martinjaggi Apr 20, 2023

martinjaggi Apr 20, 2023

s314cy Apr 20, 2023

martinjaggi commented Apr 20, 2023

Add support for text data & tokenization #577

Add support for text data & tokenization #577

Conversation

s314cy commented Apr 20, 2023 • edited Loading

martinjaggi Apr 20, 2023

Choose a reason for hiding this comment

martinjaggi Apr 20, 2023

Choose a reason for hiding this comment

s314cy Apr 20, 2023

Choose a reason for hiding this comment

martinjaggi commented Apr 20, 2023

s314cy commented Apr 20, 2023 •

edited

Loading