Sync fork #2

vsabolcec · 2024-03-01T14:14:32Z

No description provided.

* Add files via upload Changed fsx default filepath for logging output to user's home. * Update process_common_crawl_dump.py * Update process_common_crawl_dump.py Changed fsx default filepath for logging output to user's home. Default file path uses a personal fsx instance, which breaks if the user does not have access to it. * make path relative --------- Co-authored-by: Guilherme Penedo <[email protected]>

* added upload_block_size parameter * small fix --------- Co-authored-by: guipenedo <[email protected]>

* np.fromiter instead of np.array * replaced fromiter on some other files --------- Co-authored-by: Giorgio Angelotti <[email protected]> Co-authored-by: guipenedo <[email protected]>

* Fix compression type * fix type hints for compression --------- Co-authored-by: guipenedo <[email protected]>

* started work on fasttext filter support; improved download of the model * added filtering logic * adds docs link * added to init

* added citation * added to toc * updated author list

* added parquet writer * nit * Update src/datatrove/pipeline/writers/parquet.py Co-authored-by: Mario Šaško <[email protected]> * updated test * nit --------- Co-authored-by: Mario Šaško <[email protected]>

…he HuggingFace hub (#105) * added parquet writer * nit * first version of HF writer * bugfix * fix order * bugfix * bugfix repo type * bugfix repo type #2 * change to PRs * fix super call * fix super call #2 * fix super call #3 * fix super call #4 * DEBUG MESSAGES * DEBUG MESSAGES 2 * DEBUG MESSAGES 3 * fix generator issue * added backoff retries to hf upload * added expand_metadata option * added splitting files into at most 5gb * fix tests * nit * bugfix * bugfix * bugfix filename change and default max huggingface filesize * nit * nit

* added upload_block_size parameter * small fix * adding docstrings * FileTokenizerMerger * update more doc-strings * adding tests * fix quality + tests * fix comments * start method back to forkserver * style * update * quality * cleaning up * remove FileTokenizerMerger and corresponding test * Update src/datatrove/io.py * Update src/datatrove/io.py * Update src/datatrove/io.py * Update src/datatrove/pipeline/base.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/readers/jsonl.py * Update src/datatrove/pipeline/stats/urls.py * Update src/datatrove/pipeline/tokens/tokenizer.py * Update src/datatrove/pipeline/tokens/tokenizer.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/minhash.py * Update src/datatrove/pipeline/dedup/sentence_dedup.py * Update src/datatrove/pipeline/dedup/sentence_dedup.py * Update src/datatrove/pipeline/dedup/sentence_dedup.py * Update src/datatrove/pipeline/dedup/sentence_dedup.py * Update src/datatrove/pipeline/dedup/sentence_dedup.py * Update src/datatrove/pipeline/dedup/sentence_dedup.py * Update src/datatrove/pipeline/dedup/sentence_dedup.py * Update src/datatrove/pipeline/readers/base.py * Update src/datatrove/pipeline/readers/base.py * Update src/datatrove/pipeline/readers/base.py * Update src/datatrove/pipeline/readers/base.py * Update src/datatrove/pipeline/readers/base.py * Update src/datatrove/pipeline/readers/base.py * Update src/datatrove/pipeline/readers/base.py * Update src/datatrove/pipeline/readers/csv.py * Update src/datatrove/pipeline/readers/csv.py * Update src/datatrove/pipeline/readers/csv.py * Update src/datatrove/pipeline/readers/huggingface.py * Update src/datatrove/pipeline/readers/huggingface.py * Update src/datatrove/pipeline/readers/ipc.py * Update src/datatrove/pipeline/readers/ipc.py * Update src/datatrove/pipeline/readers/jsonl.py * Update src/datatrove/pipeline/readers/jsonl.py * Update src/datatrove/pipeline/readers/parquet.py * Update src/datatrove/pipeline/readers/parquet.py * Update src/datatrove/pipeline/readers/warc.py * Update src/datatrove/pipeline/readers/warc.py * Update src/datatrove/pipeline/readers/warc.py * Update src/datatrove/pipeline/readers/parquet.py * fix style * small changes * nit --------- Co-authored-by: guipenedo <[email protected]>

* update * add docstring * Update src/datatrove/pipeline/filters/fasttext_filter.py * nit * fix pyproject.toml deprecation warning * fix pyproject.toml deprecation warning * fix style on ruff 2.0 --------- Co-authored-by: guipenedo <[email protected]>

guipenedo and others added 17 commits February 17, 2024 16:16

added multi node parallelism to local executor (#85)

598bb54

Fix typos and formatting in README.md (#91)

da0cb5d

bugfix stats file not being saved to s3 (#92)

45d8af2

Fix url stats (#89)

3f2d8e4

* added upload_block_size parameter * small fix --------- Co-authored-by: guipenedo <[email protected]>

Efficiency: np.fromiter instead of np.array (#88)

4645788

* np.fromiter instead of np.array * replaced fromiter on some other files --------- Co-authored-by: Giorgio Angelotti <[email protected]> Co-authored-by: guipenedo <[email protected]>

add language option to punkt (#94)

9e3dc41

Fix compression type (#95)

8f1c3b7

* Fix compression type * fix type hints for compression --------- Co-authored-by: guipenedo <[email protected]>

decoupled reading logic from DedupReader (#98)

a49b93c

Support for arbitrary fasttext models (#99)

1c80d0c

* started work on fasttext filter support; improved download of the model * added filtering logic * adds docs link * added to init

Adds citation (#101)

795e542

* added citation * added to toc * updated author list

Adds parquet writer (#103)

d4cf053

* added parquet writer * nit * Update src/datatrove/pipeline/writers/parquet.py Co-authored-by: Mario Šaško <[email protected]> * updated test * nit --------- Co-authored-by: Mario Šaško <[email protected]>

speed up hf listing

576278f

make it possible to skip index matches on stage3

b4b77f2

vsabolcec merged commit dd77a00 into beme248:main Mar 1, 2024

vsabolcec added a commit that referenced this pull request May 14, 2024

Update stats and filters #2

4f5d07e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync fork #2

Sync fork #2

vsabolcec commented Mar 1, 2024

Sync fork #2

Sync fork #2

Conversation

vsabolcec commented Mar 1, 2024