-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend randomize_start feature to local executor #193
Merged
guipenedo
merged 2 commits into
huggingface:main
from
justHungryMan:dev/local_randomize_start
May 24, 2024
Merged
Extend randomize_start feature to local executor #193
guipenedo
merged 2 commits into
huggingface:main
from
justHungryMan:dev/local_randomize_start
May 24, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Maybe we could move the sleep inside the base class in this case, since both local and slurm executors will be using it? |
@guipenedo That could be also a great choice. I'll fix it |
hynky1999
pushed a commit
to hynky1999/datatrove
that referenced
this pull request
Jun 13, 2024
commit dc2e34cdfeffb51efe019fb4ebdd8021c31d7d2c Merge: 048173a 7693b75 Author: Hynek Kydlicek <[email protected]> Date: Thu Jun 13 13:16:29 2024 +0000 Merge remote-tracking branch 'upstream/main' into summary_stats commit 048173a51d7a9bfa72df8fd1a24f08ec2a39d829 Merge: 30b0e5f a118cdb Author: Hynek Kydlicek <[email protected]> Date: Thu Jun 13 13:12:15 2024 +0000 Merge remote-tracking branch 'upstream/main' into summary_stats commit 7693b75 Author: sungjun lee <[email protected]> Date: Wed Jun 12 16:41:49 2024 +0900 Add label_only option to LanguageFilter (huggingface#210) * Add label_only option to LanguageFilter * Update src/datatrove/pipeline/filters/language_filter.py --------- Co-authored-by: sungjun.lee <[email protected]> Co-authored-by: Guilherme Penedo <[email protected]> commit 8c5df55 Author: Luc Georges <[email protected]> Date: Mon Jun 10 11:33:28 2024 +0200 fix(ci): remove unnecessary permissions (huggingface#212) commit bb3c4fe Author: Luc Georges <[email protected]> Date: Mon Jun 10 10:36:57 2024 +0200 feat(ci): add trufflehog secrets detection (huggingface#211) commit a118cdb Author: sungjun lee <[email protected]> Date: Tue Jun 4 22:57:57 2024 +0900 follow up recent commit about randomize_start_duration (huggingface#207) commit e9963f6 Author: Antoni-Joan Solergibert <[email protected]> Date: Sat Jun 1 18:10:54 2024 +0200 Issues w/ DatatroveFolderDataset (huggingface#203) * Pattern str * Fixed filename_pattern str * removed cyclic indexing of dataset * Added __del__ method to close files commit c15dafd Author: Thomas Wolf <[email protected]> Date: Sat Jun 1 17:54:33 2024 +0200 Update README.md commit 0f2c69f Author: sungjun lee <[email protected]> Date: Tue May 28 23:58:50 2024 +0900 Update the randomize_start argument to randomize_start_duration to accept an integer specifying the maximum number of seconds to delay the start of each task. (huggingface#199) commit 30b0e5f Author: Hynek Kydlicek <[email protected]> Date: Tue May 28 06:09:57 2024 +0000 cache stats commit 6f6deeb Author: Guilherme Penedo <[email protected]> Date: Mon May 27 14:16:47 2024 +0200 Update CITATION.cff commit 530aa9c Author: Guilherme Penedo <[email protected]> Date: Mon May 27 14:16:20 2024 +0200 Update README.md commit 8dbbb41 Author: sungjun lee <[email protected]> Date: Mon May 27 17:51:07 2024 +0900 Add description for randomize_start (huggingface#194) * Add ramdomize_start feature in local executor * move randomize_start to global class in base, PipelineExecutor * Update randomize_start description in README.md and executor/base.py commit fa663fa Author: sungjun lee <[email protected]> Date: Fri May 24 18:43:45 2024 +0900 Extend randomize_start feature to local executor (huggingface#193) * Add ramdomize_start feature in local executor * move randomize_start to global class in base, PipelineExecutor commit 01ea445 Merge: 202f99e 83dc55c Author: Hynek Kydlíček <[email protected]> Date: Fri May 24 09:26:11 2024 +0200 Merge pull request huggingface#192 from justHungryMan/dev/typo_fix Fix snapshot representation and numeric conversion in example Code (fineweb) commit 83dc55c Author: justhungryman <[email protected]> Date: Fri May 24 16:11:00 2024 +0900 Corrected the snapshot from '2O23-5O' to '2023-50' and converted alphabetic representations to numeric to avoid confusion. commit df62cb6 Author: Hynek Kydlíček <[email protected]> Date: Thu May 23 18:37:14 2024 +0200 fix tests commit db37c84 Author: Hynek Kydlíček <[email protected]> Date: Thu May 23 14:29:51 2024 +0200 kenlm commit a00f79f Author: Hynek Kydlíček <[email protected]> Date: Thu May 23 12:27:47 2024 +0200 readable deps commit 202f99e Author: Guilherme Penedo <[email protected]> Date: Wed May 22 21:15:54 2024 +0200 Migrate pipeline blocks to new word tokenizers (huggingface#189) * add dependencies and lazy load tokenizers * clear up sent dedup * rename load_tokenizer and ensure dependencies * change all nltk usages to multilingual word tokenizer commit 912b876 Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 18:21:25 2024 +0200 migrate to new tokenizer commit 498f321 Merge: b160af0 6bc8144 Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 17:52:39 2024 +0200 Merge pull request huggingface#191 from huggingface/url_dedup_index Url Index + missing hash_config struct inference commit 08f45a6 Merge: e982471 3647626 Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 17:46:07 2024 +0200 Merge remote-tracking branch 'origin/use-multilingual-tokenizers' into summary_stats commit 6bc8144 Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 17:42:38 2024 +0200 add url index, use hash_cfg in sentence dedup commit 3647626 Author: guipenedo <[email protected]> Date: Wed May 22 16:24:57 2024 +0200 change all nltk usages to multilingual word tokenizer commit e982471 Merge: 890df3f b160af0 Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 16:23:14 2024 +0200 Merge remote-tracking branch 'origin' into summary_stats commit 890df3f Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 16:20:40 2024 +0200 fmt commit 816f1fa Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 16:19:47 2024 +0200 add kenlm deps commit ee9ea79 Author: Hynek Kydlicek <[email protected]> Date: Wed May 22 14:08:20 2024 +0000 fix paths + perplexity counting utils commit f35e8e7 Author: guipenedo <[email protected]> Date: Wed May 22 15:35:08 2024 +0200 rename load_tokenizer and ensure dependencies commit 8a7eda5 Author: guipenedo <[email protected]> Date: Wed May 22 15:22:58 2024 +0200 clear up sent dedup commit 9b8ce85 Author: guipenedo <[email protected]> Date: Wed May 22 11:01:36 2024 +0200 add dependencies and lazy load tokenizers commit b160af0 Author: Guilherme Penedo <[email protected]> Date: Wed May 22 15:21:22 2024 +0200 add uv to resolve dependencies (huggingface#188) commit d62da07 Author: vsabolcec <[email protected]> Date: Wed May 22 15:17:39 2024 +0200 Add more word tokenizers (huggingface#187) * Add more languages * Merge pull request huggingface#6 from Kesta-bos/multilingual Add KiwiTokenizer for Korean Tokenizing * Add sent and span tokenize to KiwiTokenizer * Catalan as proxy for Occitan Co-authored-by: Guilherme Penedo <[email protected]> * siple_span_tokenize generator * Note for better tokenizers * Stanza don't redownload models * Remove GeorgianTokenizer --------- Co-authored-by: beme248 <[email protected]> Co-authored-by: Guilherme Penedo <[email protected]> commit ad7a3d2 Merge: b212336 c42ee2b Author: Hynek Kydlicek <[email protected]> Date: Wed May 22 13:08:08 2024 +0000 Merge remote-tracking branch 'upstream/summary_stats' into summary_stats commit b212336 Author: Hynek Kydlicek <[email protected]> Date: Wed May 22 13:04:39 2024 +0000 update kenlm commit 20e3f79 Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 15:02:41 2024 +0200 new stats commit 009d392 Author: Hynek Kydlíček <[email protected]> Date: Wed May 22 11:45:05 2024 +0200 tmp sync commit 71c77a4 Author: beme248 <[email protected]> Date: Tue May 21 18:30:18 2024 +0200 [WIP] Multi-Lingual Tokenization (huggingface#147) * Add tokenizers * Type fix * English- and Korean-only tokenization * English tokenizer test * Add multilang tokenizer to Gopher quality filter * Require language metadata in Gopher quality * Lazy-load and separate dependencies * Move top-level import to classes * Remove print in tests * pyproject.toml: tokenization -> multilingual * Add sent_tokenize, and strip whitespaces * Move word_tokenizers.py, remove MultilingualWordTokenizer, add load_tokenizer * Add span_tokenize --------- Co-authored-by: vsabolcec <[email protected]> Co-authored-by: vsabolcec <[email protected]> commit 1026804 Author: Guilherme Penedo <[email protected]> Date: Tue May 21 18:30:01 2024 +0200 Migrate dedup to xxhash (huggingface#179) * Extract hash config * fix sent dedup * add hashing configs to the tests * Refactor MinhashConfig and HashConfig classes * fix xxhash optionality * remove duplicate sha * Delete remove_duplicate_k_gram.py * consistent name for file_name * remove done todo * Fix file_stem type conversion in MinhashDedupBuckets * readd path compression * removed DEFAULT_ objects and fixed missing np datatype from hashconfig for decont * fix tests * Update src/datatrove/pipeline/dedup/minhash.py Co-authored-by: Guilherme Penedo <[email protected]> * Update src/datatrove/pipeline/dedup/url_dedup.py * Update src/datatrove/pipeline/dedup/sentence_dedup.py --------- Co-authored-by: Hynek Kydlíček <[email protected]> commit 777352d Author: Guilherme Penedo <[email protected]> Date: Fri May 17 16:35:55 2024 +0200 Make colorization configurable for both files and console output (huggingface#185) * add colorize toggles for both console output and log files * fix colorization of final output messages * globally change colorization on executor init * fix duplicated logging sink * fix duplicated logging sink * document new colorize options * change colorization to env variables only * bugfix * nit * nit commit c42ee2b Author: Hynek Kydlíček <[email protected]> Date: Wed May 15 19:11:41 2024 +0200 consistent params commit 122ef2f Author: Hynek Kydlíček <[email protected]> Date: Wed May 15 18:49:32 2024 +0200 update lang_stats to usse metadata commit 4879162 Author: Hynek Kydlíček <[email protected]> Date: Wed May 15 18:34:09 2024 +0200 update doc metadata with stats commit bd6d3a6 Author: Hynek Kydlíček <[email protected]> Date: Wed May 15 18:32:54 2024 +0200 Replace old stats with new summary stats commit 0af2821 Author: Hynek Kydlíček <[email protected]> Date: Wed May 15 18:20:30 2024 +0200 Update tests/pipeline/test_stats.py Co-authored-by: Guilherme Penedo <[email protected]> commit b96922b Author: Hynek Kydlíček <[email protected]> Date: Wed May 15 18:18:16 2024 +0200 review fixes commit e4326f5 Author: Hynek Kydlíček <[email protected]> Date: Fri May 10 17:03:26 2024 +0200 fmt commit 224f4f8 Author: Hynek Kydlicek <[email protected]> Date: Fri May 10 14:56:36 2024 +0000 fix termin mark ratio commit 9f45c62 Merge: 9cb9c46 3033936 Author: Hynek Kydlicek <[email protected]> Date: Fri May 10 14:55:30 2024 +0000 Merge branch 'summary_stats' of github.com:hynky1999/datatrove into summary_stats commit 3033936 Author: Hynek Kydlíček <[email protected]> Date: Fri May 10 15:58:11 2024 +0200 fix readme commit 2b7f87f Author: Hynek Kydlíček <[email protected]> Date: Fri May 10 15:57:59 2024 +0200 refactor config commit 828d367 Merge: 3c9be60 4d83342 Author: Hynek Kydlíček <[email protected]> Date: Fri May 10 15:20:54 2024 +0200 Merge remote-tracking branch 'origin/main' into pr/hynky1999/158 commit 3c9be60 Author: Hynek Kydlíček <[email protected]> Date: Fri May 10 15:19:35 2024 +0200 reuse lid also for filtering commit 9cb9c46 Author: Hynek Kydlicek <[email protected]> Date: Mon May 6 21:17:48 2024 +0000 fff commit a4849df Author: Hynek Kydlíček <[email protected]> Date: Mon Apr 29 13:57:07 2024 +0200 lang stats commit f0982b7 Author: Hynek Kydlíček <[email protected]> Date: Mon Apr 29 13:28:29 2024 +0200 lang stats commit 4c7d2dd Author: Hynek Kydlíček <[email protected]> Date: Sat Apr 20 15:52:12 2024 +0200 fix toc commit f6723ad Author: Hynek Kydlíček <[email protected]> Date: Sat Apr 20 15:46:13 2024 +0200 better docs + requirments update for blocks commit e571b8d Author: Hynek Kydlíček <[email protected]> Date: Sat Apr 20 15:22:51 2024 +0200 fix readme addition to toc commit b09dd56 Author: Hynek Kydlíček <[email protected]> Date: Sat Apr 20 15:15:03 2024 +0200 Add stats section to readme commit de83657 Author: Hynek Kydlíček <[email protected]> Date: Sat Apr 20 14:59:58 2024 +0200 Refactor summary_stats module commit 392c7bf Author: Hynek Kydlíček <[email protected]> Date: Sat Apr 20 14:32:56 2024 +0200 Add new summary stats classes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Extend the randomize_start option, which is available in the slurm executor, to the local executor. Initially, it may seem that randomize_start is unnecessary for local operations. However, specific scenarios, such as processing WARCs directly from Common Crawl's S3, highlight its importance. When dealing with S3, there is a risk of exceeding the permitted rate limit for API calls resulting in potential blocking, if to much API calls are occured simultaneously. By enabling either sequential or randomized execution through the executor, we can mitigate this issue.
The addition of this feature to the local executor was prompted by running the fineweb examples locally.