Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow an integer parameter for 'randomize_start' in executor/base.py #199

Merged

Conversation

justHungryMan
Copy link
Contributor

Follow up on #197

…cept an integer specifying the maximum number of seconds to delay the start of each task.
@guipenedo
Copy link
Collaborator

LGTM, ty

@guipenedo guipenedo merged commit 0f2c69f into huggingface:main May 28, 2024
4 checks passed
hynky1999 pushed a commit to hynky1999/datatrove that referenced this pull request Jun 13, 2024
commit dc2e34cdfeffb51efe019fb4ebdd8021c31d7d2c
Merge: 048173a 7693b75
Author: Hynek Kydlicek <[email protected]>
Date:   Thu Jun 13 13:16:29 2024 +0000

    Merge remote-tracking branch 'upstream/main' into summary_stats

commit 048173a51d7a9bfa72df8fd1a24f08ec2a39d829
Merge: 30b0e5f a118cdb
Author: Hynek Kydlicek <[email protected]>
Date:   Thu Jun 13 13:12:15 2024 +0000

    Merge remote-tracking branch 'upstream/main' into summary_stats

commit 7693b75
Author: sungjun lee <[email protected]>
Date:   Wed Jun 12 16:41:49 2024 +0900

    Add label_only option to LanguageFilter (huggingface#210)

    * Add label_only option to LanguageFilter

    * Update src/datatrove/pipeline/filters/language_filter.py

    ---------

    Co-authored-by: sungjun.lee <[email protected]>
    Co-authored-by: Guilherme Penedo <[email protected]>

commit 8c5df55
Author: Luc Georges <[email protected]>
Date:   Mon Jun 10 11:33:28 2024 +0200

    fix(ci): remove unnecessary permissions (huggingface#212)

commit bb3c4fe
Author: Luc Georges <[email protected]>
Date:   Mon Jun 10 10:36:57 2024 +0200

    feat(ci): add trufflehog secrets detection (huggingface#211)

commit a118cdb
Author: sungjun lee <[email protected]>
Date:   Tue Jun 4 22:57:57 2024 +0900

    follow up recent commit about randomize_start_duration (huggingface#207)

commit e9963f6
Author: Antoni-Joan Solergibert <[email protected]>
Date:   Sat Jun 1 18:10:54 2024 +0200

    Issues w/ DatatroveFolderDataset (huggingface#203)

    * Pattern str

    * Fixed filename_pattern str

    * removed cyclic indexing of dataset

    * Added __del__ method to close files

commit c15dafd
Author: Thomas Wolf <[email protected]>
Date:   Sat Jun 1 17:54:33 2024 +0200

    Update README.md

commit 0f2c69f
Author: sungjun lee <[email protected]>
Date:   Tue May 28 23:58:50 2024 +0900

    Update the randomize_start argument to randomize_start_duration to accept an integer specifying the maximum number of seconds to delay the start of each task. (huggingface#199)

commit 30b0e5f
Author: Hynek Kydlicek <[email protected]>
Date:   Tue May 28 06:09:57 2024 +0000

    cache stats

commit 6f6deeb
Author: Guilherme Penedo <[email protected]>
Date:   Mon May 27 14:16:47 2024 +0200

    Update CITATION.cff

commit 530aa9c
Author: Guilherme Penedo <[email protected]>
Date:   Mon May 27 14:16:20 2024 +0200

    Update README.md

commit 8dbbb41
Author: sungjun lee <[email protected]>
Date:   Mon May 27 17:51:07 2024 +0900

    Add description for randomize_start (huggingface#194)

    * Add ramdomize_start feature in local executor

    * move randomize_start to global class in base, PipelineExecutor

    * Update randomize_start description in README.md and executor/base.py

commit fa663fa
Author: sungjun lee <[email protected]>
Date:   Fri May 24 18:43:45 2024 +0900

    Extend randomize_start feature to local executor (huggingface#193)

    * Add ramdomize_start feature in local executor

    * move randomize_start to global class in base, PipelineExecutor

commit 01ea445
Merge: 202f99e 83dc55c
Author: Hynek Kydlíček <[email protected]>
Date:   Fri May 24 09:26:11 2024 +0200

    Merge pull request huggingface#192 from justHungryMan/dev/typo_fix

    Fix snapshot representation and numeric conversion in example Code (fineweb)

commit 83dc55c
Author: justhungryman <[email protected]>
Date:   Fri May 24 16:11:00 2024 +0900

    Corrected the snapshot from '2O23-5O' to '2023-50' and converted alphabetic representations to numeric to avoid confusion.

commit df62cb6
Author: Hynek Kydlíček <[email protected]>
Date:   Thu May 23 18:37:14 2024 +0200

    fix tests

commit db37c84
Author: Hynek Kydlíček <[email protected]>
Date:   Thu May 23 14:29:51 2024 +0200

    kenlm

commit a00f79f
Author: Hynek Kydlíček <[email protected]>
Date:   Thu May 23 12:27:47 2024 +0200

    readable deps

commit 202f99e
Author: Guilherme Penedo <[email protected]>
Date:   Wed May 22 21:15:54 2024 +0200

    Migrate pipeline blocks to new word tokenizers (huggingface#189)

    * add dependencies and lazy load tokenizers

    * clear up sent dedup

    * rename load_tokenizer and ensure dependencies

    * change all nltk usages to multilingual word tokenizer

commit 912b876
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 18:21:25 2024 +0200

    migrate to new tokenizer

commit 498f321
Merge: b160af0 6bc8144
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 17:52:39 2024 +0200

    Merge pull request huggingface#191 from huggingface/url_dedup_index

    Url Index + missing hash_config struct inference

commit 08f45a6
Merge: e982471 3647626
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 17:46:07 2024 +0200

    Merge remote-tracking branch 'origin/use-multilingual-tokenizers' into summary_stats

commit 6bc8144
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 17:42:38 2024 +0200

    add url index, use hash_cfg in sentence dedup

commit 3647626
Author: guipenedo <[email protected]>
Date:   Wed May 22 16:24:57 2024 +0200

    change all nltk usages to multilingual word tokenizer

commit e982471
Merge: 890df3f b160af0
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 16:23:14 2024 +0200

    Merge remote-tracking branch 'origin' into summary_stats

commit 890df3f
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 16:20:40 2024 +0200

    fmt

commit 816f1fa
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 16:19:47 2024 +0200

    add kenlm deps

commit ee9ea79
Author: Hynek Kydlicek <[email protected]>
Date:   Wed May 22 14:08:20 2024 +0000

    fix paths + perplexity counting utils

commit f35e8e7
Author: guipenedo <[email protected]>
Date:   Wed May 22 15:35:08 2024 +0200

    rename load_tokenizer and ensure dependencies

commit 8a7eda5
Author: guipenedo <[email protected]>
Date:   Wed May 22 15:22:58 2024 +0200

    clear up sent dedup

commit 9b8ce85
Author: guipenedo <[email protected]>
Date:   Wed May 22 11:01:36 2024 +0200

    add dependencies and lazy load tokenizers

commit b160af0
Author: Guilherme Penedo <[email protected]>
Date:   Wed May 22 15:21:22 2024 +0200

    add uv to resolve dependencies (huggingface#188)

commit d62da07
Author: vsabolcec <[email protected]>
Date:   Wed May 22 15:17:39 2024 +0200

    Add more word tokenizers (huggingface#187)

    * Add more languages

    * Merge pull request huggingface#6 from Kesta-bos/multilingual

    Add KiwiTokenizer for Korean Tokenizing

    * Add sent and span tokenize to KiwiTokenizer

    * Catalan as proxy for Occitan

    Co-authored-by: Guilherme Penedo <[email protected]>

    * siple_span_tokenize generator

    * Note for better tokenizers

    * Stanza don't redownload models

    * Remove GeorgianTokenizer

    ---------

    Co-authored-by: beme248 <[email protected]>
    Co-authored-by: Guilherme Penedo <[email protected]>

commit ad7a3d2
Merge: b212336 c42ee2b
Author: Hynek Kydlicek <[email protected]>
Date:   Wed May 22 13:08:08 2024 +0000

    Merge remote-tracking branch 'upstream/summary_stats' into summary_stats

commit b212336
Author: Hynek Kydlicek <[email protected]>
Date:   Wed May 22 13:04:39 2024 +0000

    update kenlm

commit 20e3f79
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 15:02:41 2024 +0200

    new stats

commit 009d392
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 22 11:45:05 2024 +0200

    tmp sync

commit 71c77a4
Author: beme248 <[email protected]>
Date:   Tue May 21 18:30:18 2024 +0200

    [WIP] Multi-Lingual Tokenization (huggingface#147)

    * Add tokenizers

    * Type fix

    * English- and Korean-only tokenization

    * English tokenizer test

    * Add multilang tokenizer to Gopher quality filter

    * Require language metadata in Gopher quality

    * Lazy-load and separate dependencies

    * Move top-level import to classes

    * Remove print in tests

    * pyproject.toml: tokenization -> multilingual

    * Add sent_tokenize, and strip whitespaces

    * Move word_tokenizers.py, remove MultilingualWordTokenizer, add load_tokenizer

    * Add span_tokenize

    ---------

    Co-authored-by: vsabolcec <[email protected]>
    Co-authored-by: vsabolcec <[email protected]>

commit 1026804
Author: Guilherme Penedo <[email protected]>
Date:   Tue May 21 18:30:01 2024 +0200

    Migrate dedup to xxhash (huggingface#179)

    * Extract hash config

    * fix sent dedup

    * add hashing configs to the tests

    * Refactor MinhashConfig and HashConfig classes

    * fix xxhash optionality

    * remove duplicate sha

    * Delete remove_duplicate_k_gram.py

    * consistent name for file_name

    * remove done todo

    * Fix file_stem type conversion in MinhashDedupBuckets

    * readd path compression

    * removed DEFAULT_ objects and fixed missing np datatype from hashconfig for decont

    * fix tests

    * Update src/datatrove/pipeline/dedup/minhash.py

    Co-authored-by: Guilherme Penedo <[email protected]>

    * Update src/datatrove/pipeline/dedup/url_dedup.py

    * Update src/datatrove/pipeline/dedup/sentence_dedup.py

    ---------

    Co-authored-by: Hynek Kydlíček <[email protected]>

commit 777352d
Author: Guilherme Penedo <[email protected]>
Date:   Fri May 17 16:35:55 2024 +0200

    Make colorization configurable for both files and console output (huggingface#185)

    * add colorize toggles for both console output and log files

    * fix colorization of final output messages

    * globally change colorization on executor init

    * fix duplicated logging sink

    * fix duplicated logging sink

    * document new colorize options

    * change colorization to env variables only

    * bugfix

    * nit

    * nit

commit c42ee2b
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 15 19:11:41 2024 +0200

    consistent params

commit 122ef2f
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 15 18:49:32 2024 +0200

    update lang_stats to usse metadata

commit 4879162
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 15 18:34:09 2024 +0200

    update doc metadata with stats

commit bd6d3a6
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 15 18:32:54 2024 +0200

    Replace old stats with new summary stats

commit 0af2821
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 15 18:20:30 2024 +0200

    Update tests/pipeline/test_stats.py

    Co-authored-by: Guilherme Penedo <[email protected]>

commit b96922b
Author: Hynek Kydlíček <[email protected]>
Date:   Wed May 15 18:18:16 2024 +0200

    review fixes

commit e4326f5
Author: Hynek Kydlíček <[email protected]>
Date:   Fri May 10 17:03:26 2024 +0200

    fmt

commit 224f4f8
Author: Hynek Kydlicek <[email protected]>
Date:   Fri May 10 14:56:36 2024 +0000

    fix termin mark ratio

commit 9f45c62
Merge: 9cb9c46 3033936
Author: Hynek Kydlicek <[email protected]>
Date:   Fri May 10 14:55:30 2024 +0000

    Merge branch 'summary_stats' of github.com:hynky1999/datatrove into summary_stats

commit 3033936
Author: Hynek Kydlíček <[email protected]>
Date:   Fri May 10 15:58:11 2024 +0200

    fix readme

commit 2b7f87f
Author: Hynek Kydlíček <[email protected]>
Date:   Fri May 10 15:57:59 2024 +0200

    refactor config

commit 828d367
Merge: 3c9be60 4d83342
Author: Hynek Kydlíček <[email protected]>
Date:   Fri May 10 15:20:54 2024 +0200

    Merge remote-tracking branch 'origin/main' into pr/hynky1999/158

commit 3c9be60
Author: Hynek Kydlíček <[email protected]>
Date:   Fri May 10 15:19:35 2024 +0200

    reuse lid also for filtering

commit 9cb9c46
Author: Hynek Kydlicek <[email protected]>
Date:   Mon May 6 21:17:48 2024 +0000

    fff

commit a4849df
Author: Hynek Kydlíček <[email protected]>
Date:   Mon Apr 29 13:57:07 2024 +0200

    lang stats

commit f0982b7
Author: Hynek Kydlíček <[email protected]>
Date:   Mon Apr 29 13:28:29 2024 +0200

    lang stats

commit 4c7d2dd
Author: Hynek Kydlíček <[email protected]>
Date:   Sat Apr 20 15:52:12 2024 +0200

    fix toc

commit f6723ad
Author: Hynek Kydlíček <[email protected]>
Date:   Sat Apr 20 15:46:13 2024 +0200

    better docs + requirments update for blocks

commit e571b8d
Author: Hynek Kydlíček <[email protected]>
Date:   Sat Apr 20 15:22:51 2024 +0200

    fix readme addition to toc

commit b09dd56
Author: Hynek Kydlíček <[email protected]>
Date:   Sat Apr 20 15:15:03 2024 +0200

    Add stats section to readme

commit de83657
Author: Hynek Kydlíček <[email protected]>
Date:   Sat Apr 20 14:59:58 2024 +0200

    Refactor summary_stats module

commit 392c7bf
Author: Hynek Kydlíček <[email protected]>
Date:   Sat Apr 20 14:32:56 2024 +0200

    Add new summary stats classes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants