Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Introduce Multi-lingual Support #1

Draft
wants to merge 227 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
227 commits
Select commit Hold shift + click to select a range
77cf624
Wiki-language statistics + example
vsabolcec Feb 12, 2024
4134451
Initial multilingual Gopher quality filter + example
vsabolcec Feb 12, 2024
2d0958e
HuggingFace token via env
vsabolcec Feb 13, 2024
63d2a7a
Rework process_fineweb.py example
vsabolcec Feb 13, 2024
6d0cd05
Datatrove language stats
vsabolcec Feb 13, 2024
23ebd80
Rework examples to new pipeline blocks
vsabolcec Feb 13, 2024
17e3b90
Initial README for multilingual preprocessing
vsabolcec Feb 13, 2024
90f8ad8
Slurm example update
vsabolcec Feb 13, 2024
2a71ad1
README update
vsabolcec Feb 13, 2024
fae2b05
Update examples and lang_stats.json
vsabolcec Feb 14, 2024
5f590d2
Add exclusion_writer to example
vsabolcec Feb 14, 2024
e2d8dbf
Update README
vsabolcec Feb 14, 2024
3fd2e9d
Update README
vsabolcec Feb 14, 2024
ead90bc
Unify tokenization
vsabolcec Feb 15, 2024
59edccf
README
vsabolcec Feb 15, 2024
c35edea
Word tokenizer support
vsabolcec Feb 20, 2024
8fb2133
Formatting
vsabolcec Feb 20, 2024
beeca91
Add stanza as dependency
vsabolcec Feb 20, 2024
ca4dd1d
Additional language statistics
vsabolcec Feb 20, 2024
9746b4a
Formatting
vsabolcec Feb 20, 2024
bbaa097
Lazy load stanza tokenizer
vsabolcec Feb 21, 2024
5279b2d
Word length distribution stat
vsabolcec Feb 21, 2024
9f2e8f9
Move q() in scope
vsabolcec Feb 21, 2024
ccc9347
Word statistics
vsabolcec Feb 22, 2024
1ccfd8e
Typo
vsabolcec Feb 22, 2024
d549b51
Ignore word case in counter
vsabolcec Feb 22, 2024
17b2c6c
Update lang_stats and README
vsabolcec Feb 22, 2024
572a19e
Word counter: optional prune
vsabolcec Feb 26, 2024
74cdba5
Top 10000 words
vsabolcec Feb 26, 2024
695ce52
Additional language stats
vsabolcec Feb 26, 2024
0c7e8d5
Remove util
vsabolcec Feb 26, 2024
2391880
Format
vsabolcec Feb 26, 2024
8e03924
Stats
vsabolcec Feb 26, 2024
2686a2d
cc_news stats
vsabolcec Feb 26, 2024
2d5b7a6
Stats
vsabolcec Feb 26, 2024
53824b6
Word stats to file
vsabolcec Feb 26, 2024
d10e1d7
Updated stats
vsabolcec Feb 26, 2024
c2e4ae5
ShuffledHFDataset, wiki sanity check
vsabolcec Feb 26, 2024
33e225f
Named parameter
vsabolcec Feb 26, 2024
61a957b
Division by zero
vsabolcec Feb 26, 2024
e60c2f0
Division by zero: different
vsabolcec Feb 26, 2024
b07fe8b
Updated stats
vsabolcec Feb 26, 2024
100ad99
Change thresholds
vsabolcec Feb 27, 2024
afdcce6
Update stats
vsabolcec Feb 27, 2024
bcab477
Sanity check and updated stats
vsabolcec Feb 27, 2024
ba8ac7c
Sanity check update
vsabolcec Feb 28, 2024
3d2906f
Stats: count UTF-8 bytes
vsabolcec Feb 29, 2024
a7dc867
Commoncrawl stats, format
vsabolcec Feb 29, 2024
67c49b4
Fix typo
vsabolcec Feb 29, 2024
451470a
Faster doc stats
vsabolcec Feb 29, 2024
8080148
Autoformat
vsabolcec Mar 1, 2024
6282cee
MultilingualGopherFilter test
vsabolcec Mar 1, 2024
b5c208b
Merge remote-tracking branch 'upstream/main' into multilingual
vsabolcec Mar 1, 2024
858585b
adjust multi-lingual
Mar 1, 2024
ff23b94
bugfix mailuser
guipenedo Mar 4, 2024
6baacca
Add `jobs_status` command. (#113)
lvwerra Mar 5, 2024
366c4cc
bugfix recursive
guipenedo Mar 5, 2024
df0e324
nit
guipenedo Mar 5, 2024
7dbf8ec
bugfix tokenization unshuffled cleanup
guipenedo Mar 5, 2024
6a15cab
Re-enable `datasets` test (#114)
mariosasko Mar 5, 2024
486acf0
nltk stats
Mar 5, 2024
fe391a7
Merge branch 'multilingual' into bm/run-pipelines
Mar 5, 2024
e1f9524
Update warc.py (#115)
jordane95 Mar 6, 2024
97090c5
Merge remote-tracking branch 'upstream/main' into multilingual
vsabolcec Mar 7, 2024
a98aafd
bugfix doc_len and doc_len_tokens means and std_dev
guipenedo Mar 8, 2024
ba34a3b
Add std to language stats
vsabolcec Mar 11, 2024
8027b16
Update language stats
vsabolcec Mar 11, 2024
eee0935
Clean-up sanity_check_wiki.py
vsabolcec Mar 11, 2024
9086eff
max_non_alpha_words_ratio per language filter
vsabolcec Mar 11, 2024
4966a6a
Fix typo
vsabolcec Mar 11, 2024
6809caa
Fix no DocumentStats
vsabolcec Mar 11, 2024
c66052d
Sanity check: min. 1 word
vsabolcec Mar 11, 2024
a7ab410
Add new languages
vsabolcec Mar 12, 2024
72c0cb8
Add word tokenizers
vsabolcec Mar 12, 2024
6539ca6
Top 50 language stats
vsabolcec Mar 13, 2024
03272f1
Add more word tokenizers
vsabolcec Mar 13, 2024
9d3e505
Change LanguageStatsReducer behaviour
vsabolcec Mar 13, 2024
f059380
Add LanguageStatsReducer default reduce
vsabolcec Mar 13, 2024
20b4b44
max_non_alpha_words_ratio in language statistics
vsabolcec Mar 13, 2024
f7ffa27
Update language stats
vsabolcec Mar 13, 2024
6ded809
draft
Mar 13, 2024
0cecc70
Merge branch 'multilingual' into bm/run-pipelines
Mar 13, 2024
505dcac
backup
Mar 13, 2024
8b78489
Update language stats
vsabolcec Mar 13, 2024
5ab7f19
add languages
Mar 13, 2024
9bef960
Merge branch 'multilingual' into bm/run-pipelines
Mar 13, 2024
9d0bc00
More tokenizers and SpaCy switch
vsabolcec Mar 14, 2024
38814ae
Add tokenizers, update stat script
vsabolcec Mar 14, 2024
9399c7c
Revert some tokenizers to stanza
vsabolcec Mar 14, 2024
c721385
Tweak SpaCy
vsabolcec Mar 14, 2024
5fa4927
95 tokenizers + test
vsabolcec Mar 14, 2024
47152be
Format code
vsabolcec Mar 14, 2024
18a5441
Update language stats
vsabolcec Mar 14, 2024
37a84f2
Top 100 language stats
vsabolcec Mar 14, 2024
5c9e4d5
draft 2
Mar 14, 2024
7444a56
draft 3
Mar 14, 2024
4a797f1
Spacy tokenizer ignore whitespaces
vsabolcec Mar 15, 2024
23c0e66
Format code
vsabolcec Mar 15, 2024
e2f5096
draft 5
Mar 15, 2024
db248b6
Update language stats
vsabolcec Mar 15, 2024
abdae22
Merge branch 'multilingual' into bm/run-pipelines
Mar 15, 2024
0a54ba6
Update language stats for top 100 languages
vsabolcec Mar 15, 2024
705e412
draft 6
Mar 15, 2024
8e0fa3d
Merge branch 'multilingual' into bm/run-pipelines
Mar 15, 2024
aadb17f
IndicNLP tokenizer fix
vsabolcec Mar 15, 2024
fc229cf
darft 9
Mar 15, 2024
3a00b5a
Merge branch 'multilingual' into bm/run-pipelines
Mar 15, 2024
f6ec925
Word tokenizer test
vsabolcec Mar 15, 2024
895b0e8
run spacy v2 top100
Mar 15, 2024
6d708bd
Bug fix: when file is empty (#126)
jordane95 Mar 15, 2024
f2a0ee0
fix
Mar 16, 2024
ea7c10b
add aggregated stats
Mar 16, 2024
c795863
update stats
Mar 17, 2024
69bbbb8
Korean and Thai faster tokenizers
vsabolcec Mar 18, 2024
21d0607
Split language statistics into multiple files
vsabolcec Mar 18, 2024
b9bd9fb
Split language statistics into multiple files (2)
vsabolcec Mar 19, 2024
f5655cc
Separate filter parameters
vsabolcec Mar 19, 2024
5e673b6
Remove testing language and format code
vsabolcec Mar 19, 2024
b2020a4
Introduce LanguageStats dataclass and clean up
vsabolcec Mar 19, 2024
0bca5a8
Remove alternative lang_stats scripts
vsabolcec Mar 19, 2024
52ee2c5
LanguageStatsReducer output to yml
vsabolcec Mar 19, 2024
b98ef1c
Load tokenizer using `from_file` (#122)
guipenedo Mar 19, 2024
bc57162
add `depends=` to LocalPipelineExecutor (#100)
guipenedo Mar 19, 2024
35fc009
Remove commoncrawl_stats.py
vsabolcec Mar 19, 2024
04fb13f
Update multilingual README.md
vsabolcec Mar 19, 2024
494d0a7
Revert remove most_common(10000)
vsabolcec Mar 19, 2024
0ac5745
Convert .json language stats into .yml
vsabolcec Mar 19, 2024
1d63b82
Merge pull request #5 from beme248/bm/run-pipelines
vsabolcec Mar 19, 2024
2a4aedb
Combine branches and use .yml in process_non_english.py
vsabolcec Mar 19, 2024
644ca12
Wiki language stats RUN_MODE
vsabolcec Mar 20, 2024
e74101a
Remove testing part from Wiki language stats script
vsabolcec Mar 20, 2024
349fdb1
Update README
vsabolcec Mar 20, 2024
ba63075
Clean up
vsabolcec Mar 20, 2024
55c6b1c
Improve C4 filter and dedup (#124)
guipenedo Mar 20, 2024
27e2cea
Adds option to shuffle input files in readers (#128)
guipenedo Mar 20, 2024
b6cd366
update Trafilatura version (#130)
adbar Mar 20, 2024
8355059
Update README
vsabolcec Mar 21, 2024
304d34c
Typo
vsabolcec Mar 21, 2024
6daa5e8
Changes to text normalization + FTFY and lines symbol formatters (#133)
guipenedo Mar 22, 2024
8421fe1
removed debug print
guipenedo Mar 22, 2024
87e30a4
fix symbollinesremover regex hanging
guipenedo Mar 23, 2024
56aa210
Minor Terminology and Documentation Updates for Local Tokenizer Loadi…
justHungryMan Mar 23, 2024
c2d61de
Tweak process_non_english.py
vsabolcec Mar 27, 2024
d44a51c
Merge remote-tracking branch 'upstream/main' into pretokenization
vsabolcec Apr 2, 2024
5c6f1bc
Add tokenizers
vsabolcec Apr 2, 2024
c4d6fce
Type fix
vsabolcec Apr 2, 2024
afadc8f
add requeue and QOS slurm options (#144)
marianna13 Apr 2, 2024
944fb21
Merge branch 'huggingface:main' into pretokenization
vsabolcec Apr 3, 2024
cca0e41
English- and Korean-only tokenization
vsabolcec Apr 3, 2024
5933335
English tokenizer test
vsabolcec Apr 3, 2024
670fc40
Fix substring dedup range (#132)
jordane95 Apr 5, 2024
8c7e052
Line dedup min remove words option (#146)
guipenedo Apr 5, 2024
22cba4c
Add multilang tokenizer to Gopher quality filter
vsabolcec Apr 10, 2024
575d98f
Require language metadata in Gopher quality
vsabolcec Apr 10, 2024
48377af
Merge branch 'huggingface:main' into pretokenization
vsabolcec Apr 10, 2024
1e8c375
Lazy-load and separate dependencies
vsabolcec Apr 10, 2024
5015a4c
fix timeout related issues in extractors
guipenedo Apr 10, 2024
e6b1ccf
Move top-level import to classes
vsabolcec Apr 11, 2024
93d5c2a
Remove print in tests
vsabolcec Apr 11, 2024
005a3ef
pyproject.toml: tokenization -> multilingual
vsabolcec Apr 11, 2024
40a3a1b
New options for FastTextClassifierFilter: apply on sentence or paragr…
guipenedo Apr 11, 2024
209ebec
Url deduplication (#145)
hynky1999 Apr 12, 2024
01175ba
fetch all the labels on fasttextfilter
guipenedo Apr 15, 2024
854389f
add min_num_sentences to line dedup
guipenedo Apr 15, 2024
6614f06
fix tests
guipenedo Apr 16, 2024
7a0f6c4
Fix race conditions during download/extraction (#155)
hynky1999 Apr 16, 2024
9d75443
Adds PII removal (#156)
guipenedo Apr 16, 2024
2aa32d4
fineweb filter
hynky1999 Apr 20, 2024
c4ee193
fw filters
hynky1999 Apr 20, 2024
9a88beb
remove duplicate filter
hynky1999 Apr 20, 2024
05194d3
clean up + fineweb example
guipenedo Apr 20, 2024
447c942
added PII block
guipenedo Apr 20, 2024
50eb055
Pypi Publish Action (#159)
hynky1999 Apr 22, 2024
1724f28
Update pyproject.toml
guipenedo Apr 22, 2024
7042785
Update pypi-release.yml
guipenedo Apr 22, 2024
6d06210
Update pypi-release.yml
guipenedo Apr 22, 2024
4e9235f
Added c4 badwords filter, added batch tokenization to tokenscounter (…
guipenedo Apr 24, 2024
a8d21e2
fix documents with a lot of paragraphs being removed by the repetitio…
guipenedo Apr 26, 2024
d3f8245
FineWeb quality statistics
vsabolcec Apr 26, 2024
1ba51c3
Merge upstream/main
vsabolcec Apr 27, 2024
83077f1
Multilang tokenizer for MinHash
vsabolcec Apr 27, 2024
ae8b4de
Remove print
vsabolcec Apr 27, 2024
0c7e54a
Change Korean Tokenizer with Kiwi
Kesta-bos Apr 28, 2024
cd1587c
fix
Kesta-bos Apr 28, 2024
a687ef0
add kiwipiepy dependency
Kesta-bos Apr 28, 2024
c72b1e4
bugfix pii emails and quality filters default args
guipenedo May 2, 2024
6a4881d
Add a skip parameter to all readers (defaults to zero) (#167)
rantav May 3, 2024
15c8425
Adds n-gram based decontamination (#172)
guipenedo May 4, 2024
d56d3c5
add skip in decont index builder
guipenedo May 4, 2024
b8ce9c1
Add repetition stats and update wiki script
vsabolcec May 6, 2024
22c739e
fix for requeueing code and change minhash default
guipenedo May 6, 2024
750f0fe
Update stats
vsabolcec May 7, 2024
af647a7
Fix division by zero
vsabolcec May 7, 2024
8805e13
Handle non-method cases in to_dict conversion (#139)
justHungryMan May 7, 2024
ff50473
Adds `tasks_per_job` to slurm executor (#153)
guipenedo May 7, 2024
12f79a1
Update statistics
vsabolcec May 7, 2024
b2b96e4
Unsigned int tokenizer and srun args (#154)
marianna13 May 7, 2024
caeca43
Update filters
vsabolcec May 7, 2024
a970249
add linting to the examples folder
guipenedo May 7, 2024
4d83342
Enhance BaseReader to allow custom adapters access to instance variab…
justHungryMan May 7, 2024
b246107
Merge branch 'beme248:multilingual' into multilingual
Kesta-bos May 8, 2024
4218ec8
Merge pull request #6 from Kesta-bos/multilingual
vsabolcec May 8, 2024
9ef64e8
Format
vsabolcec May 8, 2024
a6eca69
Merge branch 'huggingface:main' into multilingual
vsabolcec May 8, 2024
ed5244a
Fix Shuffled HFReader
vsabolcec May 8, 2024
4072131
Update Korean stats and filters
vsabolcec May 8, 2024
abd1534
Better FineWeb
vsabolcec May 12, 2024
331c439
Update statistics
vsabolcec May 12, 2024
220e235
Format
vsabolcec May 12, 2024
87720e5
MultilingualFineWebQualityFilter
vsabolcec May 13, 2024
9e4b222
Merge origin/pretokenization
vsabolcec May 13, 2024
7bd1093
Lazy load and tokenize -> word_tokenize
vsabolcec May 13, 2024
ebb8f0f
sent_tokenize
vsabolcec May 13, 2024
4b97552
Move tokenization dependencies to multilingual
vsabolcec May 13, 2024
a92222a
C4 quality sent_tokenize
vsabolcec May 13, 2024
19e04d8
Langstats tokenizer fix
vsabolcec May 13, 2024
bcac6c5
Langstats tokenizer fix 2
vsabolcec May 13, 2024
545d379
Sent tokenize tweaks
vsabolcec May 13, 2024
7ae2fa7
Whitespace stripping
vsabolcec May 13, 2024
7034978
Fallback tokenizer
vsabolcec May 13, 2024
b2ae282
Khmer punctuation
vsabolcec May 14, 2024
5846fbd
Update Wiki stats script
vsabolcec May 14, 2024
c0505de
Format
vsabolcec May 14, 2024
e3936fb
Update stat and filter #1
vsabolcec May 14, 2024
2c61271
Remove nlpashto
vsabolcec May 14, 2024
4f5d07e
Update stats and filters #2
vsabolcec May 14, 2024
28b0205
Small readme fix
vsabolcec May 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
61 changes: 61 additions & 0 deletions .github/workflows/pypi-release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: PyPI release
on:
workflow_dispatch:

jobs:
testing:
uses: ./.github/workflows/testing.yml
release:
needs: testing
runs-on: ubuntu-latest
env:
TWINE_USERNAME: __token__

steps:
- name: Checkout Repo
uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.10"

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install -U twine build

- name: Build the dist files
run: python -m build .

- name: Publish to the test PyPI
env:
TWINE_PASSWORD: ${{ secrets.TEST_PYPI_TOKEN }}
run: twine upload dist/* --repository=testpypi

- name: Test installing from test PyPI and running tests
run: |
pip install -i https://testpypi.python.org/pypi --extra-index-url https://pypi.org/simple datatrove[testing]
python -m nltk.downloader punkt
make test

- name: Get tag name
id: get_tag_name
run: |
echo TAG_NAME=$(grep '^version' pyproject.toml | head -1 | cut -d '"' -f 2) >> $GITHUB_OUTPUT

- name: Tag the release
uses: actions/github-script@v7
with:
script: |
github.rest.git.createRef({
owner: context.repo.owner,
repo: context.repo.repo,
ref: 'refs/tags/v${{ steps.get_tag_name.outputs.TAG_NAME }}',
sha: context.sha
})

- name: Publish to PyPI
env:
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: twine upload dist/* --repository=pypi
7 changes: 4 additions & 3 deletions .github/workflows/ci.yml → .github/workflows/testing.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: CI
name: Test & Check Code Quality

on:
pull_request:
Expand All @@ -7,6 +7,7 @@ on:
push:
branches:
- main
workflow_call:

jobs:
check_code_quality:
Expand All @@ -23,8 +24,8 @@ jobs:
pip install .[quality]
- name: Check quality
run: |
ruff check tests src # linter
ruff format --check tests src # formatter
ruff check tests src examples # linter
ruff format --check tests src examples # formatter

test:
runs-on: ubuntu-latest
Expand Down
176 changes: 176 additions & 0 deletions examples/fineweb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
"""
This file contains the code used to process and create the
FineWeb dataset (https://huggingface.co/datasets/HuggingFaceFW/fineweb)
"""

from datatrove.executor.slurm import SlurmPipelineExecutor
from datatrove.pipeline.dedup import MinhashDedupCluster, MinhashDedupFilter, MinhashDedupSignature
from datatrove.pipeline.dedup.minhash import MinhashConfig, MinhashDedupBuckets
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import (
C4QualityFilter,
FineWebQualityFilter,
GopherQualityFilter,
GopherRepetitionFilter,
LanguageFilter,
URLFilter,
)
from datatrove.pipeline.formatters import PIIFormatter
from datatrove.pipeline.readers import JsonlReader, WarcReader
from datatrove.pipeline.tokens import TokensCounter
from datatrove.pipeline.writers.jsonl import JsonlWriter


"""
we first ran the following pipeline for each dump
"""
DUMP_TO_PROCESS = "CC-MAIN-2O23-5O" # example

MAIN_OUTPUT_PATH = "s3://some_s3_bucket"
FILTERING_OUTPUT_PATH = f"{MAIN_OUTPUT_PATH}/base_processing"

main_processing_executor = SlurmPipelineExecutor(
job_name=f"cc_{DUMP_TO_PROCESS}",
pipeline=[
WarcReader(
f"s3://commoncrawl/crawl-data/{DUMP_TO_PROCESS}/segments/",
glob_pattern="*/warc/*", # we want the warc files
default_metadata={"dump": DUMP_TO_PROCESS},
),
URLFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/1_url/{DUMP_TO_PROCESS}")),
Trafilatura(favour_precision=True),
LanguageFilter(
exclusion_writer=JsonlWriter(
f"{FILTERING_OUTPUT_PATH}/2_non_english/",
output_filename="${language}/" + DUMP_TO_PROCESS + "/${rank}.jsonl.gz",
# folder structure: language/dump/file
)
),
GopherRepetitionFilter(
exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/3_gopher_rep/{DUMP_TO_PROCESS}")
),
GopherQualityFilter(
exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/4_gopher_qual/{DUMP_TO_PROCESS}")
),
C4QualityFilter(
filter_no_terminal_punct=False,
exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/5_c4/{DUMP_TO_PROCESS}"),
),
FineWebQualityFilter(
exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/6_fineweb_qual/{DUMP_TO_PROCESS}")
),
JsonlWriter(f"{FILTERING_OUTPUT_PATH}/output/{DUMP_TO_PROCESS}"),
],
tasks=8000,
time="10:00:00",
logging_dir=f"{MAIN_OUTPUT_PATH}/logs/base_processing/{DUMP_TO_PROCESS}",
slurm_logs_folder=f"logs/base_processing/{DUMP_TO_PROCESS}/slurm_logs", # must be local
randomize_start=True, # don't hit the bucket all at once with the list requests
mem_per_cpu_gb=2,
partition="hopper-cpu",
)
main_processing_executor.run()

"""
we then applied minhash deduplication to each individual dump,
"""

# you can also change ngrams or the number of buckets and their size here
minhash_config = MinhashConfig(
use_64bit_hashes=True, # better precision -> fewer false positives (collisions)
num_buckets=14,
hashes_per_bucket=8,
n_grams=5,
)

S3_MINHASH_BASE_PATH = f"{MAIN_OUTPUT_PATH}/minhash"

S3_LOGS_FOLDER = f"{MAIN_OUTPUT_PATH}/logs/minhash"
LOCAL_LOGS_FOLDER = "logs/minhash"

TOTAL_TASKS = 1000

# this is the original data that we want to deduplicate
INPUT_READER = JsonlReader(
f"{FILTERING_OUTPUT_PATH}/output/{DUMP_TO_PROCESS}"
) # this is the output from the first part

# stage 1 computes minhash signatures for each task (each task gets a set of files)
stage1 = SlurmPipelineExecutor(
job_name=f"mh1_{DUMP_TO_PROCESS}",
pipeline=[
INPUT_READER,
MinhashDedupSignature(
output_folder=f"{S3_MINHASH_BASE_PATH}/{DUMP_TO_PROCESS}/signatures", config=minhash_config
),
],
tasks=TOTAL_TASKS,
time="5:00:00",
partition="hopper-cpu",
logging_dir=f"{S3_LOGS_FOLDER}/signatures",
slurm_logs_folder=f"{LOCAL_LOGS_FOLDER}/signatures/slurm_logs",
randomize_start=True,
depends=main_processing_executor, # only start after the first one completes
)

stage2 = SlurmPipelineExecutor(
job_name=f"mh2_{DUMP_TO_PROCESS}",
pipeline=[
MinhashDedupBuckets(
input_folder=f"{S3_MINHASH_BASE_PATH}/{DUMP_TO_PROCESS}/signatures",
output_folder=f"{S3_MINHASH_BASE_PATH}/{DUMP_TO_PROCESS}/buckets",
config=MinhashConfig(use_64bit_hashes=True),
),
],
tasks=minhash_config.num_buckets * 50, # the code supports parallelizing each bucket. here we run 50
# workers per bucket
randomize_start=True,
logging_dir=f"{S3_LOGS_FOLDER}/buckets",
partition="hopper-cpu",
time="02:00:00",
mem_per_cpu_gb=4,
cpus_per_task=3, # you can add run more (smaller) tasks if you do not have a lot of memory
depends=stage1,
)


stage3 = SlurmPipelineExecutor(
job_name=f"mh3_{DUMP_TO_PROCESS}",
pipeline=[
MinhashDedupCluster(
input_folder=f"{S3_MINHASH_BASE_PATH}/{DUMP_TO_PROCESS}/buckets",
output_folder=f"{S3_MINHASH_BASE_PATH}/{DUMP_TO_PROCESS}/remove_ids",
config=minhash_config,
),
],
tasks=1, # this step runs on a single task
logging_dir=f"{S3_LOGS_FOLDER}/clustering",
partition="hopper-cpu",
time="30:00:00", # and can also be quite slow. Usually not this slow though
mem_per_cpu_gb=25,
cpus_per_task=8, # if you dedup a full dump, you do need a lot of memory for this one
depends=stage2,
)


stage4 = SlurmPipelineExecutor(
job_name=f"mh4_{DUMP_TO_PROCESS}",
pipeline=[
INPUT_READER,
TokensCounter(), # you can remove this one, it's just a nice way to know how many tokens we have
# before and after dedup
MinhashDedupFilter(input_folder=f"{S3_MINHASH_BASE_PATH}/{DUMP_TO_PROCESS}/remove_ids"),
# run the PII removal
PIIFormatter(),
JsonlWriter(f"{S3_MINHASH_BASE_PATH}/{DUMP_TO_PROCESS}/deduped_output"),
],
tasks=TOTAL_TASKS,
logging_dir=f"{S3_LOGS_FOLDER}/filtering",
partition="hopper-cpu",
time="5:00:00",
mem_per_cpu_gb=4,
depends=stage3,
)

# launch dedup pipelines
stage4.run()
122 changes: 122 additions & 0 deletions examples/multilingual/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Multilingual CommonCrawl cleaning pipeline

To extend the [RefinedWeb](https://arxiv.org/pdf/2306.01116.pdf) pipeline to support multilingual data, we build on top of the `datatrove` Python library. To effectively process multilingual data, we use per-language word tokenizers and adjust the Gopher quality filter thresholds for each language. Our implementation and filter thresholds are outlined in further sections.


## Language-specific word tokenizers

The filters used in the cleaning pipeline are sensitive to word tokenization, which can impact the results. Therefore, we use several word tokenization libraries to support almost 100 languages. In order to process large volumes of data efficiently, we utilize fast and reliable tokenization libraries: [NLTK](https://www.nltk.org/), [SpaCy](https://spacy.io/), [Stanza](https://stanfordnlp.github.io/stanza/), [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library), [Jieba](https://github.com/fxsjy/jieba), [NLPashto](https://pypi.org/project/nlpashto/), [PyVi](https://pypi.org/project/pyvi/) and [Anbani](https://github.com/Anbani/anbani.py). For languages that aren't officially supported by the libraries, we use a tokenizer of a supported language that is written in the same script and is from a close language family.

To further analyze the implementation of word tokenizers, inspect the [word tokenizer source code](https://github.com/beme248/datatrove/blob/multilingual/src/datatrove/tools/word_tokenizers.py).


## Multilingual Gopher quality filter: language-specific adjustments

In our implementation of the multilingual Gopher quality filter, we make language-specific adjustments based on the statistical analysis of the Wikipedia data for the top 100 high-resource languages.

We [extract the statistics](https://github.com/beme248/datatrove/blob/multilingual/examples/multilingual/lang_stats/wiki_lang_stats.py) for each language from their respective [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia). We use the [language statistics visualization tool](https://huggingface.co/spaces/ZR0zNqSGMI/mlo-language-statistics) to analyze the statistics. By comparing the statistic values among different languages, we identify that the following filter values should be tweaked per-language: `stop_words`, `min_avg_word_length`, `max_avg_word_length` and `max_non_alpha_words_ratio`.

Further subsections explain the choice for filter threshold values. Other filter threshold values are set to the default values from the original Gopher quality filter.

To further analyze the implementation of the filters, inspect the [Gopher quality filter source code](https://github.com/beme248/datatrove/blob/multilingual/src/datatrove/pipeline/filters/gopher_quality_filter.py) and [multilingual Gopher quality filter source code](https://github.com/beme248/datatrove/blob/multilingual/src/datatrove/pipeline/filters/multilingual_gopher_quality_filter.py).

### `stop_words`

To obtain stop words for each language, we count the occurrences of each word in the Wikipedia dataset. We choose stop word candidates as the highest frequency words. To account for differences among languages (e.g., English uses "the", while German uses "der", "die" and "das"), we select words with a frequency higher than 0.8% of the total word count frequency instead of a fixed number of stop words with the highest frequencies. We also remove whitespaces and symbols (e.g. "«" and "»") from the stop words.

To reduce the risk of overfiltering the data, if there are less than 8 stop words in the cleaned stop words list, we choose words that appear more frequently than 0.3% of the total word count frequency. We remove whitespaces and symbols from them as well.

To further analyze word frequencies, use the [language statistics visualization tool](https://huggingface.co/spaces/ZR0zNqSGMI/mlo-language-statistics) (tab *Word frequency*).


### `min_avg_word_length` and `max_avg_word_length`

We calculate the language-specific thresholds for `min_avg_word_length` and `max_avg_word_length` as one standard deviation below (for minimum) and one standard deviation above (for maximum) the mean word length value rounded to the closest integer. When computed for the English language, these values are similar to the original Gopher quality filter thresholds: 2 (for minimum) and 8 (for maximum).


### `max_non_alpha_words_ratio`

We calculate the `max_non_alpha_words_ratio` filter threshold for each language as three standard deviations below the mean `alpha_ratio` rounded to one decimal place. When computed for the English language, the value is equal to the default Gopher quality filter threshold: 0.8.

# Running the pipeline

## Install conda

Follow [Quick command line install](https://docs.anaconda.com/free/miniconda/#quick-command-line-install) tutorial for Linux to set up `conda`.

Restart your shell after running `~/miniconda3/bin/conda init bash` to be able to use `conda`.

## Clone the repository

```bash
git clone -b multilingual https://github.com/beme248/datatrove
cd datatrove
```

## Set up conda environment

```bash
conda create -n datatrove python=3.11
conda activate datatrove
pip install -e ".[all]" # Install dependencies
```

## Run the pipeline

To generate language-specific filter thresholds (optional, filter thresholds are already provided in folder `filters`), run
```bash
python wiki_lang_stats.py filters
```

To start the CommonCrawl cleaning pipeline, run
```bash
python process_non_english.py DUMP_NAME
```


<!-- ## Running on the CSCS Slurm cluster

### Set up access to CSCS Clariden cluster

Follow the [tutorial](https://github.com/swiss-ai/documentation/blob/main/getting_started_with_clariden/setup_clariden.md) to set up the access to the Clariden cluster.

By the end of the tutorial, you should be able to `ssh` into your account on the cluster.
```bash
ssh clariden
```

### Install conda

Follow [Quick command line install](https://docs.anaconda.com/free/miniconda/#quick-command-line-install) tutorial for Linux to set up `conda` under your user on the cluster.

Restart your shell after running `~/miniconda3/bin/conda init bash` to be able to use `conda`.

### Clone the repository

```bash
git clone -b multilingual https://github.com/beme248/datatrove
cd datatrove
```

### Set up conda environment

```bash
conda create -n datatrove python=3.11
conda activate datatrove
pip install -e ".[all]" # Install dependencies
```

### Run the pipeline


```bash
cd examples/multilingual
```

To generate language statistics (optional, language statistics are already provided), run
```bash
export HF_DATASETS_CACHE="$SCRATCH/hf_datasets"
python wiki_lang_stats.py
```

Note that we change the HuggingFace datasets library cache to the `$SCRATCH` directory becuase the datasets will not fit in `$HOME` directory. -->
Loading