Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evo2 #694

Open
wants to merge 149 commits into
base: main
Choose a base branch
from
Open

Evo2 #694

Show file tree
Hide file tree
Changes from 76 commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
50db0ca
[cye/evo2-llm-dev] Private internal development branch for Evo2 in Bi…
cspades Nov 16, 2024
737f16c
[cye/evo2-llm-dev] Add rough draft of data preprocessing for Evo2.
cspades Dec 4, 2024
a142109
Add manual data test for evo2
jstjohn Dec 4, 2024
0ad0bee
Change remotes for submodules for now
jstjohn Dec 5, 2024
82c832f
Cye/nemo2 fixes
cspades Dec 5, 2024
945506f
Write model checkpoint context and set Evo2Dataset in the pre-training.
cspades Dec 10, 2024
4fc1d84
Fix inference script to make sense, i.e. no seq parallelism for decod…
cspades Dec 11, 2024
f5adde5
Cye/fix Hyena species biases
cspades Dec 16, 2024
b9dfd5c
Hyena golden value test
jstjohn Dec 19, 2024
e6278d9
[cye/blended-training] Expose blended weights for training Hyena.
cspades Dec 21, 2024
dd0aab1
Changes for 256 node training run
jstjohn Dec 23, 2024
0560ee4
Integrate BioNeMo Noodles into Hyena data preprocessing.
cspades Dec 24, 2024
5511fe7
[cye/lineage-str] Clean up interface for taxonomic lineage tokens in …
cspades Jan 3, 2025
92d0352
Changes made on 256 node branch
jstjohn Jan 3, 2025
923cbdf
Cye/hyena flops
cspades Jan 3, 2025
6460ea3
Fix broken import of blended training config.
cspades Jan 3, 2025
7e72f48
Cye/import fix
cspades Jan 3, 2025
45923c6
Add improved nsys profiling support
jstjohn Jan 6, 2025
c805984
[cye/hyena-doc-update] Add data preprocessing documentation, fix tech…
cspades Jan 7, 2025
f5b15f3
[cye/transcript-readme] Add main documentation snippets for Hyena, an…
cspades Jan 8, 2025
9ba9e07
Bump nemo version to the new context length insensitive code, and upd…
jstjohn Jan 10, 2025
854951f
added flag for tflops callback
dorotat-nv Jan 13, 2025
ada349e
[cye/evo2-ckpt-utils] Add Evo2 ZeRO-1/3 to NeMo checkpointing utils.
cspades Jan 13, 2025
652dfe0
Add test for evo2 tokenizer.
jwilber Jan 14, 2025
265a0be
Fix nemo-savanna repo build in CI
dorotat-nv Jan 14, 2025
fb09377
fixing format issues on evo2-dev
dorotat-nv Jan 14, 2025
9cacf1b
Add tests for parallel hyena operators used in evo2
jwilber Jan 14, 2025
9ac11eb
Rebase on OSS.
cspades Jan 14, 2025
5631b93
[cye/tp-comm-fix] Fix TP communication overlap inconsistency.
cspades Jan 15, 2025
9ae9af0
Add temporary fix for shard-tensor bug in Megatron-LM
dorotat-nv Jan 16, 2025
c032408
Add initial test for preprocess.py
jwilber Jan 17, 2025
b6d238f
Bump NeMo to pick up FLOPS calculations.
cspades Jan 17, 2025
7822c04
[cye/z3-log-fix] Fix parameter count log.
cspades Jan 21, 2025
9378223
[cye/docker-patch-fix] Move Megatron patch to BioNeMo base image in D…
cspades Jan 22, 2025
9b9176a
shipping hotfix for dockers built locally - fix from main 17c6b20513…
dorotat-nv Jan 24, 2025
329548a
[cye/1m-ckpt-config] Add HyenaConfig options for 1M context length di…
cspades Jan 24, 2025
2ca40b0
[cye/fix-tp-comm-overlap] Fix default tp_comm_overlap=True being used…
cspades Jan 24, 2025
a494478
reducing scope of tested folders for evo2-dev
dorotat-nv Jan 27, 2025
72a311e
Adds basic inference test
jomitchellnv Jan 28, 2025
d4cd785
[cye/deactivate-infer-tpcomm] Deactivate TP communication during infe…
cspades Jan 28, 2025
3ba8946
fix: ensure test looks in test file dir for required data
jwilber Jan 28, 2025
34938fc
m2.5 accuracy 7b runs
jstjohn Jan 29, 2025
5fe2576
Fixes `test_evo2.py` unit test and adds enhancements to existing unit…
jomitchellnv Jan 30, 2025
30e71e9
Fix bug in wandb logger argparse.
jstjohn Jan 30, 2025
635a5df
[cye/pad-loss-mask] Fixes TP comm overlap bug with sequence parallel …
cspades Feb 3, 2025
f141830
Add longphase dataset config to repo
jwilber Feb 3, 2025
493d444
bump Megatron-LM, nemo-savanna and rebase to main OSS
dorotat-nv Feb 5, 2025
1df0176
CI hotfix
dorotat-nv Feb 7, 2025
624e797
test: Create tests for Evo2Dataset mask_phylogenetic_tags
jwilber Feb 7, 2025
75205b0
[cye/torch_dist_fix] Remove torch_dist patch and bump Megatron, reorg…
cspades Feb 11, 2025
0f6efeb
Changes related to accuracy and perf with new nemo2 changes
jstjohn Feb 11, 2025
0af5f9e
[cye/tp-comm-fp8-wgrad-fix]Require --fp8-wgrad when using TP communic…
Feb 12, 2025
0917616
Adding evo2 to JET
dorotat-nv Feb 13, 2025
3d1e19e
Remove sample data from evo2-dev branch
dorotat-nv Feb 13, 2025
9811ae4
[BUGFIX] evo2-dev CI
dorotat-nv Feb 13, 2025
f9133f5
Remove test_mask_phylogenetic tags (moving to nemo repo)
jwilber Feb 14, 2025
e83bd26
attempt at merge -- nemo matches github main. Dockerfile has major bu…
skothenhill-nv Feb 14, 2025
a175d5b
Bump nemo to fix forward bug
jstjohn Feb 14, 2025
ea70cde
Add required changes to work with NeMo upstream
jstjohn Feb 14, 2025
dddf9a4
Add back new context manager for parallel state cleanup
jstjohn Feb 14, 2025
e192982
Move test_config into nemo where the code is
jstjohn Feb 14, 2025
d355729
Fix arg name mismatch
jstjohn Feb 15, 2025
d90c10d
add new license
jwilber Feb 15, 2025
a8432a2
remove tab from license
jwilber Feb 15, 2025
4f2ade5
Bump nemo to fix bug in dataset
jstjohn Feb 15, 2025
3965502
Bump NeMo commit for perf improved loss mask
jstjohn Feb 18, 2025
f09aa36
Adding options for controlling dropout to train.py
jstjohn Feb 18, 2025
955978d
Bump nemo and remove nograd decorator
jstjohn Feb 18, 2025
3e14262
Bump nemo with latest tag masking
jstjohn Feb 18, 2025
46baa5f
Cover non-DNA case due to bug in preprocessing, never have non-dna un…
jstjohn Feb 18, 2025
ef3f55e
Try reverting some of the recent fixes related to TP
jstjohn Feb 18, 2025
af9016e
Bump nemo version with better tested
jstjohn Feb 18, 2025
aafb7a3
Revert loss mask updates
jstjohn Feb 18, 2025
a966b8b
handle 0 token case more gracefully
jstjohn Feb 18, 2025
c4ef1f1
bump NeMo with proper handling of control character containing sequen…
jstjohn Feb 19, 2025
0976fac
Update remote pointers to new public NeMo branches
jstjohn Feb 19, 2025
04982ae
Remove unused Megatron torch_dist sizing patch.
cspades Feb 19, 2025
242f3fe
Remove fasta from test and replace with synthetic sequence
jstjohn Feb 19, 2025
22ada77
Move fasta creation utility into testing sub-package
jstjohn Feb 19, 2025
b5bdec8
Add a test that verifies that the new phylo tag masking code is faste…
jstjohn Feb 19, 2025
ac1bd1f
Move phylo tag benchmark to NeMo testing
jstjohn Feb 19, 2025
bfaebd1
Merge in main
jstjohn Feb 20, 2025
0ae0c50
Update Megatron-LM submodule to commit 62529f1d (has 1M context fix) …
jwilber Feb 20, 2025
2ba5da3
fix config typo in test
jstjohn Feb 21, 2025
253a7f2
bump NeMo to latest PR version
jstjohn Feb 21, 2025
82e9c47
Fix issue causing gh-docs-deploy failure (#698)
jwilber Feb 21, 2025
fa73a00
Update nemo pointer with PR updates
jstjohn Feb 21, 2025
f466774
Add new license to new files (failing ci) (#699)
jwilber Feb 21, 2025
39290e4
Change kingdom to domain in tag description
jstjohn Feb 21, 2025
b688975
Merge in upstream
jstjohn Feb 21, 2025
94c4283
Merge branch 'main' of github.com:NVIDIA/bionemo-framework into evo2
jstjohn Feb 21, 2025
15c7dca
Make new versions of the files available freshly converted from HF
jstjohn Feb 22, 2025
3324bd4
bump nemo version to fix broken import
jstjohn Feb 22, 2025
ce133d2
bump nemo to top of tree
jstjohn Feb 22, 2025
78f92b5
Adding in the predict method and test
jstjohn Feb 24, 2025
bb5f5a1
Merge branch 'main' of github.com:NVIDIA/bionemo-framework into evo2
jstjohn Feb 24, 2025
d9e4952
bump NeMo commit
jstjohn Feb 24, 2025
b148750
Fix multipart download naming in nemo
jstjohn Feb 24, 2025
ba1d9bf
Update docs for checkpoint conversion
jstjohn Feb 24, 2025
0af3e0a
shrink tests down to 1b case
jstjohn Feb 24, 2025
c5e42d8
add end to end fine-tuning tutorial
jstjohn Feb 25, 2025
544b7a8
ignore object hashes in precommit
jstjohn Feb 25, 2025
d7a8ea7
Bump nemo pointer to latest PR pointer
jstjohn Feb 25, 2025
07c48b8
Update ci/benchmarks/partial-conv/evo2_pretrain.yaml
jstjohn Feb 25, 2025
e779f60
Update ci/benchmarks/perf/evo2_pretrain.yaml
jstjohn Feb 25, 2025
a1c8048
Slightly smaller test_train.py
jstjohn Feb 25, 2025
46edcb6
Add missing main function for inference cli
jstjohn Feb 25, 2025
e81eef3
Add --batch-size option to predict
jstjohn Feb 25, 2025
4e5acda
Fixing the description of the 1b model
jstjohn Feb 25, 2025
5bd0e2c
remove hard-coded PBSS
jstjohn Feb 26, 2025
ca16c2a
Remove comment block from code
jstjohn Feb 26, 2025
5248e5d
evo2 train unit test (#704)
dorotat-nv Feb 27, 2025
1e7323b
Updates to benchmarks: evo2 (#705)
dorotat-nv Feb 28, 2025
24f1db0
Add brca1 zeroshot example + predict and scoring updates to evo2.
jwilber Mar 4, 2025
e012146
Add vortex style fp8 support to predict
jstjohn Mar 4, 2025
ec662e4
Update the brca notebook with a run on an fp8 supporting machine
jstjohn Mar 4, 2025
aabd6a4
Merge in upstream changes to bionemo
jstjohn Mar 4, 2025
66ead75
add missing/new NGC urls
jstjohn Mar 4, 2025
0c67976
Remove fasta from pre commit
jstjohn Mar 4, 2025
ae81e4d
Remove TODOs related to PBSS
jstjohn Mar 4, 2025
b15cc82
Moved test config into the tests/config dir with the other configs
jstjohn Mar 4, 2025
a752309
Address yaml location feedback
jstjohn Mar 4, 2025
b1bb99d
Add new test covering padding and seq dims
jstjohn Mar 4, 2025
4f67795
Address comments on documentation
jstjohn Mar 4, 2025
97d3845
Run pre-commit on docs
jstjohn Mar 4, 2025
7473e14
Address PR feedback on test naming
jstjohn Mar 4, 2025
6be9801
Refactor out fasta dataset, add tests for it (#716)
jwilber Mar 4, 2025
0e13ad0
Bump nemo commit with predict changes
jstjohn Mar 4, 2025
bf08649
no longer needed since we do not have committed fastas
jstjohn Mar 4, 2025
04914d2
Reformat to pass pre-commit
jstjohn Mar 4, 2025
48cab0a
update readme to mention predict (#717)
jwilber Mar 4, 2025
1d19941
Fix parallel short hyena operator test
jwilber Mar 4, 2025
14ef1ea
Add slow tests for 7b
jstjohn Mar 4, 2025
4f83438
Update faster 1b test with lower precision so it passes in CI
jstjohn Mar 4, 2025
f235b09
Merge branch 'evo2' of github.com:NVIDIA/bionemo-framework into evo2
jstjohn Mar 4, 2025
25fefc8
Address formatting issues
jstjohn Mar 4, 2025
e34c44d
Leave megatron-lm as is and add more stringent slow test along with l…
jstjohn Mar 4, 2025
51a8c7a
Bump nemo as well
jstjohn Mar 4, 2025
19b2289
Merge branch 'main' of github.com:NVIDIA/bionemo-framework into evo2
jstjohn Mar 4, 2025
3204869
Update pointer to evo2 test file
jstjohn Mar 5, 2025
70b266c
only run most stringent comparison with h100
jstjohn Mar 5, 2025
9c7e3f7
add missing ngc link for new 7b-8k checkpoint
jstjohn Mar 5, 2025
01f8f05
Fixing sahpe issue in parallel short hyena test
jstjohn Mar 5, 2025
88f2c48
Address issue with pycache when there are tests with the same name in…
jstjohn Mar 5, 2025
825879d
Move to per-package tests for slow as well as fast tests
jstjohn Mar 5, 2025
66d99f8
Handle no tests found case
jstjohn Mar 5, 2025
a6b2a4a
Add option for allowing no slow tests for a submodule
jstjohn Mar 5, 2025
70bfee4
Handle exit code capturing within the context of a pipefail script pr…
jstjohn Mar 5, 2025
46fee2c
Merge branch 'main' into evo2
jstjohn Mar 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ dist/
coverage.xml

# Jupyter Notebook
notebooks/
.ipynb_checkpoints

# System files
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ repos:
hooks:
- id: detect-secrets
name: detect-secrets (everything but notebooks)
args: ['--baseline', '.secrets.baseline', '--exclude-files', '(.*\.ipynb|.*\.baseline)$', ]
args: ['--baseline', '.secrets.baseline', '--exclude-files', '(.*\.ipynb|.*\.baseline|.*\.fasta)$', ]
exclude: package.lock.json
- id: detect-secrets
name: detect-secrets (notebooks only)
Expand Down
6 changes: 3 additions & 3 deletions .secrets.baseline
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@
{
"path": "detect_secrets.filters.regex.should_exclude_file",
"pattern": [
"(.*\\.ipynb|.*\\.baseline)$"
"(.*\\.ipynb|.*\\.baseline|.*\\.fasta)$"
]
}
],
Expand All @@ -139,9 +139,9 @@
"filename": "pyproject.toml",
"hashed_secret": "79670e9c9d1c7ea5b81a96a2053d81437712c78e",
"is_verified": false,
"line_number": 44
"line_number": 45
}
]
},
"generated_at": "2025-01-15T19:06:19Z"
"generated_at": "2025-01-30T14:18:42Z"
}
2 changes: 1 addition & 1 deletion 3rdparty/Megatron-LM
Submodule Megatron-LM updated 253 files
2 changes: 1 addition & 1 deletion 3rdparty/NeMo
Submodule NeMo updated 396 files
63 changes: 63 additions & 0 deletions ci/benchmarks/partial-conv/evo2_pretrain.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
scope: partial-conv
time_limit: 14400
script_args:
# All arguments referenced in the script string must be specified here.
# Arguments not referenced in the script string must have the 'arg' field specified.
# See jet/core/configs.py for the specification of the configuration class
workspace:
value: /workspace/bionemo2
key_segment: False
data_path:
value: /data/evo2
key_segment: False
model:
value: evo2
variant:
value: train
config_name:
value: 7b
precision:
value: fp8
nodes:
value: 4
gpus:
value: 8
batch_size:
value: 2
pp:
value: 1
tp:
value: 8
cp:
value: 1
acc_grad:
value: 1
max_steps:
value: 20000
script: |-
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY python ${workspace}/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py \
-d ${workspace}/ci/benchmarks/test_dataset_config.yaml \
--dataset-path ${data_path} \
--grad-acc-batches ${acc_grad} \
--fp8 \
--enable-preemption \
--ckpt-async-save \
--seq-length=8192 \
--tensor-parallel-size=${tp} \
--context-parallel-size=${cp} \
--pipeline-model-parallel-size=${pp} \
--workers 8 \
--num-nodes=${nodes} \
--devices=${gpus} \
--micro-batch-size=${batch_size} \
--model-size=${config_name} \
--max-steps=${max_steps} \
--limit-val-batches=20 \
--log-every-n-steps=50 \
--val-check-interval=500 \
--tflops-callback \
--experiment-dir=${tensorboard_dir}/${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
--wandb-project=${wandb_project_name} \
--wandb-group=${model}_${variant}_${config_name}__${target} \
--wandb-job-type=${pipeline_label} \
--disable-checkpointing;
67 changes: 67 additions & 0 deletions ci/benchmarks/perf/evo2_pretrain.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
scope: perf
time_limit: 1800
script_args:
# All arguments referenced in the script string must be specified here.
# Arguments not referenced in the script string must have the 'arg' field specified.
# See jet/core/configs.py for the specification of the configuration class
workspace:
value: /workspace/bionemo2
key_segment: False
data_path:
value: /data/evo2
key_segment: False
model:
value: evo2
variant:
value: train
precision:
value: fp8
gpus:
value: 8
batch_size:
value: 2
max_steps:
value: 100
tp:
value: 8
cp:
value: 1
pp:
value: 1
acc_grad:
value: 1
products:
- nodes: 1
config_name: 7b
- nodes: 2
config_name: 7b
- nodes: 8
config_name: 40b
script: |-
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY python ${workspace}/sub-packages/bionemo-evo2/src/bionemo/evo2/run/${variant}.py \
-d ${workspace}/ci/benchmarks/test_dataset_config.yaml \
--dataset-path ${data_path} \
--grad-acc-batches ${acc_grad} \
--fp8 \
--enable-preemption \
--ckpt-async-save \
--use-megatron-comm-overlap-llama3-8k \
--seq-length=8192 \
--tensor-parallel-size=${tp} \
--context-parallel-size=${cp} \
--pipeline-model-parallel-size=${pp} \
--workers 8 \
--num-nodes=${nodes} \
--devices=${gpus} \
--micro-batch-size=${batch_size} \
--model-size=${config_name} \
--max-steps=${max_steps} \
--limit-val-batches=20 \
--log-every-n-steps=50 \
--val-check-interval=${max_steps} \
--tflops-callback \
--experiment-dir=${tensorboard_dir}/${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
--wandb-project=${wandb_project_name} \
--wandb-group=${model}_${variant}_${config_name}__${target} \
--wandb-job-type=${pipeline_label} \
--disable-checkpointing;
81 changes: 81 additions & 0 deletions ci/benchmarks/test_dataset_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
- dataset_prefix: metagenomics/pretraining_data_metagenomics/data_metagenomics_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.18
- dataset_prefix: gtdb_v220/gtdb_v220_imgpr_merged_data/data_gtdb_imgpr_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.24
- dataset_prefix: imgvr/pretraining_data_imgvr/data_imgvr_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.03
- dataset_prefix: ncrna/pretraining_data_ncrna/data_ncrna_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.02
- dataset_prefix: mrna/pretraining_data_mrna/data_mrna_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.09
- dataset_prefix: euk_windows/stitched_transcripts/pretraining_data_stiched_mrna/data_mrna_stitch_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.09
- dataset_prefix: euk_windows/windows_split/5kb_windows_lowercase/5kb_windows_lowercase_pretraining_data/windows_5kb_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.35
- dataset_prefix: promoters/pretraining_data_promoters/data_promoters_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.0003
- dataset_prefix: organelle/pretraining_data_organelle/data_organelle_train_text_CharLevelTokenizer_document
dataset_split: train
dataset_weight: 0.005
- dataset_prefix: metagenomics/pretraining_data_metagenomics/data_metagenomics_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.18
- dataset_prefix: gtdb_v220/gtdb_v220_imgpr_merged_data/data_gtdb_imgpr_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.24
- dataset_prefix: imgvr/pretraining_data_imgvr/data_imgvr_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.03
- dataset_prefix: ncrna/pretraining_data_ncrna/data_ncrna_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.02
- dataset_prefix: mrna/pretraining_data_mrna/data_mrna_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.09
- dataset_prefix: euk_windows/stitched_transcripts/pretraining_data_stiched_mrna/data_mrna_stitch_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.09
- dataset_prefix: euk_windows/windows_split/5kb_windows_lowercase/5kb_windows_lowercase_pretraining_data/windows_5kb_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.35
- dataset_prefix: promoters/pretraining_data_promoters/data_promoters_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.0003
- dataset_prefix: organelle/pretraining_data_organelle/data_organelle_valid_text_CharLevelTokenizer_document
dataset_split: validation
dataset_weight: 0.005
- dataset_prefix: metagenomics/pretraining_data_metagenomics/data_metagenomics_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.18
- dataset_prefix: gtdb_v220/gtdb_v220_imgpr_merged_data/data_gtdb_imgpr_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.24
- dataset_prefix: imgvr/pretraining_data_imgvr/data_imgvr_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.03
- dataset_prefix: ncrna/pretraining_data_ncrna/data_ncrna_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.02
- dataset_prefix: mrna/pretraining_data_mrna/data_mrna_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.09
- dataset_prefix: euk_windows/stitched_transcripts/pretraining_data_stiched_mrna/data_mrna_stitch_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.09
- dataset_prefix: euk_windows/windows_split/5kb_windows_lowercase/5kb_windows_lowercase_pretraining_data/windows_5kb_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.35
- dataset_prefix: promoters/pretraining_data_promoters/data_promoters_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.0003
- dataset_prefix: organelle/pretraining_data_organelle/data_organelle_test_text_CharLevelTokenizer_document
dataset_split: test
dataset_weight: 0.005
32 changes: 32 additions & 0 deletions ci/scripts/megatron-lm-mr2604-torch-dist-ckpt-size.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
diff --git a/megatron/core/dist_checkpointing/strategies/filesystem_async.py b/megatron/core/dist_checkpointing/strategies/filesystem_async.py
index 47ab4d112..48de3218b 100644
--- a/megatron/core/dist_checkpointing/strategies/filesystem_async.py
+++ b/megatron/core/dist_checkpointing/strategies/filesystem_async.py
@@ -113,6 +113,18 @@ class FileSystemWriterAsync(FileSystemWriter):
file_count += 1
return file_name

+ def _copy_to_cpu(ten: torch.Tensor):
+ """Pinned D2H copy (or a simple clone() if already on the CPU).
+
+ Makes sure we perform a `clone` only if we detect incontiguous storage,
+ so that we don't blow up host memory unnecessarily.
+ """
+ ten = ten.detach()
+ if ten.device.type != "cpu":
+ return ten.to("cpu", non_blocking=True)
+ is_view = ten.untyped_storage().size() != ten.numel() * ten.itemsize
+ return ten.clone() if is_view else ten
+
# Prepare bytes / tensor data in each bucket, which will be assigned to each writer process
self.write_buckets = []
for group_name, group_buckets in _split_by_separation_hint(
@@ -125,7 +137,7 @@ class FileSystemWriterAsync(FileSystemWriter):
if item.type == WriteItemType.BYTE_IO
]
tensor_data = [
- (item, planner.resolve_data(item).detach().to("cpu", non_blocking=True))
+ (item, _copy_to_cpu(planner.resolve_data(item)))
for item in bucket
if item.type != WriteItemType.BYTE_IO
]
1 change: 0 additions & 1 deletion ci/scripts/run_pytest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,6 @@ echo "Test directories: ${TEST_DIRS[*]}"
# Run tests with coverage
for dir in "${TEST_DIRS[@]}"; do
echo "Running pytest in $dir"

if ! pytest "${PYTEST_OPTIONS[@]}" --junitxml=$(basename $dir).junit.xml -o junit_family=legacy "$dir"; then
error=true
fi
Expand Down
5 changes: 3 additions & 2 deletions ci/scripts/utils.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,11 @@ check_git_repository() {
if ! git diff-index --quiet HEAD --; then
if [ $? -eq 128 ]; then
echo "ERROR: Not in a git repository!" >&2
return 1
else
echo "ERROR: Repository is dirty! Commit all changes before building the image!" >&2
echo "Warning: Repository is dirty! Commit all changes before building the image!" >&2
return 0
fi
return 1
fi
}

Expand Down
65 changes: 61 additions & 4 deletions internal/infra-bionemo/src/infra_bionemo/license_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,10 @@
"main",
)

LICENSE_HEADER: str = """
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
NVIDIA_COPYRIGHT: str = (
"# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved."
)
APACHE_BLOCK: str = """
# SPDX-License-Identifier: LicenseRef-Apache2
#
# Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -61,6 +63,9 @@
# limitations under the License.
""".strip()

# default header (split to allow for intermediate copyright headers)
LICENSE_HEADER = f"{NVIDIA_COPYRIGHT}\n{APACHE_BLOCK}"


@dataclass(frozen=True)
class HeaderNotFound(ValueError):
Expand Down Expand Up @@ -134,8 +139,60 @@ def is_valid_python(pyfile_contents: str) -> Optional[SyntaxError]:


def has_header(pyfile_contents: str, *, license_header: str = LICENSE_HEADER) -> bool:
"""True if the :param:`pyfile_contents` starts with the :param:`license_header`. False otherwise."""
return pyfile_contents.startswith(license_header)
"""Check if file has valid license header.

First checks if file has multiple copyright lines - if so, validates structure only.
If not, and custom license_header provided, does exact string match.
Otherwise validates basic structure.
"""
lines = pyfile_contents.split("\n")

# Count copyright lines at start of file
copyright_count = 0
for line in lines:
if line.strip().startswith("# SPDX-FileCopyrightText: Copyright"):
copyright_count += 1
else:
break

# If file has multiple copyrights, only validate structure
if copyright_count > 1:
# Must start with NVIDIA copyright
if not lines or not lines[0].strip() == NVIDIA_COPYRIGHT:
return False

# Find where Apache block starts
apache_start = None
for i, line in enumerate(lines):
if line.strip().startswith("# SPDX-License-Identifier: LicenseRef-Apache2"):
apache_start = i
break

if apache_start is None:
return False

# All lines between NVIDIA copyright and Apache block must be valid SPDX copyright lines
for line in lines[1:apache_start]:
if line.strip() and not line.strip().startswith("# SPDX-FileCopyrightText: Copyright"):
return False

# Check Apache block matches exactly
apache_lines = APACHE_BLOCK.split("\n")
if len(lines[apache_start:]) < len(apache_lines):
return False

for actual, expected in zip(lines[apache_start : apache_start + len(apache_lines)], apache_lines):
if actual.strip() != expected.strip():
return False

return True

# Otherwise, if custom header provided, use exact match
if license_header != LICENSE_HEADER:
return pyfile_contents.startswith(license_header)

# Otherwise do basic structure validation
return lines[0].strip() == NVIDIA_COPYRIGHT and pyfile_contents.startswith(LICENSE_HEADER)


def append_license_header(pyfile_contents: str, *, license_header: str = LICENSE_HEADER, n_sep_lines: int = 2) -> str:
Expand Down
Loading
Loading