-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom Dataset refactoring + docs #715
Conversation
This is ready for a review @tianyu-l |
The failure in CI seems like a flake with glib 2.8? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool!
I had some inline suggestions. As part of the test, would you also please share a sceenshot of the C4 dataset working properly? In CI, only c4_test
is tested.
docs/datasets.md
Outdated
1. Install TorchTitan from source: | ||
```bash | ||
pip install -e . | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think one needs to pip install
to use torchtitan :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh interesting this wasn't obvious to me
torchtitan/datasets/hf_datasets.py
Outdated
# allow user to pass in a (local or HF hub) path to use unsupported datasets | ||
# Force lowercase for consistent comparison | ||
dataset_name = dataset_name.lower() | ||
|
||
if dataset_name not in _supported_datasets: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tutorial didn't mention that we need to update the _supported_datasets
with new dataset and its path, so it would fail here. I'd recommend we always let user specify the triple, including a default path (but still with the dataset_path
overriding option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took your feedback and it was helpful to refactor the code and make it much simpler
Test logs [training]
batch_size = 1
seq_len = 8192
warmup_steps = 200 # lr scheduler warm up
max_norm = 1.0 # grad norm clipping
steps = 1000
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
tensor_parallel_degree = 1
compile = false
dataset = "c4" rank0]:2024-12-05 13:24:36,127 - root - INFO - Starting job: Llama 3 8B training
[rank0]:2024-12-05 13:24:36,128 - root - INFO - Deterministic training off
[rank0]:2024-12-05 13:24:36,915 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-12-05 13:24:36,917 - root - INFO - CUDA capacity: NVIDIA H100with 95.00GiB memory
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Building 1-D device mesh with ['dp'], [8]
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Building tiktoken tokenizer locally from ./torchtitan/datasets/tokenizer/original/tokenizer.model
[rank0]:2024-12-05 13:24:37,139 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:2024-12-05 13:24:37,139 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:2024-12-05 13:24:45,960 - root - INFO - Building llama3 8B with ModelArgs(dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, norm_type='rmsnorm')
[rank0]:2024-12-05 13:24:46,098 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:2024-12-05 13:24:46,099 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-12-05 13:24:46,163 - root - INFO - Applied FSDP to the model
[rank0]:NCCL version 2.21.5+cuda12.4
[rank0]:/home/marksaroufim/.conda/envs/titan/lib/python3.10/site-packages/torch/nn/init.py:51: UserWarning: No PYTORCH_KERNEL_CACHE_PATH or HOME environment variable set! This disables kernel caching. (Triggered internally at ../aten/src/ATen/native/cuda/jit_utils.cpp:1426.)
[rank0]: tensor.erfinv_()
[rank0]:2024-12-05 13:24:54,885 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:2024-12-05 13:24:54,894 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:2024-12-05 13:24:54,895 - root - INFO - Loading the checkpoint at step 200. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. Thank you!
Please address the remaining comments before merging.
train_configs/llama3_8b.toml
Outdated
enable_checkpoint = true | ||
folder = "checkpoint" | ||
interval_type = "steps" | ||
interval = 500 | ||
interval = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks accidental changes, pls revert
docs/datasets.md
Outdated
- `text_processor`: Function to process individual samples | ||
- The loader function should return a HuggingFace dataset object | ||
- The processor function should return a string that combines the relevant fields from your dataset | ||
- Use streaming=True for large datasets to manage memory efficiently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
- Use streaming=True for large datasets to manage memory efficiently | |
- Use `streaming=True` for large datasets to manage memory efficiently |
oh also, since you have a nice |
EDIT: removed the specific new functions in hf_datasets.py and kept most of the doc changes and will not go for a registration based API
Fixes #311
This PR describes the status quo of how new datasets should be registered today, in that there's the implicit assumption that people are installing torchtitan from source and updating hf_datasets.py to support new datasets. As an example I passed in the wikipedia dataset
The main "nice" thing about this PR is that
class HuggingFaceDataset
is now agnostic to the c4 dataset which makes it easier for new people to add datasets without reading the rest of the fileThere's another direction this PR could have went in which was to allow custom dataset registration, the benefit is people can support new datasets without installing titan from source but registration apis can feel kinda "bureaucratic" and presumably people would need to register the dataset somewhere, probably
train.py
?Not totally sure which is more in line with the repo's goals so opening this PR to discuss