Custom Dataset refactoring + docs #715

msaroufim · 2024-12-04T02:14:49Z

EDIT: removed the specific new functions in hf_datasets.py and kept most of the doc changes and will not go for a registration based API

Fixes #311

This PR describes the status quo of how new datasets should be registered today, in that there's the implicit assumption that people are installing torchtitan from source and updating hf_datasets.py to support new datasets. As an example I passed in the wikipedia dataset

The main "nice" thing about this PR is that class HuggingFaceDataset is now agnostic to the c4 dataset which makes it easier for new people to add datasets without reading the rest of the file

There's another direction this PR could have went in which was to allow custom dataset registration, the benefit is people can support new datasets without installing titan from source but registration apis can feel kinda "bureaucratic" and presumably people would need to register the dataset somewhere, probably train.py?

Not totally sure which is more in line with the repo's goals so opening this PR to discuss

def register_dataset(
    name: str,
    loader: Callable[[str, Dict[str, Any]], Any],
    processor: Callable[[Dict[str, Any]], str],
    path: Optional[str] = None,
) -> None:

    DATASET_LOADERS[name] = loader
    DATASET_TEXT_PROCESSORS[name] = processor

def wikipedia_loader(dataset_path: str, **kwargs):
    return load_dataset(
        dataset_path,
        name="20220301.en",
        split="train", 
        streaming=True,
        trust_remote_code=True,
    )

def wikipedia_processor(sample: Dict[str, Any]) -> str:
    return f"{sample['title']}\n\n{sample['text']}"

register_dataset(
    name="wikipedia",
    loader=wikipedia_loader,
    processor=wikipedia_processor,
    path="wikipedia"
)

torchtitan/datasets/hf_datasets.py

msaroufim · 2024-12-05T18:58:52Z

This is ready for a review @tianyu-l

msaroufim · 2024-12-05T19:52:36Z

The failure in CI seems like a flake with glib 2.8?

tianyu-l

This is very cool!
I had some inline suggestions. As part of the test, would you also please share a sceenshot of the C4 dataset working properly? In CI, only c4_test is tested.

tianyu-l · 2024-12-05T20:31:17Z

docs/datasets.md

+1. Install TorchTitan from source:
+```bash
+pip install -e .
+```


nit: I don't think one needs to pip install to use torchtitan :)

huh interesting this wasn't obvious to me

torchtitan/datasets/hf_datasets.py

tianyu-l · 2024-12-05T20:39:44Z

torchtitan/datasets/hf_datasets.py

-        # allow user to pass in a (local or HF hub) path to use unsupported datasets
+        # Force lowercase for consistent comparison
+        dataset_name = dataset_name.lower()
+
        if dataset_name not in _supported_datasets:


the tutorial didn't mention that we need to update the _supported_datasets with new dataset and its path, so it would fail here. I'd recommend we always let user specify the triple, including a default path (but still with the dataset_path overriding option).

I took your feedback and it was helpful to refactor the code and make it much simpler

torchtitan/datasets/hf_datasets.py

msaroufim · 2024-12-05T21:00:23Z

Test logs

[training]
batch_size = 1
seq_len = 8192
warmup_steps = 200  # lr scheduler warm up
max_norm = 1.0  # grad norm clipping
steps = 1000
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
tensor_parallel_degree = 1
compile = false
dataset = "c4"

rank0]:2024-12-05 13:24:36,127 - root - INFO - Starting job: Llama 3 8B training
[rank0]:2024-12-05 13:24:36,128 - root - INFO - Deterministic training off
[rank0]:2024-12-05 13:24:36,915 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-12-05 13:24:36,917 - root - INFO - CUDA capacity: NVIDIA H100with 95.00GiB memory
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Building 1-D device mesh with ['dp'], [8]
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Building tiktoken tokenizer locally from ./torchtitan/datasets/tokenizer/original/tokenizer.model
[rank0]:2024-12-05 13:24:37,139 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:2024-12-05 13:24:37,139 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:2024-12-05 13:24:45,960 - root - INFO - Building llama3 8B with ModelArgs(dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, norm_type='rmsnorm')
[rank0]:2024-12-05 13:24:46,098 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:2024-12-05 13:24:46,099 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-12-05 13:24:46,163 - root - INFO - Applied FSDP to the model
[rank0]:NCCL version 2.21.5+cuda12.4
[rank0]:/home/marksaroufim/.conda/envs/titan/lib/python3.10/site-packages/torch/nn/init.py:51: UserWarning: No PYTORCH_KERNEL_CACHE_PATH or HOME environment variable set! This disables kernel caching. (Triggered internally at ../aten/src/ATen/native/cuda/jit_utils.cpp:1426.)
[rank0]:  tensor.erfinv_()
[rank0]:2024-12-05 13:24:54,885 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:2024-12-05 13:24:54,894 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:2024-12-05 13:24:54,895 - root - INFO - Loading the checkpoint at step 200.

tianyu-l

Looks really good. Thank you!
Please address the remaining comments before merging.

tianyu-l · 2024-12-05T21:56:29Z

train_configs/llama3_8b.toml

+enable_checkpoint = true
 folder = "checkpoint"
 interval_type = "steps"
-interval = 500
+interval = 10


Looks accidental changes, pls revert

tianyu-l · 2024-12-05T21:57:54Z

docs/datasets.md

+  - `text_processor`: Function to process individual samples
+- The loader function should return a HuggingFace dataset object
+- The processor function should return a string that combines the relevant fields from your dataset
+- Use streaming=True for large datasets to manage memory efficiently


nit:

Suggested change

- Use streaming=True for large datasets to manage memory efficiently

- Use `streaming=True` for large datasets to manage memory efficiently

torchtitan/datasets/hf_datasets.py

tianyu-l · 2024-12-05T22:45:41Z

oh also, since you have a nice datasets.md, I think it's worth updating this line to let people know https://github.com/pytorch/torchtitan/blob/main/README.md?plain=1#L61

Custom Dataset docs

049b755

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 4, 2024

pass in kwargs

8ec6a2e

msaroufim changed the title ~~Custom Dataset docs~~ Custom Dataset refactoring + docs Dec 4, 2024

msaroufim mentioned this pull request Dec 4, 2024

[Feature] Plan to add model_register #282

Closed

msaroufim commented Dec 4, 2024

View reviewed changes

torchtitan/datasets/hf_datasets.py Outdated Show resolved Hide resolved

msaroufim added 2 commits December 5, 2024 10:46

updates

13e5313

push

544a2cd

msaroufim requested a review from tianyu-l December 5, 2024 18:48

push

8f3dfc5

Trigger CI

1907d95

tianyu-l reviewed Dec 5, 2024

View reviewed changes

msaroufim added 3 commits December 5, 2024 13:19

feedbacl

68b99e4

push

d65b8ae

update

9fb8e71

msaroufim requested a review from tianyu-l December 5, 2024 21:25

bla

c86a20c

tianyu-l approved these changes Dec 5, 2024

View reviewed changes

msaroufim added 2 commits December 5, 2024 15:06

feedback

65348c6

push

4ea6bff

msaroufim merged commit 7281e0b into main Dec 5, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Dataset refactoring + docs #715

Custom Dataset refactoring + docs #715

msaroufim commented Dec 4, 2024 •

edited

Loading

msaroufim commented Dec 5, 2024

msaroufim commented Dec 5, 2024

tianyu-l left a comment

tianyu-l Dec 5, 2024

msaroufim Dec 5, 2024

tianyu-l Dec 5, 2024

msaroufim Dec 5, 2024

msaroufim commented Dec 5, 2024 •

edited

Loading

tianyu-l left a comment

tianyu-l Dec 5, 2024

tianyu-l Dec 5, 2024

tianyu-l commented Dec 5, 2024

	- Use streaming=True for large datasets to manage memory efficiently
	- Use `streaming=True` for large datasets to manage memory efficiently

Custom Dataset refactoring + docs #715

Custom Dataset refactoring + docs #715

Conversation

msaroufim commented Dec 4, 2024 • edited Loading

msaroufim commented Dec 5, 2024

msaroufim commented Dec 5, 2024

tianyu-l left a comment

Choose a reason for hiding this comment

tianyu-l Dec 5, 2024

Choose a reason for hiding this comment

msaroufim Dec 5, 2024

Choose a reason for hiding this comment

tianyu-l Dec 5, 2024

Choose a reason for hiding this comment

msaroufim Dec 5, 2024

Choose a reason for hiding this comment

msaroufim commented Dec 5, 2024 • edited Loading

tianyu-l left a comment

Choose a reason for hiding this comment

tianyu-l Dec 5, 2024

Choose a reason for hiding this comment

tianyu-l Dec 5, 2024

Choose a reason for hiding this comment

tianyu-l commented Dec 5, 2024

msaroufim commented Dec 4, 2024 •

edited

Loading

msaroufim commented Dec 5, 2024 •

edited

Loading