Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Dataset refactoring + docs #715

Merged
merged 12 commits into from
Dec 5, 2024
Merged

Custom Dataset refactoring + docs #715

merged 12 commits into from
Dec 5, 2024

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Dec 4, 2024

EDIT: removed the specific new functions in hf_datasets.py and kept most of the doc changes and will not go for a registration based API

Fixes #311

This PR describes the status quo of how new datasets should be registered today, in that there's the implicit assumption that people are installing torchtitan from source and updating hf_datasets.py to support new datasets. As an example I passed in the wikipedia dataset

The main "nice" thing about this PR is that class HuggingFaceDataset is now agnostic to the c4 dataset which makes it easier for new people to add datasets without reading the rest of the file

There's another direction this PR could have went in which was to allow custom dataset registration, the benefit is people can support new datasets without installing titan from source but registration apis can feel kinda "bureaucratic" and presumably people would need to register the dataset somewhere, probably train.py?

Not totally sure which is more in line with the repo's goals so opening this PR to discuss

def register_dataset(
    name: str,
    loader: Callable[[str, Dict[str, Any]], Any],
    processor: Callable[[Dict[str, Any]], str],
    path: Optional[str] = None,
) -> None:

    DATASET_LOADERS[name] = loader
    DATASET_TEXT_PROCESSORS[name] = processor

def wikipedia_loader(dataset_path: str, **kwargs):
    return load_dataset(
        dataset_path,
        name="20220301.en",
        split="train", 
        streaming=True,
        trust_remote_code=True,
    )

def wikipedia_processor(sample: Dict[str, Any]) -> str:
    return f"{sample['title']}\n\n{sample['text']}"

register_dataset(
    name="wikipedia",
    loader=wikipedia_loader,
    processor=wikipedia_processor,
    path="wikipedia"
)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 4, 2024
@msaroufim msaroufim changed the title Custom Dataset docs Custom Dataset refactoring + docs Dec 4, 2024
@msaroufim msaroufim requested a review from tianyu-l December 5, 2024 18:48
@msaroufim
Copy link
Member Author

This is ready for a review @tianyu-l

@msaroufim
Copy link
Member Author

The failure in CI seems like a flake with glib 2.8?

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool!
I had some inline suggestions. As part of the test, would you also please share a sceenshot of the C4 dataset working properly? In CI, only c4_test is tested.

docs/datasets.md Outdated
Comment on lines 7 to 10
1. Install TorchTitan from source:
```bash
pip install -e .
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think one needs to pip install to use torchtitan :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh interesting this wasn't obvious to me

# allow user to pass in a (local or HF hub) path to use unsupported datasets
# Force lowercase for consistent comparison
dataset_name = dataset_name.lower()

if dataset_name not in _supported_datasets:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tutorial didn't mention that we need to update the _supported_datasets with new dataset and its path, so it would fail here. I'd recommend we always let user specify the triple, including a default path (but still with the dataset_path overriding option).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took your feedback and it was helpful to refactor the code and make it much simpler

@msaroufim
Copy link
Member Author

msaroufim commented Dec 5, 2024

Test logs

[training]
batch_size = 1
seq_len = 8192
warmup_steps = 200  # lr scheduler warm up
max_norm = 1.0  # grad norm clipping
steps = 1000
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
tensor_parallel_degree = 1
compile = false
dataset = "c4"
rank0]:2024-12-05 13:24:36,127 - root - INFO - Starting job: Llama 3 8B training
[rank0]:2024-12-05 13:24:36,128 - root - INFO - Deterministic training off
[rank0]:2024-12-05 13:24:36,915 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-12-05 13:24:36,917 - root - INFO - CUDA capacity: NVIDIA H100with 95.00GiB memory
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Building 1-D device mesh with ['dp'], [8]
[rank0]:2024-12-05 13:24:36,995 - root - INFO - Building tiktoken tokenizer locally from ./torchtitan/datasets/tokenizer/original/tokenizer.model
[rank0]:2024-12-05 13:24:37,139 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:2024-12-05 13:24:37,139 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:2024-12-05 13:24:45,960 - root - INFO - Building llama3 8B with ModelArgs(dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, norm_type='rmsnorm')
[rank0]:2024-12-05 13:24:46,098 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:2024-12-05 13:24:46,099 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-12-05 13:24:46,163 - root - INFO - Applied FSDP to the model
[rank0]:NCCL version 2.21.5+cuda12.4
[rank0]:/home/marksaroufim/.conda/envs/titan/lib/python3.10/site-packages/torch/nn/init.py:51: UserWarning: No PYTORCH_KERNEL_CACHE_PATH or HOME environment variable set! This disables kernel caching. (Triggered internally at ../aten/src/ATen/native/cuda/jit_utils.cpp:1426.)
[rank0]:  tensor.erfinv_()
[rank0]:2024-12-05 13:24:54,885 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:2024-12-05 13:24:54,894 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:2024-12-05 13:24:54,895 - root - INFO - Loading the checkpoint at step 200.

@msaroufim msaroufim requested a review from tianyu-l December 5, 2024 21:25
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Thank you!
Please address the remaining comments before merging.

Comment on lines 45 to 48
enable_checkpoint = true
folder = "checkpoint"
interval_type = "steps"
interval = 500
interval = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks accidental changes, pls revert

docs/datasets.md Outdated
- `text_processor`: Function to process individual samples
- The loader function should return a HuggingFace dataset object
- The processor function should return a string that combines the relevant fields from your dataset
- Use streaming=True for large datasets to manage memory efficiently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
- Use streaming=True for large datasets to manage memory efficiently
- Use `streaming=True` for large datasets to manage memory efficiently

@tianyu-l
Copy link
Contributor

tianyu-l commented Dec 5, 2024

oh also, since you have a nice datasets.md, I think it's worth updating this line to let people know https://github.com/pytorch/torchtitan/blob/main/README.md?plain=1#L61

@msaroufim msaroufim merged commit 7281e0b into main Dec 5, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add doc for adding custom dataset
3 participants