add doc for adding custom dataset #311

lessw2020 · 2024-05-05T16:02:21Z

per user request, we don't currently have any info on how to do this. (basically extend the hf_dataset file).

EDIT: removed the specific new functions in hf_datasets.py and kept most of the doc changes and will not go for a registration based API Fixes #311 This PR describes the status quo of how new datasets should be registered today, in that there's the implicit assumption that people are installing torchtitan from source and updating hf_datasets.py to support new datasets. As an example I passed in the wikipedia dataset The main "nice" thing about this PR is that `class HuggingFaceDataset` is now agnostic to the c4 dataset which makes it easier for new people to add datasets without reading the rest of the file There's another direction this PR could have went in which was to allow custom dataset registration, the benefit is people can support new datasets without installing titan from source but registration apis can feel kinda "bureaucratic" and presumably people would need to register the dataset somewhere, probably `train.py`? Not totally sure which is more in line with the repo's goals so opening this PR to discuss ```python def register_dataset( name: str, loader: Callable[[str, Dict[str, Any]], Any], processor: Callable[[Dict[str, Any]], str], path: Optional[str] = None, ) -> None: DATASET_LOADERS[name] = loader DATASET_TEXT_PROCESSORS[name] = processor def wikipedia_loader(dataset_path: str, **kwargs): return load_dataset( dataset_path, name="20220301.en", split="train", streaming=True, trust_remote_code=True, ) def wikipedia_processor(sample: Dict[str, Any]) -> str: return f"{sample['title']}\n\n{sample['text']}" register_dataset( name="wikipedia", loader=wikipedia_loader, processor=wikipedia_processor, path="wikipedia" ) ```

lessw2020 mentioned this issue May 5, 2024

Custom dataset for llama 3 finetuning #310

Closed

tianyu-l added documentation Improvements or additions to documentation enhancement New feature or request labels May 7, 2024

msaroufim mentioned this issue Dec 4, 2024

Custom Dataset refactoring + docs #715

Merged

msaroufim closed this as completed in #715 Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add doc for adding custom dataset #311

add doc for adding custom dataset #311

lessw2020 commented May 5, 2024

add doc for adding custom dataset #311

add doc for adding custom dataset #311

Comments

lessw2020 commented May 5, 2024