Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add doc for adding custom dataset #311

Closed
lessw2020 opened this issue May 5, 2024 · 0 comments · Fixed by #715
Closed

add doc for adding custom dataset #311

lessw2020 opened this issue May 5, 2024 · 0 comments · Fixed by #715
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@lessw2020
Copy link
Contributor

per user request, we don't currently have any info on how to do this. (basically extend the hf_dataset file).

@tianyu-l tianyu-l added documentation Improvements or additions to documentation enhancement New feature or request labels May 7, 2024
msaroufim added a commit that referenced this issue Dec 5, 2024
EDIT: removed the specific new functions in hf_datasets.py and kept most
of the doc changes and will not go for a registration based API

Fixes #311

This PR describes the status quo of how new datasets should be
registered today, in that there's the implicit assumption that people
are installing torchtitan from source and updating hf_datasets.py to
support new datasets. As an example I passed in the wikipedia dataset

The main "nice" thing about this PR is that `class HuggingFaceDataset`
is now agnostic to the c4 dataset which makes it easier for new people
to add datasets without reading the rest of the file

There's another direction this PR could have went in which was to allow
custom dataset registration, the benefit is people can support new
datasets without installing titan from source but registration apis can
feel kinda "bureaucratic" and presumably people would need to register
the dataset somewhere, probably `train.py`?

Not totally sure which is more in line with the repo's goals so opening
this PR to discuss

```python
def register_dataset(
    name: str,
    loader: Callable[[str, Dict[str, Any]], Any],
    processor: Callable[[Dict[str, Any]], str],
    path: Optional[str] = None,
) -> None:

    DATASET_LOADERS[name] = loader
    DATASET_TEXT_PROCESSORS[name] = processor

def wikipedia_loader(dataset_path: str, **kwargs):
    return load_dataset(
        dataset_path,
        name="20220301.en",
        split="train", 
        streaming=True,
        trust_remote_code=True,
    )

def wikipedia_processor(sample: Dict[str, Any]) -> str:
    return f"{sample['title']}\n\n{sample['text']}"

register_dataset(
    name="wikipedia",
    loader=wikipedia_loader,
    processor=wikipedia_processor,
    path="wikipedia"
)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants