You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EDIT: removed the specific new functions in hf_datasets.py and kept most
of the doc changes and will not go for a registration based API
Fixes#311
This PR describes the status quo of how new datasets should be
registered today, in that there's the implicit assumption that people
are installing torchtitan from source and updating hf_datasets.py to
support new datasets. As an example I passed in the wikipedia dataset
The main "nice" thing about this PR is that `class HuggingFaceDataset`
is now agnostic to the c4 dataset which makes it easier for new people
to add datasets without reading the rest of the file
There's another direction this PR could have went in which was to allow
custom dataset registration, the benefit is people can support new
datasets without installing titan from source but registration apis can
feel kinda "bureaucratic" and presumably people would need to register
the dataset somewhere, probably `train.py`?
Not totally sure which is more in line with the repo's goals so opening
this PR to discuss
```python
def register_dataset(
name: str,
loader: Callable[[str, Dict[str, Any]], Any],
processor: Callable[[Dict[str, Any]], str],
path: Optional[str] = None,
) -> None:
DATASET_LOADERS[name] = loader
DATASET_TEXT_PROCESSORS[name] = processor
def wikipedia_loader(dataset_path: str, **kwargs):
return load_dataset(
dataset_path,
name="20220301.en",
split="train",
streaming=True,
trust_remote_code=True,
)
def wikipedia_processor(sample: Dict[str, Any]) -> str:
return f"{sample['title']}\n\n{sample['text']}"
register_dataset(
name="wikipedia",
loader=wikipedia_loader,
processor=wikipedia_processor,
path="wikipedia"
)
```
per user request, we don't currently have any info on how to do this. (basically extend the hf_dataset file).
The text was updated successfully, but these errors were encountered: