Skip to content

Commit

Permalink
Updating README and documentation with latest changes (#954)
Browse files Browse the repository at this point in the history
Summary: Pull Request resolved: #954

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D42621598

Pulled By: NivekT

fbshipit-source-id: 1aef46100798e0a41f80e5c94856f0e2cb121cff
  • Loading branch information
NivekT authored and facebook-github-bot committed Jan 20, 2023
1 parent 5f3e968 commit 0d4c0f1
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 31 deletions.
66 changes: 38 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,35 @@
# TorchData

[**Why torchdata?**](#why-composable-data-loading) | [**Install guide**](#installation) |
[**Why TorchData?**](#why-composable-data-loading) | [**Install guide**](#installation) |
[**What are DataPipes?**](#what-are-datapipes) | [**Beta Usage and Feedback**](#beta-usage-and-feedback) |
[**Contributing**](#contributing) | [**Future Plans**](#future-plans)

**This library is currently in the Beta stage and currently does not have a stable release. The API may change based on
user feedback or performance. We are committed to bring this library to stable release, but future changes may not be
completely backward compatible. If you install from source or use the nightly version of this library, use it along with
the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a
GitHub issue. We'd love to hear thoughts and feedback.**
**This library is currently in the Beta stage and new features are under active development. The API may change based on
user feedback or performance. We are committed to bring this library to stable release, but a few future changes may not
be completely backward compatible. If you install from source or use the nightly version of this library, use it along
with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open
a GitHub issue. We'd love to hear thoughts and feedback.**

`torchdata` is a library of common modular data loading primitives for easily constructing flexible and performant data
pipelines.

It aims to provide composable Iterable-style and Map-style building blocks called [`DataPipes`](#what-are-datapipes)
that work well out of the box with the PyTorch's `DataLoader`. It contains functionality to reproduce many different
datasets in TorchVision and TorchText, namely including loading, parsing, caching, and several other utilities (e.g.
hash checking). We will continue to expand and harden this set of API based on user feedback.
This library introduces composable Iterable-style and Map-style building blocks called
[`DataPipes`](#what-are-datapipes) that work well out of the box with the PyTorch's `DataLoader`. These built-in
`DataPipes` have the necessary functionalities to reproduce many different datasets in TorchVision and TorchText, namely
loading files (from local or cloud), parsing, caching, transforming, filtering, and many more utilities. To understand
the basic structure of `DataPipes`, please see [What are DataPipes?](#what-are-datapipes) below, and to see how
`DataPipes` can be practically composed together into datasets, please see our
[examples](https://pytorch.org/data/main/examples.html).

To understand the basic structure of `DataPipes`, please see [What are DataPipes?](#what-are-datapipes) below, and to
see how `DataPipes` can be practically composed into datasets, please see our [`examples/`](examples/) directory.
On top of `DataPipes`, this library provides a new `DataLoader2` that allows the execution of these data pipelines in
various settings and execution backends (`ReadingService`). You can learn more about the new version of `DataLoader2` in
our [full DataLoader2 documentation](https://pytorch.org/data/main/dataloader2.html#dataloader2). Additional features
are work in progres, such as checkpointing and advanced control of randomness and determinism.

Note that because many features of the original DataLoader have been modularized into DataPipes, some now live as
[standard DataPipes in pytorch/pytorch](https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes)
rather than torchdata to preserve BC functional parity within torch.
Note that because many features of the original DataLoader have been modularized into DataPipes, their source codes live
as [standard DataPipes in pytorch/pytorch](https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes)
rather than torchdata to preserve backward-compatibility support and functional parity within `torch`. Regardless, you
can to them by importing them from `torchdata`.

## Why composable data loading?

Expand All @@ -33,9 +39,12 @@ Over many years of feedback and organic community usage of the PyTorch `DataLoad
replace. This has created a proliferation of use-case specific `DataLoader` variants in the community rather than an
ecosystem of interoperable elements.
2. Many libraries, including each of the PyTorch domain libraries, have rewritten the same data loading utilities over
and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these table-stakes
and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these commonly used
elements.

These reasons inspired the creation of `DataPipe` and `DataLoader2`, with a goal to make data loading components more
flexible and reusable.

## Installation

### Version Compatibility
Expand Down Expand Up @@ -86,17 +95,8 @@ Using conda:
conda install -c pytorch torchdata
```

Run a quick sanity check in python:

```py
from torchdata.datapipes.iter import HttpReader
URL = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"
ag_news_train = HttpReader([URL]).parse_csv().map(lambda t: (int(t[0]), " ".join(t[1:])))
agn_batches = ag_news_train.batch(2).map(lambda batch: {'labels': [sample[0] for sample in batch],\
'text': [sample[1].split() for sample in batch]})
batch = next(iter(agn_batches))
assert batch['text'][0][0:8] == ['Wall', 'St.', 'Bears', 'Claw', 'Back', 'Into', 'the', 'Black']
```
You can then proceed to run [our examples](https://github.com/pytorch/data/tree/main/examples), such as
[the IMDb one](https://github.com/pytorch/data/blob/main/examples/text/imdb.py).

### From source

Expand Down Expand Up @@ -162,7 +162,17 @@ reproduce sophisticated data pipelines, with streamed operation as a first-class

Under this naming convention, `Dataset` simply refers to a graph of `DataPipes`, and a dataset module like `ImageNet`
can be rebuilt as a factory function returning the requisite composed `DataPipes`. Note that the vast majority of
initial support is focused on `IterDataPipes`, while more `MapDataPipes` support will come later.
built-in features are implemented as `IterDataPipes`, we encourage the usage of built-in `IterDataPipe` as much as
possible and convert them to `MapDataPipe` only when necessary.

## DataLoader2

A new, light-weight DataLoader2 is introduced to decouple the overloaded data-manipulation functionalities from
`torch.utils.data.DataLoader` to `DataPipe` operations. Besides, certain features can only be achieved with
`DataLoader2`, such as like checkpointing/snapshotting and switching backend services to perform high-performant
operations.

Please read the [full documentation here](https://pytorch.org/data/main/dataloader2.html).

## Tutorial

Expand Down
2 changes: 1 addition & 1 deletion docs/source/dataloader2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ DataLoader2

.. automodule:: torchdata.dataloader2

A light-weight :class:`DataLoader2` is introduced to decouple the overloaded data-manipulation functionalities from ``torch.utils.data.DataLoader`` to ``DataPipe`` operations. Besides, a certain features can only be achieved with :class:`DataLoader2` like snapshotting and switching backend services to perform high-performant operations.
A new, light-weight :class:`DataLoader2` is introduced to decouple the overloaded data-manipulation functionalities from ``torch.utils.data.DataLoader`` to ``DataPipe`` operations. Besides, certain features can only be achieved with :class:`DataLoader2` like snapshotting and switching backend services to perform high-performant operations.

DataLoader2
------------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/torchdata.datapipes.map.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ welcomed in that Github issue.

Here is the list of available Map-style DataPipes:

MapDataPipes
List of MapDataPipes
-------------------------

.. autosummary::
Expand Down
2 changes: 1 addition & 1 deletion torchdata/dataloader2/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DataLoader2 (Prototype)

Please check [online doc](https://pytorch.org/data/main/dataloader2.html#dataloader2)
Please check out our [full DataLoader2 documentation](https://pytorch.org/data/main/dataloader2.html#dataloader2).

## DataLoader2 Prototype Usage and Feedback

Expand Down

0 comments on commit 0d4c0f1

Please sign in to comment.