Updating README and documentation with latest changes (#954)

Summary: Pull Request resolved: #954 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D42621598 Pulled By: NivekT fbshipit-source-id: 1aef46100798e0a41f80e5c94856f0e2cb121cff
pytorch · Jan 20, 2023 · 0d4c0f1 · 0d4c0f1
1 parent 5f3e968
commit 0d4c0f1
Show file tree

Hide file tree

Showing 4 changed files with 41 additions and 31 deletions.
diff --git a/README.md b/README.md
@@ -1,29 +1,35 @@
 # TorchData
 
-[**Why torchdata?**](#why-composable-data-loading) | [**Install guide**](#installation) |
+[**Why TorchData?**](#why-composable-data-loading) | [**Install guide**](#installation) |
 [**What are DataPipes?**](#what-are-datapipes) | [**Beta Usage and Feedback**](#beta-usage-and-feedback) |
 [**Contributing**](#contributing) | [**Future Plans**](#future-plans)
 
-**This library is currently in the Beta stage and currently does not have a stable release. The API may change based on
-user feedback or performance. We are committed to bring this library to stable release, but future changes may not be
-completely backward compatible. If you install from source or use the nightly version of this library, use it along with
-the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a
-GitHub issue. We'd love to hear thoughts and feedback.**
+**This library is currently in the Beta stage and new features are under active development. The API may change based on
+user feedback or performance. We are committed to bring this library to stable release, but a few future changes may not
+be completely backward compatible. If you install from source or use the nightly version of this library, use it along
+with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open
+a GitHub issue. We'd love to hear thoughts and feedback.**
 
 `torchdata` is a library of common modular data loading primitives for easily constructing flexible and performant data
 pipelines.
 
-It aims to provide composable Iterable-style and Map-style building blocks called [`DataPipes`](#what-are-datapipes)
-that work well out of the box with the PyTorch's `DataLoader`. It contains functionality to reproduce many different
-datasets in TorchVision and TorchText, namely including loading, parsing, caching, and several other utilities (e.g.
-hash checking). We will continue to expand and harden this set of API based on user feedback.
+This library introduces composable Iterable-style and Map-style building blocks called
+[`DataPipes`](#what-are-datapipes) that work well out of the box with the PyTorch's `DataLoader`. These built-in
+`DataPipes` have the necessary functionalities to reproduce many different datasets in TorchVision and TorchText, namely
+loading files (from local or cloud), parsing, caching, transforming, filtering, and many more utilities. To understand
+the basic structure of `DataPipes`, please see [What are DataPipes?](#what-are-datapipes) below, and to see how
+`DataPipes` can be practically composed together into datasets, please see our
+[examples](https://pytorch.org/data/main/examples.html).
 
-To understand the basic structure of `DataPipes`, please see [What are DataPipes?](#what-are-datapipes) below, and to
-see how `DataPipes` can be practically composed into datasets, please see our [`examples/`](examples/) directory.
+On top of `DataPipes`, this library provides a new `DataLoader2` that allows the execution of these data pipelines in
+various settings and execution backends (`ReadingService`). You can learn more about the new version of `DataLoader2` in
+our [full DataLoader2 documentation](https://pytorch.org/data/main/dataloader2.html#dataloader2). Additional features
+are work in progres, such as checkpointing and advanced control of randomness and determinism.
 
-Note that because many features of the original DataLoader have been modularized into DataPipes, some now live as
-[standard DataPipes in pytorch/pytorch](https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes)
-rather than torchdata to preserve BC functional parity within torch.
+Note that because many features of the original DataLoader have been modularized into DataPipes, their source codes live
+as [standard DataPipes in pytorch/pytorch](https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes)
+rather than torchdata to preserve backward-compatibility support and functional parity within `torch`. Regardless, you
+can to them by importing them from `torchdata`.
 
 ## Why composable data loading?
 
@@ -33,9 +39,12 @@ Over many years of feedback and organic community usage of the PyTorch `DataLoad
    replace. This has created a proliferation of use-case specific `DataLoader` variants in the community rather than an
    ecosystem of interoperable elements.
 2. Many libraries, including each of the PyTorch domain libraries, have rewritten the same data loading utilities over
-   and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these table-stakes
+   and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these commonly used
    elements.
 
+These reasons inspired the creation of `DataPipe` and `DataLoader2`, with a goal to make data loading components more
+flexible and reusable.
+
 ## Installation
 
 ### Version Compatibility
@@ -86,17 +95,8 @@ Using conda:
 conda install -c pytorch torchdata
 ```
 
-Run a quick sanity check in python:
-
-```py
-from torchdata.datapipes.iter import HttpReader
-URL = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"
-ag_news_train = HttpReader([URL]).parse_csv().map(lambda t: (int(t[0]), " ".join(t[1:])))
-agn_batches = ag_news_train.batch(2).map(lambda batch: {'labels': [sample[0] for sample in batch],\
-                                      'text': [sample[1].split() for sample in batch]})
-batch = next(iter(agn_batches))
-assert batch['text'][0][0:8] == ['Wall', 'St.', 'Bears', 'Claw', 'Back', 'Into', 'the', 'Black']
-```
+You can then proceed to run [our examples](https://github.com/pytorch/data/tree/main/examples), such as
+[the IMDb one](https://github.com/pytorch/data/blob/main/examples/text/imdb.py).
 
 ### From source
 
@@ -162,7 +162,17 @@ reproduce sophisticated data pipelines, with streamed operation as a first-class
 
 Under this naming convention, `Dataset` simply refers to a graph of `DataPipes`, and a dataset module like `ImageNet`
 can be rebuilt as a factory function returning the requisite composed `DataPipes`. Note that the vast majority of
-initial support is focused on `IterDataPipes`, while more `MapDataPipes` support will come later.
+built-in features are implemented as `IterDataPipes`, we encourage the usage of built-in `IterDataPipe` as much as
+possible and convert them to `MapDataPipe` only when necessary.
+
+## DataLoader2
+
+A new, light-weight DataLoader2 is introduced to decouple the overloaded data-manipulation functionalities from
+`torch.utils.data.DataLoader` to `DataPipe` operations. Besides, certain features can only be achieved with
+`DataLoader2`, such as like checkpointing/snapshotting and switching backend services to perform high-performant
+operations.
+
+Please read the [full documentation here](https://pytorch.org/data/main/dataloader2.html).
 
 ## Tutorial
 

diff --git a/docs/source/dataloader2.rst b/docs/source/dataloader2.rst
@@ -3,7 +3,7 @@ DataLoader2
 
 .. automodule:: torchdata.dataloader2
 
-A light-weight :class:`DataLoader2` is introduced to decouple the overloaded data-manipulation functionalities from ``torch.utils.data.DataLoader`` to ``DataPipe`` operations. Besides, a certain features can only be achieved with :class:`DataLoader2` like snapshotting and switching backend services to perform high-performant operations.
+A new, light-weight :class:`DataLoader2` is introduced to decouple the overloaded data-manipulation functionalities from ``torch.utils.data.DataLoader`` to ``DataPipe`` operations. Besides, certain features can only be achieved with :class:`DataLoader2` like snapshotting and switching backend services to perform high-performant operations.
 
 DataLoader2
 ------------

diff --git a/docs/source/torchdata.datapipes.map.rst b/docs/source/torchdata.datapipes.map.rst
@@ -26,7 +26,7 @@ welcomed in that Github issue.
 
 Here is the list of available Map-style DataPipes:
 
-MapDataPipes
+List of MapDataPipes
 -------------------------
 
 .. autosummary::

diff --git a/torchdata/dataloader2/README.md b/torchdata/dataloader2/README.md
@@ -1,6 +1,6 @@
 # DataLoader2 (Prototype)
 
-Please check [online doc](https://pytorch.org/data/main/dataloader2.html#dataloader2)
+Please check out our [full DataLoader2 documentation](https://pytorch.org/data/main/dataloader2.html#dataloader2).
 
 ## DataLoader2 Prototype Usage and Feedback