Skip to content

Commit

Permalink
[doc] [data] Update dataset intro page and fix some typos (#35361)
Browse files Browse the repository at this point in the history
  • Loading branch information
ericl authored May 16, 2023
1 parent 79784c5 commit 152e06d
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 18 deletions.
20 changes: 10 additions & 10 deletions doc/source/data/batch_inference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ If you're using Ray, the three steps for running batch inference read as follows
across the cluster.
2. Define your model in a class and define a transformation that applies your model to
your data batches (of format ``Dict[str, np.ndarray]`` by default).
3. Run inference on your data by using the :meth:`ds.map_batches() <ray.data.Datastream.map_batches>`
3. Run inference on your data by using the :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`
method from Ray Data. In this step you also define how your batch processing job
gets distributed across your cluster.

Expand All @@ -182,15 +182,15 @@ leveraging common Python libraries like NumPy and Pandas.

In fact, we're using the exact same datasets as in the previous section, but load
them into Ray data.
The result of this step is a Ray Datastream ``ds`` that we can use to run inference on.
The result of this step is a Dataset ``ds`` that we can use to run inference on.


.. tabs::

.. group-tab:: HuggingFace

Create a Pandas DataFrame with text data and convert it to a Ray Datastream
with the :meth:`ray.data.from_pandas() <ray.data.Datastream.from_pandas>` method.
Create a Pandas DataFrame with text data and convert it to a Dataset
with the :meth:`ray.data.from_pandas() <ray.data.Dataset.from_pandas>` method.

.. literalinclude:: ./doc_code/hf_quick_start.py
:language: python
Expand All @@ -200,8 +200,8 @@ The result of this step is a Ray Datastream ``ds`` that we can use to run infere
.. group-tab:: PyTorch

Create a NumPy array with 100
entries and convert it to a Ray Datastream with the
:meth:`ray.data.from_numpy() <ray.data.Datastream.from_numpy>` method.
entries and convert it to a Dataset with the
:meth:`ray.data.from_numpy() <ray.data.Dataset.from_numpy>` method.

.. literalinclude:: ./doc_code/pytorch_quick_start.py
:language: python
Expand All @@ -211,8 +211,8 @@ The result of this step is a Ray Datastream ``ds`` that we can use to run infere
.. group-tab:: TensorFlow

Create a NumPy array with 100
entries and convert it to a Ray Datastream with the
:meth:`ray.data.from_numpy() <ray.data.Datastream.from_numpy>` method.
entries and convert it to a Dataset with the
:meth:`ray.data.from_numpy() <ray.data.Dataset.from_numpy>` method.

.. literalinclude:: ./doc_code/tf_quick_start.py
:language: python
Expand Down Expand Up @@ -278,7 +278,7 @@ Below you find examples for PyTorch, TensorFlow, and HuggingFace.
3. Getting predictions with Ray Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once you have your Ray Dataset ``ds`` and your predictor class, you can use
Once you have your Dataset ``ds`` and your predictor class, you can use
:meth:`ds.map_batches() <ray.data.Dataset.map_batches>` to get predictions.
``map_batches`` takes your predictor class as an argument and allows you to specify
``compute`` resources by defining the :class:`ActorPoolStrategy <ray.data.ActorPoolStrategy>`.
Expand Down Expand Up @@ -526,7 +526,7 @@ which defines how many workers to use for inference.

<2> Each actor should use one GPU.

To summarize, mapping a function over batches is the simplest transform for Ray Datasets.
To summarize, mapping a function over batches is the simplest transform for Datasets.
The function defines the logic for transforming individual batches of data of the dataset
Performing operations over batches of data is more performant than single element
operations as it can leverage the underlying vectorization capabilities of Pandas or NumPy.
Expand Down
12 changes: 7 additions & 5 deletions doc/source/data/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@ Ray Data: Scalable Datasets for ML

.. _data-intro:

Ray Data is the standard way to load and exchange data in Ray libraries and applications.
It provides streaming distributed transformations such as maps
(:meth:`map_batches <ray.data.Dataset.map_batches>`),
Ray Data scales common ML data processing patterns that arise in batch inference
and distributed training applications. These problems occur when it becomes necessary to
combine data preprocessing and model computations in the same job. Ray Data does this by providing
streaming distributed transformations
such as maps (:meth:`map_batches <ray.data.Dataset.map_batches>`),
global and grouped aggregations (:class:`GroupedData <ray.data.grouped_data.GroupedData>`), and
shuffling operations (:meth:`random_shuffle <ray.data.Dataset.random_shuffle>`,
:meth:`sort <ray.data.Dataset.sort>`,
Expand All @@ -29,9 +31,9 @@ Streaming Batch Inference
-------------------------

Ray Data simplifies general purpose parallel GPU and CPU compute in Ray through its
powerful :ref:`Datastream <datastream_concept>` primitive. Datastreams enable workloads such as
powerful streaming :ref:`Dataset <dataset_concept>` primitive. Datasets enable workloads such as
:doc:`GPU batch inference <batch_inference>` to run efficiently on large datasets,
maximizing resource utilization by keeping the working data fitting into Ray object store memory.
maximizing resource utilization by streaming the working data through Ray object store memory.

.. image:: images/stream-example.png
:width: 650px
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"id": "dfdf1047",
"metadata": {},
"source": [
"# Batch Inference with OPT 30B and Ray Dataset\n",
"# Batch Inference with OPT 30B and Ray Data\n",
"\n",
"This notebook was tested on a single p3.16xlarge instance with 8 V100 GPUs.\n",
"\n",
Expand Down Expand Up @@ -573,7 +573,7 @@
"id": "ca57e150",
"metadata": {},
"source": [
"## Create a Ray Dataset Pipeline\n",
"## Create a Dataset Pipeline\n",
"\n",
"Finally, we connect all these pieces together, and use a BatchPredictor to run multiple copies of the DeepSpeedPredictor actors.\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-air/examples/upload_to_comet_ml.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@
"COMET WARNING: Failed to add tag(s) None to the experiment\n",
"\n",
"COMET WARNING: Empty mapping given to log_params({}); ignoring\n",
"\u001B[2m\u001B[36m(GBDTTrainable pid=19852)\u001B[0m UserWarning: Datastream 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001B[2m\u001B[36m(GBDTTrainable pid=19852)\u001B[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-19 15:19:24,628\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331069\n",
"\u001B[2m\u001B[36m(GBDTTrainable pid=19852)\u001B[0m 2022-05-19 15:19:25,961\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-19 15:19:26,830\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331069\n",
Expand Down

0 comments on commit 152e06d

Please sign in to comment.