[doc] [data] Update dataset intro page and fix some typos (#35361)

ray-project · May 16, 2023 · 152e06d · 152e06d
1 parent 79784c5
commit 152e06d
Show file tree

Hide file tree

Showing 4 changed files with 20 additions and 18 deletions.
diff --git a/doc/source/data/batch_inference.rst b/doc/source/data/batch_inference.rst
@@ -155,7 +155,7 @@ If you're using Ray, the three steps for running batch inference read as follows
    across the cluster.
 2. Define your model in a class and define a transformation that applies your model to
    your data batches (of format ``Dict[str, np.ndarray]`` by default).
-3. Run inference on your data by using the :meth:`ds.map_batches() <ray.data.Datastream.map_batches>`
+3. Run inference on your data by using the :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`
    method from Ray Data. In this step you also define how your batch processing job
    gets distributed across your cluster.
 
@@ -182,15 +182,15 @@ leveraging common Python libraries like NumPy and Pandas.
 
 In fact, we're using the exact same datasets as in the previous section, but load
 them into Ray data.
-The result of this step is a Ray Datastream ``ds`` that we can use to run inference on.
+The result of this step is a Dataset ``ds`` that we can use to run inference on.
 
 
 .. tabs::
 
     .. group-tab:: HuggingFace
 
-        Create a Pandas DataFrame with text data and convert it to a Ray Datastream
-        with the :meth:`ray.data.from_pandas() <ray.data.Datastream.from_pandas>` method.
+        Create a Pandas DataFrame with text data and convert it to a Dataset
+        with the :meth:`ray.data.from_pandas() <ray.data.Dataset.from_pandas>` method.
 
         .. literalinclude:: ./doc_code/hf_quick_start.py
             :language: python
@@ -200,8 +200,8 @@ The result of this step is a Ray Datastream ``ds`` that we can use to run infere
     .. group-tab:: PyTorch
 
         Create a NumPy array with 100
-        entries and convert it to a Ray Datastream with the
-        :meth:`ray.data.from_numpy() <ray.data.Datastream.from_numpy>` method.
+        entries and convert it to a Dataset with the
+        :meth:`ray.data.from_numpy() <ray.data.Dataset.from_numpy>` method.
 
         .. literalinclude:: ./doc_code/pytorch_quick_start.py
             :language: python
@@ -211,8 +211,8 @@ The result of this step is a Ray Datastream ``ds`` that we can use to run infere
     .. group-tab:: TensorFlow
 
         Create a NumPy array with 100
-        entries and convert it to a Ray Datastream with the
-        :meth:`ray.data.from_numpy() <ray.data.Datastream.from_numpy>` method.
+        entries and convert it to a Dataset with the
+        :meth:`ray.data.from_numpy() <ray.data.Dataset.from_numpy>` method.
 
         .. literalinclude:: ./doc_code/tf_quick_start.py
            :language: python
@@ -278,7 +278,7 @@ Below you find examples for PyTorch, TensorFlow, and HuggingFace.
 3. Getting predictions with Ray Data
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Once you have your Ray Dataset ``ds`` and your predictor class, you can use
+Once you have your Dataset ``ds`` and your predictor class, you can use
 :meth:`ds.map_batches() <ray.data.Dataset.map_batches>` to get predictions.
 ``map_batches`` takes your predictor class as an argument and allows you to specify
 ``compute`` resources by defining the :class:`ActorPoolStrategy <ray.data.ActorPoolStrategy>`.
@@ -526,7 +526,7 @@ which defines how many workers to use for inference.
 
         <2> Each actor should use one GPU.
 
-To summarize, mapping a function over batches is the simplest transform for Ray Datasets.
+To summarize, mapping a function over batches is the simplest transform for Datasets.
 The function defines the logic for transforming individual batches of data of the dataset
 Performing operations over batches of data is more performant than single element
 operations as it can leverage the underlying vectorization capabilities of Pandas or NumPy.

diff --git a/doc/source/data/data.rst b/doc/source/data/data.rst
@@ -8,9 +8,11 @@ Ray Data: Scalable Datasets for ML
 
 .. _data-intro:
 
-Ray Data is the standard way to load and exchange data in Ray libraries and applications.
-It provides streaming distributed transformations such as maps
-(:meth:`map_batches <ray.data.Dataset.map_batches>`),
+Ray Data scales common ML data processing patterns that arise in batch inference
+and distributed training applications. These problems occur when it becomes necessary to
+combine data preprocessing and model computations in the same job. Ray Data does this by providing
+streaming distributed transformations
+such as maps (:meth:`map_batches <ray.data.Dataset.map_batches>`),
 global and grouped aggregations (:class:`GroupedData <ray.data.grouped_data.GroupedData>`), and
 shuffling operations (:meth:`random_shuffle <ray.data.Dataset.random_shuffle>`,
 :meth:`sort <ray.data.Dataset.sort>`,
@@ -29,9 +31,9 @@ Streaming Batch Inference
 -------------------------
 
 Ray Data simplifies general purpose parallel GPU and CPU compute in Ray through its
-powerful :ref:`Datastream <datastream_concept>` primitive. Datastreams enable workloads such as
+powerful streaming :ref:`Dataset <dataset_concept>` primitive. Datasets enable workloads such as
 :doc:`GPU batch inference <batch_inference>` to run efficiently on large datasets,
-maximizing resource utilization by keeping the working data fitting into Ray object store memory.
+maximizing resource utilization by streaming the working data through Ray object store memory.
 
 .. image:: images/stream-example.png
    :width: 650px

diff --git a/doc/source/ray-air/examples/opt_deepspeed_batch_inference.ipynb b/doc/source/ray-air/examples/opt_deepspeed_batch_inference.ipynb
@@ -6,7 +6,7 @@
    "id": "dfdf1047",
    "metadata": {},
    "source": [
-    "# Batch Inference with OPT 30B and Ray Dataset\n",
+    "# Batch Inference with OPT 30B and Ray Data\n",
     "\n",
     "This notebook was tested on a single p3.16xlarge instance with 8 V100 GPUs.\n",
     "\n",
@@ -573,7 +573,7 @@
    "id": "ca57e150",
    "metadata": {},
    "source": [
-    "## Create a Ray Dataset Pipeline\n",
+    "## Create a Dataset Pipeline\n",
     "\n",
     "Finally, we connect all these pieces together, and use a BatchPredictor to run multiple copies of the DeepSpeedPredictor actors.\n",
     "\n",

diff --git a/doc/source/ray-air/examples/upload_to_comet_ml.ipynb b/doc/source/ray-air/examples/upload_to_comet_ml.ipynb
@@ -171,7 +171,7 @@
       "COMET WARNING: Failed to add tag(s) None to the experiment\n",
       "\n",
       "COMET WARNING: Empty mapping given to log_params({}); ignoring\n",
-      "\u001B[2m\u001B[36m(GBDTTrainable pid=19852)\u001B[0m UserWarning: Datastream 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
+      "\u001B[2m\u001B[36m(GBDTTrainable pid=19852)\u001B[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
       "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-19 15:19:24,628\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331069\n",
       "\u001B[2m\u001B[36m(GBDTTrainable pid=19852)\u001B[0m 2022-05-19 15:19:25,961\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
       "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-19 15:19:26,830\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331069\n",