From 830ac9ae527d1d599efd49bd66b2c5ca044ea414 Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Tue, 21 May 2024 09:09:33 +0200 Subject: [PATCH] Move Overview of Arrow Terminology to the beginning and add Physical layout term --- docs/source/format/Intro.rst | 91 +++++++++++++++++------------------- 1 file changed, 44 insertions(+), 47 deletions(-) diff --git a/docs/source/format/Intro.rst b/docs/source/format/Intro.rst index 6adfd4ea22ff3..ee1dd9943f08f 100644 --- a/docs/source/format/Intro.rst +++ b/docs/source/format/Intro.rst @@ -67,8 +67,51 @@ instructions. .. figure:: ./images/columnar-diagram_3.svg :alt: Tabular data being structured column by column in computer memory. +Overview of Arrow Terminology +============================= + +**Physical layout** +A specification for how to arrange values of an array in memory. + +**Buffer** +A contiguous region of memory with a given length. Buffers are used to store data for arrays. + +**Array** +A contiguous, one-dimensional sequence of values with known length where all values have the +same type. An array consists of zero or more buffers. + +**Chunked Array** +A discontiguous, one-dimensional sequence of values with known length where all values have +the same type. Consists of zero or more arrays, the “chunks”. + +.. note:: + Chunked Array is a concept specific to certain implementations such as Arrow C++ and PyArrow. + +**RecordBatch** +A contiguous, two-dimensional data structure which consist of ordered collection of arrays +of the same length. + +**Schema** +A collection of fields with optional metadata that determines all the data types of an object +like a RecordBatch or Table. + +**Table** +A discontiguous, two-dimensional chunk of data consisting of an ordered collection of Chunked +Arrays. All Chunked Arrays have the same length, but may have different types. Different columns +may be chunked differently. + +.. note:: + Table is a concept specific to certain implementations such as Arrow C++ and PyArrow. + +.. image:: ../cpp/tables-versus-record-batches.svg + :alt: A graphical representation of an Arrow Table and a + Record Batch, with structure as described in text above. + +.. seealso:: + The :ref:`glossary` for more terms. + Support for null values ------------------------ +======================= Arrow supports missing values or "nulls" for all data types: any value in an array may be semantically null, whether primitive or nested type. @@ -398,52 +441,6 @@ Example: .. _GeoArrow: https://github.com/geoarrow/geoarrow -Overview of Arrow Terminology -============================= - -Buffer ------- -A contiguous region of memory with a given length. Buffers are used to store data for arrays. - -Array ------ -A contiguous, one-dimensional sequence of values with known length where all values have the -same type. An array consists of zero or more buffers. - -Chunked Array -------------- -A discontiguous, one-dimensional sequence of values with known length where all values have -the same type. Consists of zero or more arrays, the “chunks”. - -.. note:: - Chunked Array is a concept specific to certain implementations such as Arrow C++ and PyArrow. - -RecordBatch ------------ -A contiguous, two-dimensional data structure which consist of ordered collection of arrays -of the same length. - -Schema ------- -A collection of fields with optional metadata that determines all the data types of an object -like a RecordBatch or Table. - -Table ------ -A discontiguous, two-dimensional chunk of data consisting of an ordered collection of Chunked -Arrays. All Chunked Arrays have the same length, but may have different types. Different columns -may be chunked differently. - -.. note:: - Table is a concept specific to certain implementations such as Arrow C++ and PyArrow. - -.. image:: ../cpp/tables-versus-record-batches.svg - :alt: A graphical representation of an Arrow Table and a - Record Batch, with structure as described in text above. - -.. seealso:: - The :ref:`glossary` for more terms. - The Arrow C Data Interface ==========================