From fd76fa3befebb83c38f54f774d178cefaa37d86d Mon Sep 17 00:00:00 2001 From: Ian Cook Date: Tue, 7 May 2024 14:40:10 -0400 Subject: [PATCH] Document use of self_destruct with toArrowTable --- .../docs/source/user_guide/sql/arrow_pandas.rst | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/python/docs/source/user_guide/sql/arrow_pandas.rst b/python/docs/source/user_guide/sql/arrow_pandas.rst index 3f31f64a0a7ab..dd94ec7a5e9dd 100644 --- a/python/docs/source/user_guide/sql/arrow_pandas.rst +++ b/python/docs/source/user_guide/sql/arrow_pandas.rst @@ -435,9 +435,12 @@ be verified by the user. Setting Arrow ``self_destruct`` for memory savings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas DataFrame via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas DataFrame. -This option is experimental, and some operations may fail on the resulting Pandas DataFrame due to immutable backing arrays. -Typically, you would see the error ``ValueError: buffer source array is read-only``. -Newer versions of Pandas may fix these errors by improving support for such cases. -You can work around this error by copying the column(s) beforehand. -Additionally, this conversion may be slower because it is single-threaded. +Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` +can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a +Pandas DataFrame via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas +DataFrame. This option can also save memory when creating a PyArrow Table via ``toArrowTable``. +This option is experimental. When used with ``toPandas``, some operations may fail on the resulting +Pandas DataFrame due to immutable backing arrays. Typically, you would see the error +``ValueError: buffer source array is read-only``. Newer versions of Pandas may fix these errors by +improving support for such cases. You can work around this error by copying the column(s) +beforehand. Additionally, this conversion may be slower because it is single-threaded.