Add DeltaLake DeletionVector Scan support for Databricks 14.3 [databricks] #11964

razajafri · 2025-01-14T05:00:48Z

This PR adds support for reading deletion vectors for the PERFILE reader.

contributes to #8654

Signed-off-by: Raza Jafri <[email protected]>

… which will also be based off of 350

…ases end in db Revert the change in pom to remove 350db143 shim

This commit xfails the delta-lake tests that fail on databricks 14.3. This is for the sake of temporary expediency. These tests will be revisited and triaged for actual fixes. Signed-off-by: MithunR <[email protected]>

…tests XFail delta-lake tests failing on databricks 14.3

Signed-off-by: Tim Liu <[email protected]>

pxLi · 2025-01-23T03:00:30Z

build

NvTimLiu · 2025-01-23T03:04:26Z

Enabled the pre-merge CI to run DB14.3 integration tests,
160b746

https://nvidia.slack.com/archives/C021GR1KTDY/p1737601230184499?thread_ts=1737422476.340569&cid=C021GR1KTDY

Please feel free to revert it if you do not want enable it, thanks! @razajafri

Force Delta Lake reads on Databricks 14.3 to bypass perFile check

This reverts commit 160b746.

…be read

… length

razajafri · 2025-01-24T01:52:52Z

build

sameerz · 2025-01-24T18:44:45Z

sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/DeltaLakeUtils.scala

@@ -18,7 +18,6 @@
 {"spark": "330db"}
 {"spark": "332db"}
 {"spark": "341db"}
-{"spark": "350db143"}


Update copyrights

sameerz · 2025-01-24T18:45:38Z

sql-plugin/src/main/spark350db143/scala/com/nvidia/spark/rapids/shims/DeltaLakeUtils.scala

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2024, NVIDIA CORPORATION.
+ * Copyright (c) 2022-2024, NVIDIA CORPORATION.


Update copyrights

revans2

I have not finished my review yet. I see a lot of formatting changes and function ordering
changes between the open source delta lake file for 31x

https://github.com/delta-io/delta/blob/v3.1.0/spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

and our new one that makes it difficult to see exactly what is changed/added vs open source. Not the end of the world, it just makes it go more slowly. I also have been comparing it to the 24x version in our repo so it is kind of slow going.

revans2 · 2025-01-24T19:42:52Z

delta-lake/common/src/main/scala/com/nvidia/spark/rapids/delta/DeltaProviderImplBase.scala

@@ -32,8 +32,8 @@ abstract class DeltaProviderImplBase extends DeltaProvider {
      ),
      GpuOverrides.exec[RapidsDeltaWriteExec](
        "GPU write into a Delta Lake table",
-        ExecChecks.hiddenHack(),
-        (wrapped, conf, p, r) => new RapidsDeltaWriteExecMeta(wrapped, conf, p, r)).invisible()
+        ExecChecks(TypeSig.all, TypeSig.all),


Do we really support all of the types that this can support?

revans2 · 2025-01-24T19:43:24Z

...ake/common/src/main/scala/com/nvidia/spark/rapids/delta/GpuDeltaParquetFileFormatUtils.scala

@@ -127,10 +127,10 @@ object GpuDeltaParquetFileFormatUtils {
              }
            }

-              withResource(table) { _ =>


nit: Why make this change at all? It is just white space.

revans2 · 2025-01-24T19:45:55Z

delta-lake/delta-31x/pom.xml

@@ -0,0 +1,102 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2023-2025 NVIDIA CORPORATION.


nit: I know this was mostly copied from another place, but this shows up as a new file and so the copyright should technically start in 2025.

revans2 · 2025-01-24T20:05:13Z

...-lake/delta-31x/src/main/scala/com/nvidia/spark/rapids/delta/delta31x/Delta31xProvider.scala

+  override def tagSupportForGpuFileSourceScan(meta: SparkPlanMeta[FileSourceScanExec]): Unit = {
+    val format = meta.wrapped.relation.fileFormat
+    if (format.getClass == classOf[DeltaParquetFileFormat]) {
+//      val deltaFormat = format.asInstanceOf[DeltaParquetFileFormat]


nit: These comments need to be removed. Similarly are there plans to back port deletion vector support to databricks deltalake versions that support them? Like 23x+?

revans2 · 2025-01-24T20:12:21Z

delta-lake/delta-31x/pom.xml

+        <groupId>com.nvidia</groupId>
+        <artifactId>rapids-4-spark-jdk-profiles_2.12</artifactId>
+        <version>25.02.0-SNAPSHOT</version>
+        <relativePath>../../jdk-profiles/pom.xml</relativePath>


Most versions of the pom.xml have this

spark-rapids/delta-lake/delta-20x/pom.xml

Line 26 in c82a2ea

<relativePath>../../jdk-profiles/pom.xml</relativePath>

and 23x has a different path

spark-rapids/delta-lake/delta-23x/pom.xml

Line 26 in c82a2ea

<relativePath>../../pom.xml</relativePath>

Is that a bug in 23x?

revans2 · 2025-01-24T20:40:26Z