[SPARK-34215][SQL] Keep tables cached after truncation #31308

MaxGekk · 2021-01-24T19:03:48Z

What changes were proposed in this pull request?

Invoke CatalogImpl.refreshTable() instead of combination of SessionCatalog.refreshTable() + uncacheQuery(). This allows to clear cached table data while keeping the table cached.

Why are the changes needed?

To improve user experience with Spark SQL
To be consistent to other commands, see [SPARK-34138][SQL] Keep dependants cached while refreshing v1 tables #31206

Does this PR introduce any user-facing change?

Yes.

Before:

scala> sql("CREATE TABLE tbl (c0 int)")
res1: org.apache.spark.sql.DataFrame = []
scala> sql("INSERT INTO tbl SELECT 0")
res2: org.apache.spark.sql.DataFrame = []
scala> sql("CACHE TABLE tbl")
res3: org.apache.spark.sql.DataFrame = []
scala> sql("SELECT * FROM tbl").show(false)
+---+
|c0 |
+---+
|0  |
+---+
scala> spark.catalog.isCached("tbl")
res5: Boolean = true
scala> sql("TRUNCATE TABLE tbl")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.catalog.isCached("tbl")
res7: Boolean = false

After:

scala> sql("TRUNCATE TABLE tbl")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.catalog.isCached("tbl")
res7: Boolean = true

How was this patch tested?

Added new test to CachedTableSuite:

$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"

MaxGekk · 2021-01-24T19:48:37Z

@HyukjinKwon @dongjoon-hyun @sunchao @cloud-fan Could you review this PR, please.

SparkQA · 2021-01-24T20:02:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38999/

SparkQA · 2021-01-24T20:31:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38999/

sunchao · 2021-01-24T20:37:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

-    }
+    // After deleting the data, refresh the table to make sure we don't keep around a stale
+    // file relation in the metastore cache and cached table data in the cache manager.
+    spark.catalog.refreshTable(tableIdentWithDB)


Hmm... Not sure if we should catch any non-fatal exception like the previous one, since otherwise we'd skip updating stats?

We don't catch any exceptions in other commands

What kind of exceptions should we hide (catch) from users here?

Not clear which exceptions are caught here. uncacheQuery() doesn't throw anything, spark.table(table.identifier).logicalPlan could fail but in that case it is not clear how we reached this point.

I think someone could drop a table or permanent view in a different session, or drop a Hive table through beeline or HMS API. This may cause some cache which depend on them AND this truncated table to become invalid, and potentially analysis exception when recaching them. I haven't got a chance to verify this though.

I feel overall it will be a good practice to recover from unknown errors here and continue. DropTableCommand does this as well.

The case which you described here is applicable to any other commands like add/drop/rename/recover partitions. I do believe we either should "fix" all commands with tests for the case, or apply the approach w/o catching exceptions here as we do in other commands so far (otherwise the implementation looks inconsistent).

Personally I'd keep just the try-catch logic here because I think the above do happens and we shouldn't skip updating stats in the case. But I don't really have strong opinion on this.

can we fix the inconsistency first? i.e. reach an agreement about whether we should add try-cache or not for all other commands.

... someone could drop a table or permanent view in a different session, or drop a Hive table through beeline or HMS API.

If somebody dropped the table in parallel, updating statistics wouldn't matter any more. We should show the error to user as soon as possible.

can we fix the inconsistency first?

@sunchao Can you write a test which reproduces the issue?

We might utilize HiveThriftServer2Suites for this - I can check when I got time.

So I wasn't able to reproduce it with the above example, sorry for the false alarm. Turned out the analysis exception will be thrown later when the cache is actually queried (rather than in recacheByPlan itself). Therefore, I think it should be fine in this case.

I do agree we should keep it consistent (whether try-catch or not). IMO it can be done separately tho.

SparkQA · 2021-01-24T23:17:23Z

Test build #134413 has finished for PR 31308 at commit 25b8583.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-01-25T01:37:24Z

BTW, do we have a documentation for caching behaviour? it would be great to have one.

MaxGekk · 2021-01-25T06:23:25Z

BTW, do we have a documentation for caching behaviour?

@HyukjinKwon Probably, not. At least I don't know where there are such docs.

it would be great to have one.

Yes, it would be great.

cloud-fan · 2021-01-25T07:02:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

@@ -561,16 +561,9 @@ case class TruncateTableCommand(
        }
      }
    }
-    // After deleting the data, invalidate the table to make sure we don't keep around a stale
-    // file relation in the metastore cache.
-    spark.sessionState.refreshTable(tableName.unquotedString)


can we update the doc of these 2 refreshTable and make it clear when we should use which?

Let me document them separately from this PR.

sunchao · 2021-01-26T10:23:12Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala

@@ -501,4 +501,17 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with TestHiveSingleto
      }
    }
  }
+
+  test("SPARK-34215: keep table cached after truncation") {


can we move this to org.apache.spark.sql.CachedTableSuite? the truncate table command is not limited to Hive.

Do you mean copy-paste the test there?

BTW, I am going to write unified tests for TRUNCATE TABLE command, and move the test there to run it for both v1 In-Memory and Hive external catalogs.

OK sounds good. We can keep it for now then.

@sunchao Here is the PR #31387 which unifies the TRUNCATE TABLE tests.

cloud-fan · 2021-01-26T15:36:42Z

thanks, merging to master!

### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` instead of combination of `SessionCatalog.refreshTable()` + `uncacheQuery()`. This allows to clear cached table data while keeping the table cached. ### Why are the changes needed? 1. To improve user experience with Spark SQL 2. To be consistent to other commands, see apache#31206 ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> sql("CREATE TABLE tbl (c0 int)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT INTO tbl SELECT 0") res2: org.apache.spark.sql.DataFrame = [] scala> sql("CACHE TABLE tbl") res3: org.apache.spark.sql.DataFrame = [] scala> sql("SELECT * FROM tbl").show(false) +---+ |c0 | +---+ |0 | +---+ scala> spark.catalog.isCached("tbl") res5: Boolean = true scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = false ``` After: ```scala scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = true ``` ### How was this patch tested? Added new test to `CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes apache#31308 from MaxGekk/truncate-table-cached. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 3 commits January 24, 2021 21:36

Add a test

ccab152

Call CatalogImpl.refreshTable()

3a889a3

Update the SQL migration guide

25b8583

github-actions bot added DOCS SQL labels Jan 24, 2021

sunchao reviewed Jan 24, 2021

View reviewed changes

cloud-fan reviewed Jan 25, 2021

View reviewed changes

sunchao reviewed Jan 26, 2021

View reviewed changes

cloud-fan approved these changes Jan 26, 2021

View reviewed changes

cloud-fan closed this in ac8307d Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34215][SQL] Keep tables cached after truncation #31308

[SPARK-34215][SQL] Keep tables cached after truncation #31308

MaxGekk commented Jan 24, 2021

MaxGekk commented Jan 24, 2021

SparkQA commented Jan 24, 2021

SparkQA commented Jan 24, 2021

sunchao Jan 24, 2021

MaxGekk Jan 24, 2021 •

edited

Loading

MaxGekk Jan 24, 2021

sunchao Jan 25, 2021 •

edited

Loading

MaxGekk Jan 25, 2021

sunchao Jan 25, 2021

cloud-fan Jan 25, 2021

MaxGekk Jan 25, 2021

sunchao Jan 25, 2021

sunchao Jan 26, 2021

SparkQA commented Jan 24, 2021

HyukjinKwon commented Jan 25, 2021

MaxGekk commented Jan 25, 2021

cloud-fan Jan 25, 2021

MaxGekk Jan 26, 2021

sunchao Jan 26, 2021

MaxGekk Jan 26, 2021

MaxGekk Jan 26, 2021

sunchao Jan 26, 2021

MaxGekk Jan 30, 2021

cloud-fan commented Jan 26, 2021

[SPARK-34215][SQL] Keep tables cached after truncation #31308

[SPARK-34215][SQL] Keep tables cached after truncation #31308

Conversation

MaxGekk commented Jan 24, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Jan 24, 2021

SparkQA commented Jan 24, 2021

SparkQA commented Jan 24, 2021

Choose a reason for hiding this comment

MaxGekk Jan 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao Jan 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 24, 2021

HyukjinKwon commented Jan 25, 2021

MaxGekk commented Jan 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 26, 2021

MaxGekk Jan 24, 2021 •

edited

Loading

sunchao Jan 25, 2021 •

edited

Loading