-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34215][SQL] Keep tables cached after truncation #31308
Conversation
@HyukjinKwon @dongjoon-hyun @sunchao @cloud-fan Could you review this PR, please. |
Kubernetes integration test starting |
Kubernetes integration test status success |
} | ||
// After deleting the data, refresh the table to make sure we don't keep around a stale | ||
// file relation in the metastore cache and cached table data in the cache manager. | ||
spark.catalog.refreshTable(tableIdentWithDB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Not sure if we should catch any non-fatal exception like the previous one, since otherwise we'd skip updating stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We don't catch any exceptions in other commands
- What kind of exceptions should we hide (catch) from users here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear which exceptions are caught here. uncacheQuery()
doesn't throw anything, spark.table(table.identifier).logicalPlan
could fail but in that case it is not clear how we reached this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think someone could drop a table or permanent view in a different session, or drop a Hive table through beeline or HMS API. This may cause some cache which depend on them AND this truncated table to become invalid, and potentially analysis exception when recaching them. I haven't got a chance to verify this though.
I feel overall it will be a good practice to recover from unknown errors here and continue. DropTableCommand
does this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case which you described here is applicable to any other commands like add/drop/rename/recover partitions. I do believe we either should "fix" all commands with tests for the case, or apply the approach w/o catching exceptions here as we do in other commands so far (otherwise the implementation looks inconsistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd keep just the try-catch logic here because I think the above do happens and we shouldn't skip updating stats in the case. But I don't really have strong opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we fix the inconsistency first? i.e. reach an agreement about whether we should add try-cache
or not for all other commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... someone could drop a table or permanent view in a different session, or drop a Hive table through beeline or HMS API.
If somebody dropped the table in parallel, updating statistics wouldn't matter any more. We should show the error to user as soon as possible.
can we fix the inconsistency first?
@sunchao Can you write a test which reproduces the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might utilize HiveThriftServer2Suites
for this - I can check when I got time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I wasn't able to reproduce it with the above example, sorry for the false alarm. Turned out the analysis exception will be thrown later when the cache is actually queried (rather than in recacheByPlan
itself). Therefore, I think it should be fine in this case.
I do agree we should keep it consistent (whether try-catch or not). IMO it can be done separately tho.
Test build #134413 has finished for PR 31308 at commit
|
BTW, do we have a documentation for caching behaviour? it would be great to have one. |
@HyukjinKwon Probably, not. At least I don't know where there are such docs.
Yes, it would be great. |
@@ -561,16 +561,9 @@ case class TruncateTableCommand( | |||
} | |||
} | |||
} | |||
// After deleting the data, invalidate the table to make sure we don't keep around a stale | |||
// file relation in the metastore cache. | |||
spark.sessionState.refreshTable(tableName.unquotedString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the doc of these 2 refreshTable
and make it clear when we should use which?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me document them separately from this PR.
@@ -501,4 +501,17 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with TestHiveSingleto | |||
} | |||
} | |||
} | |||
|
|||
test("SPARK-34215: keep table cached after truncation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this to org.apache.spark.sql.CachedTableSuite
? the truncate table command is not limited to Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean copy-paste the test there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I am going to write unified tests for TRUNCATE TABLE
command, and move the test there to run it for both v1 In-Memory and Hive external catalogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good. We can keep it for now then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, merging to master! |
### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` instead of combination of `SessionCatalog.refreshTable()` + `uncacheQuery()`. This allows to clear cached table data while keeping the table cached. ### Why are the changes needed? 1. To improve user experience with Spark SQL 2. To be consistent to other commands, see apache#31206 ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> sql("CREATE TABLE tbl (c0 int)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT INTO tbl SELECT 0") res2: org.apache.spark.sql.DataFrame = [] scala> sql("CACHE TABLE tbl") res3: org.apache.spark.sql.DataFrame = [] scala> sql("SELECT * FROM tbl").show(false) +---+ |c0 | +---+ |0 | +---+ scala> spark.catalog.isCached("tbl") res5: Boolean = true scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = false ``` After: ```scala scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = true ``` ### How was this patch tested? Added new test to `CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes apache#31308 from MaxGekk/truncate-table-cached. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Invoke
CatalogImpl.refreshTable()
instead of combination ofSessionCatalog.refreshTable()
+uncacheQuery()
. This allows to clear cached table data while keeping the table cached.Why are the changes needed?
Does this PR introduce any user-facing change?
Yes.
Before:
After:
How was this patch tested?
Added new test to
CachedTableSuite
: