Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34215][SQL] Keep tables cached after truncation #31308

Closed
wants to merge 3 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 24, 2021

What changes were proposed in this pull request?

Invoke CatalogImpl.refreshTable() instead of combination of SessionCatalog.refreshTable() + uncacheQuery(). This allows to clear cached table data while keeping the table cached.

Why are the changes needed?

  1. To improve user experience with Spark SQL
  2. To be consistent to other commands, see [SPARK-34138][SQL] Keep dependants cached while refreshing v1 tables #31206

Does this PR introduce any user-facing change?

Yes.

Before:

scala> sql("CREATE TABLE tbl (c0 int)")
res1: org.apache.spark.sql.DataFrame = []
scala> sql("INSERT INTO tbl SELECT 0")
res2: org.apache.spark.sql.DataFrame = []
scala> sql("CACHE TABLE tbl")
res3: org.apache.spark.sql.DataFrame = []
scala> sql("SELECT * FROM tbl").show(false)
+---+
|c0 |
+---+
|0  |
+---+
scala> spark.catalog.isCached("tbl")
res5: Boolean = true
scala> sql("TRUNCATE TABLE tbl")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.catalog.isCached("tbl")
res7: Boolean = false

After:

scala> sql("TRUNCATE TABLE tbl")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.catalog.isCached("tbl")
res7: Boolean = true

How was this patch tested?

Added new test to CachedTableSuite:

$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 24, 2021

@HyukjinKwon @dongjoon-hyun @sunchao @cloud-fan Could you review this PR, please.

@SparkQA
Copy link

SparkQA commented Jan 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38999/

@SparkQA
Copy link

SparkQA commented Jan 24, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38999/

}
// After deleting the data, refresh the table to make sure we don't keep around a stale
// file relation in the metastore cache and cached table data in the cache manager.
spark.catalog.refreshTable(tableIdentWithDB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Not sure if we should catch any non-fatal exception like the previous one, since otherwise we'd skip updating stats?

Copy link
Member Author

@MaxGekk MaxGekk Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We don't catch any exceptions in other commands
  2. What kind of exceptions should we hide (catch) from users here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear which exceptions are caught here. uncacheQuery() doesn't throw anything, spark.table(table.identifier).logicalPlan could fail but in that case it is not clear how we reached this point.

Copy link
Member

@sunchao sunchao Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think someone could drop a table or permanent view in a different session, or drop a Hive table through beeline or HMS API. This may cause some cache which depend on them AND this truncated table to become invalid, and potentially analysis exception when recaching them. I haven't got a chance to verify this though.

I feel overall it will be a good practice to recover from unknown errors here and continue. DropTableCommand does this as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case which you described here is applicable to any other commands like add/drop/rename/recover partitions. I do believe we either should "fix" all commands with tests for the case, or apply the approach w/o catching exceptions here as we do in other commands so far (otherwise the implementation looks inconsistent).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd keep just the try-catch logic here because I think the above do happens and we shouldn't skip updating stats in the case. But I don't really have strong opinion on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix the inconsistency first? i.e. reach an agreement about whether we should add try-cache or not for all other commands.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... someone could drop a table or permanent view in a different session, or drop a Hive table through beeline or HMS API.

If somebody dropped the table in parallel, updating statistics wouldn't matter any more. We should show the error to user as soon as possible.

can we fix the inconsistency first?

@sunchao Can you write a test which reproduces the issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might utilize HiveThriftServer2Suites for this - I can check when I got time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I wasn't able to reproduce it with the above example, sorry for the false alarm. Turned out the analysis exception will be thrown later when the cache is actually queried (rather than in recacheByPlan itself). Therefore, I think it should be fine in this case.

I do agree we should keep it consistent (whether try-catch or not). IMO it can be done separately tho.

@SparkQA
Copy link

SparkQA commented Jan 24, 2021

Test build #134413 has finished for PR 31308 at commit 25b8583.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

BTW, do we have a documentation for caching behaviour? it would be great to have one.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 25, 2021

BTW, do we have a documentation for caching behaviour?

@HyukjinKwon Probably, not. At least I don't know where there are such docs.

it would be great to have one.

Yes, it would be great.

@@ -561,16 +561,9 @@ case class TruncateTableCommand(
}
}
}
// After deleting the data, invalidate the table to make sure we don't keep around a stale
// file relation in the metastore cache.
spark.sessionState.refreshTable(tableName.unquotedString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the doc of these 2 refreshTable and make it clear when we should use which?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me document them separately from this PR.

@@ -501,4 +501,17 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with TestHiveSingleto
}
}
}

test("SPARK-34215: keep table cached after truncation") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to org.apache.spark.sql.CachedTableSuite? the truncate table command is not limited to Hive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean copy-paste the test there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I am going to write unified tests for TRUNCATE TABLE command, and move the test there to run it for both v1 In-Memory and Hive external catalogs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good. We can keep it for now then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao Here is the PR #31387 which unifies the TRUNCATE TABLE tests.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ac8307d Jan 26, 2021
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
### What changes were proposed in this pull request?
Invoke `CatalogImpl.refreshTable()` instead of combination of `SessionCatalog.refreshTable()` + `uncacheQuery()`. This allows to clear cached table data while keeping the table cached.

### Why are the changes needed?
1. To improve user experience with Spark SQL
2. To be consistent to other commands, see apache#31206

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```scala
scala> sql("CREATE TABLE tbl (c0 int)")
res1: org.apache.spark.sql.DataFrame = []
scala> sql("INSERT INTO tbl SELECT 0")
res2: org.apache.spark.sql.DataFrame = []
scala> sql("CACHE TABLE tbl")
res3: org.apache.spark.sql.DataFrame = []
scala> sql("SELECT * FROM tbl").show(false)
+---+
|c0 |
+---+
|0  |
+---+
scala> spark.catalog.isCached("tbl")
res5: Boolean = true
scala> sql("TRUNCATE TABLE tbl")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.catalog.isCached("tbl")
res7: Boolean = false
```

After:
```scala
scala> sql("TRUNCATE TABLE tbl")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.catalog.isCached("tbl")
res7: Boolean = true
```

### How was this patch tested?
Added new test to `CachedTableSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```

Closes apache#31308 from MaxGekk/truncate-table-cached.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants