-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31709][SQL] Proper base path for database/table location when it is a relative path #28527
Conversation
Test build #122612 has finished for PR 28527 at commit
|
Test build #122614 has started for PR 28527 at commit |
retest this please |
Test build #122622 has finished for PR 28527 at commit
|
retest this please |
Test build #122639 has finished for PR 28527 at commit
|
retest this please |
1 similar comment
retest this please |
Test build #122690 has finished for PR 28527 at commit
|
retest this please |
@@ -231,7 +239,8 @@ class SessionCatalog( | |||
def alterDatabase(dbDefinition: CatalogDatabase): Unit = { | |||
val dbName = formatDatabaseName(dbDefinition.name) | |||
requireDbExists(dbName) | |||
externalCatalog.alterDatabase(dbDefinition.copy(name = dbName)) | |||
externalCatalog.alterDatabase(dbDefinition.copy( | |||
name = dbName, locationUri = makeQualifiedDBPath(dbDefinition.locationUri))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update the location uri here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have ALTER (DATABASE|SCHEMA) database_name SET LOCATION path
syntax that works for hive 3.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see.
retest this please |
Test build #122764 has finished for PR 28527 at commit
|
cc @cloud-fan @dongjoon-hyun too, thanks |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, thanks, @yaooqinn
Test build #122827 has finished for PR 28527 at commit
|
retest this please |
Test build #122859 has finished for PR 28527 at commit
|
retest this please |
cc: @cloud-fan |
Test build #123270 has finished for PR 28527 at commit
|
Test build #126862 has finished for PR 28527 at commit
|
retest this please |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Show resolved
Hide resolved
can you rebase/merge with master, to trigger github actions? |
Test build #126876 has finished for PR 28527 at commit
|
Test build #126881 has finished for PR 28527 at commit
|
gentle ping @cloud-fan |
locationUri | ||
} else { | ||
val dbName = formatDatabaseName(database) | ||
val dbLocation = makeQualifiedDBPath(getDatabaseMetadata(dbName).locationUri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about it as it adds an extra database lookup. Is it better to push this work to the underlying external catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we are calling requireDbExists(dbName)
repeatedly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally createTable
should only do one RPC to the hive metastore. requireDbExists
is one problem but we can simply remove it. However, the new database lookup seems can't be easily removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the new database lookup seems can't be easily removed.
Yes, it seem make no difference even we put this into external catalogs, we have to call the getDatabase
API in order to get the actual path of the database for this particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if we don't qualify the path here and leave it to hive metastore? Will it still be a relative path in the hive metastore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, #17254
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok we did the same thing for partition. LGTM then
thanks, merging to master! |
…tter of its path is slash in create/alter table ### What changes were proposed in this pull request? After #28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. ### Why are the changes needed? This is to fix the behavior described above. ### Does this PR introduce _any_ user-facing change? Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. ### How was this patch tested? Updated unit tests. Closes #35462 from bozhang2820/spark-31709. Authored-by: Bo Zhang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…tter of its path is slash in create/alter table After apache#28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. This is to fix the behavior described above. Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. Updated unit tests. Closes apache#35462 from bozhang2820/spark-31709. Authored-by: Bo Zhang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…new Path(locationUri).isAbsolute" in create/alter table ### What changes were proposed in this pull request? After #28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. ### Why are the changes needed? This is to fix the behavior described above. ### Does this PR introduce _any_ user-facing change? Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. ### How was this patch tested? Updated unit tests. Closes #35591 from bozhang2820/spark-31709-3.2. Authored-by: Bo Zhang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…new Path(locationUri).isAbsolute" in create/alter table ### What changes were proposed in this pull request? After #28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. ### Why are the changes needed? This is to fix the behavior described above. ### Does this PR introduce _any_ user-facing change? Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. ### How was this patch tested? Updated unit tests. Closes #35591 from bozhang2820/spark-31709-3.2. Authored-by: Bo Zhang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 915f0cc) Signed-off-by: Wenchen Fan <[email protected]>
…new Path(locationUri).isAbsolute" in create/alter table ### What changes were proposed in this pull request? After apache#28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. ### Why are the changes needed? This is to fix the behavior described above. ### Does this PR introduce _any_ user-facing change? Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. ### How was this patch tested? Updated unit tests. Closes apache#35591 from bozhang2820/spark-31709-3.2. Authored-by: Bo Zhang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Currently, the user home directory is used as the base path for the database and table locations when their locationa are specified with a relative paths, e.g.
The user home is not always warehouse-related, unchangeable in runtime, and shared both by database and table as the parent directory. Meanwhile, we use the table path as the parent directory for relative partition locations.
The config
spark.sql.warehouse.dir` as the base path instead.
spark.sql.warehouse.dir
representsthe default location for managed databases and tables
.For databases, the case above seems not to follow its semantics, because it should use
For tables, it seems to be right but here I suggest enriching the meaning that lets it also be the for external tables with relative paths for locations.
With changes in this PR,
The location of a database will be
warehouseDir/dbpath
whendbpath
is relative.The location of a table will be
dbpath/tblpath
whentblpath
is relative.Why are the changes needed?
bugfix and improvement
Firstly, the databases with relative locations should be created under the default location specified by
spark.sql.warehouse.dir
.Secondly, the external tables with relative paths may also follow this behavior for consistency.
At last, the behavior for database, tables and partitions with relative paths to choose base paths should be the same.
Does this PR introduce any user-facing change?
Yes, this PR changes the
createDatabase
,alterDatabase
,createTable
andalterTable
APIs and related DDLs. If the LOCATION clause is followed by a relative path, the root path will bespark.sql.warehouse.dir
for databases, andspark.sql.warehouse.dir
/dbPath
for tables.e.g.
after
How was this patch tested?
Add unit tests.