-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC: fix incorrect paths when listing prefixes with ADLS #8661
Conversation
f -> { | ||
StorageUri location = StorageUri.of(f.location()); | ||
if (!location.isAbsolute()) { | ||
location = basePath.resolve("/").resolve(location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't basePath.relativize(location)
below equal to location
in this case?
(Maybe I'm too tired ATM tho)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh no :-)
Here is a typical case with ADLS:
basePath = abfs://[email protected]/warehouse/ns/table/
f.location() = warehouse/ns/table/whatever.parquet
So basePath.relativize(location)
would yield warehouse/ns/table/whatever.parquet
which would be then resolved against basePath
giving a wrong result:
abfs://[email protected]/warehouse/ns/table/warehouse/ns/table/whatever.parquet
So we need to first take the "root" URI:
basePath.resolve("/") = abfs://[email protected]/
Then resolve location
:
root.resolve(location) = abfs://[email protected]/warehouse/ns/table/whatever.parquet
|
||
// Enforce a single version of Netty among dependencies | ||
// (Spark, Hadoop and Azure) | ||
implementation(platform("io.netty:netty-bom:4.1.110.Final")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renovate will bump this one. You'd need a version range or make the version explicit in another way using "Gradle magic" (IIRC nessie-client built script has a few of those).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to bump this version, as long as one single version is used. If it ever happens that a new Netty version becomes incompatible with one of the components using it, then we'd need to force it – but for now, I think that's fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - then maybe just add it to gradle/libs.versions.toml
.
Maybe also make it an enforcedPlatform
, which enforces the exact versions of that bom, whereas versions from a "plain" platform
can be bumped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the behavior of Iceberg's ADLS FileIO implmentation behaves differently from the S3, GCS and Haddop FileIO implementations.
S3 + GCS returns fully qualified URIs (s3://bucket/path/path/path/file
) in FileInfo
but ADLs the only full path, but as a relative URI.
I wonder what else would break in Iceberg, given that the ADLSFileIO.listPrefix() has a different behavior.
FYI adutra#1 |
I built a docker container locally using this fix and deployed it in our K8S lab to test against real data on Azure ADLSv2. Everything worked as expected and the files are now getting deleted.
Built container using...
|
Fixes #8659.
The core of the issue is that ADLS returns relative paths in response to a prefix-list operation, while S3 and GCS return absolute paths.
Fixing that issue uncovered a second issue: the
BlobStorageException
with error codePathNotFound
wasn't being flagged as a not-found error.A third issue was that we had multiple SLF4J bindings on the classpath, so no log messages were printed during tests. I fixed that too.
It was extremely hard to diagnose the issue given how difficult it is to test with ADLS:
ITSparkIcebergNessieAzure
had issues with incompatible Netty versions on the classpath; I fixed these, but it seems Azurite is not compatible with ADLS v2 list-prefix REST endpoint, so it's basically unusable.ITSparkIcebergNessieAzureMock
that uses @snazy excellent mock server, and that worked once I got past two more issues:a. the serializability issue in
AzureProperties
(PR by @snazy still not merged).b. an issue with the list-prefix REST endpoint response – there was un undocumented required field (!!).
Basically if you want to test:
./gradlew clean publishToMavenLocal -DsparkVersions=3.3,3.4,3.5
;gradle/libs.versions.toml
;@Disabled
annotation onITSparkIcebergNessieAzureMock
;Then run:
One last issue that I decided not to fix is an issue with special chars (test case 3): I think that
ADLSFileIO
is inherently broken since it passes blob names unencoded to the ADLS client, but the client attempts to URL-decode the name, seecom.azure.storage.blob.specialized.BlobAsyncClientBase
. This blows up with: