-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Implement file reads for Azure filesystem #37511
Comments
cc @srilman since you mentioned you might be able to help out |
@Tom-Newton are you taking over the work on #12914? |
I wouldn't say I'm taking over but I'm keen to push it along. So far @srilman and I have both merged PRs that implement a subset of what #12914 implemented. |
Thank you! Would you mind rebasing that PR to incorporate the work that has been done on the other PRs? |
Plan was to keep merging small sections until it's feature complete. I think that's more likely to be done by extracting small sections from #12914 rather than rebasing it and trying to merge it all in one go. This was the approach taken for the GCS filesystem and recommend by @kou. |
@Tom-Newton ok. This makes sense. Just make sure you make any work in progress you have as visible as possible so we can help along the way. |
I'm going to start working on this. |
I think the PR is ready for review #38269. |
### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from #12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from #12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on #12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: #37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…he#38269) ### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: apache#37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…he#38269) ### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: apache#37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…he#38269) ### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: apache#37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Describe the enhancement requested
Read support probably requires an Azure implementation for
arrow::io::RandomAccessFile
then that can be used to implement theOpenInputStream
andOpenInputFile
methods of theAzureFileSystem
.#12914 implemented all of these features so this will be largely a case of just extracting the relevant parts from there. One modification I would suggest compared to that would be to avoid branching logic based on whether the Azure storage account has the hierarchical namespace enabled. Utilising features of the hierarchical namespace can make renames and listing tasks faster but for just reading blobs it shouldn't make any difference.
If we want to use features of the hierarchical namespace that adds some complexities:
ServiceClient::GetAccountInfo()
requires Storage Account Contributor permissions (https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-service-properties?tabs=azure-ad#authorization) which is quite significantly elevated. Hadoop solves this by essentially callingPathClient::GetAccessControlList()
and if it raises an exception hierarchical namespace is not supported https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.java#L356-L385.Related Issues:
Component(s)
C++
The text was updated successfully, but these errors were encountered: