[C++] Implement file reads for Azure filesystem #37511

Tom-Newton · 2023-09-01T08:47:40Z

Describe the enhancement requested

Read support probably requires an Azure implementation for arrow::io::RandomAccessFile then that can be used to implement the OpenInputStream and OpenInputFile methods of the AzureFileSystem.

#12914 implemented all of these features so this will be largely a case of just extracting the relevant parts from there. One modification I would suggest compared to that would be to avoid branching logic based on whether the Azure storage account has the hierarchical namespace enabled. Utilising features of the hierarchical namespace can make renames and listing tasks faster but for just reading blobs it shouldn't make any difference.

If we want to use features of the hierarchical namespace that adds some complexities:

Makes things harder to test because its not supported by azurite Is there a plan to support AdlsGen2 (Datalake Storage) on top of blobstore emulator ? Azure/Azurite#553
Its a bit difficult to query the storage account to determine if it supports hierarchical namespace. ServiceClient::GetAccountInfo() requires Storage Account Contributor permissions (https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-service-properties?tabs=azure-ad#authorization) which is quite significantly elevated. Hadoop solves this by essentially calling PathClient::GetAccessControlList() and if it raises an exception hierarchical namespace is not supported https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.java#L356-L385.

Related Issues:

[C++] Filesystem implementation for Azure Blob Storage #18014 (is a child of)

Component(s)

C++

The text was updated successfully, but these errors were encountered:

Tom-Newton · 2023-09-01T08:53:19Z

cc @srilman since you mentioned you might be able to help out
If nobody else is able to pick it up I will probably start working on this within a couple of weeks.

felipecrv · 2023-09-13T14:58:59Z

@Tom-Newton are you taking over the work on #12914?

cc @zeroshade @bkietz

Tom-Newton · 2023-09-13T16:19:38Z

@Tom-Newton are you taking over the work on #12914?

I wouldn't say I'm taking over but I'm keen to push it along. So far @srilman and I have both merged PRs that implement a subset of what #12914 implemented.

felipecrv · 2023-09-13T18:14:46Z

@Tom-Newton are you taking over the work on #12914?

I wouldn't say I'm taking over but I'm keen to push it along. So far @srilman and I have both merged PRs that implement a subset of what #12914 implemented.

Thank you! Would you mind rebasing that PR to incorporate the work that has been done on the other PRs?

Tom-Newton · 2023-09-13T18:34:43Z

Thank you! Would you mind rebasing that PR to incorporate the work that has been done on the other PRs?

Plan was to keep merging small sections until it's feature complete. I think that's more likely to be done by extracting small sections from #12914 rather than rebasing it and trying to merge it all in one go. This was the approach taken for the GCS filesystem and recommend by @kou.

felipecrv · 2023-09-13T18:36:39Z

@Tom-Newton ok. This makes sense. Just make sure you make any work in progress you have as visible as possible so we can help along the way.

Tom-Newton · 2023-09-29T11:27:22Z

I'm going to start working on this.

Tom-Newton · 2023-10-16T10:40:51Z

I think the PR is ready for review #38269.
Hopefully what I've done makes sense, I'm still very inexperienced writing C++.

### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from #12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from #12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on #12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: #37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…he#38269) ### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: apache#37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

Tom-Newton added the Type: enhancement label Sep 1, 2023

github-actions bot added the Component: C++ label Sep 1, 2023

Tom-Newton changed the title ~~Implement file reads for Azure filesystem~~ [C++] Implement file reads for Azure filesystem Sep 1, 2023

github-actions bot mentioned this issue Oct 14, 2023

GH-37511: [C++] Implement file reads for Azure filesystem #38269

Merged

github-actions bot assigned Tom-Newton Oct 14, 2023

This was referenced Oct 18, 2023

[C++] Return filesystem properties not user defined metadata in Azure file reads #38330

Closed

[C++] Implement file writes for Azure filesystem #38333

Closed

bkietz closed this as completed in #38269 Oct 19, 2023

bkietz added this to the 15.0.0 milestone Oct 19, 2023

kou mentioned this issue Nov 14, 2023

[C++] Filesystem implementation for Azure Blob Storage #18014

Closed

Tom-Newton mentioned this issue Feb 11, 2024

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Implement file reads for Azure filesystem #37511

[C++] Implement file reads for Azure filesystem #37511

Tom-Newton commented Sep 1, 2023 •

edited

Loading

Tom-Newton commented Sep 1, 2023

felipecrv commented Sep 13, 2023

Tom-Newton commented Sep 13, 2023

felipecrv commented Sep 13, 2023

Tom-Newton commented Sep 13, 2023

felipecrv commented Sep 13, 2023 •

edited

Loading

Tom-Newton commented Sep 29, 2023

Tom-Newton commented Oct 16, 2023

[C++] Implement file reads for Azure filesystem #37511

[C++] Implement file reads for Azure filesystem #37511

Comments

Tom-Newton commented Sep 1, 2023 • edited Loading

Describe the enhancement requested

Component(s)

Tom-Newton commented Sep 1, 2023

felipecrv commented Sep 13, 2023

Tom-Newton commented Sep 13, 2023

felipecrv commented Sep 13, 2023

Tom-Newton commented Sep 13, 2023

felipecrv commented Sep 13, 2023 • edited Loading

Tom-Newton commented Sep 29, 2023

Tom-Newton commented Oct 16, 2023

Tom-Newton commented Sep 1, 2023 •

edited

Loading

felipecrv commented Sep 13, 2023 •

edited

Loading