Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure concurrent writes #2069

Closed
davidrobbo opened this issue Jan 12, 2024 · 3 comments
Closed

Azure concurrent writes #2069

davidrobbo opened this issue Jan 12, 2024 · 3 comments
Labels
question Further information is requested

Comments

@davidrobbo
Copy link

Description

Looking for a bit of advice on concurrent writes against a table in azure. I'm familiar with the requirement to have a Dynamo table to handle table locks in AWS, but reading the docs (as someone not overly familiar with azure), I'm unsure if concurrent writes against a delta table in blob/adls2 in azure could lead to data loss, or if by default the storage in azure provides 'put-if-absent' type guarantees that makes it protect against data loss from multi-app/cluster writes?

Use Case

I would like to utilise write_deltalake to write to a table on Azure (blob storage and/or adls2) from a multi-replica python app, and I want to guarantee no data loss while creating a transaction in the _delta_log directory (i.e. each replica of my application creating a transaction under _delta_log with the same name at exactly the same point in time leading to one replica overwriting the other replica's entry)

Related Issue(s)

None.

@davidrobbo davidrobbo added the enhancement New feature or request label Jan 12, 2024
@roeap
Copy link
Collaborator

roeap commented Jan 13, 2024

Good news - azure storage supports concurrent writes out of the box.

@ion-elgreco ion-elgreco added question Further information is requested and removed enhancement New feature or request labels Jan 13, 2024
@inigohidalgo
Copy link
Contributor

Apologies for necro.
@roeap could you give some info on how this is achieved? Is it based on conditional headers for read/write in blob?

I saw the azure implementation for the LogStore is basically the default, whereas the aws one has a specific LogStore implementation based on dynamoDB locking.

Thanks!

@inigohidalgo
Copy link
Contributor

I think I have answered myself by trawling through various issues and PRs.

To answer myself: yes, the difference is that the Azure object store implements a simple copy_if_not_exists

https://github.com/apache/arrow-rs/blob/c2b05cdbcb37f46f170c7b5073ee6bc2178fdace/object_store/src/azure/mod.rs#L134-L137

Whereas the AWS implementation depends on the locking client

https://github.com/apache/arrow-rs/blob/c2b05cdbcb37f46f170c7b5073ee6bc2178fdace/object_store/src/aws/mod.rs#L292-L318

ion-elgreco pushed a commit that referenced this issue Jun 1, 2024
- It was unclear to me that concurrent writing was available by default
for non-S3 backends, so I am making the language clearer.
- I have also added an extra section showing that R2 and maybe MinIO can
enable concurrent writing
- Fixed a couple of unrelated formatting issues in the page I edited

closes #2556 

#2069 also had the same confusion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants