Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File not transferred from Azure to AWS when specifying a keyName for S3 not matching the part name on Azure side #238

Closed
ank19 opened this issue Feb 21, 2024 · 7 comments
Assignees
Labels
bug_report Suspected bugs, awaiting triage triage all new issues awaiting classification

Comments

@ank19
Copy link

ank19 commented Feb 21, 2024

Bug Report

Describe the Bug

Expected Behavior

A consumer willing to transfer a file to it's own AWS S3 bucket from a provider, which is using Azure Data Lake as a source, should be able to specify a "keyName" in the S3 data transfer destination, which does not necessarily have to exactly match the object name from the asset specification used on provider side within the Azure Data Lake.

Observed Behavior

Assuming the provider created an asset on Azure Data Lake in a directory "Outbound/4MB" and assuming that the consumer specifies keyName="Outbound/4MB", then we're able to transfer the file successfully.
If the consumer however changes the keyName to something not exactly matching that, e. g. by removing the directory name, changing the file name, then the transfer process does not succeed (but marked as COMPLETED).

Steps to Reproduce

Steps to reproduce the behavior:

  1. Create an asset (and the policy, ...) on provide side, using Azure Data Lake as a data source for the asset, e. g.
    "dataAddress": {
    "@type": "DataAddress",
    "type": "AzureStorage",
    "container": "...",
    "account": "...",
    "blobName": "Outbound/4MB.bin"
    }
  2. On consumer side, initiate negotation, ..., and start to transfer the file to S3 using a "keyName" NOT matching the "blobName" --> file not transferred
  3. Do the same, but this time use a "keyName" exactly matching the "blobName" --> file transferred succesfully

Context Information

  • EDC 0.6.0: Behaviour as described above (file is not transferred at all, but transfer is marked as COMPLETED)
  • EDC 0.6.0-rc2: Behaviour as described above, but on provider side a S3 exception "Invalid Upload-ID" is indicated.

Detailed Description

Further notes:

  • Our business partner used a AWS Secret Key Id and Key for S3
  • On provider side, we're using Azure Default Identity authentication

Possible Implementation

I had a look at the implementation, but I haven't found a hint yet. The S3DataSink does not seem to use the keyName at all (or just for logging purposes before @yurimssilva fix for the exception handling). The S3DataSinkFactory only seems to use the keyName for accessing the vault, which is not relevant in our case.

Copy link

Thanks for your contribution 🔥 We will take a look asap 🚀

@ndr-brt ndr-brt added bug_report Suspected bugs, awaiting triage triage all new issues awaiting classification labels Feb 27, 2024
@yurimssilva yurimssilva self-assigned this Feb 27, 2024
@hemantxpatel
Copy link

@ank19 As per current implementation, S3DataSink always uses the source file name (keyName in source data address). It ignores keyName in destination.
So when transfer has state COMPLETED, you may find a file in destination bucket with same name as source.

This is the buggy code which doesn't take destination file name keyName into consideration.

String getDestinationObjectName(String partName) {
if (!StringUtils.isNullOrEmpty(folderName)) {
return folderName.endsWith("/") ? folderName + partName : folderName + "/" + partName;
}
return partName;
}

Azure Transfer works fine as it takes destination file name blobName into consideration.

https://github.com/eclipse-edc/Technology-Azure/blob/7bb9563855f272448674cfa5ca5030e3fb1b0451/extensions/data-plane/data-plane-azure-storage/src/main/java/org/eclipse/edc/connector/dataplane/azure/storage/pipeline/AzureStorageDataSink.java#L114-L121

@yurimssilva
Copy link
Contributor

@ank19 As per current implementation, S3DataSink always uses the source file name (keyName in source data address). It ignores keyName in destination. So when transfer has state COMPLETED, you may find a file in destination bucket with same name as source.

This is the buggy code which doesn't take destination file name keyName into consideration.

String getDestinationObjectName(String partName) {
if (!StringUtils.isNullOrEmpty(folderName)) {
return folderName.endsWith("/") ? folderName + partName : folderName + "/" + partName;
}
return partName;
}

Azure Transfer works fine as it takes destination file name blobName into consideration.

https://github.com/eclipse-edc/Technology-Azure/blob/7bb9563855f272448674cfa5ca5030e3fb1b0451/extensions/data-plane/data-plane-azure-storage/src/main/java/org/eclipse/edc/connector/dataplane/azure/storage/pipeline/AzureStorageDataSink.java#L114-L121

While I haven't delved into the issue yet, there are several aspects to consider:

The S3DataSink hinges on the use of DataSource.Part.name to establish its object key, enabling multiple object transfers. Given the unknown number of objects (Part list) during the transfer request, it becomes impractical for the Consumer to assign names to them. To address this, we introduced the folderName property as a precautionary measure in the destination to prevent collisions. It's important to note that this addition alone isn't the root cause of the issue.

@yurimssilva
Copy link
Contributor

yurimssilva commented Feb 29, 2024

@ank19 Upon replicating the scenario in a local environment, the transfer process testing exhibited no issues when both Provider and Consumer were running on tractusx:0.6.0 (EDC and Technology 0.5.1) versions. However, when on tractux:0.6.0-rc2 (EDC and Technology 0.4.1), the observed error "Invalid Upload-ID" surfaced. This suggests that the issue has been resolved between the Technology versions 0.4.1 and 0.5.1.

Regarding the keyName usage is a topic that requires discussion, as I said above:

The S3DataSink hinges on the use of DataSource.Part.name to establish its object key, enabling multiple object transfers. Given the unknown number of objects (Part list) during the transfer request, it becomes impractical for the Consumer to assign names to them. To address this, we introduced the folderName property as a precautionary measure in the destination to prevent collisions. ...

@ank19
Copy link
Author

ank19 commented Feb 29, 2024

@yurimssilva - Thanks for the update. I also think the "Invalid Upload-ID" issue from 0.6.0-rc2 is resolved. I guess that was a unlucky coincidence together with a differing behavior of the Azure data plane and the AWS data plane.
It might indeed make sense to discuss whether to align both planes or not.

String getDestinationBlobName(String partName) {
        var name = !StringUtils.isNullOrEmpty(blobName) && StringUtils.isNullOrBlank(blobPrefix) ? blobName : partName;
        if (!StringUtils.isNullOrEmpty(folderName)) {
            return folderName.endsWith("/") ? folderName + name : folderName + "/" + name;
        } else {
            return name;
        }
    }

https://github.com/eclipse-edc/Technology-Azure/blob/e35acf853d77e50e9826177a62984831aa7e74ca/extensions/data-plane/data-plane-azure-storage/src/main/java/org/eclipse/edc/connector/dataplane/azure/storage/pipeline/AzureStorageDataSink.java#L58C4-L65C6

@rafaelmag110
Copy link
Contributor

Hi @ank19, +1 on pushing the discussion. Would you like to follow up with that?

Regarding the present issue, do you believe it is still beneficial to keep it open or can it be closed?

@ank19
Copy link
Author

ank19 commented Mar 4, 2024

Hi @rafaelmag110 - I'll suggest that you start that discussion from your side, mainly for the reason that we on our side are not using AWS. I'm closing that issue, but I think it would be benefitial though, to align the behavior for both data planes to avoid future misunderstanding.

@ank19 ank19 closed this as completed Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug_report Suspected bugs, awaiting triage triage all new issues awaiting classification
Projects
None yet
Development

No branches or pull requests

5 participants