Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nxf_s3_download fails to identify directories properly #4757

Closed
Chris-Cherry opened this issue Feb 20, 2024 · 2 comments · Fixed by #5069
Closed

nxf_s3_download fails to identify directories properly #4757

Chris-Cherry opened this issue Feb 20, 2024 · 2 comments · Fixed by #5069

Comments

@Chris-Cherry
Copy link

Bug report

When running nextflow with AWS Batch, nxf_s3_download() fails to identify an S3 directory as a directory and attempts to download it as a file (i.e. cp dir instead of cp dir/*), causing an 404 error.

Expected behavior and actual behavior

Expected behavior: directory is staged properly to the ecs container running the process.
Actual behavior: The process is killed due to a 404 error.

Steps to reproduce the problem

main.nf:

process ls {
  input:
  path dir

  output:
  stdout

  script:
  """
  ls
  """
}

workflow {
  // S3 directory
  dir = channel.fromPath('s3://ryft-public-sample-data/esRedditJson')
  ls(dir) | view { it }
}

nextflow.config:

workDir = 's3://bucket/'

plugins {
  id 'nf-amazon'
}

process {
  container = 'public.ecr.aws/l9m5o0x9/cellranger:7.2.0'
  executor = 'awsbatch'
  queue = 'sample'
}

aws {
  region = 'us-east-1'
  batch {
    cliPath = '/usr/bin/aws'
    platformType = 'fargate'
    executionRole = 'arn:aws:iam::xxxxxxx:role/ecsTaskExecutionRole'
    jobRole = 'arn:aws:iam::xxxxxxx:role/testAdminAccessEcs'
  }
}

Program output

nextflow.log

Environment

  • Nextflow version: 24.01.0-edge.5903 (I need edge to run batch with fargate)
  • Java version: java-17-amazon-corretto
  • Operating system: Linux (Amazon Linux 2023)
  • Bash version: 5.2.15
  • S5cmd version: 2.2.2

Additional context

Following through .command.run, I was able to trace the relevant error to:

nxf_s3_download() {
    local source=$1
    local target=$2
    local file_name=$(basename $1)
    local is_dir=$(s5cmd ls $source | grep -F "DIR ${file_name}/" -c)
    if [[ $is_dir == 1 ]]; then
        s5cmd cp "$source/*" "$target"
    else
        s5cmd cp "$source" "$target"
    fi
}

In particular, s5cmd ls ${source} is returning DIR esRedditJson/ rather than what seems to be the expected DIR esRedditJson/ (double space instead of the single space that seems to be anticipated).

Simple code bit to reproduce error. Changing to double spaces appears to fix the issue.

source='s3://ryft-public-sample-data/esRedditJson'
target=localdir
file_name=$(basename $source)

is_dir=$(s5cmd ls $source | grep -F "DIR ${file_name}/" -c)
if [[ $is_dir == 1 ]]; then
        s5cmd cp "$source/*" "$target"
    else
        s5cmd cp "$source" "$target"
fi

is_dir=$(s5cmd ls $source | grep -F "DIR  ${file_name}/" -c)
if [[ $is_dir == 1 ]]; then
        s5cmd cp "$source/*" "$target"
    else
        s5cmd cp "$source" "$target"
fi
@tzuni
Copy link

tzuni commented Jun 14, 2024

Just ran into/discovered the same issue.

@pditommaso
Copy link
Member

Interesting, it looks like this changed in s5cmd 2.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants