Skip to content

Commit

Permalink
[FLINK-34569][e2e] fail fast if AWS cli container fails to start (#24491
Browse files Browse the repository at this point in the history
)

* [FLINK-34569][e2e] Fail fast if aws cli container fails to run

Why:
An end-to-end test run failed and in the test logs you could see that the
AWS cli container failed to start. Because of the way it's organised the
failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was
empty. This lead to a loop trying to docker exec a command in a container
named "" and the test taking 15 minutes to time out. This change speeds up
the failure.

Note that we use 'return' to prevent an immediate failure of the script so
that we have the potential to implement a simple retry.

Signed-off-by: Robert Young <[email protected]>

* [FLINK-34569][e2e] Add naive retry when creating aws cli container

Why:
An end-to-end test run failed with what looked like a transient network
exception when pulling the aws cli image. This retries once.

Signed-off-by: Robert Young <[email protected]>

* [FLINK-34569][e2e] Remove jq containers after user

Why:
A large pile of exited jq containers were left in docker after
an operation was retried repeatedly.

Signed-off-by: Robert Young <[email protected]>

* [FLINK-34569][e2e] Clean up after failed awscli container run

Why:
If for some reason the command can return a non-zero exit code and also
create a container, this will remove it so we don't have an orphan sitting
stranded.

Signed-off-by: Robert Young <[email protected]>

---------

Signed-off-by: Robert Young <[email protected]>
  • Loading branch information
robobario authored and hlteoh37 committed Jun 6, 2024
1 parent 9a69067 commit 5599444
Showing 1 changed file with 18 additions and 3 deletions.
21 changes: 18 additions & 3 deletions flink-end-to-end-tests/test-scripts/common_s3_operations.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,23 @@
# AWSCLI_CONTAINER_ID
###################################
function aws_cli_start() {
export AWSCLI_CONTAINER_ID=$(docker run -d \
local CONTAINER_ID
CONTAINER_ID=$(docker run -d \
--network host \
--mount type=bind,source="$TEST_INFRA_DIR",target=/hostdir \
-e AWS_REGION -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
--entrypoint python \
-it banst/awscli)
if [ $? -ne 0 ]; then
echo "running aws cli container failed"
if [ -n "$CONTAINER_ID" ]
then
docker kill "$CONTAINER_ID"
docker rm "$CONTAINER_ID"
fi
return 1
fi
export AWSCLI_CONTAINER_ID="$CONTAINER_ID"

while [[ "$(docker inspect -f {{.State.Running}} "$AWSCLI_CONTAINER_ID")" -ne "true" ]]; do
sleep 0.1
Expand All @@ -58,7 +69,11 @@ function aws_cli_stop() {
if [[ $AWSCLI_CONTAINER_ID ]]; then
aws_cli_stop
fi
aws_cli_start
aws_cli_start || aws_cli_start
if [ $? -ne 0 ]; then
echo "running the aws cli container failed"
exit 1
fi

###################################
# Runs an aws command on the previously started container.
Expand Down Expand Up @@ -135,7 +150,7 @@ function s3_get_number_of_lines_by_prefix() {

# find all files that have the given prefix
parts=$(aws_cli s3api list-objects --bucket "$IT_CASE_S3_BUCKET" --prefix "$1" |
docker run -i stedolan/jq -r '[.Contents[].Key] | join(" ")')
docker run -i --rm stedolan/jq -r '[.Contents[].Key] | join(" ")')

# in parallel (N tasks), query the number of lines, store result in a file named lines
N=10
Expand Down

0 comments on commit 5599444

Please sign in to comment.