Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INCIDENT: Prow currently down #2460

Closed
Gregory-Pereira opened this issue Sep 3, 2022 · 12 comments · Fixed by #2514
Closed

INCIDENT: Prow currently down #2460

Gregory-Pereira opened this issue Sep 3, 2022 · 12 comments · Fixed by #2514
Assignees
Labels

Comments

@Gregory-Pereira
Copy link
Member

All prow jobs are currently failing with a 503 internal server error. It pipes a hexdump to the logs (see logs below).
prow.log

It just spits out a fat hex dump... More info to come.

/assign
/cc @harshad16

@harshad16
Copy link
Member

This would happen now with the switch of obc to smaug
there is a delay in response now.

time aws s3 --endpoint https://s3-openshift-storage.apps.smaug.na.operate-first.cloud --profile smaug-prow ls s3://ci-prow/

An error occurred (504) when calling the ListObjectsV2 operation (reached max retries: 4): Gateway Timeout
aws s3 --endpoint  --profile smaug-prow ls s3://ci-prow/  0.42s user 0.04s system 0% cpu 2:40.74 total

This would cause issue in PR checks and logs find.

@Gregory-Pereira
Copy link
Member Author

Gregory-Pereira commented Sep 6, 2022

Another instance of this happening despite the OBC being moved back to smaug: https://prow.operate-first.cloud/view/s3/ci-prow/prow-logs/pr-logs/pull/operate-first_apps/2385/kubeval-validation/1567159471027785728

corresponding pr: #2385

@harshad16
Copy link
Member

harshad16 commented Sep 12, 2022

The prow jobs init container are having hard time to upload the data to storage.
init-upload is init container step in prow jobs and it timeouts before it could upload the data, to start the CI test that the prow jobs have to run.
Screenshot from 2022-09-12 13-22-43

From the logs,
It is observed that data is queued for upload, however the container timeouts due to delay in response.

{"component":"initupload","dest":"pr-logs/pull/operate-first_apps/2399/kubeval-validation/1569375544263315456/clone-records.json","file":"prow/pod-utils/gcs/upload.go:86","func":"k8s.io/test-infra/prow/pod-utils/gcs.upload","level":"info","msg":"Queued for upload","severity":"info","time":"2022-09-12T17:21:16Z"}
{"component":"initupload","dest":"pr-logs/pull/operate-first_apps/2399/kubeval-validation/1569375544263315456/started.json","file":"prow/pod-utils/gcs/upload.go:86","func":"k8s.io/test-infra/prow/pod-utils/gcs.upload","level":"info","msg":"Queued for upload","severity":"info","time":"2022-09-12T17:21:16Z"}
{"component":"initupload","dest":"pr-logs/pull/operate-first_apps/2399/kubeval-validation/1569375544263315456/clone-log.txt","file":"prow/pod-utils/gcs/upload.go:86","func":"k8s.io/test-infra/prow/pod-utils/gcs.upload","level":"info","msg":"Queued for upload","severity":"info","time":"2022-09-12T17:21:16Z"}
{"component":"initupload","dest":"pr-logs/pull/operate-first_apps/2399/kubeval-validation/1569375544263315456/started.json","file":"prow/pod-utils/gcs/upload.go:114","func":"k8s.io/test-infra/prow/pod-utils/gcs.upload.func1","level":"info","msg":"Finished upload","severity":"info","time":"2022-09-12T17:21:16Z"}
{"component":"initupload","dest":"pr-logs/pull/operate-first_apps/2399/kubeval-validation/1569375544263315456/clone-log.txt","file":"prow/pod-utils/gcs/upload.go:114","func":"k8s.io/test-infra/prow/pod-utils/gcs.upload.func1","level":"info","msg":"Finished upload","severity":"info","time":"2022-09-12T17:25:20Z"}

Only 2/3 data is completed in upload.

Screenshot from 2022-09-12 13-52-17

464e54bb-32bf-11ed-a4d6-0a580a80025f-initupload.log

@Gregory-Pereira
Copy link
Member Author

Gregory-Pereira commented Sep 13, 2022

Yesterday I tested connectivity to the bucket and found no issues. I tested listing, uploading and downloading:

$ time aws s3 --endpoint http://s3-eu-central-1.ionoscloud.com/ --profile aws-prow ls s3://prow
aws s3 --endpoint http://s3-eu-central-1.ionoscloud.com/ --profile aws-prow l  2.16s user 0.89s system 38% cpu 7.872 total
$ time aws s3 --endpoint http://s3-eu-central-1.ionoscloud.com/ --profile aws-prow cp ./testing.yaml s3://prow/
upload: ./testing.yaml to s3://prow/testing.yaml
aws s3 --endpoint http://s3-eu-central-1.ionoscloud.com/ --profile aws-prow c  0.68s user 0.18s system 14% cpu 5.823 total
$ time aws s3 --endpoint http://s3-eu-central-1.ionoscloud.com/ --profile aws-prow cp  s3://prow/testing.yaml ./
download: s3://prow/testing.yaml to ./testing.yaml
aws s3 --endpoint http://s3-eu-central-1.ionoscloud.com/ --profile aws-prow c  0.75s user 0.15s system 13% cpu 6.684 total

@durandom
Copy link
Member

is everything in prow reconfigured to the new ionos storage? maybe some controllers need to be restarted to pick up the new config?

@harshad16
Copy link
Member

Each job attaches to the credentials via secrets
These secrets seems to be correct https://console-openshift-console.apps.smaug.na.operate-first.cloud/k8s/ns/opf-ci-prow/secrets/s3-credentials

@Gregory-Pereira
Copy link
Member Author

I read through the docs on IONOs storage, but I couldnt find information on configuring two of the five properties available in that secret, mainly wether to "s3_force_path_style" and wether the connection was secure or inseccure. Maybe that could be related?

@codificat
Copy link
Member

We surely want a secure connection.

A question I have (disclaimer: I don't know/have context about ionos usage, so this might be off-topic): the endpoint is eu-central-1 (i.e. in Europe), while the cluster is in North America (right?). Can we not use a closer location?

It probably does not cause fatal issues, though. But it might be better to use a "local" endpoint

@Gregory-Pereira
Copy link
Member Author

Well we currently have storage issues on smaug (slow obc connection to opf-ci-prow timeout producing frequent flakes), and infra as we are still trying to hook it up with storage from the NESE folks. It is a bucket, so I want to try to make it work first rather than provision a whole new one, remake all the storage routing changes and do this whole debugging jig again.

@durandom
Copy link
Member

Unfortunately IONOS doesn't have a US S3 location

@codificat
Copy link
Member

Prow jobs seem to be running successfully at the moment:

Success rate over time: 3h0s: 94%, 12h0s: 95%, 48h0s: 95%

@durandom
Copy link
Member

@Gregory-Pereira can we close this incident?

@durandom durandom transferred this issue from operate-first/apps Sep 21, 2022
@durandom durandom transferred this issue from operate-first/support Sep 21, 2022
@HumairAK HumairAK moved this from Done to In Progress in Operations Project Board Sep 23, 2022
Repository owner moved this from In Progress to Done in Operations Project Board Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

5 participants