Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Testing] Use google/cloud-sdk:279.0.0 to resolve workload identity flakiness #3019

Merged
merged 15 commits into from
Feb 12, 2020

Conversation

Bobgy
Copy link
Contributor

@Bobgy Bobgy commented Feb 8, 2020

Same rationale as #3018,
but this PR tries if the new version is stable

Discussion about which versions are good: b/146669263
It also mentions client version is also related, so I will try upgrading google/cloud-sdk version soon

UPDATE 2020.2.10: client version is the most important, after updating to google/cloud-sdk/279.0.0 (latest one), flakiness disappears.

Requirements to fix GKE workload identity intermittent timeouts:


This change is Reviewable

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 8, 2020

1.15.8 alone doesn't fix the issue, will need to try also use new google/cloud-sdk version

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 8, 2020

As mentioned in b/148920399, we also need to update clients to latest.

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 8, 2020

/retest
It failed in cloud build, not our problem

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 8, 2020

/hold
This doesn't seem good, error messages look a little weird

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 10, 2020

1.15.8 doesn't seem compatible with kfp, it triggers some stable integration test failures.
However, it seems upgrading gcloud version is the best fix. Everything seems stable, except secret sample is failing due to pip not found. After investigation, it seems google python2 support has stopped. google/cloud-sdk image no longer contains pip. I updated the sample to use python3 instead. Let's see if that makes tests pass.

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 10, 2020

xgboost timed out, we should increase time limit

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 10, 2020

/test kubeflow-pipeline-e2e-test
/test kubeflow-pipeline-sample-test
Now all tests are passing, retry for the first time.

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 10, 2020

/test kubeflow-pipeline-e2e-test
/test kubeflow-pipeline-sample-test
Great, passed again. Retry the second time

@Bobgy Bobgy changed the title [WIP] [Testing] Use gke 1.15.8 to mitigate workload identity flakiness [Testing] Use google/cloud-sdk:279.0.0 to mitigate workload identity flakiness Feb 10, 2020
@Bobgy Bobgy changed the title [Testing] Use google/cloud-sdk:279.0.0 to mitigate workload identity flakiness [Testing] Use google/cloud-sdk:279.0.0 to resolve workload identity flakiness Feb 10, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 10, 2020

Cleaned up unrelated changes

@numerology
Copy link

Thanks @Bobgy !
Let me give a third shot:)

@numerology
Copy link

/test kubeflow-pipeline-e2e-test
/test kubeflow-pipeline-sample-test

@numerology
Copy link

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: numerology

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@numerology
Copy link

Took a look and it seems like Travis is complaining about the version of six. Perhaps we need to pin it in .travis.yaml to get it fixed.

Specifically, test on py35 passed because it got six==1.12.0 while other failed with six==1.11.0

@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Feb 11, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 11, 2020

@numerology Thank you! Let me try that

@numerology
Copy link

FYI the fix is pending here #3035

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 11, 2020

@numerology separated travis test failure to a separate PR: #3039

UPDATE: ohh, thanks I didn't know there's an existing PR to fix it. I will close mine.

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 12, 2020

/hold cancel
I'll self approve because this seems stable now.

@Bobgy Bobgy added the lgtm label Feb 12, 2020
@k8s-ci-robot k8s-ci-robot merged commit 02fabd3 into kubeflow:master Feb 12, 2020
@Bobgy Bobgy deleted the test_use_1.15.8 branch February 14, 2020 02:21
Jeffwan pushed a commit to Jeffwan/pipelines that referenced this pull request Dec 9, 2020
…lakiness (kubeflow#3019)

* [Testing] Use gke 1.15.8 to mitigate workload identity flakiness

* Upgrade gcloud version

* Update image builder image too

* Turn on workload identity

* Update deploy-cluster.sh

* secret sample uses python3 instead

* Increase xgboost time limit

* Revert files with bad format

* Update component and pipelines to use gcloud 279.0.0

* Fix secret sample using python3

* Upgrade frontend integration test image

* Rebuild frontend integration test image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants