Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Fix flaky IT test cases #8260

Merged
merged 24 commits into from
Sep 12, 2022
Merged

chore: Fix flaky IT test cases #8260

merged 24 commits into from
Sep 12, 2022

Conversation

lqiu96
Copy link
Contributor

@lqiu96 lqiu96 commented Aug 31, 2022

This PR should resolve a bunch of the flaky test cases in the monorepo.

Status:
image

Quota Metrics after VM cleanup (7 Day Graph):
image

@lqiu96 lqiu96 requested review from suztomo and meltsufin September 2, 2022 19:23
@lqiu96 lqiu96 marked this pull request as ready for review September 2, 2022 19:23
@lqiu96 lqiu96 requested a review from ddixit14 September 2, 2022 19:35
.kokoro/build.sh Show resolved Hide resolved
@@ -72,6 +72,8 @@ public static void setUp() throws IOException {
instances = new ArrayList<>();
InstancesSettings instanceSettings = InstancesSettings.newBuilder().build();
instancesClient = InstancesClient.create(instanceSettings);

Util.cleanUpComputeInstances(instancesClient, DEFAULT_PROJECT, DEFAULT_ZONE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a bit too aggressive to delete all compute instances? Why isn't the cleanup in tearDown() sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tearDown() cleanup is sufficient for 99% of the case. There was a significant amount of compute instances that were in there (my guess from previous runs that didn't have the cleanup or when the IT jobs were cancelled halfway).

This project is only used for ITs so it would have 24 hours to report the results before the previous instances are deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that someone runs the test on their own project and causes inadvertent deletion of instances. If we want to do this kind of cleanup, maybe we can have instance name prefix that is consistent and can accurately filter instances created by this test.

Copy link
Contributor Author

@lqiu96 lqiu96 Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that someone runs the test on their own project and causes inadvertent deletion of instances.

Actually, I was thinking that this would be fine for the ITs. I was thinking the ideal state for gcloud-devel would have 0 compute instances when there are no active ITs running. If Util.cleanUpComputeInstances (or more filtered version of that function) matches any existing instances, then those should be deleted. 24 hours is 12x as long as the longest GraalVM job.

maybe we can have instance name prefix that is consistent and can accurately filter instances created by this test.

Yep! I was planning on creating another PR that would update the instances for each module to have a more unique name instead test-instance or gapic-instance. i.e. Compute would be test-compute-instance-UUID, Container would be test-container-instance-UUID or something like that.

Copy link
Contributor Author

@lqiu96 lqiu96 Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok nvm, I just re-read the comment.

test on their own project

@@ -90,7 +91,13 @@ public static void beforeClass() throws Exception {

@AfterClass
public static void afterClass() throws Exception {
Thread.sleep(TimeUnit.MINUTES.toMillis(5));
Operation response = client.getOperation(PROJECT_ID, ZONE, operation.getName());
// Sleep for one minute until Cluster CREATE operation is complete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we waiting for cluster creation completion in afterClass()? Shouldn't we wait for it right after creation in beforeClass()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run operations of the client (get/ list) without the operation having to be complete. It was only when deleting the cluster, I got an error with Operation in progress. I think it would make sense to move this to wait when the operation was run.

I'll change this to to do the busy wait in beforeClass().

@@ -66,6 +66,8 @@ public class ITNotebookServiceClientTest {
public static void setUp() throws IOException, ExecutionException, InterruptedException {
// Create Test Notebook Instance
client = NotebookServiceClient.create();
Util.cleanUpNotebookInstances(client, PARENT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this delete the one created above before it's even used?
Also, why isn't cleanup in tearDown() sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would only delete notebooks that have a creation time more than 24 hours earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a prefix for the notebook names or something to avoid accidentally deleting unrelated notebooks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's similar to the comment above about compute and container. I'll add in more specific names for the IT instances.

@lqiu96 lqiu96 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2022
@lqiu96 lqiu96 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2022
@lqiu96 lqiu96 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2022
@lqiu96 lqiu96 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2022
@lqiu96 lqiu96 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 9, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 9, 2022
@lqiu96 lqiu96 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 12, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 12, 2022
@lqiu96
Copy link
Contributor Author

lqiu96 commented Sep 12, 2022

@meltsufin @suztomo Could you re-review this PR?

for (Instance instance : listPagedResponse.iterateAll()) {
if (isCreatedBeforeThresholdTime(
ZonedDateTime.parse(instance.getCreationTimestamp()).toInstant())
&& instance.getName().startsWith(BaseTest.COMPUTE_PREFIX)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both Util classes, I think I would rather have the prefix passed in as an argument into this method, but I guess this is OK too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, this seems better. I'll create a new PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants