Improve visibility on failures in daily job #6071

jsoriano · 2023-05-03T12:50:29Z

We have a ~~daily job~~ daily job that tests all integrations with the latest snapshot version of the stack. Failures on this job are only reported to a Slack Channel, but this has been failing for some time, so it has been relying on manually checking the health of the job.

Define a better visibility strategy for this job, so we can early detect issues there, and notify the affected teams.

Some ideas:

Automatically create an issue when an integration fails on this job, assigned to the team maintaining the package.
Include the failing integrations in the notifications in slack or other means.

ebeahan · 2023-06-28T19:32:35Z

I opened #6746 as an attempt at one minor improvement into the daily job status.

mrodm · 2024-05-09T16:22:32Z

Report new errors/failures as GitHub Issues

For this purpose, it could be take advantage of the xUnit files written by elastic-package for each kind of tests. All these files are uploaded as artifacts in Buildkite and can be retrieved adding one new step at the end of the pipeline.

If there is any failure or error in tests, it would be present there. An example of a real failing test:

<?xml version="1.0" encoding="UTF-8"?>
<testsuites>
  <testsuite name="system" tests="3" failures="1">
    <!--test suite for system tests-->
    <testcase name="system test: postgresql" classname="sql_input." time="37.935391511"></testcase>
    <testcase name="system test: mssql" classname="sql_input." time="27.131335712"></testcase>
    <testcase name="system test: mysql" classname="sql_input." time="34.332166741">
      <failure>one or more errors found in documents stored in metrics-sql.sql-ep data stream: [0] found error.message in event: cannot open connection: testing connection: dial tcp 172.18.0.7:3306: connect: connection refused</failure>
    </testcase>
  </testsuite>
</testsuites>

As these tests are run in the daily job (local Elastic stack), they also are run in a specific version of the Elastic stack. For instance, it could be 7.17.0-SNAPSHOT or 8.14-SNAPSHOT. However, this is not valid for the builds testing with Elastic Serverless.

Having that as a basis, I think of two different options to create GitHub issues:

1 issue for all errors/failures related to tests raised related to each package.
- Sometimes, the same error happens in different tests
1 issue per each error or failure related to tests raised related to each package.
- More flexible to take actions if errors/failures are independent between them.

Issues created should have this information available to help the owner team:

Set specific (new?) labels(e.g. failed-test and automation)
Elastic stack version used, if possible.
- There is no such information when testing with Serverless.
Package name and version.
Link to the Buildkite pipeline
- If present link to buildkite, it could complicate how to check if issue exists (if description is used to check if an issue already exists).
If possible, snippet of the errors/failures and which type of test (system, static, pipeline or asset) failed.
Mention to the GitHub Team in charge of that package.
- This information should be available through the GitHub CODEOWNERS file.

How to avoid creating duplicated GitHub Issues

Some checks that can be performed to avoid creating duplicated issues for the same errors/failures:

List only issues with the specific labels.
Check if it exists one with the same title:
- Write in the title "Failing tests: <stack_version> - <package_name> - <package_version>"
Check if it exists one with the same title and body:
- Write title as above
- Write the body ensuring that tests are written always in the same order.

It could be added some metadata in the issue description following the example of: https://github.com/probot/metadata to help checking whether or not an issue for that test (or tests) was already created.

In Kibana, there is a similar approach: https://github.com/elastic/kibana/blob/f82d64043155736b6daf3a5c2286fa14417fc19c/packages/kbn-failed-test-reporter-cli/failed_tests_reporter/issue_metadata.ts#L45

Issue description...

<!-- integrationsCI = {"build": "xxx", "package": "my-package", "version": "1.0.0", "failures": ["", ""], "errors": ["", ""]} -->

All these checks could be performed in a new step in Buildkite:

Download all xml files from tests (build/test-results/*.xml)
Process those files to get the errors or failures per package
Loop into those errors/failures
1. Check if there is an issue for the error(s)/failure(s).
2. If not, create a new issue with all the required data (and metadata?)

Open Questions

Would this process make sense ?
Should an issue contain just one failure or error ? or all of them ?
Should this run for all jobs triggered daily ? For both Serverless (Observability and Security) and local Elastic stacks (7.17.* and 8.14.0-SNAPSHOT) ?
How to ensure that issues are not duplicated ? WDYT about adding that metadata string into the description?
Once an GH issue is created, is the team owner in charge of closing it ? Even if that error failed just in a few builds and then run successfully ?

cc @elastic/ecosystem

mrodm · 2024-05-13T16:22:17Z

Just checked that in the latest daily builds for Elastic stack 8.14 (local), these were the failing packages:

https://buildkite.com/elastic/integrations/builds/11362
- couchbase reported [couchbase.xdcr] Flaky test - invalid character 'R' looking for beginning of value #9824
- sql_input reported as an issue [sql_input] Flaky test - cannot open connection: testing connection #9840
https://buildkite.com/elastic/integrations/builds/11352
- kibana
- sql_input
https://buildkite.com/elastic/integrations/builds/11351
- sql_input
https://buildkite.com/elastic/integrations/builds/11328
- couchbase
- ibmmq
- sql_input
https://buildkite.com/elastic/integrations/builds/11293
- couchbase
- ibmmq
https://buildkite.com/elastic/integrations/builds/11241
- ibmmq
https://buildkite.com/elastic/integrations/builds/11206
- ibmmq
- couchbase
https://buildkite.com/elastic/integrations/builds/11124
- couchbase
- ibmmq

Wondering if it should be created the issue as soon as the a package fails without requiring to fail N times in a row.
It's true that those packages with flaky tests, if they are not reported in the first occurrence maybe they will not be reported if the next build succeeds.

As a note, these packages have been already reported:

jsoriano · 2024-05-13T16:37:31Z

Would this process make sense ?

Yes, it looks like a good approach.

We may have failures related to a package that don't generate a xunit file. But rather than handling specifically these failures, I would prefer to enhance elastic-package so it generate xunit files for these failures.

Should an issue contain just one failure or error ? or all of them ?

It would be better to create one issue per failing test, this is more actionable. It would be important to avoid duplicate issues.

Should this run for all jobs triggered daily ? For both Serverless (Observability and Security) and local Elastic stacks (7.17.* and 8.14.0-SNAPSHOT) ?

Yes.

How to ensure that issues are not duplicated ? WDYT about adding that metadata string into the description?

I think we can rely on the information provided in the xunit file, so we'd have titles like "Failing tests: build name - test title in buildkite", so for example: "Failing test: daily 7.x - system test: default (variant: v7.1.0) in couchbase.xdcr", and we can base the duplication checks on these titles. Even if the issue is not exactly the same, this indicates that a given package needs to be reviewed in a given scenario.

The "build name" would indicate if the issue happened in 7.x, 8.x/latest or serverless. I would not reference here to specific versions because the same issue can be happening during multiple release cycles.

For the same reason I would not use the package version as dimension, a package whose usual tests don't fail could have multiple versions with issues in the daily jobs.

When finding duplicates, it would be nice to include the new failing build in the description of the issue. We could keep links to the original failing build and to the N latest failures found.
If we detect a duplicate that was closed, I would not reopen the closed one, as it might be an old unrelated one. I would create a new issue, and link to the closed ones.

Once an GH issue is created, is the team owner in charge of closing it ? Even if that error failed just in a few builds and then run successfully ?

Yes, the team owner is in principle responsible of closing it. Maybe ecosystem should be pinged too, at least at the beginning to discard issues created by reasons not related to the package.

mrodm · 2024-05-15T11:24:44Z

When finding duplicates, it would be nice to include the new failing build in the description of the issue. We could keep links to the original failing build and to the N latest failures found.

Yes, it could be added a comment to the issue or try to update the description of the issue to include latest failures (build links). I don't know if it can be updated the description of an issue with the gh tool or it must be done via API, something to check.

jsoriano mentioned this issue May 3, 2023

Test integrations with latest stack version on PRs #6072

Open

ebeahan mentioned this issue Jun 28, 2023

[ci] Enable Slack notifications on daily job status #6746

Merged

mrodm self-assigned this May 9, 2024

jsoriano mentioned this issue May 13, 2024

[sql_input] Flaky test - cannot open connection: testing connection #9840

Closed

This was referenced Jun 20, 2024

Change separator in results file written elastic/elastic-package#1925

Merged

Report packages failing in daily CI jobs #10234

Merged

mrodm closed this as completed in #10234 Jul 2, 2024

This was referenced Jul 3, 2024

[CI] Add gh cli version to report failing tests in daily jobs #10347

Merged

[CI] Fix reporting failing tests in serverless pipeline #10377

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve visibility on failures in daily job #6071

Improve visibility on failures in daily job #6071

jsoriano commented May 3, 2023 •

edited

Loading

ebeahan commented Jun 28, 2023

mrodm commented May 9, 2024

mrodm commented May 13, 2024

jsoriano commented May 13, 2024

mrodm commented May 15, 2024

Improve visibility on failures in daily job #6071

Improve visibility on failures in daily job #6071

Comments

jsoriano commented May 3, 2023 • edited Loading

ebeahan commented Jun 28, 2023

mrodm commented May 9, 2024

Report new errors/failures as GitHub Issues

How to avoid creating duplicated GitHub Issues

Open Questions

mrodm commented May 13, 2024

jsoriano commented May 13, 2024

mrodm commented May 15, 2024

jsoriano commented May 3, 2023 •

edited

Loading