Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve visibility on failures in daily job #6071

Closed
jsoriano opened this issue May 3, 2023 · 5 comments · Fixed by #10234
Closed

Improve visibility on failures in daily job #6071

jsoriano opened this issue May 3, 2023 · 5 comments · Fixed by #10234
Assignees

Comments

@jsoriano
Copy link
Member

jsoriano commented May 3, 2023

We have a daily job daily job that tests all integrations with the latest snapshot version of the stack. Failures on this job are only reported to a Slack Channel, but this has been failing for some time, so it has been relying on manually checking the health of the job.

Define a better visibility strategy for this job, so we can early detect issues there, and notify the affected teams.

Some ideas:

  • Automatically create an issue when an integration fails on this job, assigned to the team maintaining the package.
  • Include the failing integrations in the notifications in slack or other means.
@ebeahan
Copy link
Member

ebeahan commented Jun 28, 2023

I opened #6746 as an attempt at one minor improvement into the daily job status.

@mrodm mrodm self-assigned this May 9, 2024
@mrodm
Copy link
Contributor

mrodm commented May 9, 2024

Report new errors/failures as GitHub Issues

For this purpose, it could be take advantage of the xUnit files written by elastic-package for each kind of tests. All these files are uploaded as artifacts in Buildkite and can be retrieved adding one new step at the end of the pipeline.

If there is any failure or error in tests, it would be present there. An example of a real failing test:

<?xml version="1.0" encoding="UTF-8"?>
<testsuites>
  <testsuite name="system" tests="3" failures="1">
    <!--test suite for system tests-->
    <testcase name="system test: postgresql" classname="sql_input." time="37.935391511"></testcase>
    <testcase name="system test: mssql" classname="sql_input." time="27.131335712"></testcase>
    <testcase name="system test: mysql" classname="sql_input." time="34.332166741">
      <failure>one or more errors found in documents stored in metrics-sql.sql-ep data stream: [0] found error.message in event: cannot open connection: testing connection: dial tcp 172.18.0.7:3306: connect: connection refused</failure>
    </testcase>
  </testsuite>
</testsuites>

As these tests are run in the daily job (local Elastic stack), they also are run in a specific version of the Elastic stack. For instance, it could be 7.17.0-SNAPSHOT or 8.14-SNAPSHOT. However, this is not valid for the builds testing with Elastic Serverless.

Having that as a basis, I think of two different options to create GitHub issues:

  • 1 issue for all errors/failures related to tests raised related to each package.
    • Sometimes, the same error happens in different tests
  • 1 issue per each error or failure related to tests raised related to each package.
    • More flexible to take actions if errors/failures are independent between them.

Issues created should have this information available to help the owner team:

  • Set specific (new?) labels(e.g. failed-test and automation)
  • Elastic stack version used, if possible.
    • There is no such information when testing with Serverless.
  • Package name and version.
  • Link to the Buildkite pipeline
    • If present link to buildkite, it could complicate how to check if issue exists (if description is used to check if an issue already exists).
  • If possible, snippet of the errors/failures and which type of test (system, static, pipeline or asset) failed.
  • Mention to the GitHub Team in charge of that package.
    • This information should be available through the GitHub CODEOWNERS file.

How to avoid creating duplicated GitHub Issues

Some checks that can be performed to avoid creating duplicated issues for the same errors/failures:

  • List only issues with the specific labels.
  • Check if it exists one with the same title:
    • Write in the title "Failing tests: <stack_version> - <package_name> - <package_version>"
  • Check if it exists one with the same title and body:
    • Write title as above
    • Write the body ensuring that tests are written always in the same order.

It could be added some metadata in the issue description following the example of: https://github.com/probot/metadata to help checking whether or not an issue for that test (or tests) was already created.

In Kibana, there is a similar approach: https://github.com/elastic/kibana/blob/f82d64043155736b6daf3a5c2286fa14417fc19c/packages/kbn-failed-test-reporter-cli/failed_tests_reporter/issue_metadata.ts#L45

Issue description...

<!-- integrationsCI = {"build": "xxx", "package": "my-package", "version": "1.0.0", "failures": ["", ""], "errors": ["", ""]} -->

All these checks could be performed in a new step in Buildkite:

  1. Download all xml files from tests (build/test-results/*.xml)
  2. Process those files to get the errors or failures per package
  3. Loop into those errors/failures
    1. Check if there is an issue for the error(s)/failure(s).
    2. If not, create a new issue with all the required data (and metadata?)

Open Questions

  • Would this process make sense ?
  • Should an issue contain just one failure or error ? or all of them ?
  • Should this run for all jobs triggered daily ? For both Serverless (Observability and Security) and local Elastic stacks (7.17.* and 8.14.0-SNAPSHOT) ?
  • How to ensure that issues are not duplicated ? WDYT about adding that metadata string into the description?
  • Once an GH issue is created, is the team owner in charge of closing it ? Even if that error failed just in a few builds and then run successfully ?

cc @elastic/ecosystem

@mrodm
Copy link
Contributor

mrodm commented May 13, 2024

Just checked that in the latest daily builds for Elastic stack 8.14 (local), these were the failing packages:

Wondering if it should be created the issue as soon as the a package fails without requiring to fail N times in a row.
It's true that those packages with flaky tests, if they are not reported in the first occurrence maybe they will not be reported if the next build succeeds.

As a note, these packages have been already reported:

@jsoriano
Copy link
Member Author

  • Would this process make sense ?

Yes, it looks like a good approach.

We may have failures related to a package that don't generate a xunit file. But rather than handling specifically these failures, I would prefer to enhance elastic-package so it generate xunit files for these failures.

  • Should an issue contain just one failure or error ? or all of them ?

It would be better to create one issue per failing test, this is more actionable. It would be important to avoid duplicate issues.

  • Should this run for all jobs triggered daily ? For both Serverless (Observability and Security) and local Elastic stacks (7.17.* and 8.14.0-SNAPSHOT) ?

Yes.

  • How to ensure that issues are not duplicated ? WDYT about adding that metadata string into the description?

I think we can rely on the information provided in the xunit file, so we'd have titles like "Failing tests: build name - test title in buildkite", so for example: "Failing test: daily 7.x - system test: default (variant: v7.1.0) in couchbase.xdcr", and we can base the duplication checks on these titles. Even if the issue is not exactly the same, this indicates that a given package needs to be reviewed in a given scenario.

The "build name" would indicate if the issue happened in 7.x, 8.x/latest or serverless. I would not reference here to specific versions because the same issue can be happening during multiple release cycles.

For the same reason I would not use the package version as dimension, a package whose usual tests don't fail could have multiple versions with issues in the daily jobs.

When finding duplicates, it would be nice to include the new failing build in the description of the issue. We could keep links to the original failing build and to the N latest failures found.
If we detect a duplicate that was closed, I would not reopen the closed one, as it might be an old unrelated one. I would create a new issue, and link to the closed ones.

  • Once an GH issue is created, is the team owner in charge of closing it ? Even if that error failed just in a few builds and then run successfully ?

Yes, the team owner is in principle responsible of closing it. Maybe ecosystem should be pinged too, at least at the beginning to discard issues created by reasons not related to the package.

@mrodm
Copy link
Contributor

mrodm commented May 15, 2024

When finding duplicates, it would be nice to include the new failing build in the description of the issue. We could keep links to the original failing build and to the N latest failures found.

Yes, it could be added a comment to the issue or try to update the description of the issue to include latest failures (build links). I don't know if it can be updated the description of an issue with the gh tool or it must be done via API, something to check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants