Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worflow-level "error" field does not represent failures in jobs #114

Open
seh opened this issue May 25, 2023 · 10 comments
Open

Worflow-level "error" field does not represent failures in jobs #114

seh opened this issue May 25, 2023 · 10 comments

Comments

@seh
Copy link
Contributor

seh commented May 25, 2023

Even when a job within a workflow fails—with the "github.job.conclusion" field being something other than "success" and the "error" field on the job's span being true—the top-level trace span's "error" field still reports true, making it appear as though the workflow succeeded. Instead, the top-level span's "error" field should bear a false value if any of the constituent jobs failed.

@nikordaris
Copy link
Contributor

Thanks for reporting this. Parent propagation is tricky. Can you provide any more details on what you've seen specifically? Catching all the edge cases is tough!

@seh
Copy link
Contributor Author

seh commented Jul 15, 2023

Can you provide any more details on what you've seen specifically?

We have jobs that fail—often due to a step invoking a shell program that exits with a nonzero code—and the span representing that job indicates that there was an error. However, there's no indication on the top-level workflow run's trace span that a job within it failed, even though the workflow run failed as a consequence.

@nikordaris
Copy link
Contributor

Are you running this action on workflow_run Event? Or is it the final job of the workflow it is tracing? I believe there are some edge cases with when the workflow data becomes available over the API when this action is run within the workflow it is tracing which could be what you are observing?

@seh
Copy link
Contributor Author

seh commented Jul 17, 2023

The event in question is "pull_request", and the otel-export-trace-action action runs as the lone step in a final job within the workflow that "needs" all the preceding jobs, with the if condition of always().

What I didn't say earlier is that we'd like some attribute to be present in the top-level trace span that answers the question of whether all the constituent jobs ran to completion successfully. Whether that's the "error" attribute being true to indicate a failure or a successful job count being lower than a total job count, so long as I could discern it in a query, I'd be satisfied. My goal is to detect and react to workflow runs failing in some way.

@nikordaris
Copy link
Contributor

I would try to separate this trace action in a separate workflow that runs on completion of your PR workflow to see if that solves the error propagation issue.

The GitHub API parent is supposed to set the error of children properly. I believe the issue here is the trace action job at the end isn't complete yet because it is still running the GitHub workflow parent status isn't updated yet. Moving it out of the PR workflow should solve this problem.

@seh
Copy link
Contributor Author

seh commented Jul 19, 2023

I think I understand your suggestion, but that makes it much more difficult to use this action. Instead of this being a job included in any workflow about which we'd like to publish traces, we would now need a second, companion workflow.

Is it feasible to iterate over all of the constituent jobs in the workflow in which this action is running and inspect each job's outcome, such that if any of those outcomes are failures, you could then convey that failure (or two counts, the total number of jobs and the number that either failed or succeeded) in the top-level trace span?

@nikordaris
Copy link
Contributor

So I don't think adding another workflow is a big challenge. It's actually less code as you can then target all the workflows you want to trace in that one otel workflow instead of creating a trace job for every workflow. This is how we do it for all of our repos.

With that said, the nature of this action is to trust the GitHub API and mirror it over otel. So I don't think it will create some risk and complexity to overwrite what the API says based on internal business logic.

TBH, I've been considering explicitly recommending in the readme to not do this approach of embedding in the workflow because of these eventual consistency issues with the GitHub API.

@seh
Copy link
Contributor Author

seh commented Jul 20, 2023

I'm willing to accept that "I'm holding it wrong", but at the time that I started using this action I had followed what was the prescribed approach, which was already so much cleaner than my previous use of Honeycomb's buildevents program. It would help me and I assume many other users tremendously if you'd show an example of how you think this action should be used to best effect, together with a warning against doing what I've been doing.

@nikordaris
Copy link
Contributor

Absolutely, I'm discovering this along the way too. I built this action for myself to solve my pains with using existing tooling like buildevents and otel cli. I shared it with the community to primarily solicit help in maintaining it.

In the readme the first usage example is how we use it successfully. It is the primary way that we test it. Embedding it in the workflow was an afterthought but not the intended usage. If the readme usage for a separate workflow isn't clear we should definitely improve the docs for that. Is there a specific question you have on using it in that way that I can help answer? I also welcome any updates to the documentation once you do end up resolving your issue so that others can benefit from your learnings.

@seh
Copy link
Contributor Author

seh commented Jul 22, 2023

I confess that I completely bypasses reading that first example in the README.md file because I thought, "No, I'm not reacting to the workflow_run event; I'm reacting to push, as in the second example."

Removing the second and third examples would be a good first step, since it's easy to delete code. Harder is writing new prose. I think we should explain that the action doesn't work correctly if run in the same workflow as the workflow run on which it's trying to report, and that the workflow_run event—together with the "completed" type constraint—is the way to sequence one workflow after another correctly.

While we're here, note that there's also a community discussion entry in which lots of people complain about the ambiguity presented by this event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants