-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get failure information on EMR job failure #32151
Get failure information on EMR job failure #32151
Conversation
My experience with EMR is that it takes time for logs to be populated. More often the failure details are actually in stdout rather than stderr and the general error message is not very helpful. I'm curious to what exactly is being registered here. Is it possible to share example? |
The error messages might not explain the root cause of the problem, but they will at least be a starting point for an investigation. This is an example where I manually deleted the cluster before the job was run:
This message is logged |
Since Airflow task failed you already know the step encountered a terminal state. Either way I'm OK with it as is.. was just curious to see what is logged in the message. |
except AirflowException as ex: | ||
if "EMR Steps failed" in str(ex): | ||
resp = self.get_conn().describe_step(ClusterId=job_flow_id, StepId=step_id) | ||
self.log.error("EMR Steps failed: %s", resp["Step"]["Status"]["StateChangeReason"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other portions of the status that are worth logging in addition to the state change reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. In the documentation, there is supposed to be a FailureDetails
field that should provide details, but I haven't seen that in the actual response body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I have a failed job it takes 3-4 minutes before the logs appear and the issue is discovered. I wonder if the empty body reflects this as querying too soon when there is not yet information from the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the waiter is already querying this endpoint, so you're going to get the same info as what's already printed by the method since you put "Step.Status.StateChangeReason"
in the status args.
+1 to use the wait
method, but -1 to this except
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other source for useful information you know of, @dacort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@o-nikolas FailureDetails
should surface common reasons that a step failed and it will do so before the logs get synced to S3. I'm not as familiar with StageChangedReason
, but I'm not sure it will have potential details about why the job failed. I think it's more related to service-specific reasons why a step state changed (e.g. somebody cancelled it). Here's an example of a job I ran with an invalid entrypoint file. FailureDetails
is populated immediately and the log file it shows didn't sync to S3 until 4 minutes after the step failed.
{
"Step": {
"Id": "s-00928142VC3M3WYAG8XN",
"Name": "test-fail",
"Config": {
"Jar": "command-runner.jar",
"Properties": {},
"Args": [
"spark-submit",
"--deploy-mode",
"cluster",
"/tmp/fail.py"
]
},
"ActionOnFailure": "CONTINUE",
"Status": {
"State": "FAILED",
"StateChangeReason": {},
"FailureDetails": {
"Reason": "The given path is invalid.",
"Message": "Exception in thread \"main\" java.io.FileNotFoundException: File file:/mnt/tmp/fail.py does not exist",
"LogFile": "s3://aws-logs-568026268536-us-west-2/elasticmapreduce/j-2WPQL41AO1169/steps/s-00928142VC3M3WYAG8XN/stderr.gz"
},
"Timeline": {
"CreationDateTime": "2023-07-18T09:41:40.728000-07:00",
"StartDateTime": "2023-07-18T09:41:57.422000-07:00",
"EndDateTime": "2023-07-18T09:42:03.499000-07:00"
}
}
}
}
And here's one where I cancelled the job before it ran with StateChangeReason
populated.
{
"Step": {
"Id": "s-07516842NY2M4IHHH71T",
"Name": "test-fail",
"Config": {
"Jar": "command-runner.jar",
"Properties": {},
"Args": [
"spark-submit",
"--deploy-mode",
"cluster",
"/tmp/fail.py"
]
},
"ActionOnFailure": "CONTINUE",
"Status": {
"State": "CANCELLED",
"StateChangeReason": {
"Message": "Cancelled by user"
},
"Timeline": {
"CreationDateTime": "2023-07-18T09:48:45.141000-07:00"
}
}
}
}
Add logs to display failure reason if EMR Job fails
56adab5
to
39d3572
Compare
Currently, the logs do not give any information about the progress of an EMR job that has been scheduled. This PR uses the new
wait
util method to display the status of the EMR jobs, and if a job fails, it calls thedescribe_step
method to log the reason for the job failure.^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.