Get failure information on EMR job failure #32151

syedahsn · 2023-06-26T18:17:27Z

Currently, the logs do not give any information about the progress of an EMR job that has been scheduled. This PR uses the new wait util method to display the status of the EMR jobs, and if a job fails, it calls the describe_step method to log the reason for the job failure.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

eladkal · 2023-06-26T18:32:00Z

and if a job fails, it calls the describe_step method to log the reason for the job failure.

My experience with EMR is that it takes time for logs to be populated. More often the failure details are actually in stdout rather than stderr and the general error message is not very helpful.

I'm curious to what exactly is being registered here. Is it possible to share example?

syedahsn · 2023-06-27T00:20:47Z

The error messages might not explain the root cause of the problem, but they will at least be a starting point for an investigation. This is an example where I manually deleted the cluster before the job was run:

[2023-06-26, 23:47:57 UTC] {emr.py:146} INFO - Adding steps to j-2D6XS5VTI1WGZ
[2023-06-26, 23:47:57 UTC] {emr.py:171} INFO - Steps ['s-08516095GAHC1SG5B4N'] added to JobFlow
[2023-06-26, 23:47:57 UTC] {waiter_with_logging.py:74} INFO - EMR Step status is: PENDING - {}
[2023-06-26, 23:48:36 UTC] {emr.py:188} ERROR - EMR Steps failed: {'Message': 'Job terminated'}
[2023-06-26, 23:48:36 UTC] {taskinstance.py:1916} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/airflow/providers/amazon/aws/utils/waiter_with_logging.py", line 68, in wait
    waiter.wait(**args, WaiterConfig={"MaxAttempts": 1})
  File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 55, in wait
    Waiter.wait(self, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 375, in wait
    raise WaiterError(
botocore.exceptions.WaiterError: Waiter StepComplete failed: Waiter encountered a terminal failure state: For expression "Step.Status.State" we matched expected path: "CANCELLED"

This message is logged EMR Steps failed: {'Message': 'Job terminated'}, which provides some indication of what happened.
If a job fails due to some issue with the job itself, then, like you said, that information will not be logged here.
I agree its not much, but its better than nothing. What do you think?

eladkal · 2023-06-27T07:03:26Z

If a job fails due to some issue with the job itself, then, like you said, that information will not be logged here.
I agree its not much, but its better than nothing. What do you think?

Since Airflow task failed you already know the step encountered a terminal state.
Maybe worth consulting with @dacort to see if somehow we can bring a more informative message?

Either way I'm OK with it as is.. was just curious to see what is logged in the message.

o-nikolas · 2023-06-27T18:28:11Z

airflow/providers/amazon/aws/hooks/emr.py

+                except AirflowException as ex:
+                    if "EMR Steps failed" in str(ex):
+                        resp = self.get_conn().describe_step(ClusterId=job_flow_id, StepId=step_id)
+                        self.log.error("EMR Steps failed: %s", resp["Step"]["Status"]["StateChangeReason"])


Are there any other portions of the status that are worth logging in addition to the state change reason?

Not really. In the documentation, there is supposed to be a FailureDetails field that should provide details, but I haven't seen that in the actual response body.

When I have a failed job it takes 3-4 minutes before the logs appear and the issue is discovered. I wonder if the empty body reflects this as querying too soon when there is not yet information from the logs.

the waiter is already querying this endpoint, so you're going to get the same info as what's already printed by the method since you put "Step.Status.StateChangeReason" in the status args.
+1 to use the wait method, but -1 to this except

Is there any other source for useful information you know of, @dacort?

@o-nikolas FailureDetails should surface common reasons that a step failed and it will do so before the logs get synced to S3. I'm not as familiar with StageChangedReason, but I'm not sure it will have potential details about why the job failed. I think it's more related to service-specific reasons why a step state changed (e.g. somebody cancelled it). Here's an example of a job I ran with an invalid entrypoint file. FailureDetails is populated immediately and the log file it shows didn't sync to S3 until 4 minutes after the step failed.

{ "Step": { "Id": "s-00928142VC3M3WYAG8XN", "Name": "test-fail", "Config": { "Jar": "command-runner.jar", "Properties": {}, "Args": [ "spark-submit", "--deploy-mode", "cluster", "/tmp/fail.py" ] }, "ActionOnFailure": "CONTINUE", "Status": { "State": "FAILED", "StateChangeReason": {}, "FailureDetails": { "Reason": "The given path is invalid.", "Message": "Exception in thread \"main\" java.io.FileNotFoundException: File file:/mnt/tmp/fail.py does not exist", "LogFile": "s3://aws-logs-568026268536-us-west-2/elasticmapreduce/j-2WPQL41AO1169/steps/s-00928142VC3M3WYAG8XN/stderr.gz" }, "Timeline": { "CreationDateTime": "2023-07-18T09:41:40.728000-07:00", "StartDateTime": "2023-07-18T09:41:57.422000-07:00", "EndDateTime": "2023-07-18T09:42:03.499000-07:00" } } } }

And here's one where I cancelled the job before it ran with StateChangeReason populated.

{ "Step": { "Id": "s-07516842NY2M4IHHH71T", "Name": "test-fail", "Config": { "Jar": "command-runner.jar", "Properties": {}, "Args": [ "spark-submit", "--deploy-mode", "cluster", "/tmp/fail.py" ] }, "ActionOnFailure": "CONTINUE", "Status": { "State": "CANCELLED", "StateChangeReason": { "Message": "Cancelled by user" }, "Timeline": { "CreationDateTime": "2023-07-18T09:48:45.141000-07:00" } } } }

Add logs to display failure reason if EMR Job fails

airflow/providers/amazon/aws/hooks/emr.py

syedahsn requested review from eladkal and o-nikolas as code owners June 26, 2023 18:17

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jun 26, 2023

eladkal approved these changes Jun 27, 2023

View reviewed changes

vincbeck approved these changes Jun 27, 2023

View reviewed changes

o-nikolas reviewed Jun 27, 2023

View reviewed changes

vandonr-amz mentioned this pull request Jun 28, 2023

aws waiter util: log status info with error level on waiter error #32247

Merged

syedahsn added 3 commits August 2, 2023 09:30

Use util wait method in wait_for_completion.

7b3d2e3

Add logs to display failure reason if EMR Job fails

Fix unit test

c57a0da

Fix waiter parameters, use FailureDetails instead of StateChangeReason

39d3572

syedahsn force-pushed the syedahsn/emr-logging-on-failure branch from 56adab5 to 39d3572 Compare August 2, 2023 16:30

Update unit test

6de477b

o-nikolas reviewed Aug 4, 2023

View reviewed changes

airflow/providers/amazon/aws/hooks/emr.py Outdated Show resolved Hide resolved

Only log failure details if it is inlcuded in the response

5f03edb

o-nikolas reviewed Aug 4, 2023

View reviewed changes

airflow/providers/amazon/aws/hooks/emr.py Outdated Show resolved Hide resolved

Fix logic error

e5935aa

o-nikolas approved these changes Aug 8, 2023

View reviewed changes

o-nikolas merged commit 8bbea92 into apache:main Aug 8, 2023

eladkal mentioned this pull request Aug 11, 2023

Status of testing Providers that were prepared on August 11, 2023 #33305

Closed

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get failure information on EMR job failure #32151

Get failure information on EMR job failure #32151

syedahsn commented Jun 26, 2023

eladkal commented Jun 26, 2023 •

edited

Loading

syedahsn commented Jun 27, 2023

eladkal commented Jun 27, 2023 •

edited

Loading

o-nikolas Jun 27, 2023

syedahsn Jun 28, 2023

eladkal Jun 28, 2023

vandonr-amz Jun 28, 2023

o-nikolas Jul 17, 2023

dacort Jul 18, 2023

Get failure information on EMR job failure #32151

Get failure information on EMR job failure #32151

Conversation

syedahsn commented Jun 26, 2023

eladkal commented Jun 26, 2023 • edited Loading

syedahsn commented Jun 27, 2023

eladkal commented Jun 27, 2023 • edited Loading

o-nikolas Jun 27, 2023

Choose a reason for hiding this comment

syedahsn Jun 28, 2023

Choose a reason for hiding this comment

eladkal Jun 28, 2023

Choose a reason for hiding this comment

vandonr-amz Jun 28, 2023

Choose a reason for hiding this comment

o-nikolas Jul 17, 2023

Choose a reason for hiding this comment

dacort Jul 18, 2023

Choose a reason for hiding this comment

eladkal commented Jun 26, 2023 •

edited

Loading

eladkal commented Jun 27, 2023 •

edited

Loading