Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get failure information on EMR job failure #32151

Merged
merged 6 commits into from
Aug 8, 2023

Conversation

syedahsn
Copy link
Contributor

Currently, the logs do not give any information about the progress of an EMR job that has been scheduled. This PR uses the new wait util method to display the status of the EMR jobs, and if a job fails, it calls the describe_step method to log the reason for the job failure.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jun 26, 2023
@eladkal
Copy link
Contributor

eladkal commented Jun 26, 2023

and if a job fails, it calls the describe_step method to log the reason for the job failure.

My experience with EMR is that it takes time for logs to be populated. More often the failure details are actually in stdout rather than stderr and the general error message is not very helpful.

I'm curious to what exactly is being registered here. Is it possible to share example?

@syedahsn
Copy link
Contributor Author

The error messages might not explain the root cause of the problem, but they will at least be a starting point for an investigation. This is an example where I manually deleted the cluster before the job was run:

[2023-06-26, 23:47:57 UTC] {emr.py:146} INFO - Adding steps to j-2D6XS5VTI1WGZ
[2023-06-26, 23:47:57 UTC] {emr.py:171} INFO - Steps ['s-08516095GAHC1SG5B4N'] added to JobFlow
[2023-06-26, 23:47:57 UTC] {waiter_with_logging.py:74} INFO - EMR Step status is: PENDING - {}
[2023-06-26, 23:48:36 UTC] {emr.py:188} ERROR - EMR Steps failed: {'Message': 'Job terminated'}
[2023-06-26, 23:48:36 UTC] {taskinstance.py:1916} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/airflow/providers/amazon/aws/utils/waiter_with_logging.py", line 68, in wait
    waiter.wait(**args, WaiterConfig={"MaxAttempts": 1})
  File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 55, in wait
    Waiter.wait(self, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 375, in wait
    raise WaiterError(
botocore.exceptions.WaiterError: Waiter StepComplete failed: Waiter encountered a terminal failure state: For expression "Step.Status.State" we matched expected path: "CANCELLED"

This message is logged EMR Steps failed: {'Message': 'Job terminated'}, which provides some indication of what happened.
If a job fails due to some issue with the job itself, then, like you said, that information will not be logged here.
I agree its not much, but its better than nothing. What do you think?

@eladkal
Copy link
Contributor

eladkal commented Jun 27, 2023

If a job fails due to some issue with the job itself, then, like you said, that information will not be logged here.
I agree its not much, but its better than nothing. What do you think?

Since Airflow task failed you already know the step encountered a terminal state.
Maybe worth consulting with @dacort to see if somehow we can bring a more informative message?

Either way I'm OK with it as is.. was just curious to see what is logged in the message.

except AirflowException as ex:
if "EMR Steps failed" in str(ex):
resp = self.get_conn().describe_step(ClusterId=job_flow_id, StepId=step_id)
self.log.error("EMR Steps failed: %s", resp["Step"]["Status"]["StateChangeReason"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other portions of the status that are worth logging in addition to the state change reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. In the documentation, there is supposed to be a FailureDetails field that should provide details, but I haven't seen that in the actual response body.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I have a failed job it takes 3-4 minutes before the logs appear and the issue is discovered. I wonder if the empty body reflects this as querying too soon when there is not yet information from the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the waiter is already querying this endpoint, so you're going to get the same info as what's already printed by the method since you put "Step.Status.StateChangeReason" in the status args.
+1 to use the wait method, but -1 to this except

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other source for useful information you know of, @dacort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@o-nikolas FailureDetails should surface common reasons that a step failed and it will do so before the logs get synced to S3. I'm not as familiar with StageChangedReason, but I'm not sure it will have potential details about why the job failed. I think it's more related to service-specific reasons why a step state changed (e.g. somebody cancelled it). Here's an example of a job I ran with an invalid entrypoint file. FailureDetails is populated immediately and the log file it shows didn't sync to S3 until 4 minutes after the step failed.

{
    "Step": {
        "Id": "s-00928142VC3M3WYAG8XN",
        "Name": "test-fail",
        "Config": {
            "Jar": "command-runner.jar",
            "Properties": {},
            "Args": [
                "spark-submit",
                "--deploy-mode",
                "cluster",
                "/tmp/fail.py"
            ]
        },
        "ActionOnFailure": "CONTINUE",
        "Status": {
            "State": "FAILED",
            "StateChangeReason": {},
            "FailureDetails": {
                "Reason": "The given path is invalid.",
                "Message": "Exception in thread \"main\" java.io.FileNotFoundException: File file:/mnt/tmp/fail.py does not exist",
                "LogFile": "s3://aws-logs-568026268536-us-west-2/elasticmapreduce/j-2WPQL41AO1169/steps/s-00928142VC3M3WYAG8XN/stderr.gz"
            },
            "Timeline": {
                "CreationDateTime": "2023-07-18T09:41:40.728000-07:00",
                "StartDateTime": "2023-07-18T09:41:57.422000-07:00",
                "EndDateTime": "2023-07-18T09:42:03.499000-07:00"
            }
        }
    }
}

And here's one where I cancelled the job before it ran with StateChangeReason populated.

{
    "Step": {
        "Id": "s-07516842NY2M4IHHH71T",
        "Name": "test-fail",
        "Config": {
            "Jar": "command-runner.jar",
            "Properties": {},
            "Args": [
                "spark-submit",
                "--deploy-mode",
                "cluster",
                "/tmp/fail.py"
            ]
        },
        "ActionOnFailure": "CONTINUE",
        "Status": {
            "State": "CANCELLED",
            "StateChangeReason": {
                "Message": "Cancelled by user"
            },
            "Timeline": {
                "CreationDateTime": "2023-07-18T09:48:45.141000-07:00"
            }
        }
    }
}

@syedahsn syedahsn force-pushed the syedahsn/emr-logging-on-failure branch from 56adab5 to 39d3572 Compare August 2, 2023 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants