-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore options around specifying ways to react to step failures #15
Comments
The idea is that some step "failures" may indicate a no-op rather than a failure per se. For example, if your jobflow step is designed to simply check if there is new data in HBase or DynamoDB to process, then you probably do want the jobflow to terminate, but don't want to bubble up an overall jobflow failure to your pagerduty. It would be particularly interesting if we could somehow get different return codes from jobflow steps, so that we can distinguish between dynamodb_has_new_data.jar failing and reporting no new data to process. |
From my tests, it seems the return code from a step isn't reported back up the EMR chain. cf StepStateChangeReason's Code which is always None. |
Is there any way of capturing a message? |
I haven't dug into the message, it might be capturing stderr, I'll have to try that out. What we do in eer is inspect which step resulted in a failure and if it's a no-op detecting step, we respond appropriately. Unfortunately, that's not really generic. |
The generic version of what we do in EER is to add a Factotum-like behavior property to each jobflow step definition: {
"jarfile": "dynamodb_has_new_data.jar",
"action_on_failure": "TERMINATE_WITH_FAILURE" <<default>> | "TERMINATE_WITH_SUCCESS" |
That combined with a way to provide feedback (maybe through StepStateChangeReason's Message) would solve our issue, indeed. |
Yes - fingers crossed for StepStateChangeReason's Message being usable! |
Unfortunately, emr doesn't pick up anything from a script step. {
"Step":{
"ActionOnFailure":"CANCEL_AND_WAIT",
"Config":{
"Args":[
"s3://snowplow-hosted-assets-eu-central-1/common/emr/snowplow-check-dir-empty.sh",
"s3://ben-test-output/processing/raw/"
],
"Jar":"s3://eu-central-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"MainClass":null,
"Properties":{
}
},
"Id":"s-14DXZAQ9JXDYD",
"Name":"Checking that s3://ben-test-output/processing/raw/ is empty",
"Status":{
"FailureDetails":{
"LogFile":"s3://ben-test-output/logs/j-3TOU8BN6L2QUX/steps/s-14DXZAQ9JXDYD/",
"Message":null,
"Reason":"Unknown Error."
},
"State":"FAILED",
"StateChangeReason":{
"Code":null,
"Message":null
},
"Timeline":{
"CreationDateTime":"2017-03-23T18:24:26Z",
"EndDateTime":"2017-03-23T18:24:48Z",
"StartDateTime":"2017-03-23T18:24:44Z"
}
}
}
} As a result, we could make do with terminate_success/terminate_failure but we wouldn't have any feedback to give. |
Shame! |
Pushing back as I actually think no-ops are a bit of a red herring and we are better off with #17... |
pushing back |
See discussion in #11
The text was updated successfully, but these errors were encountered: