Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore options around specifying ways to react to step failures #15

Open
BenFradet opened this issue Mar 23, 2017 · 11 comments
Open

Explore options around specifying ways to react to step failures #15

BenFradet opened this issue Mar 23, 2017 · 11 comments

Comments

@BenFradet
Copy link
Contributor

See discussion in #11

@BenFradet BenFradet added this to the Version 0.3.0 milestone Mar 23, 2017
@alexanderdean
Copy link
Member

The idea is that some step "failures" may indicate a no-op rather than a failure per se. For example, if your jobflow step is designed to simply check if there is new data in HBase or DynamoDB to process, then you probably do want the jobflow to terminate, but don't want to bubble up an overall jobflow failure to your pagerduty.

It would be particularly interesting if we could somehow get different return codes from jobflow steps, so that we can distinguish between dynamodb_has_new_data.jar failing and reporting no new data to process.

@BenFradet
Copy link
Contributor Author

From my tests, it seems the return code from a step isn't reported back up the EMR chain.

cf StepStateChangeReason's Code which is always None.

@alexanderdean
Copy link
Member

Is there any way of capturing a message?

@BenFradet
Copy link
Contributor Author

I haven't dug into the message, it might be capturing stderr, I'll have to try that out.

What we do in eer is inspect which step resulted in a failure and if it's a no-op detecting step, we respond appropriately. Unfortunately, that's not really generic.

@alexanderdean
Copy link
Member

alexanderdean commented Mar 23, 2017

The generic version of what we do in EER is to add a Factotum-like behavior property to each jobflow step definition:

{
"jarfile": "dynamodb_has_new_data.jar",
"action_on_failure": "TERMINATE_WITH_FAILURE" <<default>> | "TERMINATE_WITH_SUCCESS"

@BenFradet
Copy link
Contributor Author

That combined with a way to provide feedback (maybe through StepStateChangeReason's Message) would solve our issue, indeed.

@alexanderdean
Copy link
Member

Yes - fingers crossed for StepStateChangeReason's Message being usable!

@BenFradet
Copy link
Contributor Author

BenFradet commented Mar 23, 2017

Unfortunately, emr doesn't pick up anything from a script step.

{  
   "Step":{  
      "ActionOnFailure":"CANCEL_AND_WAIT",
      "Config":{  
         "Args":[  
            "s3://snowplow-hosted-assets-eu-central-1/common/emr/snowplow-check-dir-empty.sh",
            "s3://ben-test-output/processing/raw/"
         ],
         "Jar":"s3://eu-central-1.elasticmapreduce/libs/script-runner/script-runner.jar",
         "MainClass":null,
         "Properties":{  

         }
      },
      "Id":"s-14DXZAQ9JXDYD",
      "Name":"Checking that s3://ben-test-output/processing/raw/ is empty",
      "Status":{  
         "FailureDetails":{  
            "LogFile":"s3://ben-test-output/logs/j-3TOU8BN6L2QUX/steps/s-14DXZAQ9JXDYD/",
            "Message":null,
            "Reason":"Unknown Error."
         },
         "State":"FAILED",
         "StateChangeReason":{  
            "Code":null,
            "Message":null
         },
         "Timeline":{  
            "CreationDateTime":"2017-03-23T18:24:26Z",
            "EndDateTime":"2017-03-23T18:24:48Z",
            "StartDateTime":"2017-03-23T18:24:44Z"
         }
      }
   }
}

As a result, we could make do with terminate_success/terminate_failure but we wouldn't have any feedback to give.

@alexanderdean
Copy link
Member

Shame!

@BenFradet BenFradet mentioned this issue Mar 24, 2017
12 tasks
@alexanderdean
Copy link
Member

alexanderdean commented Apr 21, 2017

Pushing back as I actually think no-ops are a bit of a red herring and we are better off with #17...

@BenFradet
Copy link
Contributor Author

pushing back

@BenFradet BenFradet removed this from the Version 0.4.0 milestone Jan 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants