Explore options around specifying ways to react to step failures #15

BenFradet · 2017-03-23T09:56:52Z

See discussion in #11

alexanderdean · 2017-03-23T10:05:34Z

The idea is that some step "failures" may indicate a no-op rather than a failure per se. For example, if your jobflow step is designed to simply check if there is new data in HBase or DynamoDB to process, then you probably do want the jobflow to terminate, but don't want to bubble up an overall jobflow failure to your pagerduty.

It would be particularly interesting if we could somehow get different return codes from jobflow steps, so that we can distinguish between dynamodb_has_new_data.jar failing and reporting no new data to process.

BenFradet · 2017-03-23T10:22:31Z

From my tests, it seems the return code from a step isn't reported back up the EMR chain.

cf StepStateChangeReason's Code which is always None.

alexanderdean · 2017-03-23T10:24:05Z

Is there any way of capturing a message?

BenFradet · 2017-03-23T10:31:16Z

I haven't dug into the message, it might be capturing stderr, I'll have to try that out.

What we do in eer is inspect which step resulted in a failure and if it's a no-op detecting step, we respond appropriately. Unfortunately, that's not really generic.

alexanderdean · 2017-03-23T10:34:11Z

The generic version of what we do in EER is to add a Factotum-like behavior property to each jobflow step definition:

{
"jarfile": "dynamodb_has_new_data.jar",
"action_on_failure": "TERMINATE_WITH_FAILURE" <<default>> | "TERMINATE_WITH_SUCCESS"

BenFradet · 2017-03-23T10:49:28Z

That combined with a way to provide feedback (maybe through StepStateChangeReason's Message) would solve our issue, indeed.

alexanderdean · 2017-03-23T10:56:35Z

Yes - fingers crossed for StepStateChangeReason's Message being usable!

BenFradet · 2017-03-23T18:27:47Z

Unfortunately, emr doesn't pick up anything from a script step.

{  
   "Step":{  
      "ActionOnFailure":"CANCEL_AND_WAIT",
      "Config":{  
         "Args":[  
            "s3://snowplow-hosted-assets-eu-central-1/common/emr/snowplow-check-dir-empty.sh",
            "s3://ben-test-output/processing/raw/"
         ],
         "Jar":"s3://eu-central-1.elasticmapreduce/libs/script-runner/script-runner.jar",
         "MainClass":null,
         "Properties":{  

         }
      },
      "Id":"s-14DXZAQ9JXDYD",
      "Name":"Checking that s3://ben-test-output/processing/raw/ is empty",
      "Status":{  
         "FailureDetails":{  
            "LogFile":"s3://ben-test-output/logs/j-3TOU8BN6L2QUX/steps/s-14DXZAQ9JXDYD/",
            "Message":null,
            "Reason":"Unknown Error."
         },
         "State":"FAILED",
         "StateChangeReason":{  
            "Code":null,
            "Message":null
         },
         "Timeline":{  
            "CreationDateTime":"2017-03-23T18:24:26Z",
            "EndDateTime":"2017-03-23T18:24:48Z",
            "StartDateTime":"2017-03-23T18:24:44Z"
         }
      }
   }
}

As a result, we could make do with terminate_success/terminate_failure but we wouldn't have any feedback to give.

alexanderdean · 2017-03-23T18:58:50Z

Shame!

alexanderdean · 2017-04-21T17:17:22Z

Pushing back as I actually think no-ops are a bit of a red herring and we are better off with #17...

BenFradet · 2018-01-17T09:22:34Z

pushing back

BenFradet added this to the Version 0.3.0 milestone Mar 23, 2017

BenFradet mentioned this issue Mar 24, 2017

0.2.0 #12

Merged

12 tasks

alexanderdean modified the milestones: Version 0.4.0, Version 0.3.0 Apr 21, 2017

BenFradet removed this from the Version 0.4.0 milestone Jan 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore options around specifying ways to react to step failures #15

Explore options around specifying ways to react to step failures #15

BenFradet commented Mar 23, 2017

alexanderdean commented Mar 23, 2017

BenFradet commented Mar 23, 2017

alexanderdean commented Mar 23, 2017

BenFradet commented Mar 23, 2017

alexanderdean commented Mar 23, 2017 •

edited

Loading

BenFradet commented Mar 23, 2017

alexanderdean commented Mar 23, 2017

BenFradet commented Mar 23, 2017 •

edited

Loading

alexanderdean commented Mar 23, 2017

alexanderdean commented Apr 21, 2017 •

edited

Loading

BenFradet commented Jan 17, 2018

Explore options around specifying ways to react to step failures #15

Explore options around specifying ways to react to step failures #15

Comments

BenFradet commented Mar 23, 2017

alexanderdean commented Mar 23, 2017

BenFradet commented Mar 23, 2017

alexanderdean commented Mar 23, 2017

BenFradet commented Mar 23, 2017

alexanderdean commented Mar 23, 2017 • edited Loading

BenFradet commented Mar 23, 2017

alexanderdean commented Mar 23, 2017

BenFradet commented Mar 23, 2017 • edited Loading

alexanderdean commented Mar 23, 2017

alexanderdean commented Apr 21, 2017 • edited Loading

BenFradet commented Jan 17, 2018

alexanderdean commented Mar 23, 2017 •

edited

Loading

BenFradet commented Mar 23, 2017 •

edited

Loading

alexanderdean commented Apr 21, 2017 •

edited

Loading