[mapreduce] An agent that reports MapReduce metrics #2236

zachradtka · 2016-01-29T19:01:37Z

An agent check that gathers metrics for MapReduce applications.

The metrics collected are:

MapReduce Job Metrics
---------------------
mapreduce.job.elapsed_ime                The elapsed time since the application started (in ms)
mapreduce.job.maps_total                 The total number of maps
mapreduce.job.maps_completed             The number of completed maps
mapreduce.job.reduces_total              The total number of reduces
mapreduce.job.reduces_completed          The number of completed reduces
mapreduce.job.maps_pending               The number of maps still to be run
mapreduce.job.maps_running               The number of running maps
mapreduce.job.reduces_pending            The number of reduces still to be run
mapreduce.job.reduces_running            The number of running reduces
mapreduce.job.new_reduce_attempts        The number of new reduce attempts
mapreduce.job.running_reduce_attempts    The number of running reduce attempts
mapreduce.job.failed_reduce_attempts     The number of failed reduce attempts
mapreduce.job.killed_reduce_attempts     The number of killed reduce attempts
mapreduce.job.successful_reduce_attempts The number of successful reduce attempts
mapreduce.job.new_map_attempts           The number of new map attempts
mapreduce.job.running_map_attempts       The number of running map attempts
mapreduce.job.failed_map_attempts        The number of failed map attempts
mapreduce.job.killed_map_attempts        The number of killed map attempts
mapreduce.job.successful_map_attempts    The number of successful map attempts
MapReduce Job Counter Metrics
-----------------------------
mapreduce.job.counter.reduce_counter_value   The counter value of reduce tasks
mapreduce.job.counter.map_counter_value      The counter value of map tasks
mapreduce.job.counter.total_counter_value    The counter value of all tasks
MapReduce Task Metrics
----------------------
mapreduce.job.task.progress         The progress of the task as a percent
mapreduce.job.task.elapsed_time     The elapsed time sine the application started (in ms)
maprduce.job.task.state.new         The total number of new tasks
maprduce.job.task.state.scheduled   The total number of tasks with a scheduled state
maprduce.job.task.state.running     The total number of tasks with a running state
maprduce.job.task.state.succeeded   The total number of tasks with a succeeded state
maprduce.job.task.state.failed      The total number of tasks with a failed state
maprduce.job.task.state.killWait    The total number of task with a kill_wait state
maprduce.job.task.state.killed      The total number of tasks with a killed state
maprduce.job.task.type.map          The total number of map tasks
maprduce.job.task.type.reduce       The total number of reduce tasks
MapReduce Task Counter Metrics
------------------------------
mapreduce.job.task.counter.map_input_records        The number of input records of a map task
mapreduce.job.task.counter.map_output_records       The number of output records of a map task
mapreduce.job.task.counter.combine_input_records    The number of input records of a combine task
mapreduce.job.task.counter.combine_output_records   The number of output records of a combine task
mapreduce.job.task.counter.reduce_input_records     The number of input records of a reduce task
mapreduce.job.task.counter.reduce_output_records    The number of output records of a reduce task

Authors:
@zachradtka
@wjsl

olivielpeau · 2016-02-01T17:17:27Z

checks.d/mapreduce.py

+        '''
+        # Initialize the state and type counts
+        state_counts = {state: 0 for state in MAPREDUCE_TASK_STATE_METRICS}
+        type_counts = {task_type: 0 for task_type in MAPREDUCE_TASK_TYPE_METRICS}


Just a quick comment: the default test fails here on python2.6 because of the dict comprehension that's only in python since 2.7.

You can probably initialize these vars to defaultdicts instead (from the collections module).

zachradtka · 2016-02-09T05:04:03Z

The mapreduce agent has the following metrics

MapReduce Job Metrics
---------------------
mapreduce.job.elapsed_ime                The elapsed time since the application started (in ms)
mapreduce.job.maps_total                 The total number of maps
mapreduce.job.maps_completed             The number of completed maps
mapreduce.job.reduces_total              The total number of reduces
mapreduce.job.reduces_completed          The number of completed reduces
mapreduce.job.maps_pending               The number of maps still to be run
mapreduce.job.maps_running               The number of running maps
mapreduce.job.reduces_pending            The number of reduces still to be run
mapreduce.job.reduces_running            The number of running reduces
mapreduce.job.new_reduce_attempts        The number of new reduce attempts
mapreduce.job.running_reduce_attempts    The number of running reduce attempts
mapreduce.job.failed_reduce_attempts     The number of failed reduce attempts
mapreduce.job.killed_reduce_attempts     The number of killed reduce attempts
mapreduce.job.successful_reduce_attempts The number of successful reduce attempts
mapreduce.job.new_map_attempts           The number of new map attempts
mapreduce.job.running_map_attempts       The number of running map attempts
mapreduce.job.failed_map_attempts        The number of failed map attempts
mapreduce.job.killed_map_attempts        The number of killed map attempts
mapreduce.job.successful_map_attempts    The number of successful map attempts

MapReduce Job Counter Metrics
-----------------------------
mapreduce.job.counter.reduce_counter_value   The counter value of reduce tasks
mapreduce.job.counter.map_counter_value      The counter value of map tasks
mapreduce.job.counter.total_counter_value    The counter value of all tasks

MapReduce Map Task Metrics
--------------------------
mapreduce.job.map.task.progress     The distribution of all map task progresses
mapreduce.job.map.task.elapsed_time The distribution of all map tasks elapsed time

MapReduce Reduce Task Metrics
--------------------------
mapreduce.job.reduce.task.progress      The distribution of all reduce task progresses
mapreduce.job.reduce.task.elapsed_time  The distribution of all reduce tasks elapsed time

The task metrics are histogram metrics and sample from each task running under a job.

The job counter metrics are custom metrics that the user must specify in the init_config section of mapreduce.yaml.

As an example, to get metrics for the counter with name HDFS_BYTES_WRITTEN for the counter group org.apache.hadoop.mapreduce.FileSystemCounter the init_config section would look like:

  counter_metrics:
     - job_name: 'Foo'
       metrics:
         - counter_group_name: 'org.apache.hadoop.mapreduce.FileSystemCounter'
           counters:
             - counter_name: 'HDFS_BYTES_WRITTEN'

olivielpeau · 2016-02-09T19:21:42Z

checks.d/mapreduce.py

+                                    'user_name:' + str(user_name),
+                                    'job_name:' + str(job_name)]
+
+                            self._set_metrics_from_json(tags, job_json, MAPREDUCE_JOB_METRICS)


Actually there's still an issue here: we're sending the MAPREDUCE_JOB_METRICS (all gauges) for each job (ie for each job_id), but we only tag by user_name/job_name, which means that these metrics get overwritten when multiple jobs share the same job_name.

This means that once the check has run, for a given job_name, the values of these metrics will be the values from the last job in the list that has this job_name.

Instead of this we should pre-aggregate the metrics for each job_name, probably using histograms (but we could probably only compute the average or the max/min per job_name instead of a whole histogram for some of these metrics, depending on what's interesting for each metric).

Is there a way with histograms to only produce max, avg, median, etc?

OK, I have updated all of the Job metrics to be Histograms.

There's no built-in way to do that in the agent actually, so let's keep it simple and use normal histograms. Thanks!

olivielpeau · 2016-02-09T20:04:35Z

@zachradtka Just added a few comments on the check (see above), unfortunately there's still an tagging/aggregation issue on the metrics, feel free to reach out if you need more details.

zachradtka · 2016-02-09T22:31:01Z

@olivielpeau I updated the Agent. It tracks applications and jobs by app ID and job ID respectively. I also updated the agent to utilize histograms for all of the checks. Please let me know if there is anything else I should change.

olivielpeau · 2016-02-09T23:16:39Z

Thanks @zachradtka! I'll make a few comments on some details, and we may have more comments once we've tested the check on our environment, but overall the check looks good.

olivielpeau · 2016-02-09T23:22:01Z

checks.d/mapreduce.py

+        # Get the running MR applications from YARN
+        running_apps = self._get_running_app_ids(rm_address,
+            states=YARN_APPLICATION_STATES,
+            applicationTypes=YARN_APPLICATION_TYPES)


Nitpick: coud you move the usage of these parameters further down, inside the _get_running_app_ids method, on the _rest_request_to_json function call?

olivielpeau · 2016-02-09T23:43:29Z

Only had one comment for now, we'll get back to you once we've tested the check, thanks!

zachradtka · 2016-02-09T23:49:36Z

OK, made that quick change.

olivielpeau · 2016-02-10T16:10:38Z

conf.d/mapreduce.yaml.example

+  #
+
+  counter_metrics:
+  #   - job_name: 'Foo'


Could you add a comment here in the yaml file with a link to the related hadoop doc page? (probably https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html#Job_Counters_API)

olivielpeau · 2016-02-10T16:17:57Z

@zachradtka: I have some feedback from the team that's going to use the check:

do you think it would make sense for some of the counter metrics to be retrieved by default? Specifically we were thinking about the ones that are useful for judging the work performed (bytes read/written, input/output records).
could you add a config option that allows configuring a counter_metrics entry for all the job names instead of just for one name?

The team is going to test the check today, I'll get back to you with their feedback on that. Thanks!

zachradtka · 2016-02-10T16:43:27Z

Ha, I originally had it written to be job agnostic and get all the specified counters from all jobs, but decided that having more granularity was better. I do like the option of having universal counters, that are collected for every job, and configurable in the counter_metrics section. And yes, I can do that.

Is this feature a blocker for testing, or can you test without it and I can add it later today?

olivielpeau · 2016-02-10T16:48:07Z

Thanks for your answers. It's not a blocker, we're going to test the check as-is today.

zachradtka · 2016-02-11T05:51:55Z

I updated the agent to include job-agnostic counter metrics configurable in the mapreduce.yaml. I also added some comments in the yaml file explaining how to configure the agent.

olivielpeau · 2016-02-12T00:59:21Z

checks.d/mapreduce.py

+# Service Check Names
+SERVICE_CHECK = 'mapreduce.cluster.can_connect'
+YARN_SERVICE_CHECK = 'mapreduce.resource_manager.can_connect'
+MAPREDUCE_SERVICE_CHECK = 'mapreduce.application_master.can_connect'


Actually hadn't noticed that there were different service checks when I suggested to only send an OK once.

The check should actually send an OK service check for each kind of service when all the requests to that service have been made successfully.

So from what I see in the current state of the check, we should probably:

remove the SERVICE_CHECK service check altogether (given that we only ever send OKs on it)

send an OK YARN_SERVICE_CHECK service check right after self._get_running_app_ids, because, at this point of the check, all the requests made to the resource manager have been successful

send an OK MAPREDUCE_SERVICE_CHECK at the end of the main check method, because it's only at that point of the check that we know that all the requests to the application master have been successful.

Does that sound sensible to you? If so could you make these changes and add a small comment on each OK service check explaining why we're sending the service check at that point of the check?

Thanks!

olivielpeau · 2016-02-12T01:06:26Z

checks.d/mapreduce.py

+        Set a metric
+        '''
+        if metric_type == GAUGE:
+            self.gauge(metric_name, value, tags=tags, device_name=device_name)


Let's remove these two lines as we're not using them anymore.

We'll add them back later if we need them.

olivielpeau · 2016-02-12T01:29:47Z

@zachradtka Went through the check and made a few comments, tell me if anything needs to be clarified. Thanks!

zachradtka · 2016-02-12T03:27:43Z

@olivielpeau I made the changes, squashed my changes, rebased with master and pushed. Let me know if there is anything else I can do.

Thanks for the helpful suggestions!

olivielpeau · 2016-02-12T17:26:27Z

checks.d/mapreduce.py

+        # Report success after gathering all metrics from ResourceManaager
+        self.service_check(YARN_SERVICE_CHECK,
+            AgentCheck.OK,
+            tags=['resourcemanager_url:%s' % rm_address],


Given that we're using a url tag in the _rest_request_to_json method, we should also use url here to be consistent. (The rationale being that if we send a CRITICAL service check with a given service check name and tags on a check run, we want the OK service check that's sent when things go back to normal on the same service check name to have the same tags, so that scoping service checks with tags works as expected on the web UI).

olivielpeau · 2016-02-12T17:56:03Z

Added in 3 comments. Don't worry about the one on the service check assertions, it's a nice to have but not mandatory at all.

Thanks!

zachradtka · 2016-02-12T21:10:01Z

OK, all requirements have been addressed.

zachradtka · 2016-02-12T21:43:24Z

Squashed and pushed!

olivielpeau · 2016-02-12T22:08:04Z

Looks good, thanks @zachradtka! I'll merge once the CI passes.

olivielpeau · 2016-02-12T23:08:05Z

Merging! 🎉

[mapreduce] An agent that reports MapReduce metrics

olivielpeau added checks new integration community labels Feb 1, 2016

olivielpeau added this to the Triage milestone Feb 1, 2016

olivielpeau reviewed Feb 1, 2016
View reviewed changes

olivielpeau reviewed Feb 9, 2016
View reviewed changes

olivielpeau reviewed Feb 10, 2016
View reviewed changes

olivielpeau modified the milestones: 5.7.0, Triage Feb 12, 2016

olivielpeau reviewed Feb 12, 2016
View reviewed changes

zachradtka force-pushed the minerkasch/mapreduce branch from 32fa5d4 to 60cebf5 Compare February 12, 2016 03:26

olivielpeau reviewed Feb 12, 2016
View reviewed changes

olivielpeau self-assigned this Feb 12, 2016

[mapreduce] An agent that reports MapReduce metrics

9cffbfa

zachradtka force-pushed the minerkasch/mapreduce branch from 37df455 to 9cffbfa Compare February 12, 2016 21:42

olivielpeau added a commit that referenced this pull request Feb 12, 2016

Merge pull request #2236 from MinerKasch/minerkasch/mapreduce

518f3d3

[mapreduce] An agent that reports MapReduce metrics

olivielpeau merged commit 518f3d3 into DataDog:master Feb 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mapreduce] An agent that reports MapReduce metrics #2236

[mapreduce] An agent that reports MapReduce metrics #2236

zachradtka commented Jan 29, 2016

olivielpeau Feb 1, 2016

zachradtka commented Feb 9, 2016

olivielpeau Feb 9, 2016

olivielpeau Feb 9, 2016

zachradtka Feb 9, 2016

zachradtka Feb 9, 2016

olivielpeau Feb 9, 2016

olivielpeau commented Feb 9, 2016

zachradtka commented Feb 9, 2016

olivielpeau commented Feb 9, 2016

olivielpeau Feb 9, 2016

olivielpeau commented Feb 9, 2016

zachradtka commented Feb 9, 2016

olivielpeau Feb 10, 2016

olivielpeau commented Feb 10, 2016

zachradtka commented Feb 10, 2016

olivielpeau commented Feb 10, 2016

zachradtka commented Feb 11, 2016

olivielpeau Feb 12, 2016

olivielpeau Feb 12, 2016

olivielpeau commented Feb 12, 2016

zachradtka commented Feb 12, 2016

olivielpeau Feb 12, 2016

olivielpeau commented Feb 12, 2016

zachradtka commented Feb 12, 2016

zachradtka commented Feb 12, 2016

olivielpeau commented Feb 12, 2016

olivielpeau commented Feb 12, 2016

[mapreduce] An agent that reports MapReduce metrics #2236

[mapreduce] An agent that reports MapReduce metrics #2236

Conversation

zachradtka commented Jan 29, 2016

Choose a reason for hiding this comment

zachradtka commented Feb 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olivielpeau commented Feb 9, 2016

zachradtka commented Feb 9, 2016

olivielpeau commented Feb 9, 2016

Choose a reason for hiding this comment

olivielpeau commented Feb 9, 2016

zachradtka commented Feb 9, 2016

Choose a reason for hiding this comment

olivielpeau commented Feb 10, 2016

zachradtka commented Feb 10, 2016

olivielpeau commented Feb 10, 2016

zachradtka commented Feb 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olivielpeau commented Feb 12, 2016

zachradtka commented Feb 12, 2016

Choose a reason for hiding this comment

olivielpeau commented Feb 12, 2016

zachradtka commented Feb 12, 2016

zachradtka commented Feb 12, 2016

olivielpeau commented Feb 12, 2016

olivielpeau commented Feb 12, 2016