Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mapreduce] An agent that reports MapReduce metrics #2236

Merged
merged 1 commit into from
Feb 12, 2016

Conversation

zachradtka
Copy link
Contributor

An agent check that gathers metrics for MapReduce applications.

The metrics collected are:

MapReduce Job Metrics
---------------------
mapreduce.job.elapsed_ime                The elapsed time since the application started (in ms)
mapreduce.job.maps_total                 The total number of maps
mapreduce.job.maps_completed             The number of completed maps
mapreduce.job.reduces_total              The total number of reduces
mapreduce.job.reduces_completed          The number of completed reduces
mapreduce.job.maps_pending               The number of maps still to be run
mapreduce.job.maps_running               The number of running maps
mapreduce.job.reduces_pending            The number of reduces still to be run
mapreduce.job.reduces_running            The number of running reduces
mapreduce.job.new_reduce_attempts        The number of new reduce attempts
mapreduce.job.running_reduce_attempts    The number of running reduce attempts
mapreduce.job.failed_reduce_attempts     The number of failed reduce attempts
mapreduce.job.killed_reduce_attempts     The number of killed reduce attempts
mapreduce.job.successful_reduce_attempts The number of successful reduce attempts
mapreduce.job.new_map_attempts           The number of new map attempts
mapreduce.job.running_map_attempts       The number of running map attempts
mapreduce.job.failed_map_attempts        The number of failed map attempts
mapreduce.job.killed_map_attempts        The number of killed map attempts
mapreduce.job.successful_map_attempts    The number of successful map attempts
MapReduce Job Counter Metrics
-----------------------------
mapreduce.job.counter.reduce_counter_value   The counter value of reduce tasks
mapreduce.job.counter.map_counter_value      The counter value of map tasks
mapreduce.job.counter.total_counter_value    The counter value of all tasks
MapReduce Task Metrics
----------------------
mapreduce.job.task.progress         The progress of the task as a percent
mapreduce.job.task.elapsed_time     The elapsed time sine the application started (in ms)
maprduce.job.task.state.new         The total number of new tasks
maprduce.job.task.state.scheduled   The total number of tasks with a scheduled state
maprduce.job.task.state.running     The total number of tasks with a running state
maprduce.job.task.state.succeeded   The total number of tasks with a succeeded state
maprduce.job.task.state.failed      The total number of tasks with a failed state
maprduce.job.task.state.killWait    The total number of task with a kill_wait state
maprduce.job.task.state.killed      The total number of tasks with a killed state
maprduce.job.task.type.map          The total number of map tasks
maprduce.job.task.type.reduce       The total number of reduce tasks
MapReduce Task Counter Metrics
------------------------------
mapreduce.job.task.counter.map_input_records        The number of input records of a map task
mapreduce.job.task.counter.map_output_records       The number of output records of a map task
mapreduce.job.task.counter.combine_input_records    The number of input records of a combine task
mapreduce.job.task.counter.combine_output_records   The number of output records of a combine task
mapreduce.job.task.counter.reduce_input_records     The number of input records of a reduce task
mapreduce.job.task.counter.reduce_output_records    The number of output records of a reduce task

Authors:
@zachradtka
@wjsl

'''
# Initialize the state and type counts
state_counts = {state: 0 for state in MAPREDUCE_TASK_STATE_METRICS}
type_counts = {task_type: 0 for task_type in MAPREDUCE_TASK_TYPE_METRICS}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick comment: the default test fails here on python2.6 because of the dict comprehension that's only in python since 2.7.

You can probably initialize these vars to defaultdicts instead (from the collections module).

@zachradtka
Copy link
Contributor Author

The mapreduce agent has the following metrics

MapReduce Job Metrics
---------------------
mapreduce.job.elapsed_ime                The elapsed time since the application started (in ms)
mapreduce.job.maps_total                 The total number of maps
mapreduce.job.maps_completed             The number of completed maps
mapreduce.job.reduces_total              The total number of reduces
mapreduce.job.reduces_completed          The number of completed reduces
mapreduce.job.maps_pending               The number of maps still to be run
mapreduce.job.maps_running               The number of running maps
mapreduce.job.reduces_pending            The number of reduces still to be run
mapreduce.job.reduces_running            The number of running reduces
mapreduce.job.new_reduce_attempts        The number of new reduce attempts
mapreduce.job.running_reduce_attempts    The number of running reduce attempts
mapreduce.job.failed_reduce_attempts     The number of failed reduce attempts
mapreduce.job.killed_reduce_attempts     The number of killed reduce attempts
mapreduce.job.successful_reduce_attempts The number of successful reduce attempts
mapreduce.job.new_map_attempts           The number of new map attempts
mapreduce.job.running_map_attempts       The number of running map attempts
mapreduce.job.failed_map_attempts        The number of failed map attempts
mapreduce.job.killed_map_attempts        The number of killed map attempts
mapreduce.job.successful_map_attempts    The number of successful map attempts

MapReduce Job Counter Metrics
-----------------------------
mapreduce.job.counter.reduce_counter_value   The counter value of reduce tasks
mapreduce.job.counter.map_counter_value      The counter value of map tasks
mapreduce.job.counter.total_counter_value    The counter value of all tasks

MapReduce Map Task Metrics
--------------------------
mapreduce.job.map.task.progress     The distribution of all map task progresses
mapreduce.job.map.task.elapsed_time The distribution of all map tasks elapsed time

MapReduce Reduce Task Metrics
--------------------------
mapreduce.job.reduce.task.progress      The distribution of all reduce task progresses
mapreduce.job.reduce.task.elapsed_time  The distribution of all reduce tasks elapsed time

The task metrics are histogram metrics and sample from each task running under a job.

The job counter metrics are custom metrics that the user must specify in the init_config section of mapreduce.yaml.

As an example, to get metrics for the counter with name HDFS_BYTES_WRITTEN for the counter group org.apache.hadoop.mapreduce.FileSystemCounter the init_config section would look like:

  counter_metrics:
     - job_name: 'Foo'
       metrics:
         - counter_group_name: 'org.apache.hadoop.mapreduce.FileSystemCounter'
           counters:
             - counter_name: 'HDFS_BYTES_WRITTEN'

'user_name:' + str(user_name),
'job_name:' + str(job_name)]

self._set_metrics_from_json(tags, job_json, MAPREDUCE_JOB_METRICS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's still an issue here: we're sending the MAPREDUCE_JOB_METRICS (all gauges) for each job (ie for each job_id), but we only tag by user_name/job_name, which means that these metrics get overwritten when multiple jobs share the same job_name.

This means that once the check has run, for a given job_name, the values of these metrics will be the values from the last job in the list that has this job_name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this we should pre-aggregate the metrics for each job_name, probably using histograms (but we could probably only compute the average or the max/min per job_name instead of a whole histogram for some of these metrics, depending on what's interesting for each metric).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way with histograms to only produce max, avg, median, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have updated all of the Job metrics to be Histograms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no built-in way to do that in the agent actually, so let's keep it simple and use normal histograms. Thanks!

@olivielpeau
Copy link
Member

@zachradtka Just added a few comments on the check (see above), unfortunately there's still an tagging/aggregation issue on the metrics, feel free to reach out if you need more details.

@zachradtka
Copy link
Contributor Author

@olivielpeau I updated the Agent. It tracks applications and jobs by app ID and job ID respectively. I also updated the agent to utilize histograms for all of the checks. Please let me know if there is anything else I should change.

@olivielpeau
Copy link
Member

Thanks @zachradtka! I'll make a few comments on some details, and we may have more comments once we've tested the check on our environment, but overall the check looks good.

# Get the running MR applications from YARN
running_apps = self._get_running_app_ids(rm_address,
states=YARN_APPLICATION_STATES,
applicationTypes=YARN_APPLICATION_TYPES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: coud you move the usage of these parameters further down, inside the _get_running_app_ids method, on the _rest_request_to_json function call?

@olivielpeau
Copy link
Member

Only had one comment for now, we'll get back to you once we've tested the check, thanks!

@zachradtka
Copy link
Contributor Author

OK, made that quick change.

#

counter_metrics:
# - job_name: 'Foo'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here in the yaml file with a link to the related hadoop doc page? (probably https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html#Job_Counters_API)

@olivielpeau
Copy link
Member

@zachradtka: I have some feedback from the team that's going to use the check:

  • do you think it would make sense for some of the counter metrics to be retrieved by default? Specifically we were thinking about the ones that are useful for judging the work performed (bytes read/written, input/output records).
  • could you add a config option that allows configuring a counter_metrics entry for all the job names instead of just for one name?

The team is going to test the check today, I'll get back to you with their feedback on that. Thanks!

@zachradtka
Copy link
Contributor Author

Ha, I originally had it written to be job agnostic and get all the specified counters from all jobs, but decided that having more granularity was better. I do like the option of having universal counters, that are collected for every job, and configurable in the counter_metrics section. And yes, I can do that.

Is this feature a blocker for testing, or can you test without it and I can add it later today?

@olivielpeau
Copy link
Member

Thanks for your answers. It's not a blocker, we're going to test the check as-is today.

@zachradtka
Copy link
Contributor Author

I updated the agent to include job-agnostic counter metrics configurable in the mapreduce.yaml. I also added some comments in the yaml file explaining how to configure the agent.

@olivielpeau olivielpeau modified the milestones: 5.7.0, Triage Feb 12, 2016
# Service Check Names
SERVICE_CHECK = 'mapreduce.cluster.can_connect'
YARN_SERVICE_CHECK = 'mapreduce.resource_manager.can_connect'
MAPREDUCE_SERVICE_CHECK = 'mapreduce.application_master.can_connect'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually hadn't noticed that there were different service checks when I suggested to only send an OK once.

The check should actually send an OK service check for each kind of service when all the requests to that service have been made successfully.

So from what I see in the current state of the check, we should probably:

  • remove the SERVICE_CHECK service check altogether (given that we only ever send OKs on it)
  • send an OK YARN_SERVICE_CHECK service check right after self._get_running_app_ids, because, at this point of the check, all the requests made to the resource manager have been successful
  • send an OK MAPREDUCE_SERVICE_CHECK at the end of the main check method, because it's only at that point of the check that we know that all the requests to the application master have been successful.

Does that sound sensible to you? If so could you make these changes and add a small comment on each OK service check explaining why we're sending the service check at that point of the check?

Thanks!

Set a metric
'''
if metric_type == GAUGE:
self.gauge(metric_name, value, tags=tags, device_name=device_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these two lines as we're not using them anymore.

We'll add them back later if we need them.

@olivielpeau
Copy link
Member

@zachradtka Went through the check and made a few comments, tell me if anything needs to be clarified. Thanks!

@zachradtka
Copy link
Contributor Author

@olivielpeau I made the changes, squashed my changes, rebased with master and pushed. Let me know if there is anything else I can do.

Thanks for the helpful suggestions!

# Report success after gathering all metrics from ResourceManaager
self.service_check(YARN_SERVICE_CHECK,
AgentCheck.OK,
tags=['resourcemanager_url:%s' % rm_address],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're using a url tag in the _rest_request_to_json method, we should also use url here to be consistent. (The rationale being that if we send a CRITICAL service check with a given service check name and tags on a check run, we want the OK service check that's sent when things go back to normal on the same service check name to have the same tags, so that scoping service checks with tags works as expected on the web UI).

@olivielpeau
Copy link
Member

Added in 3 comments. Don't worry about the one on the service check assertions, it's a nice to have but not mandatory at all.

Thanks!

@olivielpeau olivielpeau self-assigned this Feb 12, 2016
@zachradtka
Copy link
Contributor Author

OK, all requirements have been addressed.

@zachradtka
Copy link
Contributor Author

Squashed and pushed!

@olivielpeau
Copy link
Member

Looks good, thanks @zachradtka! I'll merge once the CI passes.

@olivielpeau
Copy link
Member

Merging! 🎉

olivielpeau added a commit that referenced this pull request Feb 12, 2016
[mapreduce] An agent that reports MapReduce metrics
@olivielpeau olivielpeau merged commit 518f3d3 into DataDog:master Feb 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants