-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mapreduce] An agent that reports MapReduce metrics #2236
Conversation
''' | ||
# Initialize the state and type counts | ||
state_counts = {state: 0 for state in MAPREDUCE_TASK_STATE_METRICS} | ||
type_counts = {task_type: 0 for task_type in MAPREDUCE_TASK_TYPE_METRICS} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick comment: the default
test fails here on python2.6 because of the dict comprehension that's only in python since 2.7.
You can probably initialize these vars to defaultdict
s instead (from the collections
module).
The mapreduce agent has the following metrics
The The As an example, to get metrics for the counter with name
|
'user_name:' + str(user_name), | ||
'job_name:' + str(job_name)] | ||
|
||
self._set_metrics_from_json(tags, job_json, MAPREDUCE_JOB_METRICS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there's still an issue here: we're sending the MAPREDUCE_JOB_METRICS
(all gauges) for each job (ie for each job_id
), but we only tag by user_name
/job_name
, which means that these metrics get overwritten when multiple jobs share the same job_name
.
This means that once the check has run, for a given job_name
, the values of these metrics will be the values from the last job in the list that has this job_name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this we should pre-aggregate the metrics for each job_name
, probably using histograms (but we could probably only compute the average or the max/min per job_name
instead of a whole histogram for some of these metrics, depending on what's interesting for each metric).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way with histograms to only produce max
, avg
, median
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have updated all of the Job metrics to be Histograms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no built-in way to do that in the agent actually, so let's keep it simple and use normal histograms. Thanks!
@zachradtka Just added a few comments on the check (see above), unfortunately there's still an tagging/aggregation issue on the metrics, feel free to reach out if you need more details. |
@olivielpeau I updated the Agent. It tracks applications and jobs by app ID and job ID respectively. I also updated the agent to utilize histograms for all of the checks. Please let me know if there is anything else I should change. |
Thanks @zachradtka! I'll make a few comments on some details, and we may have more comments once we've tested the check on our environment, but overall the check looks good. |
# Get the running MR applications from YARN | ||
running_apps = self._get_running_app_ids(rm_address, | ||
states=YARN_APPLICATION_STATES, | ||
applicationTypes=YARN_APPLICATION_TYPES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: coud you move the usage of these parameters further down, inside the _get_running_app_ids
method, on the _rest_request_to_json
function call?
Only had one comment for now, we'll get back to you once we've tested the check, thanks! |
OK, made that quick change. |
# | ||
|
||
counter_metrics: | ||
# - job_name: 'Foo' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment here in the yaml file with a link to the related hadoop doc page? (probably https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html#Job_Counters_API)
@zachradtka: I have some feedback from the team that's going to use the check:
The team is going to test the check today, I'll get back to you with their feedback on that. Thanks! |
Ha, I originally had it written to be job agnostic and get all the specified counters from all jobs, but decided that having more granularity was better. I do like the option of having universal counters, that are collected for every job, and configurable in the Is this feature a blocker for testing, or can you test without it and I can add it later today? |
Thanks for your answers. It's not a blocker, we're going to test the check as-is today. |
I updated the agent to include job-agnostic counter metrics configurable in the |
# Service Check Names | ||
SERVICE_CHECK = 'mapreduce.cluster.can_connect' | ||
YARN_SERVICE_CHECK = 'mapreduce.resource_manager.can_connect' | ||
MAPREDUCE_SERVICE_CHECK = 'mapreduce.application_master.can_connect' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually hadn't noticed that there were different service checks when I suggested to only send an OK
once.
The check should actually send an OK
service check for each kind of service when all the requests to that service have been made successfully.
So from what I see in the current state of the check, we should probably:
- remove the
SERVICE_CHECK
service check altogether (given that we only ever sendOK
s on it) - send an
OK
YARN_SERVICE_CHECK
service check right afterself._get_running_app_ids
, because, at this point of the check, all the requests made to the resource manager have been successful - send an
OK
MAPREDUCE_SERVICE_CHECK
at the end of the maincheck
method, because it's only at that point of the check that we know that all the requests to the application master have been successful.
Does that sound sensible to you? If so could you make these changes and add a small comment on each OK
service check explaining why we're sending the service check at that point of the check?
Thanks!
Set a metric | ||
''' | ||
if metric_type == GAUGE: | ||
self.gauge(metric_name, value, tags=tags, device_name=device_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these two lines as we're not using them anymore.
We'll add them back later if we need them.
@zachradtka Went through the check and made a few comments, tell me if anything needs to be clarified. Thanks! |
32fa5d4
to
60cebf5
Compare
@olivielpeau I made the changes, squashed my changes, rebased with master and pushed. Let me know if there is anything else I can do. Thanks for the helpful suggestions! |
# Report success after gathering all metrics from ResourceManaager | ||
self.service_check(YARN_SERVICE_CHECK, | ||
AgentCheck.OK, | ||
tags=['resourcemanager_url:%s' % rm_address], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're using a url
tag in the _rest_request_to_json
method, we should also use url
here to be consistent. (The rationale being that if we send a CRITICAL
service check with a given service check name and tags on a check run, we want the OK
service check that's sent when things go back to normal on the same service check name to have the same tags, so that scoping service checks with tags works as expected on the web UI).
Added in 3 comments. Don't worry about the one on the service check assertions, it's a nice to have but not mandatory at all. Thanks! |
OK, all requirements have been addressed. |
37df455
to
9cffbfa
Compare
Squashed and pushed! |
Looks good, thanks @zachradtka! I'll merge once the CI passes. |
Merging! 🎉 |
[mapreduce] An agent that reports MapReduce metrics
An agent check that gathers metrics for MapReduce applications.
The metrics collected are:
Authors:
@zachradtka
@wjsl