Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hdfs_jmx] An HDFS agent that uses JMX #2235

Merged
merged 1 commit into from
Feb 8, 2016

Conversation

zachradtka
Copy link
Contributor

An agent check for HDFS that uses the JMX interface to gather HDFS metrics from individual HDFS data nodes.

The metrics collected are

hdfs.dfs_remaining                  The remaining disk space left in bytes
hdfs.storage_info                   Path to HDFS storage location
hdfs.dfs_capacity                   Disk capacity in bytes
hdfs.dfs_used                       Disk usage in bytes
hdfs.cache_capacity                 Cache capacity in bytes
hdfs.num_failed_volumes             Number of failed volumes
hdfs.last_volume_failure_date       Date the last volume failed
hdfs.estimated_capacity_lost_total  The estimated capacity lost in bytes
hdfs.num_blocks_cached              The number of blocks cached
hdfs.num_blocks_failed_to_cache     The number of blocks that failed to cache
hdfs.num_blocks_failed_to_uncache   The number of failed blocks to remove from cache

Authors:
@zachradtka
@wjsl

@olivielpeau
Copy link
Member

Thanks @zachradtka and @wjsl for your contributions!

We'll review your PRs in depth soon.

@olivielpeau olivielpeau added this to the Triage milestone Feb 1, 2016
@zachradtka
Copy link
Contributor Author

I just pushed a quick change that added a few metrics for the HDFS namenode and split the agent-check for the datanodes and for the name node.

The metrics for the namenode are as follows.

hdfs.namenode.capacity_total                    Total disk capacity in bytes
hdfs.namenode.capacity_used                     Disk usage in bytes
hdfs.namenode.capacity_remaining                Remaining disk space left in bytes
hdfs.namenode.total_load                        Total load on the file system
hdfs.namenode.fs_lock_queue_length              Lock queue length
hdfs.namenode.blocks_total                      Total number of blocks
hdfs.namenode.max_objects                       Maximum number of files HDFS supports
hdfs.namenode.files_total                       Total number of files
hdfs.namenode.pending_replication_blocks        Number of blocks pending replication
hdfs.namenode.under_replicated_blocks           Number of under replicated blocks
hdfs.namenode.scheduled_replication_blocks      Number of blocks scheduled for replication
hdfs.namenode.pending_deletion_blocks           Number of pending deletion blocks
hdfs.namenode.num_live_data_nodes               Total number of live data nodes
hdfs.namenode.num_dead_data_nodes               Total number of dead data nodes
hdfs.namenode.num_decom_live_data_nodes         Number of decommissioning live data nodes
hdfs.namenode.num_decom_dead_data_nodes         Number of decommissioning dead data nodes
hdfs.namenode.volume_failures_total             Total volume failures
hdfs.namenode.estimated_capacity_lost_total     Estimated capacity lost in bytes
hdfs.namenode.num_decommissioning_data_nodes    Number of decommissioning data nodes
hdfs.namenode.num_stale_data_nodes              Number of stale data nodes
hdfs.namenode.num_stale_storages                Number of stale storages

Sorry for the late add, but I really felt these metrics would be helpful.

return MockResponse(body, 200)

class HDFSDataNode(AgentCheckTest):
CHECK_NAME = 'hdfs_jmx'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update CHECK_NAME to hdfs_datanode ;)

@olivielpeau
Copy link
Member

Added a bunch of comments, most of them are nitpicks. The checks looks good overall, thanks!

I've only added comments on the namenode check, but since the datanode check is pretty similar could you also take them into account for the datanode check?

One last thing: we try to keep our metrics' and service checks' prefixes similar to the checks' names, so could you rename all the metrics and service checks to hdfs_namenode.[...] and hdfs_datanode respectively?

Thanks again!

@zachradtka
Copy link
Contributor Author

Thanks for the comments, I completed all of them on both the datanode and namenode agent checks.


# Add query_params as arguments
if query_params:
query = '&'.join(['{}={}'.format(key, value) for key, value in query_params.iteritems()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support python 2.6 so you have to use '{0}={1}'.format(key, value) here (i.e. number the fields)

@olivielpeau
Copy link
Member

Thanks for addressing my comments! I've added in a few comments and a question, once they're addressed the check should be good to go.

@zachradtka
Copy link
Contributor Author

All comments are always welcome!

I have fixed all of the concerns for both the datanode and namenode agents. Let me know what to do next.

@olivielpeau
Copy link
Member

Added one comment, once it's addressed could you squash your commits into one?

Thanks!

@zachradtka
Copy link
Contributor Author

OK, All commits squashed and rebased to the latest master.

@olivielpeau olivielpeau modified the milestones: 5.7.0, Triage Feb 5, 2016
@olivielpeau
Copy link
Member

Thanks again!

Looks good, I'll merge once the CI passes.

olivielpeau added a commit that referenced this pull request Feb 8, 2016
[hdfs_jmx] An HDFS agent that uses JMX
@olivielpeau olivielpeau merged commit 7f95b83 into DataDog:master Feb 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants