Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

Begin capturing metrics to better understand Pantheon's behaviour #326

Merged
merged 16 commits into from
Nov 29, 2018

Conversation

ajsutton
Copy link
Contributor

PR description

Introduces a set of simple interfaces to define metrics capturing devices like counter and timer. Two initial implementations are provided, one using Prometheus Client and one No-Op implementation which is currently only used in tests.

Metrics are exposed via the debug_metrics JSON-RPC method.

Metrics being captured initially:

  • Total number of peers ever connected to
  • Total number of peers disconnected, by disconnect reason and whether the disconnect was initiated locally or remotely.
  • Current number of peers
  • Timing for processing JSON-RPC requests, broken down by method name.
  • Generic JVM and process metrics (memory used, heap size, thread count, time spent in GC, file descriptors opened, CPU time etc).

@ajsutton
Copy link
Contributor Author

The format of the output is not guaranteed as the particularly metrics we track will change but an example of the current output is:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "jvm": {
      "memory_bytes_init": {
        "heap": 268435456,
        "nonheap": 2555904
      },
      "threads_current": 39,
      "memory_bytes_used": {
        "heap": 837431408,
        "nonheap": 60211576
      },
      "memory_pool_bytes_used": {
        "PS Eden Space": 802370296,
        "Code Cache": 16560960,
        "Compressed Class Space": 4810848,
        "PS Survivor Space": 2490416,
        "PS Old Gen": 32570696,
        "Metaspace": 38839768
      },
      "threads_deadlocked": 0,
      "classes_loaded_total": 6890,
      "buffer_pool_used_bytes": {
        "direct": 8193,
        "mapped": 0
      },
      "memory_pool_bytes_committed": {
        "PS Eden Space": 908591104,
        "Code Cache": 16711680,
        "Compressed Class Space": 5292032,
        "PS Survivor Space": 2621440,
        "PS Old Gen": 169869312,
        "Metaspace": 40943616
      },
      "threads_deadlocked_monitor": 0,
      "memory_pool_bytes_init": {
        "PS Eden Space": 67108864,
        "Code Cache": 2555904,
        "Compressed Class Space": 0,
        "PS Survivor Space": 11010048,
        "PS Old Gen": 179306496,
        "Metaspace": 0
      },
      "gc_collection_seconds_sum": {
        "PS MarkSweep": 0.089,
        "PS Scavenge": 0.12
      },
      "memory_bytes_committed": {
        "heap": 1081081856,
        "nonheap": 62947328
      },
      "buffer_pool_used_buffers": {
        "direct": 2,
        "mapped": 0
      },
      "threads_started_total": 40,
      "classes_unloaded_total": 0,
      "gc_collection_seconds_count": {
        "PS MarkSweep": 2,
        "PS Scavenge": 18
      },
      "memory_pool_bytes_max": {
        "PS Eden Space": 1413480448,
        "Code Cache": 251658240,
        "Compressed Class Space": 1073741824,
        "PS Survivor Space": 2621440,
        "PS Old Gen": 2863661056,
        "Metaspace": -1
      },
      "threads_daemon": 7,
      "memory_bytes_max": {
        "heap": 3817865216,
        "nonheap": -1
      },
      "threads_peak": 40,
      "buffer_pool_capacity_bytes": {
        "direct": 8192,
        "mapped": 0
      },
      "classes_loaded": 6890
    },
    "process": {
      "open_fds": 166,
      "cpu_seconds_total": 34.28068,
      "start_time_seconds": 1543452099.972,
      "max_fds": 10240
    },
    "rpc": {
      "request_time": {
        "eth_syncing": {
          "bucket": {
            "+Inf": 1,
            "0.01": 1,
            "0.075": 1,
            "0.75": 1,
            "0.005": 1,
            "0.025": 1,
            "0.1": 1,
            "1.0": 1,
            "0.05": 1,
            "10.0": 1,
            "0.25": 1,
            "0.5": 1,
            "5.0": 1,
            "2.5": 1,
            "7.5": 1
          },
          "count": 1,
          "sum": 0.002367387
        },
        "debug_metrics": {
          "bucket": {
            "+Inf": 1,
            "0.01": 0,
            "0.075": 1,
            "0.75": 1,
            "0.005": 0,
            "0.025": 1,
            "0.1": 1,
            "1.0": 1,
            "0.05": 1,
            "10.0": 1,
            "0.25": 1,
            "0.5": 1,
            "5.0": 1,
            "2.5": 1,
            "7.5": 1
          },
          "count": 1,
          "sum": 0.024784594
        },
        "eth_blockNumber": {
          "bucket": {
            "+Inf": 1,
            "0.01": 1,
            "0.075": 1,
            "0.75": 1,
            "0.005": 1,
            "0.025": 1,
            "0.1": 1,
            "1.0": 1,
            "0.05": 1,
            "10.0": 1,
            "0.25": 1,
            "0.5": 1,
            "5.0": 1,
            "2.5": 1,
            "7.5": 1
          },
          "count": 1,
          "sum": 0.00109602
        }
      }
    },
    "peers": {
      "disconnected_total": {
        "remote": {
          "SUBPROTOCOL_TRIGGERED": 2,
          "USELESS_PEER": 1,
          "TOO_MANY_PEERS": 8
        },
        "local": {
          "SUBPROTOCOL_TRIGGERED": 13,
          "USELESS_PEER": 2,
          "BREACH_OF_PROTOCOL": 1
        }
      },
      "peer_count_current": 4,
      "connected_total": 30
    }
  }
}


public interface MetricsSystem {

default Counter createCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a default method? Doesn't the var-args already handle the zero arg case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns a raw Counter the one with varargs returns LabelledMetric<Counter> so you have to supply label values before you can call inc().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, however createLabeledCounter for the varargs would look better in calling code where return value isn't always obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, changed.

public void shouldCreateObservationsFromTimer() {
final OperationTimer timer = metricsSystem.createTimer(RPC, "request", "Some help");

final TimingContext context = timer.startTimer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make one of these timer tests a try with resources? That would validate the Closable interface on them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ajsutton ajsutton merged commit 081c2f9 into PegaSysEng:master Nov 29, 2018
@ajsutton ajsutton deleted the pluggable-metrics branch November 29, 2018 23:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants