Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

Plumb in three more metrics #344

Merged
merged 8 commits into from
Dec 3, 2018
Merged

Plumb in three more metrics #344

merged 8 commits into from
Dec 3, 2018

Conversation

shemnon
Copy link
Contributor

@shemnon shemnon commented Dec 1, 2018

PR description

  • add blockchain_height gauge
  • add blockchain_difficulty_total gauge
  • add blockchain_announcedBlock_ingest histogram

This involved some deep pluming such that the metrics system needs to be
created in the PantheonCommand, along with trickle down effects into other
consensus engines. This is likely where it should live anyway.

Testing necessitated turning the PrometheusMetricsSystem::init method
into a singleton accessed via PrometheusMetricsSystem::instance.

* add blockchain_height gauge
* add blockchain_difficulty_total gauge
* add blockchain_announcedBlock_ingest histogram

This involved some deep pluming such that the metrics system needs to be
created in the PantheonCommand, along with trickle down effects into other
consensus engines.  This is likely where it should live anyway.

Testing necessitated turning the PrometheusMetricsSystem::init method
into a singleton accessed via PrometheusMetricsSystem::instance.
@shemnon shemnon requested a review from ajsutton December 1, 2018 22:02
@shemnon
Copy link
Contributor Author

shemnon commented Dec 1, 2018

I'm all ears for different names for the metrics. We may want to see how they evolve and go on a renaming binge at some point.

Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only real concern is making PrometheusMetricsSystem a singleton as that breaks isolation of Pantheon instances running in the same JVM. The description indicated that was required for testing, but I don't see any usages of it outside of production code so not sure why it was needed. What did I miss?

MetricCategory.BLOCKCHAIN,
"difficulty_total",
"Total difficulty of the chainhead",
() -> (double) this.getChainHead().getTotalDifficulty().toLong());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast to double is proving to be an annoyingly common pattern. Should we introduce a LongSupplier version of gauges? Even if it just did the cast internally it would tidy code up. Also, in hind-sight I should have used DoubleSupplier. Might be a separate PR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because I also see the potential of a UInt256Supplier.

metricsSystem.createTimer(
MetricCategory.BLOCKCHAIN,
"pantheon_blockchain_announcedBlock_ingest",
"Time to ingest a single announced block");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it meaningful to track importing an announced block separately to just importing a block? ie should we just time how long BlockImporter takes as that will capture all block imports? I could see it being useful to track how long it takes from first hearing about a block to it being on-chain, but this metric doesn't quite capture that since it only starts the timer after the new block has been retrieved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was timing the amount of time to process the block. Its separate code paths IIRc between download and announced blocks, and different tasks (download I see being split into at least three non-adjacent tasks : headers, bodies, receipts).

I could push this down into the persistBlock task, but I was hoping to have the number for the logging statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the timing in the log message is probably justification enough for this. I think at some point having a timer inside BlockImporter itself which will capture every block imported makes sense, but there are going to be a number of different timings related to block import (e.g. timing import of each batch of blocks when synchronizing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pluming it into tasks is a much larger PR. I'll leave it here for now, but it may move.

LOG.info(
"Successfully imported announced block {} ({}).",
"Successfully imported announced block {} ({}) in {}ms.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the time in the log message is a nice touch.

@@ -48,9 +49,16 @@
private static final String PANTHEON_PREFIX = "pantheon_";
private final Map<MetricCategory, Collection<Collector>> collectors = new ConcurrentHashMap<>();

private static final Supplier<MetricsSystem> INSTANCE =
Suppliers.memoize(PrometheusMetricsSystem::init);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong as it makes the metrics system JVM-wide whereas we want to be able to run two separate Pantheon instances in the same JVM and have them entirely isolated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Prometheus Server code introduced the test breakage. So this is reverted.

@@ -422,6 +426,10 @@ public void run() {
Configurator.setAllLevels("", logLevel);
}

if (metricsSystem == null) {
metricsSystem = PrometheusMetricsSystem.instance();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would metricsSystem not be null here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an early attempt to address my server test breakage.

# Conflicts:
#	acceptance-tests/src/test/java/tech/pegasys/pantheon/tests/acceptance/dsl/node/ThreadPantheonNodeRunner.java
#	pantheon/src/main/java/tech/pegasys/pantheon/RunnerBuilder.java
#	pantheon/src/main/java/tech/pegasys/pantheon/cli/PantheonCommand.java
#	pantheon/src/main/java/tech/pegasys/pantheon/cli/PantheonControllerBuilder.java
#	pantheon/src/test/java/tech/pegasys/pantheon/RunnerTest.java
#	pantheon/src/test/java/tech/pegasys/pantheon/cli/CommandTestAbstract.java
#	pantheon/src/test/java/tech/pegasys/pantheon/cli/PantheonCommandTest.java
@@ -95,6 +95,7 @@ public void initMocks() throws Exception {
when(mockRunnerBuilder.webSocketConfiguration(any())).thenReturn(mockRunnerBuilder);
when(mockRunnerBuilder.dataDir(any())).thenReturn(mockRunnerBuilder);
when(mockRunnerBuilder.bannedNodeIds(any())).thenReturn(mockRunnerBuilder);
when(mockRunnerBuilder.metricsSystem(any())).thenReturn(mockRunnerBuilder);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a whole lot more lines to add a new object to the builder before the builder remodel.

@ajsutton ajsutton added the enhancement New feature or request label Dec 2, 2018
# Conflicts:
#	pantheon/src/main/java/tech/pegasys/pantheon/RunnerBuilder.java
#	pantheon/src/main/java/tech/pegasys/pantheon/cli/PantheonCommand.java
@shemnon shemnon merged commit b74f88a into PegaSysEng:master Dec 3, 2018
@shemnon shemnon deleted the metrics branch December 4, 2018 03:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants