Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic implementation of OpenCensus #1414

Closed
wants to merge 15 commits into from
Closed

Basic implementation of OpenCensus #1414

wants to merge 15 commits into from

Conversation

DazWilkin
Copy link
Contributor

Implementation of OpenCensus monitoring (alternative to Prometheus-specific monitoring and a complement to the existing OpenCensus trace).

The code needs work.

I'm interested in feedback on the implementation and its usefulness.

Checklist

Copy link
Member

@AlCutter AlCutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Daz!
I think this is a great idea, could potentially kill off a bunch of complexity eventually by pulling everything under opencensus.

I popped a few light comments in-line, though I know you said the code's not really ready yet.

log.Fatal(err)
}
// Important to invoke Flush before exiting
defer sd.Flush()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to have this func not be init but an exported Initialise func which itself returns a func() to be called at shutdown time, e.g.

func Initialize() (func(), error) {
 ...
 return func() { 
     if using SD {
        sd.Flush()
     }
     if using DD {
        dd.Stop()
    }
 }, nil
}

then in the various server mains:

  ...
 var mf monitoring.MetricFactory
  if *useOpencensus {
     flush, err := opencensus.Initialise()
     // if err etc. 
     defer flush()
     mf = &opencensus.MetricFactory{..} 
  }
  ...

Otherwise, as it stands, I think it'll just flush anything that's been logged during the course of the init method when the init returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed thanks to your guidance: 2706a6e

This requires changes to each server main which would break the Prometheus implementation. So, leaving unchanged for now. Must also understand how best to reflect the configuration through Trillian's configuration.

//TODO(dazwilkin) Datadog, Prometheus & Stackdriver should be configurable
if true {
	flush, err := opencensus.Initialize()
	if err != nil {
		glog.Exitf("Failed to initialize OpenCensus Exporter(s): %v", err)
	}
	defer flush()
}

monitoring/opencensus/metrics.go Show resolved Hide resolved
MetricPrefix: namespace,
})
if err != nil {
log.Fatal(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a mix of log and glog through here, is that because you don't want to rely on glog being usable in init?

}
}

// createKeyMethods creates OpenCensus Tag Keys for each label name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createTagKeys :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: c44fd34

// -- contains non-printable ASCII
// -- or len is 0 or >256
// Printable ASCII 32-126 inclusive
func checkLabelNames(labelNames []string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: labelNames -> names or ln to reduce stutter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: c6436c0

return tagKeys
}

// createMeasureAndView creates the OpenCensus Measure used to record stats and a View for reporting them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit missing full stops in comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: d5ab0a5


// forAllLabelsAValue trivially ensures the numbers of labels matches the number of values
func forAllLabelsAValue(labels, values []string) error {
if len(labels) != len(values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can reduce clutter and re-calculation a bit like this:

if ll, lv := len(labels), len(values); ll != lv {
   return fmt.Errorf(..., ll, lv)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: e487093

return r
}

// NewHistogram creates a new Histogram object backed by Prometheus.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/Prometheus/OpenCensus/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: acc264c

}

// Inc adds 1 to the gauge
func (g *Gauge) Inc(labelVals ...string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible these are unused, or calling code could be refactored to call Set (github search makes it hard to figure that out, vi and go build would do a better job :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment but I think perhaps you allude to this issue:

One challenge implementing the Trillian metric interface with OpenCensus is that the interface assumes the possibility of tracking state of the measure. Unlike with Prometheus (where this is possible), OpenCensus is a proxy (for one or more monitoring services) and the proxy simply dispatches the current value to the configured backends without recording the current measure's value.

As a result, it's possible to increment Counters, and set the value of Gauges. But it's not possible to get the value of either, and thus it's not possible to increment, decrement of add values to e.g. Gauges because this requires retaining knowledge of the current value.

If the interface is changed to reflect this, then the Prometheus tests that assume this, will break. As it is, the OpenCensus tests break because it's not possible to get the measure values.

@DazWilkin
Copy link
Contributor Author

@AlCutter --

Yes, was seeking feedback and guidance, thank you.

I'm headed out on vacation but will incorporate your feedback when I'm back.

@DazWilkin
Copy link
Contributor Author

@AlCutter --

This isn't forgotten.

I'm now back Stateside and have been working with @gdbelvin on Key Transparency.

I'll make the changes you propose and resubmit before the end of next week.

@DazWilkin
Copy link
Contributor Author

Apologies... a further week's delay.

We lost power this week during Seattle's Snowmaggedon and it's set me back

@DazWilkin
Copy link
Contributor Author

OpenCensus now provides an Agent.

Among other features (including accepting incoming Prometheus metric data), this proxies OpenCensus metric data to OpenCensus Exporters.

This means that, instead of -- as currently -- having to supported multiple Exporters (Stackdriver, Prometheus, Datadog) in the Trillian code, the code may be simplified to write to a generic OpenCensus exporter that itself writes to the Agent. At run|deploy time, the operator can decide which monitoring|trace services to configure.

I'll simplify the code to use the generic Exporter and re-commit.

)

const (
namespace = "trillian"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namespace is unused (from deadcode)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: 30c7a29

if strings.IndexFunc(labelName, nonPrintableASCII) != -1 {
glog.Fatalf("OpenCensus label names must be printable ASCII; '%s' is not", labelName)
if strings.IndexFunc(name, nonPrintableASCII) != -1 {
glog.Fatalf("OpenCensus label names must be printable ASCII; '%s' is not", name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitty nit: can use %q rather than '%s' for the same output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I did not know that... my printf foo is weak.

Fixed: 8986d5a

@AlCutter
Copy link
Member

AlCutter commented Apr 4, 2019

Looks like it's missing an interface method:

monitoring/opencensus/metrics_test.go:12:39: cannot use MetricFactory literal (type MetricFactory) as type "github.com/google/trillian/monitoring".MetricFactory in argument to testonly.TestCounter:
	MetricFactory does not implement "github.com/google/trillian/monitoring".MetricFactory (missing NewHistogramWithBuckets method)

@DazWilkin
Copy link
Contributor Author

DazWilkin commented Apr 4, 2019

There's an issue in the OpenCensus Stackdriver Exporter. If buckets are defined to include a 0-valued bucket i.e. [0.0, ....], the Exporter fails and no metric data is sent to Stackdriver: census-ecosystem/opencensus-go-exporter-stackdriver#98

This Trillian OpenCensus code now ships metrics to an OpenCensus Agent. The Trillian code is unable to determine whether the Agent is configured to export to Stackdriver.

To avoid this error (and assuming -- !! -- that this doesn't impact the other Exporters), the code currently logs the fact and deletes the 0-valued bucket from the list of buckets.

NB The Interface doesn't support and the existing Prometheus code doesn't check for a histogram created with zero buckets. Neither does this OpenCensus code (though it does log the fact). This appears to be an error case.

@AlCutter
Copy link
Member

Probably need a rebase on this - Travis is failing due to an obsolete workaround still being attempted.

@RJPercival
Copy link
Contributor

Hi @DazWilkin, is this stalled on the linked exporter issue or does it just need some work to rebase it and get it merged?

@DazWilkin
Copy link
Contributor Author

I'm just back from vacation.

I am unclear what failed.

I'll try to recreate this, review and update this thread by the end of this week, OK?

@DazWilkin
Copy link
Contributor Author

@RJPercival tested and confirmed that it continues to work.

Trillian will need an additional go.mod entry:

contrib.go.opencensus.io/exporter/ocagent v0.6.0

I am unclear why this was failing CI but I've removed the tests.

OpenCensus ships metrics asynchronously to exporters (in this case, the OpenCensus Agent). Thus it has no (easy) way to retrieve measurement values and thus it's unable to succeed with the monitoring tests as defined.

I've revised the README.md to explain configuration, but in summary:

  1. Add a flag to the server main.go to define the OpenCensus Agent endpoint
  2. Instantiate it
  3. Initialize it (!) providing the Agent's endpoint and defer shutdown.
var (
	ocatEndpoint    = flag.String("ocat_endpoint", "localhost:55678", "Endpoint for OpenCensus Agent receiver (host:port)")
)

...

mf := opencensus.MetricFactory{}

flush, err := opencensus.Initialize(*ocatEndpoint)
if err != nil {
	glog.Exitf("Failed to initialize OpenCensus Agent: %v", err)
}
defer flush()

NB Because it no longer defaults to (only use) Prometheus, the server's HTTP endpoint no longer includes the /metrics endpoint. If the Agent is configured to export to Prometheus, these metrics will be merged into the Agent's :55678/metrics

@AlCutter
Copy link
Member

Hey Daz, we're just cleaning out old PR/issues - there's clearly a bunch of work which went into this, is it worth trying to bump this up to date and land it or is it less useful now?

@DazWilkin
Copy link
Contributor Author

As you know, OpenCensus is superseded by OpenTelemetry which, IIRC, implements a metric format that's very close to Prometheus' exposition format (maybe even the same?).

I think this PR should be dropped and Trillian may either export directly as an OpenTelemetry exporter or worst-case use the OpenTelemetry collector.

@AlCutter
Copy link
Member

Cool, thanks Daz!

@AlCutter AlCutter closed this Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants