[2021 Theme Proposal] Better observability #74

MichaelMure · 2020-11-30T13:02:42Z

Theme description

When running a service in production, observability is a cornerstone or reliability and performance. While go-ipfs has become a quite complex piece of software, it too often is a black box for a node operator. It's often difficult to identify problem, let alone solve them. It's time to improve the situation, notably with a better tracing.

Hypothesis

administrating a black box is hard and lead to breakage of various shape and form
observability helps everyone with barely any downside

Vision statement

If executed properly, node operators will be able to:

better diagnostic and resolve a wide range of issues
provide better feedback to the development team
rely less on PL to diagnose those issues and reduce the burden on the development team
develop solution to address those issues

Additionally, better observability also greatly helps during development, to verify correctness and to have actual numbers to base decision on.

Why focus this year

There is no major roadblock preventing this. Just work that need to be done or completed.

Example workstreams

Observability consist of 3 pillars: logs, metrics, traces. Those are in different shapes at the moment in go-ipfs and will require different amount of work.

Logs

Logs are in a decent shape in go-ipfs. Most of the subsystems are instrumented, although not equally. However they are a bit difficult to exploit as there is a single sink possible (stdout) and a unique global filter.

For reference, Infura use a custom plugin to get those logs out of go-ipfs.

Possible work:

develop an API to register a log sink, with dedicated filtering
tag the logs with a request ID if availabe, which allow later to match logs and traces
review the log instrumentation across subsystems to harmonize it

Metrics

go-ipfs expose metrics in the Prometheus format. I don't have many complaints about it.

Possible work:

review the metric instrumentation across subsystems to harmonize it
identify missing metrics and implement them

Tracing

This is the real meat of this proposal. go-ipfs here is a black box. The best you can achieve is to know how long a request is handled by go-ipfs. No details about the internals. AFAIK, only the DHT is decently instrumented.

For reference, Infura use a PluginTracer to export traces to an external system for analysis. However this require not only this plugin but also some custom code in our fork to get something meaningful. This is obviously not great.

Possible work:

add a go context in the data pipeline
add tracing in the data pipeline
add tracing in other important subsystems (pinner, pubsub, connect to the DHT ...)
support distributed tracing (match traces coming from another system and reaching go-ipfs)

The text was updated successfully, but these errors were encountered:

atopal · 2020-11-30T13:33:16Z

Thank you Michael! Can you say how this compares to theme #63 (Developer Tooling)? Should this be a part of that? Or do you think it's stand alone?

MichaelMure · 2020-11-30T13:40:05Z

There is some overlap but I think these proposals focus on two different aspects. #63 focus on developer tooling and how to make it easier to build applications. This proposal focus much more on the internals and how to monitor and debug internal issues.

Better observability would tangentially help #63 but merging them would makes things complicated for everyone involved imho.

obo20 · 2020-12-01T00:03:03Z

Thanks @MichaelMure for this theme suggestion.

From Pinata's side of things I wholeheartedly agree with everything Michael talked about. There's been way too many times where I've had to sit through a debugging session with one of PL's IPFS developers because I couldn't provide any meaningful logs and needed to debug in real-time. While the things mentioned here would certainly help out on our end, I think Michael's point of rely less on PL to diagnose those issues and reduce the burden on the development team really stands out to me here as a huge win.

momack2 · 2020-12-03T04:34:37Z

@raulk @aarshkshah1992 - I think this is somewhat related to the PhantomDrift project (https://github.com/libp2p/observer-toolkit and https://github.com/libp2p/observation-deck) that was worked on with Nearform in Q1/2. If there's any learnings or next steps from that project, would be good to share!

github-actions · 2023-09-25T00:06:22Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

MichaelMure added the 2021 Theme Proposal label Nov 30, 2020

MichaelMure assigned atopal and dchoi27 Nov 30, 2020

ipfs deleted a comment from welcome bot Dec 3, 2020

bertrandfalguiere mentioned this issue Dec 9, 2020

[2021 Theme Proposal] Push for mainstream browser integration #81

Closed

MichaelMure mentioned this issue Sep 29, 2021

context/tracing in the blockstore/datastore pipeline ipfs/kubo#6803

Closed

github-actions bot added the Stale label Sep 25, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2021 Theme Proposal] Better observability #74

[2021 Theme Proposal] Better observability #74

MichaelMure commented Nov 30, 2020

atopal commented Nov 30, 2020

MichaelMure commented Nov 30, 2020

obo20 commented Dec 1, 2020

momack2 commented Dec 3, 2020

github-actions bot commented Sep 25, 2023

[2021 Theme Proposal] Better observability #74

[2021 Theme Proposal] Better observability #74

Comments

MichaelMure commented Nov 30, 2020

Theme description

Hypothesis

Vision statement

Why focus this year

Example workstreams

Logs

Metrics

Tracing

atopal commented Nov 30, 2020

MichaelMure commented Nov 30, 2020

obo20 commented Dec 1, 2020

momack2 commented Dec 3, 2020

github-actions bot commented Sep 25, 2023