-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PeerDAS metrics #13
base: master
Are you sure you want to change the base?
Conversation
| `libp2p_rpc_requests_sent_total` | Counter | Number of requests sent | On propagating an RPC request | | ||
| `libp2p_rpc_requests_bytes_sent_total` | Counter | Number of requests bytes sent | On propagating an RPC request | | ||
| `libp2p_rpc_requests_received_total` | Counter | Number of requests received | On receiving an RPC request | | ||
| `libp2p_rpc_requests_bytes_received_total` | Counter | Number of requests bytes received | On receiving an RPC request | | ||
| `libp2p_rpc_responses_sent_total` | Counter | Number of responses sent | On propagating an RPC response | | ||
| `libp2p_rpc_responses_bytes_sent_total` | Counter | Number of responses bytes sent | On propagating an RPC response | | ||
| `libp2p_rpc_responses_received_total` | Counter | Number of responses received | On receiving an RPC response | | ||
| `libp2p_rpc_responses_bytes_received_total` | Counter | Number of responses bytes received | On receiving an RPC response | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about prefixing these with req_resp
instead of libp2p_rpc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a specific reason @matthewkeil ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose in case libp2p
is swapped out? Sounds ok to me, but we do already have some libp2p_
metrics in here. What do you think of p2p_rpc
@matthewkeil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly because the domains prescribed in the spec are "gossipsub" and "req/resp" so thinking that domain specific metrics should follow that convention (similar to how the gossipsub
are labeled in this PR).
Both transmission methodologies are p2p
so not sure that is a necessary differentiator. Also, at least within our tech stack, both "gossipsub" and "req/resp" are using @libp2p
so qualifying one with that prefix and not the other doesnt seem correct (also seems redundant and implementation specific).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a newcomer, I found it confusing that specs refer to req/resp, while metrics use a libp2p prefix. Needed extra explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terms rpc
& req/resp
have been used interchangeably throughout the implementation and discussions. I feel that req/resp
is a bit more precise, though rpc
is more commonly used outside of the spec.
In Lighthouse, most of our metrics and code references rpc
, and I'd personally prefer to keep it consistent there and also avoid breaking changes to existing metrics. However I think we can probably find ways around this, e.g. metrics relabelling on export (will use more storage) or configure mapping in Grafana, if standardising these metrics provide benefits.
|
||
#### Req/Resp metrics | ||
|
||
Req/Resp domain metrics should contain protocol ID `/eth2/beacon_chain/req/data_column_sidecars_by_root/1/` and `/eth2/beacon_chain/req/data_column_sidecars_by_range/1/` as labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we standardize the label name to protocolId
or something similar? Only issue I see with that would be naming conventions across languages (ie JS is protocolId
and rust would be protocol_id
). But would allow a common grafana dashboard to work for all clients.
Same question for topic
above in the gossipsub section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewkeil Yup standardizing the label names makes sense - they should be in snake case
Prometheus metrics and label names are written in snake_case
Question @parithosh regarding implementation, should chunks be considered a "response" (ie each column/block in a Should this be specified here for other implementers? As a note, I tend to lean that stream/req/resp is 1:1 (ie any number of columns/blocks sent over a single stream counts as a single response) |
I think we're okay with either approach as long as all clients do it the same way. In general i'd agree that a single req/resp should count as "1", but thinking of a future where the number of columns/blocks are different, then the metric would also mean something else (especially if we're trying to draw some analysis on top of it)? Do other implementers have an opinion here? |
Question regarding the "deduplicated" messages. We segregate incoming messages for duplicates and then check for valid/invalid message format at the gossip layer. Should deduplicated messages include invalid (all non-duplicates) in the count or only messages that go through the gossip validation process correctly? BTW this does not include full app level verification, just gossip protocol validation. After speaking with @wemeetagain, we tend to lean towards not including app level validation as it will greatly simplify the implementation to a single collection point instead of after the beacon-node validation sprinkled across several handlers. |
Would love to get the PR that implements these merged. Any thoughts on the changes or perhaps not making them. Either way is fine but wanted to pop this back up. |
Some general points:
|
For gossipsub metrics we apply filters by topic and choose specifically "data_column_sidecar" labeled topics. Is it possible in Nimpus implementation? |
The question of which topics one wants to have statistics on the duplicated messages for is question and doesn't affect the basic unavailability of this data to Nimbus. To be more precise, when Nimbus intializes libp2p, it sets a historyLength = 6,
historyGossip = 3,
fanoutTTL = chronos.seconds(60),
# 2 epochs matching maximum valid attestation lifetime
seenTTL = chronos.seconds(int(SECONDS_PER_SLOT * SLOTS_PER_EPOCH * 2)), These parameters get translated in This message seen cache, in It doesn't matter what topic, if things are operating properly Nimbus simply never sees them.
|
This PeerDAS metrics dashboard is implemented according to the current PeerDAS metrics specs. It can be used by all the CL clients which implemented the set of [these metrics](ethereum/beacon-metrics#13). The metrics are still under discussion, so, the dashboard is currently in development. Filters: by client, job, supernode.
Agreed. This is similar to what drove my question above about "fully validated" messages. Those filtered messages, and invalid ones never make it to Lodestar. To simplify metric collection we did it at the libp2p layer without domain specific validation like bls, etc. https://github.com/ChainSafe/js-libp2p-gossipsub/pull/508/files As a note @tersec we are just collecting these in our libp2p-gossipsub layer and adding the topic as a label so realistically we are collecting for every topic and its not "ethereum specific" because you are correct, the libp2p layer should not have any knowledge of the application layer. It just reports metrics on what goes across the wire, agnostic to what it "actually is" other than a message. This also guided my review of the gossipsub metric names so they were gossip-only and did not refer to application specifics. |
You are correct @tersec. As additional color to the comment above there is a goal to help standardize the way metrics are collected across clients (at several layers of the stack) to help operators and promote client diversity. Building muscle memory around metric naming paradigms and even better allowing for unified dashboards across clients is an overarching dream/wish/goal. While the metrics are in the "ethereum/beacon-metrics" repo there are some places where collection points will need to be agnostic, like in the libp2p-gossipsub layer as an excellent example. These libs across the languages should be application agnostic but still usefully and methodically defined. Perhaps there will be a push outside of eth to standardize the metrics further but for now this is a process in its infancy and realistically most of these libs are maintained by the client teams from what I gather.
This is a great point and we do the same thing. We kept our collection at the libp2p-gossipsub layer because of this exact thing and we are not counting "full verifications" because of what you mentioned. We are only counting correctly formatted messages that got through deserialization because the collection after actual gossip handling was too burdensome. There would be collection points in each gossip handler which is infeasible for this scope. See the note about "full verification" in the comment above.
We added this timing in the batch verification we run. Was a pretty easy change. We also do 0 single verifications so everything is run as a batch. As for runtime vs bench performance testing I tend to see what you are saying but think that the actual runtime collection is pertinent. The performance of our node is not linear across the full slot/epoch and understanding how and where the bottlenecks occur can inform protocol development. For instance (while not relevant for this specific topic but still pertinent to the discussion) adding additional work at epoch transition that is heavily memory bound and that triggers extensive garbage collection will affect slot zero proposals and attestations for us. As we add more work to to the gossip validation, will it potentially affect single-slot finality and more importantly, how?
Good point. We added it but this may warrant further investigation of usefulness |
@KatyaRyazantseva was there a discussion or decision on the naming of the metric? If not perhaps we go with our impressions and swap the name so its less confusion? Would love to get this one landed so we can update our metrics in ChainSafe/lodestar#7095 |
If the only practical way to implement this is to add things to the
It's a way of asking for libp2p modifications for all messages without ever stating such. It should be explicit on this point if so, and that can be discussed in those terms, not specifying this as if it's all about PeerDAS. Because, as a counterexample to
One of the main libp2p libraries is very definitely not an "Ethereum" library in disguise; neither are In general, it turns this "realistically, most of these libs ..." into an effective requirement, which is, again, beyond what should be the scope of a PR which never says anything like this explicitly in a part of one of the many Ethereum-specific spec processes. I'd prefer to see some feedback from more libp2p people here. Maybe these metrics are reasonable, maybe not. But there's a similar tension between
and
People can update metrics and create dashboards however they want, without this being standardized. Proposal:
|
| `gossipsub_topic_msg_sent_counts_total` | Counter | Number of gossip messages sent to each topic | On sending a message over a topic | | ||
| `gossipsub_topic_msg_sent_bytes_total` | Counter | Number of bytes sent to each topic | On sending a message over a topic | | ||
| `gossipsub_topic_msg_recv_counts_unfiltered_total` | Counter | Number of gossip messages received from each topic (including duplicates) | On receiving a message from a topic including duplicates| | ||
| `gossipsub_topic_msg_recv_bytes_unfiltered_total` | Counter | Number of bytes received from each topic (including duplicates) | On receiving a message from a topic including duplicates | | ||
| `gossipsub_topic_msg_recv_counts_total` | Counter | Number of gossip messages received from each topic (deduplicated) | On receiving a message from a topic deduplicated | | ||
| `gossipsub_topic_msg_recv_bytes_total` | Counter | Number of bytes received from each topic (deduplicated) | On receiving a message from a topic deduplicated | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these generic metrics would benefit multi-client monitoring - although it would be difficult for clients to change them if they already exist.
Most of these metrics already exist in rust-libp2p
, I wonder if other clients have equivalent metric for these? If it's hard to standardise these, perhaps it would still be useful to find out the equivalent for each client, so we could make it possible to have a single multi-client dashboard without standardising the metric names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw the duplicate tracking metrics would be useful to analyse the effectiveness of IDONTWANT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We (Lodestar) are the owners of the js-libp2p-gossipsub
repo and have updated those metric names to match this PR. That is still in draft though because this PR has not landed.
I do completely agree that unifying metric names across clients will be huge for both ethpandops and operators as a whole. By making dashboards universal it will be trivial to switch to promote client diversity or to recover if a big bug is found in one.
This PR introduces a recommended list of metrics for monitoring PeerDAS. The proposed changes attempt to standardize the naming convention for these metrics across client implementations. By adopting these metrics, clients can provide better PeerDAS monitoring, compatibility and simplify the process of building summary dashboards for PeerDAS performance tracking and analysis.
The list is open for discussion. Each client has the opportunity to contribute to it by suggesting additions or disputing existing metrics.
Data column, kzg, custody metrics
Gossipsub metrics
Gossipsub domain metrics should contain topic
data_column_sidecar_{subnet_id}
as labels.Req/Resp metrics
Req/Resp domain metrics should contain protocol ID
/eth2/beacon_chain/req/data_column_sidecars_by_root/1/
and/eth2/beacon_chain/req/data_column_sidecars_by_range/1/
as labels.