Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something suss with metrics on subnets or bug? #6698

Closed
AgeManning opened this issue Dec 16, 2024 · 3 comments
Closed

Something suss with metrics on subnets or bug? #6698

AgeManning opened this issue Dec 16, 2024 · 3 comments

Comments

@AgeManning
Copy link
Member

Description

We have seen a few cases where we are are unable to publish attestations due to lack of peers on subnets. The failed attestations in the following graph indicate this.

Screenshot_select-area_20241216120018

However, the metrics indicate that we have peers on the subnet, so these failures shouldn't be occurring.

If it's just our report of the metrics, then that's not too bad, however if our metrics are accurate, then there is a bug. If our metrics are wrong and we are using these numbers to balance our peers per subnet, then this also isn't great.

I suspect the metrics, but it needs investigation (which i'll attempt to do, just making this issue for visibility)

@AgeManning
Copy link
Member Author

I've had more of a look into this.
I wrote a modified binary to run on this node that was exhibiting this behaviour. The modified code checked the metrics against the connected_peers mapping inside gossipsub. It has not reported any kind of mismatch.
I think this means that the metric is accurate, which is concerning, because it indicates that we are not sending messages to peers that we are connected to that are subscribed to a subnet.

I couldn't find why, it requires more investigation. One legitimate reason is that the peers we are connected to are scored poorly such that we don't publish to them. However, I find this highly unlikely, I checked the peer scoring metric and it didn't indicate this, so I think its a safe assumption that this is not the case.

The relevant code is here: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/gossipsub/src/behaviour.rs#L644

The question is, why is recipient_peers empty (we receive an InsufficientPeers) error, when we are relatively confident that connected_peers contains peers that are subscribed to the topic we want to publish on?

I didn't see any obvious bug on my first pass of this.

@AgeManning
Copy link
Member Author

cc @ackintosh @jxs @elenaf9

@AgeManning
Copy link
Member Author

I think @ackintosh has solved this one. Will re-open if the issue persists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant