-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question regarding gossipsub #256
Comments
@iulianpascalau thanks for the report! Would be useful if you could point us to the code where you set up gossipsub and the validators. What's the message throughput you're subjecting the system to? |
Ok, working on a lightweight wrapper over libp2p libs. The above comment was done on a system in which only one out of 7 peers broadcast 2 messages (one about 2KB and one under 1KB) delayed by a 100us and the other, upon receiving those 2 messages, broadcast each a message of under 1KB size. (in other words, the first peer broadcast a block header + block body and the others sent their signature shares after processing the header and body).
it happened that some peers did not get the message, reaching a low number as 200 (something) peers out of 384 that have received the broadcast message. Re-running the same app, same network topology but with the older versions:
yield a receiving rate of 100% (all peers got the message) |
Now, further investigating the older versions, still found out the same problem depicted in the first comment: some peers got the message after a long period of time (order of seconds) after the broadcast event. |
This is how we used the old version const pubsubTimeCacheDuration = 10 * time.Minute
//......
optsPS := []pubsub.Option{
pubsub.WithMessageSigning(withSigning),
}
pubsub.TimeCacheDuration = pubsubTimeCacheDuration
ps, err := pubsub.NewGossipSub(ctxProvider.Context(), ctxProvider.Host(), optsPS...) set up validators: //......
//topic creation
subscrRequest, err := netMes.pb.Subscribe(name)
netMes.mutTopics.Unlock()
if err != nil {
return err
}
if createChannelForTopic {
err = netMes.outgoingPLB.AddChannel(name)
}
//just a dummy func to consume messages received by the newly created topic
go func() {
for {
_, _ = subscrRequest.Next(ctx)
}
}()
//...........
//assigning validators
err := netMes.pb.RegisterTopicValidator(topic, func(ctx context.Context, pid peer.ID, message *pubsub.Message) bool {
err := handler.ProcessReceivedMessage(NewMessage(message), broadcastHandler)
if err != nil {
log.Trace("p2p validator", "error", err.Error(), "topics", message.TopicIDs)
}
return err == nil
}) |
And how we used the new version const pubsubTimeCacheDuration = 10 * time.Minute
//......
optsPS := []pubsub.Option{
pubsub.WithMessageSigning(withSigning),
}
pubsub.TimeCacheDuration = pubsubTimeCacheDuration
ps, err := pubsub.NewGossipSub(ctxProvider.Context(), ctxProvider.Host(), optsPS...) set up validators: //......
//topic creation
netMes.topics[name] = nil
topic, err := netMes.pb.Join(name)
if err != nil {
netMes.mutTopics.Unlock()
return err
}
subscrRequest, err := topic.Subscribe()
if err != nil {
netMes.mutTopics.Unlock()
return err
}
netMes.topics[name] = topic
netMes.mutTopics.Unlock()
if createChannelForTopic {
err = netMes.outgoingPLB.AddChannel(name)
}
//just a dummy func to consume messages received by the newly created topic
go func() {
for {
_, _ = subscrRequest.Next(ctx)
}
}()
//...........
//assigning validators
err := netMes.pb.RegisterTopicValidator(topic, func(ctx context.Context, pid peer.ID, message *pubsub.Message) bool {
err := handler.ProcessReceivedMessage(NewMessage(message), broadcastHandler)
if err != nil {
log.Trace("p2p validator", "error", err.Error(), "topics", message.TopicIDs)
}
return err == nil
}) |
Sorry for the long messages. |
Can you try with the PX PR? There might be some topology issues causing messages to only propagate by gossip; #234 |
Will try. Thanks! 👍 |
We have tested the PX PR and found out that actually performed worse than the release tag v0.2.5. Worse meaning higher latencies when sending payloads between 300000-650000 bytes on the 384 nodes. The messages were sent from the same peer with a time duration of about 1 second between each send (100 sends). Average latencies were between 462ms - 2.49s with highs between 934ms - 4s. v0.2.5 on the same setup, same test, yield 295ms - 1.65s for average and 596ms - 2.84s for highs. |
We have conducted some more tests in which we employed the afore mentioned wrapper. One of them was supposed to choose 100 (random) peers out of those 384 and each of them supposed to broadcast a small message (around 300-600 bytes). We expected that the messages, after a while, would have reached all other peers but we soon found out the test had failed. Some messages reached as low as 3 peers, averaging at around half of the peers. This happened with "old libs", "new libs" and px pr version with our "lightweight" wrapper and production wrapper. None seemed to cope with such large flooding of messages. We even tested the "new libs" with the option |
And now comes the "not-so-funny part". We tested a setup in an integration test (ran on a single machine, a plain old golang test) that created 200 wrapper instances (production wrappers) and connect them in between using kad-dht discovery mechanism. After that, the first 100 nodes broadcast a single message. After a couple of seconds each peer (out of the 200) got the 100 messages. We kind of expected that since all of the nodes connected through local host interface and soon we started adding stuff in the pubsub implementation. Things like: making the outbound peer channels be a buffered channels of 1 instead of 32 or adding time sleeps of 150ms in comms.go, handleSendingMessages function just before the writeMsg call. The test continued to show us the messages reached all peers with more or less latency, but in the end, all messages reached all peers. Same happened when testing with the mocknet testing infrastructure and playing with the link latency. We were simply unable to reproduce the findings on our medium sized network of 384 nodes using local created hosts. |
I am a little confused by all this. |
Also, a single message is not necessarily representative, as there may be convergences delays. |
As a first step towards diagnosing, can you turn off the connection manager? Let them have as many connections as they want, just to make sure there is no interference. |
@vyzo you started working on a testground test for peer exchange. Not sure what the status of that is. Do you have any reference numbers you can share from actual experiments? |
Ok, short update, it appeared that the problem in which we reported that messages are being dropped randomly is not an issue (it was the application that used our wrappers that caused false reports). The messages are being broadcast to all peers even in the presence of high latencies or misconfigured pubsub implementation (small output message queue size). |
Yeah, that 1s delay is indicative of gossip transmission. |
Ok great, will focus on our protocol then. :) |
Hello
I am wondering why in a setup of 7 nodes (connected in a complete graph fashion) sometimes it happens that a peer will get the message delivered after about 1 second (sometimes even more) in respect to the other peers. The setup we are using has multiple topics, uses message signing and gossip sub as the pubsub router. The message payload is around 2kB large. We are also using topic validators but the functions return in under
20ms65ms (after our latest measurements). The setup has been employed on Digital Ocean and after that on AWS VPS, both runs yielding the same results.I'm asking this as I have tried using the flood sub router and a message is being broadcast to all peers under 150ms.
I'm trying to figure out a way to reproduce this but I have failed even when using 100 hosts connected through localhost interface.
The text was updated successfully, but these errors were encountered: