-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
protocols/kad: Improve options to efficiently retrieve #2712
Conversation
e21e3c4
to
013eeaf
Compare
Integrates libp2p/rust-libp2p#2712 to fetch providers more quickly. For widely available providers this reduces the fetch time considerably.
792ea96
to
9ed7e70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing that this is already significantly speeding up your query times @dignifiedquire? Thanks for providing a patch with the discussion right away.
decide which queries should be transitioned to the new progress based api
Without having looked into every one, I would appreciate the consistency across the query types, thus in favor of this.
potentially remove Quorum based on the new functionality, as this now could be done by the caller if using the progression based api
I would be in favor of that for the sake of simplicity. I don't think anyone depends on the Quorum
concept specifically. Not offering Quorum
is at the expense of potentially doing extra unneeded work, e.g. when contacting another peer immediately followed by a finish
call through a user. I doubt this is an issue though.
protocols/kad/src/behaviour.rs
Outdated
pub key: record::Key, | ||
pub providers: HashSet<PeerId>, | ||
pub provider: PeerId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return them one by one? Why not as a batch? One always discovers them in batches, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I went back and for on this one. The nice thing about this, is that I can reuse the sequence number as the count of results, and the processing is a little more streamlined. But batching is definitely more efficient given they come in as batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any idea about the difference in the performance profile?
From an API PoV, it seems nicer to return them one-by-one given that we are already returning a list of events now. For example, if I want to write a log line for each progress entry, being given a set makes this slightly more cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any idea about the difference in the performance profile?
I don't know, the cost is 1
vs n
events per discovery. From my understanding that cost should be quite low, but I might be mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively I am also assuming that neither has a performance benefit.
Returning them one by one does loose information. Returning them batched gives the user the information that all records within the same batch came from the same source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mxinden updated, please check out
protocols/kad/src/behaviour.rs
Outdated
result: QueryResult, | ||
/// Execution statistics from the query. | ||
stats: QueryStats, | ||
}, | ||
|
||
/// An outbound query has produced a result. | ||
OutboundQueryProgressed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a slight preference for now. Should probably put more thoughts into this.
How about the following:
- Remove
OutboundQueryProgressed
. - Rename
OutboundQueryCompleted
toOutboundQueryProgressed
. - Add field
index: Index
toOutboundQueryProgressed
(formerlyOutboundQueryCompleted
) whereenum Index { Continuing(usize), Final(usize) }
indicating the sequence number and whether it is the final update for the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference of my solution over this, happy to change it, but lets get decide on a version before I go change it again 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless anyone of the recently tagged people (or anyone else) raises any objections, I suggest going with the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unfamiliar with the internals of kademlia but do we know ahead of time, how many we will get? If yes, then always returning both numbers could be another design.
What does the number within Final
mean? Would it work to include a boolean completed
in the event and only have count
otherwise?
The one thing I am worried about with your design suggestion @mxinden is that it introduces more nesting which can be cumbersome to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unfamiliar with the internals of kademlia but do we know ahead of time, how many we will get?
At least not in the providers case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you were right, I was being stupid, please check out the new structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mxinden I started expanding this to GetRecord
and am running into issues with this construction. There is a need to differentiate the type and information between the progress events and the last event, e.g. for GetRecord
the last event will not contain an actual record, only the optional cache.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about instead returning the cache_candidates
on each progression? I would expect this BTreeMap
to be small and thus cloning to be cheap. Would you agree @dignifiedquire? Would that solve the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not entirely, as I would still need to make the record be optional, which is pretty confusing as a consumer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we return an event for each contacted peer directly with an optional record.
yes very much |
Thanks for the ping @mxinden but I've been out of the loop for a while again so I cannot really comment anything. Will try to keep up to date with this and any follow-up release. |
I am not sure how many changes this will require. I looked through the code and found 0 usages of Providers. |
@kpp the API changes would as well apply to |
345a530
to
1fc7b8b
Compare
1fc7b8b
to
2b89ae9
Compare
@dignifiedquire let me know once this is ready for another review. Excited for this to eventually land. |
2b89ae9
to
dc5db66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making the relevant changes and thanks for propagating them across the code-base.
I left a couple of comments. Overall the direction looks great to me.
I am of the opinion that we should apply this pattern to all query types (e.g. as well to get_closest_peers
). As far as I understand no one is objecting.
I would like to keep the master
branch consistent at all times. Thus I would prefer the changes to the other query types to happen in this pull request as well. Would you be willing to do that work as well @dignifiedquire?
protocols/kad/src/behaviour.rs
Outdated
pub providers: HashSet<PeerId>, | ||
/// How many providers have been discovered so var. | ||
pub providers_so_far: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? In case a user is interested in all of the providers found, would they not need to keep track of all of them anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My use case was that I need it, and it is a lot easier to track in here, than to track externally. We might also need it internally if we want to do any early cancellations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I am fine with leaving it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually realized it doesn't help me much after all, so I think we could savely remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove it then. Thanks.
protocols/kad/src/behaviour.rs
Outdated
@@ -2472,6 +2557,25 @@ pub enum KademliaEvent { | |||
PendingRoutablePeer { peer: PeerId, address: Multiaddr }, | |||
} | |||
|
|||
/// Information about progress events. | |||
#[derive(Debug, Clone)] | |||
pub struct Index { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I proposed the name Index
, I don't think it is intuitive.
In case anyone has more intuitive naming suggestions, that would be appreciated. //CC @thomaseizinger who usually has good ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also dislike the name, but haven't had any better ideas 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ProgressStep
?
pub struct ProgressStep {
index: usize,
last: bool
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I find ProgressStep
a lot more intuitive than Index
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
protocols/kad/src/behaviour.rs
Outdated
/// An outbound query has produced a result. | ||
OutboundQueryCompleted { | ||
/// An outbound query has finished. | ||
OutboundQueryProgressed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the doc comment still up to date? Should it not be "An outbound query progressed"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
protocols/kad/src/behaviour.rs
Outdated
/// The ID of the query that finished. | ||
id: QueryId, | ||
/// The result of the query. | ||
/// The optional final result of the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would "The interim result of the query" not be more descriptive? I find this doc comment confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
protocols/kad/src/behaviour.rs
Outdated
result: QueryResult, | ||
/// Execution statistics from the query. | ||
stats: QueryStats, | ||
/// Indicates which event this is, if therer are multiple responses for a single request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Indicates which event this is, if therer are multiple responses for a single request. | |
/// Indicates which event this is, if therer are multiple responses for a single query. |
I think we refer to it as query everywhere else, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} | ||
}); | ||
} | ||
QuickCheck::new().tests(10).quickcheck(prop as fn(_)) | ||
} | ||
|
||
fn get_providers_limit<const N: usize>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our first const generic 🎉
Yeah, I'll work on it, just wanted to get the structure right first. |
e646c48
to
730de19
Compare
@mxinden if we remove the quorum from |
I would expect the Do you think this is too implicit / not intuitive for users? In other words should we consider the potentially unnecessary work significant and thus the whole API a footgun? |
Friendly ping @dignifiedquire. Would be unfortunate for this to go stale. Anything you would need from my end? Do I understand correctly, that this is the only reason why iroh depends on a fork of rust-libp2p? |
yes, that is correct |
730de19
to
2fb1ee8
Compare
I am a little afraid that could happen to be honest. |
2fb1ee8
to
7dfafd9
Compare
Honestly I am not sure how much sense it really makes to change the others behaviour atm, other than the changes I made so far. I will likely have a better understanding down the line when starting to use the rest of the API similarly intensely as I do provider finding atm. So I would love if we can find something close to what is there that we can merge, and I (or others) can work on improving the other pieces of the API as needed. |
Do you see any issues with applying the new mechanism to I understand your reasoning on only doing this change to the |
7dfafd9
to
7b05de1
Compare
@mxinden fair enough, I updated the |
Co-authored-by: Max Inden <[email protected]>
Co-authored-by: Max Inden <[email protected]>
thanks a lot, merged them all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for bearing with me @dignifiedquire!
Hmm, this is quite a lot to read: now that I’m facing a somewhat non-obvious upgrade for ipfs-embed (as I’ve never used Kademlia), what is the upgrade path for the quorum removal from |
@rkuhn can you ping me on your upgrade pull request? Happy to guide or provide patches. I am assuming that you already read the changelog entry. Unfortunately that is not a step-by-step instruction manual. |
Here’s what I came up with, hope it isn’t too far off :-) |
With #2712 merged, this can be marked as done.
With #2712 merged, this can be marked as done.
With libp2p#2712 merged, this can be marked as done.
* upgrade libp2p to 0.50.0 * on_swarm_event and on_connection_handler_event * replace `Swarm::new` with `Swarm::with_threadpool_executor` * on_swarm_event and on_connection_handler_event part 2 * on_swarm_event and on_connection_handler_event part 3 * on_swarm_event and on_connection_handler_event part 4 * update libp2p * libp2p 0.50.0 * rename OutboundQueryCompleted to OutboundQueryProgressed refs libp2p/rust-libp2p#2712 * remove unused var * accumulate outbound_query_records until query is finished * format code * use p_handler instead of new_handler #12734 (comment) * pass ListenFailure to kademlia #12734 (comment) * use tokio executor in tests #12734 (comment) * use chrono Local::now instead of deprecated Local::today * remove unused vars from request_responses tests * attempt to fix pallet UI tests * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish the query. There was also another behavior change in libp2p, they stopped automatic caching on remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from our point of view of the network. The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish the query. There was also another behavior change in libp2p, they stopped automatic caching on remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from our point of view of the network. The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish the query. There was also another behavior change in libp2p, they stopped automatic caching on remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from our point of view of the network. The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
* upgrade libp2p to 0.50.0 * on_swarm_event and on_connection_handler_event * replace `Swarm::new` with `Swarm::with_threadpool_executor` * on_swarm_event and on_connection_handler_event part 2 * on_swarm_event and on_connection_handler_event part 3 * on_swarm_event and on_connection_handler_event part 4 * update libp2p * libp2p 0.50.0 * rename OutboundQueryCompleted to OutboundQueryProgressed refs libp2p/rust-libp2p#2712 * remove unused var * accumulate outbound_query_records until query is finished * format code * use p_handler instead of new_handler paritytech#12734 (comment) * pass ListenFailure to kademlia paritytech#12734 (comment) * use tokio executor in tests paritytech#12734 (comment) * use chrono Local::now instead of deprecated Local::today * remove unused vars from request_responses tests * attempt to fix pallet UI tests * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish the query. There was also another behavior change in libp2p, they stopped automatic caching on remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from our point of view of the network. The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
* upgrade libp2p to 0.50.0 * on_swarm_event and on_connection_handler_event * replace `Swarm::new` with `Swarm::with_threadpool_executor` * on_swarm_event and on_connection_handler_event part 2 * on_swarm_event and on_connection_handler_event part 3 * on_swarm_event and on_connection_handler_event part 4 * update libp2p * libp2p 0.50.0 * rename OutboundQueryCompleted to OutboundQueryProgressed refs libp2p/rust-libp2p#2712 * remove unused var * accumulate outbound_query_records until query is finished * format code * use p_handler instead of new_handler paritytech#12734 (comment) * pass ListenFailure to kademlia paritytech#12734 (comment) * use tokio executor in tests paritytech#12734 (comment) * use chrono Local::now instead of deprecated Local::today * remove unused vars from request_responses tests * attempt to fix pallet UI tests * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish the query. There was also another behavior change in libp2p, they stopped automatic caching on remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from our point of view of the network. The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
* upgrade libp2p to 0.50.0 * on_swarm_event and on_connection_handler_event * replace `Swarm::new` with `Swarm::with_threadpool_executor` * on_swarm_event and on_connection_handler_event part 2 * on_swarm_event and on_connection_handler_event part 3 * on_swarm_event and on_connection_handler_event part 4 * update libp2p * libp2p 0.50.0 * rename OutboundQueryCompleted to OutboundQueryProgressed refs libp2p/rust-libp2p#2712 * remove unused var * accumulate outbound_query_records until query is finished * format code * use p_handler instead of new_handler paritytech/substrate#12734 (comment) * pass ListenFailure to kademlia paritytech/substrate#12734 (comment) * use tokio executor in tests paritytech/substrate#12734 (comment) * use chrono Local::now instead of deprecated Local::today * remove unused vars from request_responses tests * attempt to fix pallet UI tests * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI
Description
The end goal is to allow the following
get_providers
give the caller results back in real time, not collect all provider recordsget_providers
allow the total number of records to be of a fixed limit, such that not necessarily 20 peers need to be contacted (as is the case in the current impl)Implementation
There are currently two different commits, the first one implements the basic limit functionality, but this gets replaced with a more general solution in 2868e7b, which streams the results and allows the caller to
.finish()
the query when they have enough.Things Left To Do
Quorum
based on the new functionality, as this now could be done by the caller if using the progression based api