Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protocols/kad: Improve options to efficiently retrieve #2712

Merged
merged 33 commits into from
Nov 25, 2022

Conversation

dignifiedquire
Copy link
Member

@dignifiedquire dignifiedquire commented Jun 16, 2022

Still a draft as I am opening this to discuss the details and direction of this.

Description

The end goal is to allow the following

  • when calling get_providers give the caller results back in real time, not collect all provider records
  • when calling get_providers allow the total number of records to be of a fixed limit, such that not necessarily 20 peers need to be contacted (as is the case in the current impl)

Implementation

There are currently two different commits, the first one implements the basic limit functionality, but this gets replaced with a more general solution in 2868e7b, which streams the results and allows the caller to .finish() the query when they have enough.

Things Left To Do

  • decide which queries should be transitioned to the new progress based api
  • potentially remove Quorum based on the new functionality, as this now could be done by the caller if using the progression based api
  • write changelog

dignifiedquire added a commit to n0-computer/iroh that referenced this pull request Jun 16, 2022
Integrates libp2p/rust-libp2p#2712 to fetch providers more quickly. For widely available providers this reduces the fetch time considerably.
@dignifiedquire dignifiedquire force-pushed the feat-kad-count branch 2 times, most recently from 792ea96 to 9ed7e70 Compare June 19, 2022 17:40
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing that this is already significantly speeding up your query times @dignifiedquire? Thanks for providing a patch with the discussion right away.

decide which queries should be transitioned to the new progress based api

Without having looked into every one, I would appreciate the consistency across the query types, thus in favor of this.

potentially remove Quorum based on the new functionality, as this now could be done by the caller if using the progression based api

I would be in favor of that for the sake of simplicity. I don't think anyone depends on the Quorum concept specifically. Not offering Quorum is at the expense of potentially doing extra unneeded work, e.g. when contacting another peer immediately followed by a finish call through a user. I doubt this is an issue though.

pub key: record::Key,
pub providers: HashSet<PeerId>,
pub provider: PeerId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return them one by one? Why not as a batch? One always discovers them in batches, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I went back and for on this one. The nice thing about this, is that I can reuse the sequence number as the count of results, and the processing is a little more streamlined. But batching is definitely more efficient given they come in as batches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any idea about the difference in the performance profile?

From an API PoV, it seems nicer to return them one-by-one given that we are already returning a list of events now. For example, if I want to write a log line for each progress entry, being given a set makes this slightly more cumbersome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any idea about the difference in the performance profile?

I don't know, the cost is 1 vs n events per discovery. From my understanding that cost should be quite low, but I might be mistaken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively I am also assuming that neither has a performance benefit.

Returning them one by one does loose information. Returning them batched gives the user the information that all records within the same batch came from the same source.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinden updated, please check out

result: QueryResult,
/// Execution statistics from the query.
stats: QueryStats,
},

/// An outbound query has produced a result.
OutboundQueryProgressed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a slight preference for now. Should probably put more thoughts into this.

How about the following:

  • Remove OutboundQueryProgressed.
  • Rename OutboundQueryCompleted to OutboundQueryProgressed.
  • Add field index: Index to OutboundQueryProgressed (formerly OutboundQueryCompleted) where enum Index { Continuing(usize), Final(usize) } indicating the sequence number and whether it is the final update for the query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference of my solution over this, happy to change it, but lets get decide on a version before I go change it again 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless anyone of the recently tagged people (or anyone else) raises any objections, I suggest going with the above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unfamiliar with the internals of kademlia but do we know ahead of time, how many we will get? If yes, then always returning both numbers could be another design.

What does the number within Final mean? Would it work to include a boolean completed in the event and only have count otherwise?

The one thing I am worried about with your design suggestion @mxinden is that it introduces more nesting which can be cumbersome to deal with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unfamiliar with the internals of kademlia but do we know ahead of time, how many we will get?

At least not in the providers case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you were right, I was being stupid, please check out the new structure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinden I started expanding this to GetRecord and am running into issues with this construction. There is a need to differentiate the type and information between the progress events and the last event, e.g. for GetRecord the last event will not contain an actual record, only the optional cache.
Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about instead returning the cache_candidates on each progression? I would expect this BTreeMap to be small and thus cloning to be cheap. Would you agree @dignifiedquire? Would that solve the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not entirely, as I would still need to make the record be optional, which is pretty confusing as a consumer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we return an event for each contacted peer directly with an optional record.

@dignifiedquire
Copy link
Member Author

I am guessing that this is already significantly speeding up your query times @dignifiedquire?

yes very much

@mxinden
Copy link
Member

mxinden commented Jun 22, 2022

@kpp @tomaka do you have thoughts on this? This will require a small change in the Authority Discovery module.

@koivunej what do you think of this proposal? I would guess rust-ipfs could benefit from the performance gains as well.

@koivunej
Copy link
Contributor

Thanks for the ping @mxinden but I've been out of the loop for a while again so I cannot really comment anything. Will try to keep up to date with this and any follow-up release.

@kpp
Copy link
Contributor

kpp commented Jun 22, 2022

This will require a small change in the Authority Discovery module.

I am not sure how many changes this will require. I looked through the code and found 0 usages of Providers.

@mxinden
Copy link
Member

mxinden commented Jun 23, 2022

This will require a small change in the Authority Discovery module.

I am not sure how many changes this will require. I looked through the code and found 0 usages of Providers.

@kpp the API changes would as well apply to get_record, which Substrate is using last time I worked on it.

@dignifiedquire dignifiedquire force-pushed the feat-kad-count branch 2 times, most recently from 345a530 to 1fc7b8b Compare July 4, 2022 11:09
@mxinden
Copy link
Member

mxinden commented Jul 14, 2022

@dignifiedquire let me know once this is ready for another review. Excited for this to eventually land.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the relevant changes and thanks for propagating them across the code-base.

I left a couple of comments. Overall the direction looks great to me.

I am of the opinion that we should apply this pattern to all query types (e.g. as well to get_closest_peers). As far as I understand no one is objecting.

I would like to keep the master branch consistent at all times. Thus I would prefer the changes to the other query types to happen in this pull request as well. Would you be willing to do that work as well @dignifiedquire?

pub providers: HashSet<PeerId>,
/// How many providers have been discovered so var.
pub providers_so_far: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? In case a user is interested in all of the providers found, would they not need to keep track of all of them anyways?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My use case was that I need it, and it is a lot easier to track in here, than to track externally. We might also need it internally if we want to do any early cancellations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I am fine with leaving it as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually realized it doesn't help me much after all, so I think we could savely remove this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove it then. Thanks.

@@ -2472,6 +2557,25 @@ pub enum KademliaEvent {
PendingRoutablePeer { peer: PeerId, address: Multiaddr },
}

/// Information about progress events.
#[derive(Debug, Clone)]
pub struct Index {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I proposed the name Index, I don't think it is intuitive.

In case anyone has more intuitive naming suggestions, that would be appreciated. //CC @thomaseizinger who usually has good ideas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also dislike the name, but haven't had any better ideas 😓

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ProgressStep?

pub struct ProgressStep {
    index: usize,
    last: bool
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I find ProgressStep a lot more intuitive than Index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 2414 to 2488
/// An outbound query has produced a result.
OutboundQueryCompleted {
/// An outbound query has finished.
OutboundQueryProgressed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the doc comment still up to date? Should it not be "An outbound query progressed"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/// The ID of the query that finished.
id: QueryId,
/// The result of the query.
/// The optional final result of the query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "The interim result of the query" not be more descriptive? I find this doc comment confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

result: QueryResult,
/// Execution statistics from the query.
stats: QueryStats,
/// Indicates which event this is, if therer are multiple responses for a single request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Indicates which event this is, if therer are multiple responses for a single request.
/// Indicates which event this is, if therer are multiple responses for a single query.

I think we refer to it as query everywhere else, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
});
}
QuickCheck::new().tests(10).quickcheck(prop as fn(_))
}

fn get_providers_limit<const N: usize>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our first const generic 🎉

@dignifiedquire
Copy link
Member Author

I would like to keep the master branch consistent at all times. Thus I would prefer the changes to the other query types to happen in this pull request as well. Would you be willing to do that work as well @dignifiedquire?

Yeah, I'll work on it, just wanted to get the structure right first.

@dignifiedquire
Copy link
Member Author

@mxinden if we remove the quorum from get_record, is your expectation that there is no termination condition anymore, and the caller always has to manually terminate, or should there be a default termination?

@mxinden
Copy link
Member

mxinden commented Aug 10, 2022

@mxinden if we remove the quorum from get_record, is your expectation that there is no termination condition anymore, and the caller always has to manually terminate, or should there be a default termination?

I would expect the QueryPeerIter to eventually run out of peers and thus I would expect the query to eventually terminate.

Do you think this is too implicit / not intuitive for users? In other words should we consider the potentially unnecessary work significant and thus the whole API a footgun?

@mxinden
Copy link
Member

mxinden commented Sep 23, 2022

Friendly ping @dignifiedquire. Would be unfortunate for this to go stale. Anything you would need from my end?

Do I understand correctly, that this is the only reason why iroh depends on a fork of rust-libp2p?

@dignifiedquire
Copy link
Member Author

Do I understand correctly, that this is the only reason why iroh depends on a fork of rust-libp2p?

yes, that is correct

@dignifiedquire
Copy link
Member Author

Do you think this is too implicit / not intuitive for users? In other words should we consider the potentially unnecessary work significant and thus the whole API a footgun?

I am a little afraid that could happen to be honest.

@dignifiedquire
Copy link
Member Author

Friendly ping @dignifiedquire. Would be unfortunate for this to go stale. Anything you would need from my end?

Honestly I am not sure how much sense it really makes to change the others behaviour atm, other than the changes I made so far. I will likely have a better understanding down the line when starting to use the rest of the API similarly intensely as I do provider finding atm.

So I would love if we can find something close to what is there that we can merge, and I (or others) can work on improving the other pieces of the API as needed.

@mxinden
Copy link
Member

mxinden commented Sep 27, 2022

Honestly I am not sure how much sense it really makes to change the others behaviour atm, other than the changes I made so far.

Do you see any issues with applying the new mechanism to GetRecord?

I understand your reasoning on only doing this change to the GetRecord system once we have a deeper understanding of it. That said, I do want to keep the libp2p-kad API surface consistent, i.e. not offer two different mechanisms to execute operations on the DHT. I argue that this consistency helps both user experience and maintainability. Thus I suggest to either include the changes across the entire API surface, or delaying this pull request until we have more information, but not merge as is.

@dignifiedquire
Copy link
Member Author

@mxinden fair enough, I updated the GetRecord api, looks pretty alright to me in tests. Do you think that is enough, in terms of updates or do you want any other api calls to be changed to return multiple results?

@dignifiedquire
Copy link
Member Author

Some small suggestions

thanks a lot, merged them all

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for bearing with me @dignifiedquire!

@rkuhn
Copy link
Contributor

rkuhn commented Nov 30, 2022

Hmm, this is quite a lot to read: now that I’m facing a somewhat non-obvious upgrade for ipfs-embed (as I’ve never used Kademlia), what is the upgrade path for the quorum removal from get_record? @mxinden @dignifiedquire

@mxinden
Copy link
Member

mxinden commented Dec 2, 2022

@rkuhn can you ping me on your upgrade pull request? Happy to guide or provide patches.

I am assuming that you already read the changelog entry. Unfortunately that is not a step-by-step instruction manual.

@rkuhn
Copy link
Contributor

rkuhn commented Dec 2, 2022

Here’s what I came up with, hope it isn’t too far off :-)

mxinden added a commit that referenced this pull request Dec 13, 2022
With #2712 merged, this can be marked as done.
mergify bot pushed a commit that referenced this pull request Dec 13, 2022
jxs pushed a commit to jxs/rust-libp2p that referenced this pull request Dec 14, 2022
paritytech-processbot bot pushed a commit to paritytech/substrate that referenced this pull request Jan 5, 2023
* upgrade libp2p to 0.50.0

* on_swarm_event and on_connection_handler_event

* replace `Swarm::new` with `Swarm::with_threadpool_executor`

* on_swarm_event and on_connection_handler_event part 2

* on_swarm_event and on_connection_handler_event part 3

* on_swarm_event and on_connection_handler_event part 4

* update libp2p

* libp2p 0.50.0

* rename OutboundQueryCompleted to OutboundQueryProgressed

refs libp2p/rust-libp2p#2712

* remove unused var

* accumulate outbound_query_records until query is finished

* format code

* use p_handler instead of new_handler

#12734 (comment)

* pass ListenFailure to kademlia

#12734 (comment)

* use tokio executor in tests

#12734 (comment)

* use chrono Local::now

instead of deprecated Local::today

* remove unused vars from request_responses tests

* attempt to fix pallet UI tests

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI
bkchr added a commit to paritytech/substrate that referenced this pull request Jan 5, 2023
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded
to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of
using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish
the query. There was also another behavior change in libp2p, they stopped automatic caching on
remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from
our point of view of the network.

The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
bkchr added a commit to paritytech/substrate that referenced this pull request Jan 5, 2023
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded
to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of
using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish
the query. There was also another behavior change in libp2p, they stopped automatic caching on
remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from
our point of view of the network.

The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
bkchr added a commit to paritytech/substrate that referenced this pull request Jan 5, 2023
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded
to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of
using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish
the query. There was also another behavior change in libp2p, they stopped automatic caching on
remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from
our point of view of the network.

The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
@dariusc93 dariusc93 mentioned this pull request Feb 12, 2023
5 tasks
ltfschoen pushed a commit to ltfschoen/substrate that referenced this pull request Feb 22, 2023
* upgrade libp2p to 0.50.0

* on_swarm_event and on_connection_handler_event

* replace `Swarm::new` with `Swarm::with_threadpool_executor`

* on_swarm_event and on_connection_handler_event part 2

* on_swarm_event and on_connection_handler_event part 3

* on_swarm_event and on_connection_handler_event part 4

* update libp2p

* libp2p 0.50.0

* rename OutboundQueryCompleted to OutboundQueryProgressed

refs libp2p/rust-libp2p#2712

* remove unused var

* accumulate outbound_query_records until query is finished

* format code

* use p_handler instead of new_handler

paritytech#12734 (comment)

* pass ListenFailure to kademlia

paritytech#12734 (comment)

* use tokio executor in tests

paritytech#12734 (comment)

* use chrono Local::now

instead of deprecated Local::today

* remove unused vars from request_responses tests

* attempt to fix pallet UI tests

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI
ltfschoen pushed a commit to ltfschoen/substrate that referenced this pull request Feb 22, 2023
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded
to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of
using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish
the query. There was also another behavior change in libp2p, they stopped automatic caching on
remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from
our point of view of the network.

The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023
* upgrade libp2p to 0.50.0

* on_swarm_event and on_connection_handler_event

* replace `Swarm::new` with `Swarm::with_threadpool_executor`

* on_swarm_event and on_connection_handler_event part 2

* on_swarm_event and on_connection_handler_event part 3

* on_swarm_event and on_connection_handler_event part 4

* update libp2p

* libp2p 0.50.0

* rename OutboundQueryCompleted to OutboundQueryProgressed

refs libp2p/rust-libp2p#2712

* remove unused var

* accumulate outbound_query_records until query is finished

* format code

* use p_handler instead of new_handler

paritytech#12734 (comment)

* pass ListenFailure to kademlia

paritytech#12734 (comment)

* use tokio executor in tests

paritytech#12734 (comment)

* use chrono Local::now

instead of deprecated Local::today

* remove unused vars from request_responses tests

* attempt to fix pallet UI tests

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI
ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023
Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded
to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of
using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish
the query. There was also another behavior change in libp2p, they stopped automatic caching on
remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from
our point of view of the network.

The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712
taqtiqa-mark pushed a commit to subversive-upstream/polkadot-sdk-substrate-frame-support that referenced this pull request Sep 14, 2023
* upgrade libp2p to 0.50.0

* on_swarm_event and on_connection_handler_event

* replace `Swarm::new` with `Swarm::with_threadpool_executor`

* on_swarm_event and on_connection_handler_event part 2

* on_swarm_event and on_connection_handler_event part 3

* on_swarm_event and on_connection_handler_event part 4

* update libp2p

* libp2p 0.50.0

* rename OutboundQueryCompleted to OutboundQueryProgressed

refs libp2p/rust-libp2p#2712

* remove unused var

* accumulate outbound_query_records until query is finished

* format code

* use p_handler instead of new_handler

paritytech/substrate#12734 (comment)

* pass ListenFailure to kademlia

paritytech/substrate#12734 (comment)

* use tokio executor in tests

paritytech/substrate#12734 (comment)

* use chrono Local::now

instead of deprecated Local::today

* remove unused vars from request_responses tests

* attempt to fix pallet UI tests

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI

* restart CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants