-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discovery+graph: track job set dependencies in vb #9241
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
7acf321
to
fc00572
Compare
ValidationBarrier
ValidationBarrier
fc72083
to
7d95cd2
Compare
cc: @gijswijs for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding the distinction between parent and child jobs here both confusing and unnecessary. What we have here is a dependency graph of undifferentiated jobIDs. once all of the dependencies have run we can run. once we run we want to signal all of our dependents. We should be able to accomplish this with a single removeJob
that does this index cleanup and dependent signaling.
The main difficulty I'm noticing in this PR is that we have multiple IDs that we want to be able to map to JobID
s from disjoint domains. My recommendation here is to make the core algebra of this component undifferentiated and then have auxilliary mappings that help recover the relevant JobID
from the other unique protocol identifiers.
graph/validation_barrier.go
Outdated
annID = msg.ShortChannelID | ||
|
||
// TODO: If ok is false, we have serious issues. | ||
parentJobIDs, ok = v.jobDependencies[childJobID] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this read not need a mutex lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's locked above no?
graph/validation_barrier.go
Outdated
// We don't want to block when sending out the signal. | ||
select { | ||
case notifyChan <- struct{}{}: | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok with swallowing the signal instead? Seems like this could case jobs to never be run, particularly if lastJob
is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here. 1. How would jobs not be run in the current scenario? 2. What does swallowing the signal look like here?
cc: @gijswijs for review |
Outstanding things to do:
|
@Crypt-iQ - is this ready for re-review given the itest failures? |
Sorry no, it seems like I broke something |
now it is, the fn change broke it |
I could not figure out a decent way to use generics to accomplish this, so
will update this code with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this change. 👍 Somehow, for me, the old solution with the depency maps is harder to reason about. I like the concept of jobInfoMap
better.
One thing I found (somewhat in line with @ProofOfKeags earlier comment) is that the usage of the words dependency, dependent and parent, child respectively is inconsistent throughout. The code would be more readable if we would stick to one pair. If we choose parent/child we don't need to fight anymore about british vs us spelling, which is an added benefit.
Apart from that I made some small comments in the code.
graph/validation_barrier.go
Outdated
jobDesc = fmt.Sprintf("job=lnwire.NodeAnnouncement, pub=%s", | ||
vertex) | ||
route.Vertex(msg.NodeID)) | ||
|
||
// Other types of jobs can be executed immediately, so we'll just | ||
// return directly. | ||
case *lnwire.AnnounceSignatures1: | ||
// TODO(roasbeef): need to wait on chan ann? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe return nil
here and remove the following block below:
// If it's not ok, it means either the job is not a dependent type, or
// it doesn't have a dependency signal. Either way, we can return
// early.
if !ok {
return nil
}
Same for the case
after this one.
Less loc and to my mind clearer intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
graph/validation_barrier_test.go
Outdated
@@ -120,11 +131,19 @@ func TestValidationBarrierQuit(t *testing.T) { | |||
// Signal completion for the first half of tasks, but only allow | |||
// dependents to be processed as well for the second quarter. | |||
case i < numTasks/4: | |||
barrier.SignalDependants(anns[i], false) | |||
err := barrier.SignalDependents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the deny
and allow
logic the first case can be removed. There is no difference with the second case anymore, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, good catch
Addressed comments, but may need to hold off on upgrading the |
@Crypt-iQ is this ready for another round or do you want to address CI first? |
I think for this to be ready for review, the changes made to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think those fn
updates are all in now so can rebase here 👍
@ellemouton: review reminder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, but there is go-routine leak that should be addressed 🙏
it is fixed if the validation barrier is completely removed from the graphdb.Builder
// If this message has an existing dependency, then we'll wait until | ||
// that has been fully validated before we proceed. | ||
err := vb.WaitForDependants(update.msg) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move this to the discovery
package since it is only used there now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah - i see there is still an instantiation of it here - but do we need it at all?
afaict, it has no effect anymore since the only call we do on it is CompleteJob
which only serves a purpose if other processes are taking tokens out of the semaphore as well.
In fact, I actually think because of how it is used now: All calls to CompleteJob
will actually hang until the builder shuts down. The only reason it shuts down successfully now is cause of the quit channel of the builder (and hence the validation barrier's quit channel) gets closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch
graph/builder.go
Outdated
// concerned about. | ||
if notError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just
if err != nil {
// Log as a debug message if this is not an error we need to be
// concerned about.
if IsError(err, ErrIgnored, ErrOutdated) {
log.Debugf("process network updates got: %v", err)
} else {
log.Errorf("process network updates got: %v", err)
}
return
}
no need for the noError
variable. It will also only ever be checked if err != nil
so no need for the err == nil
clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
} | ||
|
||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to note that in future, we will likely have the Builder's AddNode/AddChannel/UpdateChannel calls be exposed via RPC for clients who dont want to sync the graph themselves. So initially i was thinking we may want to keep the validation barrier for that use case but after having thought about it more, i think it is fine to remove since those clients will always be trusted AND those calls will be done in sync with the gossiper calls performed by those clients.
So, ACK from my side for removal here 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as they're done in sync with the gossiper calls and the semaphore logic of the VB is used, that all sounds good and should work fine.
This omits calls to InitJobDependencies, SignalDependants, and WaitForDependants. These changes have been made here because the router / builder code does not actually need job dependency management. Calls to the builder code (i.e. AddNode, AddEdge, UpdateEdge) are all blocking in the gossiper. This, combined with the fact that child jobs are run after parent jobs in the gossiper, means that the calls to the router will happen in the proper dependency order.
This commit does two things: - removes the concept of allow / deny. Having this in place was a minor optimization and removing it makes the solution simpler. - changes the job dependency tracking to track sets of abstact parent jobs rather than individual parent jobs. As a note, the purpose of the ValidationBarrier is that it allows us to launch gossip validation jobs in goroutines while still ensuring that the validation order of these goroutines is adhered to when it comes to validating ChannelAnnouncement _before_ ChannelUpdate and _before_ NodeAnnouncement.
Addressed comments @ellemouton , whenever you're ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a thorough review and I have no further remarks, so 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the updates! 🙏
This PR changes the
ValidationBarrier
to track abstract job dependencies. This just means that every time a child job comes in (i.e. channel update or node announcement), we track the set of possible parent jobs that are related to it (channel announcement(s)) that have registered viaInitJobDependencies
. The goroutines containing the child jobs will then wait to be notified every time one of their parent jobs completes. From the child job's POV, this just works as ref-counting except that you're only counting the parent jobs you're interested in.With this, we can now extend the
ValidationBarrier
to track any sort of abstract job that requires both concurrency and waiting for another job to finish. It also makes it possible in a future PR to very easily make node announcements depend on channel announcements. See the commit messages for more details.