Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Feature/neo retry logic #398
[WIP] Feature/neo retry logic #398
Changes from 1 commit
3b953a2
8c01ca9
65259b2
d327100
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - this is the fun bit :-) There are a variety of things to cope with here. Various exceptions can come back depending on where in its lifecycle the cluster currently is:
(I think the
AttributeError
s are coming from aTransactionProxy
getting aSessionExpired
exception, retrying successfully but settingdb.transaction
toNone
.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ís tricky, none (with perhaps the only exception being the first) are very clearly 'transient' and could just as well indicate issues that will not resolved by simply waiting and retrying.
I think there are generally two distinct cases:
(1) There is an open/healthy connection to the neo4j cluster. However, the cluster is not in a valid state, e.g. it is currently electing a new leader. Any queries issued against the cluster whilst it being in this invalid state will cause 'transient' errors that can be mitigated by the retry logic as implemented above.
It is, however, not clear which concrete
Exception
the driver raises but the neobolt codebase has a few good candidates of whichNeo.ClientError.Cluster.NotALeader
seems very likely to indicate the above mentioned scenario, next to a few others that are clearly marked as transient.(2) There are (potentially transient) network issues that prevent the driver to issue queries to the cluster over a previously healthy and open connection. In my experience, the above exceptions tend to occur most in this scenario and I think the solution here is different: not only should we retry issuing the query, but we should also attempt to reinitiate the connection.
@robertlagrant Does this description match your experience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right!
I think the neobolt NotALeader exception is a good pick for when the we're now talking to a follower. As a bonus, given how it's named I assume it also will be thrown when the node is in CANDIDATE status, before the cluster forms.
A good example of (2) is that the app has been directed (say) to node 1, which is the leader, and node 1 dies for some reason. Nodes 2 and 3 re-establish themselves as follower and leader, but while node 1 is physically coming back up it may be at a point where it's not even serving network requests, let alone have cluster info. So if the app tries to connect in that time, it almost certainly will get a connection timeout. When it tries to connect to the old node before it's got the cluster info, that's probably the service unavailable exception. I imagine that neobolt's ConnectionExpired and ServiceUnavailable exceptions will cover those cases.
In the case of ConnectionExpired, it could probably stop trying to connect to that server and connect afresh to the globally configured connection string instead, as that'll probably be a load balancer that will route the traffic to an alive node, re-get routing info and all will be well. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I hadn't thought about the CANDIDATE stage, but I think you're right!
Your example of scenario 2 seems entirely on point, it's consistent with what we see happen every once in a while.
I agree: connect afresh to the globally configure connection string and it should resolve. I see one possible risk: there might be a small window where the load balancer has not picked up the unavailability of node 1 and so there is a (very) small chance that connecting afresh might end up at node 1 - I'm not sure of the likelihood of this happening (it also depends on health check interval settings and cluster size), but it might be prudent to ensure some retry logic for the reconnect with a configurable MAX_RETRY_SECONDS (which should then be set by the user to a value greater than the health check interval, with a reasonable default - 10 seconds?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think that's a good move. I'd probably suggest that if they choose to set it to
0
that the retry will be disabled, in case some people want to implement such things in infrastructure instead. I can't think of an obvious example of this, but I imagine if people have a cluster-aware load balancer then they'd want to disable it to eliminate unnecessary moving parts.But yeah, nice catch. Hadn't thought of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertlagrant is there any chance you can try this to see if it resolves the issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertlagrant ping?