-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Feature/neo retry logic #398
[WIP] Feature/neo retry logic #398
Conversation
@mvanderkroon This is great, thank you. |
neomodel/util.py
Outdated
response.keys(), | ||
) | ||
break | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - this is the fun bit :-) There are a variety of things to cope with here. Various exceptions can come back depending on where in its lifecycle the cluster currently is:
ServiceUnavailable("Unable to retrieve routing information")
SessionExpired("Failed to obtain connection towards '%s' server." % access_mode)
self.Error("Failed to read from defunct connection {!r}".format(self.server.address))
AttributeError: 'NoneType' object has no attribute 'commit'
AttributeError: 'NoneType' object has no attribute 'rollback'
AttributeError: 'NoneType' object has no attribute 'protocol_version'
self.Error("Failed to read from closed connection {!r}".format(self.server.address))
self.Error("Failed to write to closed connection {!r}".format(self.server.address))
ConnectionRefusedError: [Errno 111] Connection refused
requests.exceptions.ConnectionError: HTTPConnectionPool(host='neo4j-0', port=7474): Max retries exceeded with url: /db/data/transaction/commit
(I think the AttributeError
s are coming from a TransactionProxy
getting a SessionExpired
exception, retrying successfully but setting db.transaction
to None
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ís tricky, none (with perhaps the only exception being the first) are very clearly 'transient' and could just as well indicate issues that will not resolved by simply waiting and retrying.
I think there are generally two distinct cases:
(1) There is an open/healthy connection to the neo4j cluster. However, the cluster is not in a valid state, e.g. it is currently electing a new leader. Any queries issued against the cluster whilst it being in this invalid state will cause 'transient' errors that can be mitigated by the retry logic as implemented above.
It is, however, not clear which concrete Exception
the driver raises but the neobolt codebase has a few good candidates of which Neo.ClientError.Cluster.NotALeader
seems very likely to indicate the above mentioned scenario, next to a few others that are clearly marked as transient.
(2) There are (potentially transient) network issues that prevent the driver to issue queries to the cluster over a previously healthy and open connection. In my experience, the above exceptions tend to occur most in this scenario and I think the solution here is different: not only should we retry issuing the query, but we should also attempt to reinitiate the connection.
@robertlagrant Does this description match your experience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right!
I think the neobolt NotALeader exception is a good pick for when the we're now talking to a follower. As a bonus, given how it's named I assume it also will be thrown when the node is in CANDIDATE status, before the cluster forms.
A good example of (2) is that the app has been directed (say) to node 1, which is the leader, and node 1 dies for some reason. Nodes 2 and 3 re-establish themselves as follower and leader, but while node 1 is physically coming back up it may be at a point where it's not even serving network requests, let alone have cluster info. So if the app tries to connect in that time, it almost certainly will get a connection timeout. When it tries to connect to the old node before it's got the cluster info, that's probably the service unavailable exception. I imagine that neobolt's ConnectionExpired and ServiceUnavailable exceptions will cover those cases.
In the case of ConnectionExpired, it could probably stop trying to connect to that server and connect afresh to the globally configured connection string instead, as that'll probably be a load balancer that will route the traffic to an alive node, re-get routing info and all will be well. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I hadn't thought about the CANDIDATE stage, but I think you're right!
Your example of scenario 2 seems entirely on point, it's consistent with what we see happen every once in a while.
I agree: connect afresh to the globally configure connection string and it should resolve. I see one possible risk: there might be a small window where the load balancer has not picked up the unavailability of node 1 and so there is a (very) small chance that connecting afresh might end up at node 1 - I'm not sure of the likelihood of this happening (it also depends on health check interval settings and cluster size), but it might be prudent to ensure some retry logic for the reconnect with a configurable MAX_RETRY_SECONDS (which should then be set by the user to a value greater than the health check interval, with a reasonable default - 10 seconds?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think that's a good move. I'd probably suggest that if they choose to set it to 0
that the retry will be disabled, in case some people want to implement such things in infrastructure instead. I can't think of an obvious example of this, but I imagine if people have a cluster-aware load balancer then they'd want to disable it to eliminate unnecessary moving parts.
But yeah, nice catch. Hadn't thought of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertlagrant is there any chance you can try this to see if it resolves the issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertlagrant ping?
…usterstate related (or otherwise indicated as transient)
def __str__(self): | ||
node_labels = ",".join(self.db_node.labels) | ||
ncr_items = list(map(lambda x:"{} --> {}".format(",".join(x[0]), x[1]), | ||
self.current_node_class_registry.items())) | ||
ncr_items = list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a style/preference thing, but it might be more readable to do the following:
ncr_items_multiline = "\n".join(
"{} --> {}".format(",".join(node_labels), model_class)
for node_labels, model_class in self.current_node_class_registry.items()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that definitely looks better, I use an autoformatter (black) that wreaked havoc on the existing formatting, hence the large amount of changes.
Question is whether the implemented logic is solving your issues, can you check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvanderkroon I will! I just came here to ask if it's ready for a test. I'll give it a go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still getting a NotALeaderError
. I can't see why, as I'm a bear of very little brain. I double-checked that the code I'm using is yours, and I can see the except (ServiceUnavailableException, NotALeaderError, DatabaseUnavailableError
in your code, so that's very odd!
Here's the traceback:
2019-01-20 21:02:54.855 | INFO | commands:do_second:80 - Attempting to write data to neo4j-2
2019-01-20 21:02:54.871 | ERROR | commands:do_second:84 - Traceback (most recent call last):
File "/commands.py", line 81, in do_second
write_data()
File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 317, in wrapper
return func(*args, **kwargs)
File "/commands.py", line 56, in write_data
return db.cypher_query(INSERT_QUERY)
File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 39, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 261, in cypher_query
raise exc_info[1].with_traceback(exc_info[2])
File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 233, in cypher_query
results, meta = ([list(r.values()) for r in response], response.keys())
File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 233, in <listcomp>
results, meta = ([list(r.values()) for r in response], response.keys())
File "/usr/local/lib/python3.7/site-packages/neo4j/v1/api.py", line 736, in records
self._session.fetch()
File "/usr/local/lib/python3.7/site-packages/neo4j/v1/api.py", line 347, in fetch
detail_count, _ = self._connection.fetch()
File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 290, in fetch
raise error
File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 287, in fetch
return self._fetch()
File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 327, in _fetch
response.on_failure(summary_metadata or {})
File "/usr/local/lib/python3.7/site-packages/neo4j/v1/result.py", line 70, in on_failure
raise CypherError.hydrate(**metadata)
neo4j.exceptions.NotALeaderError: No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertlagrant, @mvanderkroon There seem to be certain events that do not trigger a sync from the databsase server towards the driver. In the past, I have experienced this in relation to transactions where an operation was causing an error server-side, which however was only becoming known at client-side during a different transaction. This was confusing because it flagged errors at points that were not directly related to it.
Therefore, do you think that you could try sync on the session, at the point that is likely to be causing the problem? Perhaps this is one of those cases where the server has updated its state but this has not been propagated back to the client side (?). So, upon the first "fail", do a sync
first and THEN retry the transaction. Hopefully this will update the client about which server has become the leader (?). Just an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertlagrant, @mvanderkroon Shall we include this change in an upcoming release? Did you guys try the sync
before re-trying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aanastasiou I am busy these weeks with other activities, so I was not able to test it just yet. At the earliest I'd be able to look into it at the beginning of next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvanderkroon Of course, I understand, there is no rush, if I have sometime from my side I will give it a try too.
All the best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update from my end: we have been running into exactly the issue as @robertlagrant (countless NOTALEADER
errors). However, in this case it seems unrelated to neomodel (or even the python-neo4j-driver): for us, causal clustering is just incredibly unstable. The causal cluster in fact gets stuck in a continuous LEADER election loop. I'm therefore quite certain that adding more retry-logic
makes little sense on the application side.
To me, 'causal clustering' is either fundamentally broken, or we are unlucky enough to encounter some sort of edge case bug. So far our experience with causal clustering has been far from ideal: unstable all the way through and write performance is ~6x worse than against unclustered. Our solution so far is to reduce the number of core nodes to 1 (essentially losing cluster capabilities) and keep our fingers crossed until we can migrate away from neo4j.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvanderkroon Thanks for letting me know. @robertlagrant is there anything to add from your end on this (?).
Possibly related. There are certain conditions for which the client does not get updated about any error conditions immediately, unless the client does a |
Can that be merged ? What would need to be done to merge this ? Or should it be closed ? |
Adds retry logic to the query execution code which should mitigate high level, transient neo4j errors such as the ones mentioned in #335