[WIP] Feature/neo retry logic #398

mvanderkroon · 2019-01-09T13:35:38Z

Adds retry logic to the query execution code which should mitigate high level, transient neo4j errors such as the ones mentioned in #335

…eo4j exceptions

aanastasiou · 2019-01-09T14:17:51Z

@mvanderkroon This is great, thank you.

neomodel/util.py

robertlagrant · 2019-01-09T22:25:38Z

neomodel/util.py

+                        response.keys(),
+                    )
+                    break
+                except Exception as e:


So - this is the fun bit :-) There are a variety of things to cope with here. Various exceptions can come back depending on where in its lifecycle the cluster currently is:

ServiceUnavailable("Unable to retrieve routing information") SessionExpired("Failed to obtain connection towards '%s' server." % access_mode) self.Error("Failed to read from defunct connection {!r}".format(self.server.address)) AttributeError: 'NoneType' object has no attribute 'commit' AttributeError: 'NoneType' object has no attribute 'rollback' AttributeError: 'NoneType' object has no attribute 'protocol_version' self.Error("Failed to read from closed connection {!r}".format(self.server.address)) self.Error("Failed to write to closed connection {!r}".format(self.server.address)) ConnectionRefusedError: [Errno 111] Connection refused requests.exceptions.ConnectionError: HTTPConnectionPool(host='neo4j-0', port=7474): Max retries exceeded with url: /db/data/transaction/commit

(I think the AttributeErrors are coming from a TransactionProxy getting a SessionExpired exception, retrying successfully but setting db.transaction to None.)

This ís tricky, none (with perhaps the only exception being the first) are very clearly 'transient' and could just as well indicate issues that will not resolved by simply waiting and retrying.

I think there are generally two distinct cases:

(1) There is an open/healthy connection to the neo4j cluster. However, the cluster is not in a valid state, e.g. it is currently electing a new leader. Any queries issued against the cluster whilst it being in this invalid state will cause 'transient' errors that can be mitigated by the retry logic as implemented above.

It is, however, not clear which concrete Exception the driver raises but the neobolt codebase has a few good candidates of which Neo.ClientError.Cluster.NotALeader seems very likely to indicate the above mentioned scenario, next to a few others that are clearly marked as transient.

(2) There are (potentially transient) network issues that prevent the driver to issue queries to the cluster over a previously healthy and open connection. In my experience, the above exceptions tend to occur most in this scenario and I think the solution here is different: not only should we retry issuing the query, but we should also attempt to reinitiate the connection.

@robertlagrant Does this description match your experience?

Yes, that's right!

I think the neobolt NotALeader exception is a good pick for when the we're now talking to a follower. As a bonus, given how it's named I assume it also will be thrown when the node is in CANDIDATE status, before the cluster forms.

A good example of (2) is that the app has been directed (say) to node 1, which is the leader, and node 1 dies for some reason. Nodes 2 and 3 re-establish themselves as follower and leader, but while node 1 is physically coming back up it may be at a point where it's not even serving network requests, let alone have cluster info. So if the app tries to connect in that time, it almost certainly will get a connection timeout. When it tries to connect to the old node before it's got the cluster info, that's probably the service unavailable exception. I imagine that neobolt's ConnectionExpired and ServiceUnavailable exceptions will cover those cases.

In the case of ConnectionExpired, it could probably stop trying to connect to that server and connect afresh to the globally configured connection string instead, as that'll probably be a load balancer that will route the traffic to an alive node, re-get routing info and all will be well. What do you think?

ah I hadn't thought about the CANDIDATE stage, but I think you're right!

Your example of scenario 2 seems entirely on point, it's consistent with what we see happen every once in a while.

I agree: connect afresh to the globally configure connection string and it should resolve. I see one possible risk: there might be a small window where the load balancer has not picked up the unavailability of node 1 and so there is a (very) small chance that connecting afresh might end up at node 1 - I'm not sure of the likelihood of this happening (it also depends on health check interval settings and cluster size), but it might be prudent to ensure some retry logic for the reconnect with a configurable MAX_RETRY_SECONDS (which should then be set by the user to a value greater than the health check interval, with a reasonable default - 10 seconds?).

Yes I think that's a good move. I'd probably suggest that if they choose to set it to 0 that the retry will be disabled, in case some people want to implement such things in infrastructure instead. I can't think of an obvious example of this, but I imagine if people have a cluster-aware load balancer then they'd want to disable it to eliminate unnecessary moving parts.

But yeah, nice catch. Hadn't thought of that.

@robertlagrant is there any chance you can try this to see if it resolves the issues?

@robertlagrant ping?

…usterstate related (or otherwise indicated as transient)

robertlagrant · 2019-01-16T15:23:40Z

neomodel/exceptions.py

    def __str__(self):
        node_labels = ",".join(self.db_node.labels)
-        ncr_items = list(map(lambda x:"{} --> {}".format(",".join(x[0]), x[1]),
-                             self.current_node_class_registry.items()))
+        ncr_items = list(


This is just a style/preference thing, but it might be more readable to do the following:

ncr_items_multiline = "\n".join( "{} --> {}".format(",".join(node_labels), model_class) for node_labels, model_class in self.current_node_class_registry.items() )

that definitely looks better, I use an autoformatter (black) that wreaked havoc on the existing formatting, hence the large amount of changes.

Question is whether the implemented logic is solving your issues, can you check this?

@mvanderkroon I will! I just came here to ask if it's ready for a test. I'll give it a go.

I'm still getting a NotALeaderError. I can't see why, as I'm a bear of very little brain. I double-checked that the code I'm using is yours, and I can see the except (ServiceUnavailableException, NotALeaderError, DatabaseUnavailableError in your code, so that's very odd!

Here's the traceback:

2019-01-20 21:02:54.855 | INFO | commands:do_second:80 - Attempting to write data to neo4j-2 2019-01-20 21:02:54.871 | ERROR | commands:do_second:84 - Traceback (most recent call last): File "/commands.py", line 81, in do_second write_data() File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 317, in wrapper return func(*args, **kwargs) File "/commands.py", line 56, in write_data return db.cypher_query(INSERT_QUERY) File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 39, in wrapper return func(self, *args, **kwargs) File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 261, in cypher_query raise exc_info[1].with_traceback(exc_info[2]) File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 233, in cypher_query results, meta = ([list(r.values()) for r in response], response.keys()) File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 233, in <listcomp> results, meta = ([list(r.values()) for r in response], response.keys()) File "/usr/local/lib/python3.7/site-packages/neo4j/v1/api.py", line 736, in records self._session.fetch() File "/usr/local/lib/python3.7/site-packages/neo4j/v1/api.py", line 347, in fetch detail_count, _ = self._connection.fetch() File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 290, in fetch raise error File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 287, in fetch return self._fetch() File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 327, in _fetch response.on_failure(summary_metadata or {}) File "/usr/local/lib/python3.7/site-packages/neo4j/v1/result.py", line 70, in on_failure raise CypherError.hydrate(**metadata) neo4j.exceptions.NotALeaderError: No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER

@robertlagrant, @mvanderkroon There seem to be certain events that do not trigger a sync from the databsase server towards the driver. In the past, I have experienced this in relation to transactions where an operation was causing an error server-side, which however was only becoming known at client-side during a different transaction. This was confusing because it flagged errors at points that were not directly related to it.

Therefore, do you think that you could try sync on the session, at the point that is likely to be causing the problem? Perhaps this is one of those cases where the server has updated its state but this has not been propagated back to the client side (?). So, upon the first "fail", do a sync first and THEN retry the transaction. Hopefully this will update the client about which server has become the leader (?). Just an idea.

@robertlagrant, @mvanderkroon Shall we include this change in an upcoming release? Did you guys try the sync before re-trying?

@aanastasiou I am busy these weeks with other activities, so I was not able to test it just yet. At the earliest I'd be able to look into it at the beginning of next week.

@mvanderkroon Of course, I understand, there is no rush, if I have sometime from my side I will give it a try too.

All the best

update from my end: we have been running into exactly the issue as @robertlagrant (countless NOTALEADER errors). However, in this case it seems unrelated to neomodel (or even the python-neo4j-driver): for us, causal clustering is just incredibly unstable. The causal cluster in fact gets stuck in a continuous LEADER election loop. I'm therefore quite certain that adding more retry-logic makes little sense on the application side.

To me, 'causal clustering' is either fundamentally broken, or we are unlucky enough to encounter some sort of edge case bug. So far our experience with causal clustering has been far from ideal: unstable all the way through and write performance is ~6x worse than against unclustered. Our solution so far is to reduce the number of core nodes to 1 (essentially losing cluster capabilities) and keep our fingers crossed until we can migrate away from neo4j.

fyi, seems related

@mvanderkroon Thanks for letting me know. @robertlagrant is there anything to add from your end on this (?).

aanastasiou · 2019-03-07T10:18:13Z

Possibly related. There are certain conditions for which the client does not get updated about any error conditions immediately, unless the client does a sync(). This is done for performance reasons.

mariusconjeaud · 2023-02-24T11:15:15Z

Can that be merged ? What would need to be done to merge this ? Or should it be closed ?

mvanderkroon added 2 commits January 9, 2019 14:07

adds query execution retry logic to deal with high-level, transient n…

3b953a2

…eo4j exceptions

cleanup structure

8c01ca9

robertlagrant suggested changes Jan 9, 2019

View reviewed changes

changes time.clock() to time.time() for compatibility across win/lin

65259b2

mvanderkroon changed the title ~~Feature/neo retry logic~~ [WIP] Feature/neo retry logic Jan 10, 2019

adds reconnect to retry logic in case failure is *not* high-level, cl…

d327100

…usterstate related (or otherwise indicated as transient)

robertlagrant reviewed Jan 16, 2019

View reviewed changes

aanastasiou mentioned this pull request Apr 23, 2019

New function required: StructuredNode.merge #438

Closed

mariusconjeaud added the marked_for_imminent_closure Fully or partially obsolete issue. Request clarification, close if none provided. label Feb 24, 2023

mariusconjeaud closed this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature/neo retry logic #398

[WIP] Feature/neo retry logic #398

mvanderkroon commented Jan 9, 2019

aanastasiou commented Jan 9, 2019

robertlagrant Jan 9, 2019

mvanderkroon Jan 10, 2019

robertlagrant Jan 10, 2019 •

edited

Loading

mvanderkroon Jan 10, 2019

robertlagrant Jan 10, 2019 •

edited

Loading

mvanderkroon Jan 12, 2019

mvanderkroon Jan 17, 2019

robertlagrant Jan 16, 2019

mvanderkroon Jan 17, 2019

robertlagrant Jan 20, 2019 •

edited

Loading

robertlagrant Jan 20, 2019

aanastasiou Jan 24, 2019

aanastasiou Jan 28, 2019

mvanderkroon Jan 28, 2019

aanastasiou Jan 30, 2019

mvanderkroon Feb 6, 2019

aanastasiou Feb 6, 2019

aanastasiou commented Mar 7, 2019

mariusconjeaud commented Feb 24, 2023

[WIP] Feature/neo retry logic #398

[WIP] Feature/neo retry logic #398

Conversation

mvanderkroon commented Jan 9, 2019

aanastasiou commented Jan 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertlagrant Jan 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertlagrant Jan 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertlagrant Jan 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aanastasiou commented Mar 7, 2019

mariusconjeaud commented Feb 24, 2023

robertlagrant Jan 10, 2019 •

edited

Loading

robertlagrant Jan 10, 2019 •

edited

Loading

robertlagrant Jan 20, 2019 •

edited

Loading