Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature/neo retry logic #398

Closed

Conversation

mvanderkroon
Copy link

Adds retry logic to the query execution code which should mitigate high level, transient neo4j errors such as the ones mentioned in #335

@aanastasiou
Copy link
Collaborator

@mvanderkroon This is great, thank you.

neomodel/util.py Outdated Show resolved Hide resolved
neomodel/util.py Outdated
response.keys(),
)
break
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - this is the fun bit :-) There are a variety of things to cope with here. Various exceptions can come back depending on where in its lifecycle the cluster currently is:

ServiceUnavailable("Unable to retrieve routing information") 
SessionExpired("Failed to obtain connection towards '%s' server." % access_mode) 
self.Error("Failed to read from defunct connection {!r}".format(self.server.address)) 
AttributeError: 'NoneType' object has no attribute 'commit' 
AttributeError: 'NoneType' object has no attribute 'rollback' 
AttributeError: 'NoneType' object has no attribute 'protocol_version' 
self.Error("Failed to read from closed connection {!r}".format(self.server.address)) 
self.Error("Failed to write to closed connection {!r}".format(self.server.address)) 
ConnectionRefusedError: [Errno 111] Connection refused 
requests.exceptions.ConnectionError: HTTPConnectionPool(host='neo4j-0', port=7474): Max retries exceeded with url: /db/data/transaction/commit

(I think the AttributeErrors are coming from a TransactionProxy getting a SessionExpired exception, retrying successfully but setting db.transaction to None.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ís tricky, none (with perhaps the only exception being the first) are very clearly 'transient' and could just as well indicate issues that will not resolved by simply waiting and retrying.

I think there are generally two distinct cases:

(1) There is an open/healthy connection to the neo4j cluster. However, the cluster is not in a valid state, e.g. it is currently electing a new leader. Any queries issued against the cluster whilst it being in this invalid state will cause 'transient' errors that can be mitigated by the retry logic as implemented above.

It is, however, not clear which concrete Exception the driver raises but the neobolt codebase has a few good candidates of which Neo.ClientError.Cluster.NotALeader seems very likely to indicate the above mentioned scenario, next to a few others that are clearly marked as transient.

(2) There are (potentially transient) network issues that prevent the driver to issue queries to the cluster over a previously healthy and open connection. In my experience, the above exceptions tend to occur most in this scenario and I think the solution here is different: not only should we retry issuing the query, but we should also attempt to reinitiate the connection.

@robertlagrant Does this description match your experience?

Copy link
Contributor

@robertlagrant robertlagrant Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right!

I think the neobolt NotALeader exception is a good pick for when the we're now talking to a follower. As a bonus, given how it's named I assume it also will be thrown when the node is in CANDIDATE status, before the cluster forms.

A good example of (2) is that the app has been directed (say) to node 1, which is the leader, and node 1 dies for some reason. Nodes 2 and 3 re-establish themselves as follower and leader, but while node 1 is physically coming back up it may be at a point where it's not even serving network requests, let alone have cluster info. So if the app tries to connect in that time, it almost certainly will get a connection timeout. When it tries to connect to the old node before it's got the cluster info, that's probably the service unavailable exception. I imagine that neobolt's ConnectionExpired and ServiceUnavailable exceptions will cover those cases.

In the case of ConnectionExpired, it could probably stop trying to connect to that server and connect afresh to the globally configured connection string instead, as that'll probably be a load balancer that will route the traffic to an alive node, re-get routing info and all will be well. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I hadn't thought about the CANDIDATE stage, but I think you're right!

Your example of scenario 2 seems entirely on point, it's consistent with what we see happen every once in a while.

I agree: connect afresh to the globally configure connection string and it should resolve. I see one possible risk: there might be a small window where the load balancer has not picked up the unavailability of node 1 and so there is a (very) small chance that connecting afresh might end up at node 1 - I'm not sure of the likelihood of this happening (it also depends on health check interval settings and cluster size), but it might be prudent to ensure some retry logic for the reconnect with a configurable MAX_RETRY_SECONDS (which should then be set by the user to a value greater than the health check interval, with a reasonable default - 10 seconds?).

Copy link
Contributor

@robertlagrant robertlagrant Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that's a good move. I'd probably suggest that if they choose to set it to 0 that the retry will be disabled, in case some people want to implement such things in infrastructure instead. I can't think of an obvious example of this, but I imagine if people have a cluster-aware load balancer then they'd want to disable it to eliminate unnecessary moving parts.

But yeah, nice catch. Hadn't thought of that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertlagrant is there any chance you can try this to see if it resolves the issues?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvanderkroon mvanderkroon changed the title Feature/neo retry logic [WIP] Feature/neo retry logic Jan 10, 2019
…usterstate related (or otherwise indicated as transient)
def __str__(self):
node_labels = ",".join(self.db_node.labels)
ncr_items = list(map(lambda x:"{} --> {}".format(",".join(x[0]), x[1]),
self.current_node_class_registry.items()))
ncr_items = list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a style/preference thing, but it might be more readable to do the following:

ncr_items_multiline = "\n".join(
    "{} --> {}".format(",".join(node_labels), model_class)
    for node_labels, model_class in self.current_node_class_registry.items()
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that definitely looks better, I use an autoformatter (black) that wreaked havoc on the existing formatting, hence the large amount of changes.

Question is whether the implemented logic is solving your issues, can you check this?

Copy link
Contributor

@robertlagrant robertlagrant Jan 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvanderkroon I will! I just came here to ask if it's ready for a test. I'll give it a go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still getting a NotALeaderError. I can't see why, as I'm a bear of very little brain. I double-checked that the code I'm using is yours, and I can see the except (ServiceUnavailableException, NotALeaderError, DatabaseUnavailableError in your code, so that's very odd!

Here's the traceback:

2019-01-20 21:02:54.855 | INFO     | commands:do_second:80 - Attempting to write data to neo4j-2
2019-01-20 21:02:54.871 | ERROR    | commands:do_second:84 - Traceback (most recent call last):
  File "/commands.py", line 81, in do_second
    write_data()
  File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 317, in wrapper
    return func(*args, **kwargs)
  File "/commands.py", line 56, in write_data
    return db.cypher_query(INSERT_QUERY)
  File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 39, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 261, in cypher_query
    raise exc_info[1].with_traceback(exc_info[2])
  File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 233, in cypher_query
    results, meta = ([list(r.values()) for r in response], response.keys())
  File "/usr/local/lib/python3.7/site-packages/neomodel/util.py", line 233, in <listcomp>
    results, meta = ([list(r.values()) for r in response], response.keys())
  File "/usr/local/lib/python3.7/site-packages/neo4j/v1/api.py", line 736, in records
    self._session.fetch()
  File "/usr/local/lib/python3.7/site-packages/neo4j/v1/api.py", line 347, in fetch
    detail_count, _ = self._connection.fetch()
  File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 290, in fetch
    raise error
  File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 287, in fetch
    return self._fetch()
  File "/usr/local/lib/python3.7/site-packages/neo4j/bolt/connection.py", line 327, in _fetch
    response.on_failure(summary_metadata or {})
  File "/usr/local/lib/python3.7/site-packages/neo4j/v1/result.py", line 70, in on_failure
    raise CypherError.hydrate(**metadata)
neo4j.exceptions.NotALeaderError: No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertlagrant, @mvanderkroon There seem to be certain events that do not trigger a sync from the databsase server towards the driver. In the past, I have experienced this in relation to transactions where an operation was causing an error server-side, which however was only becoming known at client-side during a different transaction. This was confusing because it flagged errors at points that were not directly related to it.

Therefore, do you think that you could try sync on the session, at the point that is likely to be causing the problem? Perhaps this is one of those cases where the server has updated its state but this has not been propagated back to the client side (?). So, upon the first "fail", do a sync first and THEN retry the transaction. Hopefully this will update the client about which server has become the leader (?). Just an idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertlagrant, @mvanderkroon Shall we include this change in an upcoming release? Did you guys try the sync before re-trying?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aanastasiou I am busy these weeks with other activities, so I was not able to test it just yet. At the earliest I'd be able to look into it at the beginning of next week.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvanderkroon Of course, I understand, there is no rush, if I have sometime from my side I will give it a try too.

All the best

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update from my end: we have been running into exactly the issue as @robertlagrant (countless NOTALEADER errors). However, in this case it seems unrelated to neomodel (or even the python-neo4j-driver): for us, causal clustering is just incredibly unstable. The causal cluster in fact gets stuck in a continuous LEADER election loop. I'm therefore quite certain that adding more retry-logic makes little sense on the application side.

To me, 'causal clustering' is either fundamentally broken, or we are unlucky enough to encounter some sort of edge case bug. So far our experience with causal clustering has been far from ideal: unstable all the way through and write performance is ~6x worse than against unclustered. Our solution so far is to reduce the number of core nodes to 1 (essentially losing cluster capabilities) and keep our fingers crossed until we can migrate away from neo4j.

fyi, seems related

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvanderkroon Thanks for letting me know. @robertlagrant is there anything to add from your end on this (?).

@aanastasiou
Copy link
Collaborator

Possibly related. There are certain conditions for which the client does not get updated about any error conditions immediately, unless the client does a sync(). This is done for performance reasons.

@mariusconjeaud mariusconjeaud added the marked_for_imminent_closure Fully or partially obsolete issue. Request clarification, close if none provided. label Feb 24, 2023
@mariusconjeaud
Copy link
Collaborator

Can that be merged ? What would need to be done to merge this ? Or should it be closed ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
marked_for_imminent_closure Fully or partially obsolete issue. Request clarification, close if none provided.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants