Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neo4j.exceptions.ClientError: No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER #335

Open
robertlagrant opened this issue May 15, 2018 · 12 comments

Comments

@robertlagrant
Copy link
Contributor

A funny one:

  • We have a 3-node causal-clustered neo4j setup
  • I've changed the routing protocol to be bolt+routing
  • We're using Neomodel with @db.transaction

We're getting intermittent errors as per the issue title - i.e. it's trying to write to a follower node, and presumably bolt+routing isn't sending the transaction to the leader.
Am I missing something? Is it that if the first interaction with the database is a read, that it opens the transaction on a follower node? Can I force it to the leader for every transaction?

@robertlagrant
Copy link
Contributor Author

We are still getting this issue, even when forcing a write transaction.

I've created a repro case: https://github.com/robertlagrant/neo4j-cluster-failure. Please test.

@aanastasiou
Copy link
Collaborator

@robertlagrant Would it be possible to share a little bit more information on your cluster configuration? Is that supposed to be 3 CORE servers? There are some conditions where what you describe might be the intended behaviour at least as far as RAFT is concerned (i.e. see this). I am trying to see how much of this can be dealt with at the level of neomodel and how much of this is external to it.

@mvanderkroon
Copy link

mvanderkroon commented Jan 9, 2019

Please see https://neo4j.com/docs/ogm-manual/current/reference/ (section 3.14.1.6. Retry mechanisms).

For critical applications, these failures have to be anticipated, and also managed at the architecture or application level. Even if the driver handles some low level retries, it is not always enough in case of instability, as an application may involve complex business logic, and require coarse grained units of work.

In other words, the driver does not deal with higher level failures (such as cluster disconnects). In our use cases we have worked around this by adding custom retry logic to our business logic. See very basic example down below (adding jitter and exponential backoff obviously highly recommended).

sts = time.time()
while True:
    last_exception = None
    cts = time.time()

    if cts - sts > _MAX_RETRY_SECONDS:
        raise last_exception

    try:
        session.write_transaction(do_write())
        break
    except Exception as e:
        time.sleep(1)
        last_exception = e

@aanastasiou
Copy link
Collaborator

@mvanderkroon Thank you very much, sounds like a modification is required at this point (?).

@mvanderkroon
Copy link

@aanastasiou I believe so. I have forked the repo, made the necessary changes and would be quite happy to issue a pull request. Should I point it to your master branch?

@aanastasiou
Copy link
Collaborator

@mvanderkroon Thank you very much and I do not see why not. It should be sent as a pull request to the main neomodel repo. All the best.

@robertlagrant
Copy link
Contributor Author

@aanastasiou sure - it's a 3 core server cluster. There are also 2 read replicas, but they don't really feature in this situation as far as I'm aware.

@aanastasiou
Copy link
Collaborator

@robertlagrant Thank you for your response, I think that the discussion with @mvanderkroon on the pull request was very informative about the specifics.

@kant111
Copy link

kant111 commented Aug 2, 2019

Why follower cannot accept writes?

@robertlagrant
Copy link
Contributor Author

@kant111 because that's not how Neo4J works.

@ayoubelmimouni
Copy link

when using a connection URL of bolt+routing:// this indicates the session is now cluster aware, whereas bolt:// does not understand the other members in a cluster.
However it is not simply the bolt+routing:// connection URL is only half the story. It is also the usage of session.readTransaction() and session.writeTransaction() whereby each allows you to pass the Cypher to be executed. If you send a cypher statement through session.writeTransaction and the connection URL was bolt+routing:// then regardless of the member connected to, the Cypher write statement will be routed to the LEADER. As such if one connects to bolt+routing:// and calls a session.writeTransaction() as the transaction is defined as a write it will automatically be routed to the LEADER.
It is important to note that Neo4j does not parse the Cypher statement to auto detect if the Cypher is a read or write statement.
So one could actually issue a session.readTransaction("create (n:Person {id:1})") and because it is defined as a 'readTransaction` it would be routed to a Follower, but then fail since only LEADERs can perform writes.

@gwvandesteeg
Copy link

Fun fact (tested on Neo4J 4.0.7)

Adding a trigger can only be done on the node in the cluster that is the LEADER of both the DB you are adding the trigger to AND the system database (might need the neo4j DB as well, wasn't sure, but we don't use it).

The example below is me trying to add a trigger whilst connected to the node neo4j-core-2 via the bolt connector

neo4j@nextvoice> call dbms.cluster.overview();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                                                                                                | databases                                                      | groups |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "53f95bdf-0c86-4826-8244-4ad4f7963592" | ["bolt://neo4j-core-2.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-2.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "LEADER", neo4j: "FOLLOWER", system: "FOLLOWER"}   | []     |
| "6b74a7fa-626d-4994-af32-1432b9e8b0c4" | ["bolt://neo4j-core-0.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-0.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "LEADER", system: "LEADER"}     | []     |
| "775b45fe-3ae3-466d-9ad2-7b8e5ae82e0b" | ["bolt://neo4j-core-1.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-1.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "FOLLOWER", system: "FOLLOWER"} | []     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

3 rows available after 6 ms, consumed after another 1 ms
neo4j@nextvoice> CALL apoc.trigger.add(
                 "assertExtensionNumberValidNumericalString",
                 "WITH '^([0-9]{2,5})$' AS extNumStrRegex
                 MATCH (e:Extension)
                 CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
                 RETURN NULL",
                 { phase: 'before' }
                 );
No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER

After a bunch of killing nodes and waiting for them to come back to the desired state, and connected to neo4j-core-0 via the bolt connector

neo4j@nextvoice> call dbms.cluster.overview();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                                                                                                | databases                                                      | groups |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "53f95bdf-0c86-4826-8244-4ad4f7963592" | ["bolt://neo4j-core-2.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-2.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "FOLLOWER", system: "FOLLOWER"} | []     |
| "6b74a7fa-626d-4994-af32-1432b9e8b0c4" | ["bolt://neo4j-core-0.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-0.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "LEADER", neo4j: "LEADER", system: "LEADER"}       | []     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

2 rows available after 0 ms, consumed after another 1 ms
neo4j@nextvoice> CALL apoc.trigger.add(
                 "assertExtensionNumberValidNumericalString",
                 "WITH '^([0-9]{2,5})$' AS extNumStrRegex
                 MATCH (e:Extension)
                 CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
                 RETURN NULL",
                 { phase: 'before' }
                 );
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name                                        | query                                                                                                                                                                              | selector          | params | installed | paused |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "assertExtensionNumberValidNumericalString" | "WITH '^([0-9]{2,5})$' AS extNumStrRegex
MATCH (e:Extension)
CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
RETURN NULL" | {phase: "before"} | {}     | TRUE      | FALSE  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 row available after 10 ms, consumed after another 30 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants