-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to retrieve events for certain block heights #5810
Comments
This PR #5764 fixes the issue of EN1 missing events. Once the fix for that is rolled out, the client should not receive an error since EN1 will have all the events. |
Next step is reproduce this against a single AN, then inspect the receipts for the block to see how many and from which nodes. ANs index execution receipts in the ingestion engine here: flow-go/engine/access/ingestion/engine.go Line 435 in 512eb32
Then choose an execution node based on receipts in storage here: flow-go/engine/access/rpc/backend/backend.go Line 461 in 512eb32
It's possible for an AN to have only received a receipt from a single or even no ENs for a block. In this case, the AN should just try any EN. I think we're running into a special case in this situation. If an AN is configured with a list of "preferred execution nodes", it will select one or more node from that list has it has receipts from. However, if it returns only a single node and the request to that node fails, it will not retry on another node. |
There are 2 flags an AN can use to control which EN to use:
Otherwise, the node will try with any execution node. Here's the logic: flow-go/engine/access/rpc/backend/backend.go Lines 530 to 563 in 6f0e33a
This issue comes up when an access node only has receipts from a single EN. In this case, if that node is offline or returns an error, the AN will not retry on any other node. This can create the situation where data for some blocks effectively becomes unavailable on that node. ANs receive receipts from ENs as they execute blocks, and from the actual block as they are received from consensus nodes. It's possible in some situations for an AN to only have a single receipt for a block in it's store, so that situation should be handled. |
I think we should update the behavior when
nodes selected, that the list is padded up to 3 nodes using the following methods (in order):
This would ensure there are enough fallbacks to handle cases where ENs are unavailable |
shouldn't the order be,
Since the operator wants the preferred nodes to be given more weightage. |
my thinking is that "preferred" implies that the node will try to use these if one of these nodes has executed the block, otherwise it will use another node. If we failed over to any preferred EN, I think we're more likely to see delays responding to queries if there are other ENs that have reported executing. I'm OK with either approach |
You are right - I mistakenly assumed One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error? |
In some cases it does, but we can certainly add it where needed. Did you have a case in mind that should be checked? |
🐞 Bug Report
Request to retrieve events for certain block heights fail.
While this is related to the bug https://github.com/dapperlabs/flow-go/issues/6959, it points to a different issue.
Currently, EN1 is set to return a
ResourceExhausted
error when querying for events. However, the fact that the GetEvents call consistently fails indicates that the public access nodes always query only EN1. This would happen if the access node only got one execution receipt for the block and it was from EN1. Hence the core issue here is that access node is most likely missing execution receipts from the other execution nodes.What is the severity of this bug?
important
Critical - Urgent: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). Whole team should drop what they're doing and work on this.
Critical: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). One person should look at this right now.
Important: * We have to do this before we ship, but it can wait until the next sprint (product or feature won't function without it, but it's not blocking us or users right now). Team should do this in the next sprint.
Should have: * It would be better if we do this before we ship, but it's OK if we don't (product functions without this, but it's a better user experience). Consider adding to a future sprint.
Could have: It really doesn't matter if we do this (product functions without this, impact to user is minimal).
Reproduction steps
Steps to reproduce the behaviour:
Expected behaviour
Events should be returned.
Workaround
Access node 7 and 8 run by the foundation serve events locally and respond without an error for those block heigiths.
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: