Unable to retrieve events for certain block heights #5810

vishalchangrani · 2024-04-29T22:43:57Z

🐞 Bug Report

Request to retrieve events for certain block heights fail.

flow events get --start=68225795 --end=68225795 -n mainnet A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn.

:x: Command Error: client: rpc error: code = ResourceExhausted desc = failed to retrieve events from execution nodes: 1 error occurred:
	* rpc error: code = ResourceExhausted desc =

While this is related to the bug https://github.com/dapperlabs/flow-go/issues/6959, it points to a different issue.
Currently, EN1 is set to return a ResourceExhausted error when querying for events. However, the fact that the GetEvents call consistently fails indicates that the public access nodes always query only EN1. This would happen if the access node only got one execution receipt for the block and it was from EN1. Hence the core issue here is that access node is most likely missing execution receipts from the other execution nodes.

What is the severity of this bug?

important

Critical - Urgent: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). Whole team should drop what they're doing and work on this.

Critical: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). One person should look at this right now.

Important: * We have to do this before we ship, but it can wait until the next sprint (product or feature won't function without it, but it's not blocking us or users right now). Team should do this in the next sprint.

Should have: * It would be better if we do this before we ship, but it's OK if we don't (product functions without this, but it's a better user experience). Consider adding to a future sprint.

Could have: It really doesn't matter if we do this (product functions without this, impact to user is minimal).

Reproduction steps

Steps to reproduce the behaviour:

$ flow events get --start=68225796 --end=68225796 -n mainnet A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn

❌ Command Error: client: rpc error: code = ResourceExhausted desc = failed to retrieve events from execution nodes: 1 error occurred:
	* rpc error: code = ResourceExhausted desc =

Expected behaviour

Events should be returned.

Workaround

Access node 7 and 8 run by the foundation serve events locally and respond without an error for those block heigiths.

$ flow events get --start=68225795 --end=68225795 --host access-008.mainnet24.nodes.onflow.org:9000 A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

vishalchangrani · 2024-04-29T22:46:24Z

This PR #5764 fixes the issue of EN1 missing events. Once the fix for that is rolled out, the client should not receive an error since EN1 will have all the events.
However, the root cause of this issue would still persists and needs to be fixed.

peterargue · 2024-04-29T22:59:08Z

Next step is reproduce this against a single AN, then inspect the receipts for the block to see how many and from which nodes.

ANs index execution receipts in the ingestion engine here:

flow-go/engine/access/ingestion/engine.go

Line 435 in 512eb32

    
           func (e *Engine) handleExecutionReceipt(_ flow.Identifier, r *flow.ExecutionReceipt) error {

Then choose an execution node based on receipts in storage here:

flow-go/engine/access/rpc/backend/backend.go

Line 461 in 512eb32

func findAllExecutionNodes(

It's possible for an AN to have only received a receipt from a single or even no ENs for a block. In this case, the AN should just try any EN.

I think we're running into a special case in this situation. If an AN is configured with a list of "preferred execution nodes", it will select one or more node from that list has it has receipts from. However, if it returns only a single node and the request to that node fails, it will not retry on another node.

peterargue · 2024-05-01T00:38:10Z

There are 2 flags an AN can use to control which EN to use:

--preferred-execution-node-ids: if this is set the AN will prefer to use a node from this list if it has a receipt from any. Otherwise, it will fallback to using any EN.
--fixed-execution-node-ids: if this is set the AN will only use nodes from this list.

Otherwise, the node will try with any execution node.

Here's the logic:

flow-go/engine/access/rpc/backend/backend.go

Lines 530 to 563 in 6f0e33a

    
           func chooseExecutionNodes(state protocol.State, executorIDs flow.IdentifierList) (flow.IdentitySkeletonList, error) { 
        
           	allENs, err := state.Final().Identities(filter.HasRole[flow.Identity](flow.RoleExecution)) 
        
           	if err != nil { 
        
           		return nil, fmt.Errorf("failed to retreive all execution IDs: %w", err) 
        
           	} 
        
           	// first try and choose from the preferred EN IDs 
        
           	var chosenIDs flow.IdentityList 
        
           	if len(preferredENIdentifiers) > 0 { 
        
           		// find the preferred execution node IDs which have executed the transaction 
        
           		chosenIDs = allENs.Filter(filter.And(filter.HasNodeID[flow.Identity](preferredENIdentifiers...), 
        
           			filter.HasNodeID[flow.Identity](executorIDs...))) 
        
           		if len(chosenIDs) > 0 { 
        
           			return chosenIDs.ToSkeleton(), nil 
        
           		} 
        
           	} 
        
           	// if no preferred EN ID is found, then choose from the fixed EN IDs 
        
           	if len(fixedENIdentifiers) > 0 { 
        
           		// choose fixed ENs which have executed the transaction 
        
           		chosenIDs = allENs.Filter(filter.And( 
        
           			filter.HasNodeID[flow.Identity](fixedENIdentifiers...), 
        
           			filter.HasNodeID[flow.Identity](executorIDs...))) 
        
           		if len(chosenIDs) > 0 { 
        
           			return chosenIDs.ToSkeleton(), nil 
        
           		} 
        
           		// if no such ENs are found then just choose all fixed ENs 
        
           		chosenIDs = allENs.Filter(filter.HasNodeID[flow.Identity](fixedENIdentifiers...)) 
        
           		return chosenIDs.ToSkeleton(), nil 
        
           	} 
        
           	// If no preferred or fixed ENs have been specified, then return all executor IDs i.e. no preference at all 
        
           	return allENs.Filter(filter.HasNodeID[flow.Identity](executorIDs...)).ToSkeleton(), nil 
        
           }

This issue comes up when an access node only has receipts from a single EN. In this case, if that node is offline or returns an error, the AN will not retry on any other node. This can create the situation where data for some blocks effectively becomes unavailable on that node.

ANs receive receipts from ENs as they execute blocks, and from the actual block as they are received from consensus nodes. It's possible in some situations for an AN to only have a single receipt for a block in it's store, so that situation should be handled.

peterargue · 2024-05-01T00:57:13Z

I think we should update the behavior when --preferred-execution-node-ids is set and there are less than

flow-go/engine/access/rpc/backend/node_communicator.go

Line 13 in 6f0e33a

const maxFailedRequestCount = 3

nodes selected, that the list is padded up to 3 nodes using the following methods (in order):

Use any EN with a receipt
Use any preferred node not already selected
Use any EN not already selected

This would ensure there are enough fallbacks to handle cases where ENs are unavailable

vishalchangrani · 2024-05-03T16:58:48Z

Use any EN with a receipt

Use any preferred node not already selected

Use any EN not already selected

shouldn't the order be,

Use any preferred node not already selected
Use any EN with a receipt
Use any EN not already selected

Since the operator wants the preferred nodes to be given more weightage.

peterargue · 2024-05-06T21:49:02Z

my thinking is that "preferred" implies that the node will try to use these if one of these nodes has executed the block, otherwise it will use another node.

If we failed over to any preferred EN, I think we're more likely to see delays responding to queries if there are other ENs that have reported executing. I'm OK with either approach

vishalchangrani · 2024-05-10T21:55:44Z

Use any EN with a receipt

Use any preferred node not already selected

Use any EN not already selected

You are right - I mistakenly assumed preferred nodes would always be in the EN receipt.
Good with the order you suggested.

One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error?

peterargue · 2024-05-10T22:00:16Z

One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error?

In some cases it does, but we can certainly add it where needed. Did you have a case in mind that should be checked?

vishalchangrani added the Bug Something isn't working label Apr 29, 2024

peterargue added the S-Access label Apr 29, 2024

peterargue closed this as completed May 6, 2024

peterargue reopened this May 6, 2024

peterargue assigned AndriiDiachuk May 9, 2024

AndriiDiachuk mentioned this issue May 22, 2024

[Access] chooseExecutionNodes fix #5969

Merged

peterargue closed this as completed in #5969 Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to retrieve events for certain block heights #5810

Unable to retrieve events for certain block heights #5810

vishalchangrani commented Apr 29, 2024

vishalchangrani commented Apr 29, 2024

peterargue commented Apr 29, 2024 •

edited

Loading

peterargue commented May 1, 2024 •

edited

Loading

peterargue commented May 1, 2024 •

edited

Loading

vishalchangrani commented May 3, 2024

peterargue commented May 6, 2024

vishalchangrani commented May 10, 2024

peterargue commented May 10, 2024

Unable to retrieve events for certain block heights #5810

Unable to retrieve events for certain block heights #5810

Comments

vishalchangrani commented Apr 29, 2024

🐞 Bug Report

What is the severity of this bug?

Reproduction steps

Expected behaviour

Workaround

vishalchangrani commented Apr 29, 2024

peterargue commented Apr 29, 2024 • edited Loading

peterargue commented May 1, 2024 • edited Loading

peterargue commented May 1, 2024 • edited Loading

vishalchangrani commented May 3, 2024

peterargue commented May 6, 2024

vishalchangrani commented May 10, 2024

peterargue commented May 10, 2024

peterargue commented Apr 29, 2024 •

edited

Loading

peterargue commented May 1, 2024 •

edited

Loading

peterargue commented May 1, 2024 •

edited

Loading