-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BCF-3250]: Fix FilterLog error handling in LogPoller #11654
[BCF-3250]: Fix FilterLog error handling in LogPoller #11654
Conversation
I see that you haven't updated any README files. Would it make sense to do so? |
if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded { | ||
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !errors.As(err, &rpcErr) || ...
is true when err == nil
, and false when actually matching.
Should it be this?
if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded { | |
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to) | |
if errors.As(err, &rpcErr) && rpcErr.Code != jsonRpcLimitExceeded { | |
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's actually how it is currently on develop branch, and that's the bug that's being fixed here:
Presently, it checks to see if it's an rpc error AND if it's not jsonRpcLimitExceeded. But it should be checking that it's either not an rpc error OR not the specific rpc error jsonRpcLimitExceeded. In either of those cases, it should just return the error. The only case where it needs to handle it specially (by reducing the batch size until it fits) is if it's specifically this type of rpc error (limit exceeded).
The nil case is already handled, because this is in an if err != nil
block
I guess it might be more readable if instead of !errors.As() || rpc.Err.Code != jsonRpcLimitExceeded
we change it to !(errors.As() && rpc.Err.Code == jsonRpcLimitExceeded)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the As
check fails, then you cannot access the rpcErr.Code
field. This does not make sense as written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I see what this is doing but it seems like a messy way to express it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this?
if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded { | |
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to) | |
if !(errors.As(err, &rpcErr) && rpcErr.Code == jsonRpcLimitExceeded) { | |
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I suggested above... I think that's a better way of expressing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic got slightly more complex once I added a second error code it can match against, so I moved it into a separate helper function where I think it's a lot easier to read.
@@ -1416,7 +1416,7 @@ func TestTooManyLogResults(t *testing.T) { | |||
from := fq.FromBlock.Uint64() | |||
to := fq.ToBlock.Uint64() | |||
if to-from >= 4 { | |||
return []types.Log{}, &clientErr // return "too many results" error if block range spans 4 or more blocks | |||
return []types.Log{}, clientErr // return "too many results" error if block range spans 4 or more blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here? Is this declared in a way that makes it easy to make that mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm glad you asked, there might still be something wrong here. The Error field returned by geth directly in a jsonMessage is set to their own internal type of &jsonError
:
https://github.com/smartcontractkit/go-ethereum/blob/master/rpc/json.go#L109-L123
Then our client wrapper converts it at some point into our own type of JsonError
, but confusingly... we redefine geth's jsonError
in 3 different places, twice as a struct and once as an interface:
core/chains/evm/client/errors.go:type JsonError struct {
core/chains/evm/client/simulated_backend_client.go: // type JsonError struct {
core/scripts/chaincli/handler/report.go:type JsonError interface {
Based on these, it looks like returning &clientErr
was the right thing to do after all. But if that's true, then the check for Error.As()
will fail unless we pass it a **JsonError
which doesn't seem right either. Maybe I shouldn't be checking against a concrete type anyway, it's too brittle? I think I need to investigate this more and figure out what the different ways are in which this error might be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally had a chance to go back and look through geth's error handling more carefully to sort this out. The number of different ways in which these json rpc errors can get transformed and wrapped as they are propagating up the stack is enormous. The type changes several times, sometimes involving a pointer and sometimes not.
There is an interface defined for this purpose in geth, rpc.Error
, which is designed to test whether something has an error code attached to it, which is all we need. The only problem with that was that the JsonError
type we define that it can get transformed into under some conditions doesn't actually satisfy this. Since it's clear from the comments that the main purpose of us defining this type was to mimick the jsonError
type in geth which isn't exported, I fixed this by adding the ErrorCode()
method so that they both behave the same. (Also noticed a pointer receiver on the String method, which I think was not supposed to be there.)
While looking through the geth code, I noticed another error code it can return which it would make sense for us to handle in the same way, so I've added that as well. And added a regression test for this bug of logging a critical error when there's nothing critical going on.
SonarQube Quality Gate |
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This got closed due to being stale, but someone noticed it again recently so I'll pick it back up again soon. |
8c4fdff
to
8c6764b
Compare
8c6764b
to
f4f39b6
Compare
1cb7f0f
to
a3c4524
Compare
core/chains/evm/client/errors.go
Outdated
@@ -452,7 +452,16 @@ func (err JsonError) Error() string { | |||
return err.Message | |||
} | |||
|
|||
func (err *JsonError) String() string { | |||
// Satisfies rpc.Error interface, which any actual jsonError returned from geth is expected to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to satisfy rpc.DataError
? ErrorData()
seems redundant.
Should we add interface guard to ensure JsonError always implements rpc.Error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no present need for it, but I guess it probably makes sense to add it so that it's fully parallel with the jsonError
type defined in go-ethereum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm glad you asked this--I was thinking I hadn't added ErrorData()
, but looks like I'd started to add it but hadn't actually finished it. Just completed that, and added a type guard for both.
The jsonError's returned from geth satisfy both rpc.Error and rpc.DataError--presently we only use rpc.Error but having both could avoid a future mistake where we're expecting it to satisfy that interface and it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2a9fa86
to
dcbd436
Compare
case jsonRpcLimitExceeded: | ||
fallthrough | ||
case jsonRpcResultsTooLarge: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case jsonRpcLimitExceeded: | |
fallthrough | |
case jsonRpcResultsTooLarge: | |
case jsonRpcLimitExceeded, jsonRpcResultsTooLarge: |
const jsonRpcResultsTooLarge = -32003 | ||
|
||
// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other | ||
// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5 | ||
const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically with Go naming these should be RPC
:
const jsonRpcResultsTooLarge = -32003 | |
// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other | |
// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5 | |
const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md | |
const jsonRPCResultsTooLarge = -32003 | |
// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other | |
// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5 | |
const jsonRPCLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md |
// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5 | ||
const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md | ||
|
||
func isRequestTooLargeError(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name for this, since it also matches LimitExeceeded
?
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the logic below makes sense in the case of LimitExceeded
. Why group them together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was designed specifically for the LimitExceeded
error, which is returned whenever the number of results that would have come back from an eth_getLogs
request exceeds a limit set by the rpc server. (In my testing, this was found to be 10,000 for most 3rd party rpc servers, such was alchemy and infura.)
That one has been in the PR since Dec, I just added this other one (ResultsTooLarge
) because the name of it sounds like it might get returned by some rpc servers under similar circumstances--we have no actual evidence that this is the case, I'm just guessing on that because the name sounds like what LimitExceeded
does in practice. We can leave this one out, but we can't leave out LimitExceeded
since that's what this error handling was added for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the current comment for LimitExceeded
accurate? Rate limiting is significantly different from result size limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It had been a while since I tested this, so I just went through and tested the responses using curl on a dozen or more different rpc servers.
Sadly, they are much more varied than I'd remembered (although in Dec I think I'd only tried it on 3 providers)--we may have to rely primarily on string parsing 😩, or at least use that in conjunction with error codes.
Here is the response infura gives, which is the LimitExceeded error:
{"jsonrpc":"2.0","id":1,"error":{"code":-32005,"data":{"from":"0xCB3D","limit":10000,"to":"0x7B737"},"message":"query returned more than 10000 results. Try with this block range [0xCB3D, 0x7B737]."}}
Looking at their docs, they also return the same error code if you've exceeded your quota for the day, or are sending requests too fast. So the comment is accurate, but you're right that they are lumping very different types of errors together under the same error code--I guess we would need to check either "data" or "message" to differentiate it from other types of rate limiting.
Some other examples of responses (I was either mistaken about Alchemy, or they changed it since then).
Alchemy:
{"jsonrpc":"2.0","id":1,"error":{"code":-32602,"message":"Log response size exceeded. You can make eth_getLogs requests with up to a 2K block range and no limit on the response size, or you can request any block range with a cap of 10K logs in the response. Based on your parameters and the response size limit, this block range should work: [0x0, 0x133e71]"}}
(from their docs, -32602 means "invalid method parameters")
LinkPool, Blockdaemon, Chainstack:
{"jsonrpc":"2.0","id":1,"error":{"code":-32002,"message":"request timed out"}}
Quicknode BSC, Fantom:
{"jsonrpc":"2.0","id":1,"error":{"code":-32614,"message":"eth_getLogs is limited to a 10,000 range"}}
(their docs list a number of custom error codes, up to and including -32613 but don't mention -32614 !)
SimplyVC Fanton:
{"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"too wide blocks range, the limit is 100"}}
Drpc Avalanche:
{"id":1,"jsonrpc":"2.0","error":{"message":"requested too many blocks from 0 to 16777216, maximum is set to 2048","code":-32000}}
Nirvana Labs:
(after pause for 60+ seconds... )
<!DOCTYPE html>
<html>
<head>
<title>503 Backend fetch failed</title>
</head>
<body>
<h1>Error 503 Backend fetch failed</h1>
<p>Backend fetch failed</p>
<h3>Guru Meditation:</h3>
<p>XID: 343710611</p>
<hr>
<p>Varnish cache server</p>
</body>
</html>
(I double checked that the Nirvana Labs endpoint is valid, and responds to an eth_getLogs request with a smaller block range properly--very bad behavior on their part!)
d879dd7
to
0f2facb
Compare
This was logging a critical error when any error other than an rpc error happened (eg networking issue, or context timeout) when batch_size was = 1. Should have only been logging at that level for rpc error "Limit Exceeded" More specifically, there are 4 interrelated issues addressed in this PR: 1. The logic for whether to retry or reduce batch size was wrong--so it was retrying with reduced batch size unnecessariy for transient errors 2. The error was being matched against a concrete JsonError type which is very fragile due to the number of types the type gets wrapped and re-formatted while it's propagated up the stack from geth through different layers. (It may have matched only in simulated geth but not with a live rpc server). Now it's matched against rpc.Error interface defined in geth for this purpose, which should work more generally. 3. There was a bug in the test for this feature (related to pointer indirection) which caused a false positive PASS 4. In addition to the rate limiting error returned by infura, alchemy, etc. I've added a similar error code geth can return when the request size is too large Also: A new subtest has been added to make sure that unrelated errors do not log this error message
This will make our JsonError fully parallel with jsonError, so they can both be treated in the same way.
There are a lot of different error codes and messages which can be returned depending on what type of rpc server it is. Classification has been expanded to include both the codes and regex matching of the messages, and moved into client/errors.go where a similar process happens to disambiguate SendTx error
f5c6583
to
db83ff2
Compare
db83ff2
to
7d12d19
Compare
Quality Gate passedIssues Measures |
BCF-3250
This was logging a critical error when any error other than an rpc error happened (eg networking issue, or context timeout) when batch_size was = 1.
It was intended to only log at that level for the rpc error "Limit Exceeded" (-32005). Same issue for batch_size > 1, we should only be reducing the batch size if the error was specifically "Limit Exceeded", otherwise just retry again next time with the same size. We can also add handling of other errors returned by rpc servers that can mean the same thing.
More specifically, these are the main issues addressed in this PR:
wrapped and re-formatted while it's propagating up the stack from geth through different intermediate layers. (I think it may have matched only in
simulated geth but not with a live rpc server.) This has been improved by matching against rpc.Error interface defined in geth for this purpose
New tests: