Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BCF-3250]: Fix FilterLog error handling in LogPoller #11654

Merged
merged 7 commits into from
Aug 27, 2024

Conversation

reductionista
Copy link
Contributor

@reductionista reductionista commented Dec 21, 2023

BCF-3250
This was logging a critical error when any error other than an rpc error happened (eg networking issue, or context timeout) when batch_size was = 1.
It was intended to only log at that level for the rpc error "Limit Exceeded" (-32005). Same issue for batch_size > 1, we should only be reducing the batch size if the error was specifically "Limit Exceeded", otherwise just retry again next time with the same size. We can also add handling of other errors returned by rpc servers that can mean the same thing.

More specifically, these are the main issues addressed in this PR:

  1. The logic for whether to retry or reduce batch size was wrong--so it was retrying with reduced batch size unnecessariy for transient errors
  2. The error was being matched against a concrete JsonError type, which is fragile due to the number of different ways in which the error type can gets
    wrapped and re-formatted while it's propagating up the stack from geth through different intermediate layers. (I think it may have matched only in
    simulated geth but not with a live rpc server.) This has been improved by matching against rpc.Error interface defined in geth for this purpose
  3. There was a bug in the test for this feature (related to pointer indirection) which caused a false positive PASS
  4. The error code -32005 is used by infura and at least one other I remember, but there are a handful of other possible error codes returned from 3rd party providers under these conditions, and can mean the same thing. To disambiguate further, I've added regular expression matching similar to what's done when tx mgr gets back errors from SendTx and needs to take different actions based on categorizing them into equivalence classes. (And ability to add a custom regex in toml config to match against.)

New tests:

  • A new subtest has been added to make sure that unrelated errors do not log this error message any more
  • A new test for IsTooManyResults, feeding it all the error strings I've seen come back from actual rpc servers to make sure the regular expressions match and are consistent with the error codes.

Copy link
Contributor

I see that you haven't updated any README files. Would it make sense to do so?

nolag
nolag previously approved these changes Dec 21, 2023
Comment on lines 662 to 813
if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded {
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)
Copy link
Contributor

@jmank88 jmank88 Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if !errors.As(err, &rpcErr) || ... is true when err == nil, and false when actually matching.

Should it be this?

Suggested change
if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded {
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)
if errors.As(err, &rpcErr) && rpcErr.Code != jsonRpcLimitExceeded {
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)

Copy link
Contributor Author

@reductionista reductionista Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's actually how it is currently on develop branch, and that's the bug that's being fixed here:

Presently, it checks to see if it's an rpc error AND if it's not jsonRpcLimitExceeded. But it should be checking that it's either not an rpc error OR not the specific rpc error jsonRpcLimitExceeded. In either of those cases, it should just return the error. The only case where it needs to handle it specially (by reducing the batch size until it fits) is if it's specifically this type of rpc error (limit exceeded).

The nil case is already handled, because this is in an if err != nil block

I guess it might be more readable if instead of !errors.As() || rpc.Err.Code != jsonRpcLimitExceeded we change it to !(errors.As() && rpc.Err.Code == jsonRpcLimitExceeded)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the As check fails, then you cannot access the rpcErr.Code field. This does not make sense as written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I see what this is doing but it seems like a messy way to express it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

Suggested change
if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded {
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)
if !(errors.As(err, &rpcErr) && rpcErr.Code == jsonRpcLimitExceeded) {
lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I suggested above... I think that's a better way of expressing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic got slightly more complex once I added a second error code it can match against, so I moved it into a separate helper function where I think it's a lot easier to read.

@@ -1416,7 +1416,7 @@ func TestTooManyLogResults(t *testing.T) {
from := fq.FromBlock.Uint64()
to := fq.ToBlock.Uint64()
if to-from >= 4 {
return []types.Log{}, &clientErr // return "too many results" error if block range spans 4 or more blocks
return []types.Log{}, clientErr // return "too many results" error if block range spans 4 or more blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here? Is this declared in a way that makes it easy to make that mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm glad you asked, there might still be something wrong here. The Error field returned by geth directly in a jsonMessage is set to their own internal type of &jsonError:
https://github.com/smartcontractkit/go-ethereum/blob/master/rpc/json.go#L109-L123

Then our client wrapper converts it at some point into our own type of JsonError, but confusingly... we redefine geth's jsonError in 3 different places, twice as a struct and once as an interface:

core/chains/evm/client/errors.go:type JsonError struct {
core/chains/evm/client/simulated_backend_client.go:     // type JsonError struct {
core/scripts/chaincli/handler/report.go:type JsonError interface {

Based on these, it looks like returning &clientErr was the right thing to do after all. But if that's true, then the check for Error.As() will fail unless we pass it a **JsonError which doesn't seem right either. Maybe I shouldn't be checking against a concrete type anyway, it's too brittle? I think I need to investigate this more and figure out what the different ways are in which this error might be returned.

Copy link
Contributor Author

@reductionista reductionista Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally had a chance to go back and look through geth's error handling more carefully to sort this out. The number of different ways in which these json rpc errors can get transformed and wrapped as they are propagating up the stack is enormous. The type changes several times, sometimes involving a pointer and sometimes not.

There is an interface defined for this purpose in geth, rpc.Error, which is designed to test whether something has an error code attached to it, which is all we need. The only problem with that was that the JsonError type we define that it can get transformed into under some conditions doesn't actually satisfy this. Since it's clear from the comments that the main purpose of us defining this type was to mimick the jsonError type in geth which isn't exported, I fixed this by adding the ErrorCode() method so that they both behave the same. (Also noticed a pointer receiver on the String method, which I think was not supposed to be there.)

While looking through the geth code, I noticed another error code it can return which it would make sense for us to handle in the same way, so I've added that as well. And added a regression test for this bug of logging a critical error when there's nothing critical going on.

@cl-sonarqube-production
Copy link

Copy link
Contributor

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 21, 2024
@github-actions github-actions bot closed this Feb 29, 2024
@reductionista reductionista reopened this Jun 6, 2024
@reductionista reductionista requested a review from a team as a code owner June 6, 2024 20:54
@reductionista
Copy link
Contributor Author

This got closed due to being stale, but someone noticed it again recently so I'll pick it back up again soon.

@github-actions github-actions bot removed the Stale label Jun 7, 2024
@reductionista reductionista force-pushed the fix/logpoller-too-many-results-message branch from 8c4fdff to 8c6764b Compare August 2, 2024 18:08
@reductionista reductionista changed the title Fix misleading LogPoller logging Fix FilterLog error handling in LogPoller Aug 2, 2024
@reductionista reductionista force-pushed the fix/logpoller-too-many-results-message branch from 8c6764b to f4f39b6 Compare August 2, 2024 18:24
@reductionista reductionista requested a review from a team as a code owner August 2, 2024 18:29
@reductionista reductionista force-pushed the fix/logpoller-too-many-results-message branch from 1cb7f0f to a3c4524 Compare August 2, 2024 18:47
@reductionista reductionista changed the title Fix FilterLog error handling in LogPoller BCF-3250: Fix FilterLog error handling in LogPoller Aug 2, 2024
@reductionista reductionista changed the title BCF-3250: Fix FilterLog error handling in LogPoller [BCF-3250]: Fix FilterLog error handling in LogPoller Aug 2, 2024
@@ -452,7 +452,16 @@ func (err JsonError) Error() string {
return err.Message
}

func (err *JsonError) String() string {
// Satisfies rpc.Error interface, which any actual jsonError returned from geth is expected to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to satisfy rpc.DataError? ErrorData() seems redundant.
Should we add interface guard to ensure JsonError always implements rpc.Error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no present need for it, but I guess it probably makes sense to add it so that it's fully parallel with the jsonError type defined in go-ethereum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm glad you asked this--I was thinking I hadn't added ErrorData(), but looks like I'd started to add it but hadn't actually finished it. Just completed that, and added a type guard for both.

The jsonError's returned from geth satisfy both rpc.Error and rpc.DataError--presently we only use rpc.Error but having both could avoid a future mistake where we're expecting it to satisfy that interface and it doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 811 to 813
case jsonRpcLimitExceeded:
fallthrough
case jsonRpcResultsTooLarge:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case jsonRpcLimitExceeded:
fallthrough
case jsonRpcResultsTooLarge:
case jsonRpcLimitExceeded, jsonRpcResultsTooLarge:

Comment on lines 798 to 802
const jsonRpcResultsTooLarge = -32003

// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other
// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5
const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically with Go naming these should be RPC:

Suggested change
const jsonRpcResultsTooLarge = -32003
// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other
// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5
const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md
const jsonRPCResultsTooLarge = -32003
// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other
// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5
const jsonRPCLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md

// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5
const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md

func isRequestTooLargeError(err error) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better name for this, since it also matches LimitExeceeded?

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the logic below makes sense in the case of LimitExceeded. Why group them together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was designed specifically for the LimitExceeded error, which is returned whenever the number of results that would have come back from an eth_getLogs request exceeds a limit set by the rpc server. (In my testing, this was found to be 10,000 for most 3rd party rpc servers, such was alchemy and infura.)

That one has been in the PR since Dec, I just added this other one (ResultsTooLarge) because the name of it sounds like it might get returned by some rpc servers under similar circumstances--we have no actual evidence that this is the case, I'm just guessing on that because the name sounds like what LimitExceeded does in practice. We can leave this one out, but we can't leave out LimitExceeded since that's what this error handling was added for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the current comment for LimitExceeded accurate? Rate limiting is significantly different from result size limiting.

Copy link
Contributor Author

@reductionista reductionista Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It had been a while since I tested this, so I just went through and tested the responses using curl on a dozen or more different rpc servers.
Sadly, they are much more varied than I'd remembered (although in Dec I think I'd only tried it on 3 providers)--we may have to rely primarily on string parsing 😩, or at least use that in conjunction with error codes.

Here is the response infura gives, which is the LimitExceeded error:

{"jsonrpc":"2.0","id":1,"error":{"code":-32005,"data":{"from":"0xCB3D","limit":10000,"to":"0x7B737"},"message":"query returned more than 10000 results. Try with this block range [0xCB3D, 0x7B737]."}}

Looking at their docs, they also return the same error code if you've exceeded your quota for the day, or are sending requests too fast. So the comment is accurate, but you're right that they are lumping very different types of errors together under the same error code--I guess we would need to check either "data" or "message" to differentiate it from other types of rate limiting.

Some other examples of responses (I was either mistaken about Alchemy, or they changed it since then).

Alchemy:

{"jsonrpc":"2.0","id":1,"error":{"code":-32602,"message":"Log response size exceeded. You can make eth_getLogs requests with up to a 2K block range and no limit on the response size, or you can request any block range with a cap of 10K logs in the response. Based on your parameters and the response size limit, this block range should work: [0x0, 0x133e71]"}}

(from their docs, -32602 means "invalid method parameters")

LinkPool, Blockdaemon, Chainstack:

{"jsonrpc":"2.0","id":1,"error":{"code":-32002,"message":"request timed out"}}

Quicknode BSC, Fantom:

{"jsonrpc":"2.0","id":1,"error":{"code":-32614,"message":"eth_getLogs is limited to a 10,000 range"}}

(their docs list a number of custom error codes, up to and including -32613 but don't mention -32614 !)

SimplyVC Fanton:

{"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"too wide blocks range, the limit is 100"}}

Drpc Avalanche:

{"id":1,"jsonrpc":"2.0","error":{"message":"requested too many blocks from 0 to 16777216, maximum is set to 2048","code":-32000}}

Nirvana Labs:
(after pause for 60+ seconds... )

<!DOCTYPE html>
<html>
  <head>
    <title>503 Backend fetch failed</title>
  </head>
  <body>
    <h1>Error 503 Backend fetch failed</h1>
    <p>Backend fetch failed</p>
    <h3>Guru Meditation:</h3>
    <p>XID: 343710611</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

(I double checked that the Nirvana Labs endpoint is valid, and responds to an eth_getLogs request with a smaller block range properly--very bad behavior on their part!)

@reductionista reductionista force-pushed the fix/logpoller-too-many-results-message branch 4 times, most recently from d879dd7 to 0f2facb Compare August 16, 2024 02:21
This was logging a critical error when any error other than an rpc error happened (eg networking issue, or context timeout) when batch_size was = 1.
Should have only been logging at that level for rpc error "Limit Exceeded"

More specifically, there are 4 interrelated issues addressed in this PR:

1. The logic for whether to retry or reduce batch size was wrong--so it was retrying with reduced batch size unnecessariy for transient errors
2. The error was being matched against a concrete JsonError type which is very fragile due to the number of types the type gets wrapped and re-formatted
   while it's propagated up the stack from geth through different layers.  (It may have matched only in simulated geth but not with a live rpc server).
   Now it's matched against rpc.Error interface defined in geth for this purpose, which should work more generally.
3. There was a bug in the test for this feature (related to pointer indirection) which caused a false positive PASS
4. In addition to the rate limiting error returned by infura, alchemy, etc. I've added a similar error code geth can return when the request size is too large

Also:

A new subtest has been added to make sure that unrelated errors do not log this error message
This will make our JsonError fully parallel with jsonError, so they can
both be treated in the same way.
There are a lot of different error codes and messages which can be
returned depending on what type of rpc server it is. Classification has been
expanded to include both the codes and regex matching of the messages,
and moved into client/errors.go where a similar process happens to disambiguate
SendTx error
@reductionista reductionista force-pushed the fix/logpoller-too-many-results-message branch from f5c6583 to db83ff2 Compare August 22, 2024 23:24
@reductionista reductionista force-pushed the fix/logpoller-too-many-results-message branch from db83ff2 to 7d12d19 Compare August 22, 2024 23:27
@reductionista reductionista added this pull request to the merge queue Aug 27, 2024
Merged via the queue into develop with commit bf2b72d Aug 27, 2024
134 of 135 checks passed
@reductionista reductionista deleted the fix/logpoller-too-many-results-message branch August 27, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants