-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero retryLimit Support in ReplayClient #442
Conversation
077fe70
to
7fd7a33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Just want you to add a TODO then good to go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second look at
poktroll/pkg/client/events/replay_client.go
Line 198 in e0f7978
if publishError != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryanchriswhite Added you as as a reviewer instead of @red-0ne since you have more context into the reply client and wanted to get your 👀 on this.
3ec9494
to
126353d
Compare
Some questions before more changes are made:
|
@bryanchriswhite Can you PTAL at @ezeike questions before you go OOO? |
The
These may be due to network issues or some unexpected error or whatever but these are naturally possibilities. If either of these reasons do happen we are already treating them as errors but in order for the retry logic to work using a channel makes sense. I am a little unsure what you mean by retry in the same way but treat them as errors - it sounds like that would just be the same thing to me?
The By setting the @ezeike Are you finding that your connections are failing a lot and you are reaching the 10 retry limit often? If so I think investigating what is causing that could help us figure out a better solution than an infinite loop. [1] poktroll/pkg/client/events/replay_client.go Line 207 in e0f7978
|
Errors are received through the error channel and those trigger a Sleep and a retry. However when the error channel is closed OnError returns and the SDK panics. A first thought was just to eliminate that return so the execution continues on to the same Sleep and retry logic. workFn() is called every loop and errCh created anew anyway. The current design is the SDK is to take care of connection details and do its best to ensure there's always a connection to a node. We should keep this as much as possible to keep using the SDK as simple as we can. Either:
At the moment we should pick the quickest one that solves the issue so we can get our gateway on testnet as soon as possible. The second option was quicker so I went with that. There are other discussions to have around who reconnects, how to support using multiple nodes and how configurable the SDK should be but for now we want an MVP ASAP.
From a gateway perspective we want an infinite loop. SDK only allows for specifying a single node and it should do its best to always be connected to that node. The infinite loop doesn't have to exist in OnError though, we could do it in goPublishEvents in this case. If we don't have an infinite loop the gateway will have to handle connection failures and re-create the SDK. That's reasonable too, but shifts some work to gateway devs.
If we do this
I saw the error when I didn't have localnet running. Moving forward though we can't have the SDK panic when a node connection is lost or not available. |
I'll try keep this one shorter @ezeike.
Thoughts about adding this if err := recover(); err != nil {
errMsg := err.Error()
// You could be extra verbose by checking for more parts of the panic's error
// message by including the retry limit and error from the retry.OnError but
// this is the error message given to the panic we are talking about
// see: ./pkg/client/events/replay_client.go#L199
if strings.Contains(errMsg, ".goPublishEvents should never reach this spot:") {
restartSDK() // or however you do this I am unfamiliar but rest is correct
} else {
panic(err)
}
} This means we can still have the limit - fixes your problem of needing and infinite loop as is very minimal code change quite simple in fact. Placing this in the SDK somewhere it is appropriate will be pretty easy IMO @red-0ne and you @ezeike have more context here than I do.
If we don't go with the above solution we will count down instead of count up if that makes sense as we will use the max uint64 value counting down is the only real option - again minimal changes. |
f57bd58
to
99cfe9a
Compare
IMO the proper approach here is to refactor replay client to remove the panics that are present in the package. As a consequence there is now no way to limit the number of connection attempts the replay client makes. We need to have some feedback before we allow limiting reconnection attempts. Ideally we give an error channel to replay client to signal to the caller that it is giving up but that is a larger PR and would require more complexity in our gateway. We (the backend team) have decided we'd like to the panics and move forward with testing as-is. We can re-evaluate the error channel at a later date if the need becomes apparent. If we were to implement an error channel (and a config option for retryLimit) we'd need to modify these three clients: BlockClient, DelegationClient, TxClient their associated supplier functions, and both gateways (appgate and grove gateway) to handle the config and extra state management. |
99cfe9a
to
396f4c3
Compare
5b99b80
to
0d25513
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please merge with main
- See bryan's comment
- Make sure E2E tests pass on DevNet
Will approve after ☝️
Co-authored-by: Daniel Olshansky <[email protected]>
Co-authored-by: Daniel Olshansky <[email protected]> Co-authored-by: ezeike <[email protected]>
8b008c7
to
6680149
Compare
…lient * pokt/main: [LocalNet][Hotfix] Turn off pproff by default (#500) [Tooling] Add pprof endpoints and documentation (#484) [Documentation] Basic blockchain operations (#454) [BlockClient] Refactor BlockClient to fetch latest block at init (#444) [Tooling] Add claim/proof/settlement dashboard & link (#479) [Config] feat: Simplify relay miner config (#477)
This comment was marked as outdated.
This comment was marked as outdated.
@ezeike @bryanchriswhite Could one of you please tend to the remaining |
e375bf8
to
4bf968a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving now that I see E2E tests passing in 0693ca1
* pokt/main: [Code Health] chore: cleanup localnet testutils (#515) Zero retryLimit Support in ReplayClient (#442) [LocalNet] Add infrastructure to run LLM inference (#508) [LocalNet] Documentation for MVT/LocalNet (#488) [GATEWAY] Makefile target added to send relays to grove gateway (#487) Update README [CI] Add GATEWAY_URL envar for e2e tests (#506) [Tooling] Add gateway stake/unstake/ logs (#503)
…cept * pokt/main: Reply to red0ne's comments's in https://github.com/pokt-network/poktroll/pull/501\#pullrequestreview-2047120225 [Docs] Added QuickStart video (#522) [Genesis Config] Set MaxValidators to 1 (#520) [Docs] Update `Quickstart` guide to onboard new devs (#501) [Code Health] chore: cleanup localnet testutils (#515) Zero retryLimit Support in ReplayClient (#442)
Co-authored-by: Bryan White <[email protected]> Co-authored-by: Daniel Olshansky <[email protected]> Co-authored-by: Redouane Lakrache <[email protected]>
Summary
Enables the replay client to continuously attempt to reconnect to the pocket node.
Issue
Current implementation of the replay client will panic after 10 failed connection attempts.
Type of change
Select one or more:
Testing
make go_develop_and_test
make test_e2e
devnet-test-e2e
label to the PR. THIS IS VERY EXPENSIVE, so only do it after all the reviews are complete.make docusaurus_start
Sanity Checklist