Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nimbus_validator_client does not register validators with mev-relay when used with web3signer. #5730

Closed
catwith1hat opened this issue Jan 12, 2024 · 17 comments

Comments

@catwith1hat
Copy link
Contributor

catwith1hat commented Jan 12, 2024

The good case

The mev-boost process must receive validator registrations periodically in order for mev-boost to work. In the logs this looks like this:

Jan 12 11:05:22 node3 mev-boost[1638]: time="2024-01-12T11:05:22.539+01:00" level=info msg="http: POST /eth/v1/builder/validators 200" duration=0.333286 method=POST path=/eth/v1/builder/validators status=200 version=v1.6
Jan 12 11:11:46 node3 mev-boost[1638]: time="2024-01-12T11:11:46.363+01:00" level=info msg="http: POST /eth/v1/builder/validators 200" duration=0.291441 method=POST path=/eth/v1/builder/validators status=200 version=v1.6
Jan 12 11:18:13 node3 mev-boost[1638]: time="2024-01-12T11:18:13.806+01:00" level=info msg="http: POST /eth/v1/builder/validators 200" duration=0.540833 method=POST path=/eth/v1/builder/validators status=200 version=v1.6

You can see a period refresh via POST request. When I run the in-process validator inside nimbus_beacon_node this works fine. I run nimbus_beacon_node with a remote web3signer. Here are the essential command line flags are (stripped for brevity):

/nix/store/ij1gdm2rny1wn4mck3rg9pp24b2zk8z2-nimbus-eth2-24.1.1/bin/nimbus_beacon_node \
  '--suggested-fee-recipient=0xd8da6bf26964af9d7eed9e03e53415d37aa96045' \
  '--in-process-validators=true' \
  '--web3-signer-url=http://localhost:9330' \
  '--payload-builder=true' \
  '--payload-builder-url=http://localhost:18910'

The bad case

Switching to an external validator seems to work fine at first glance (attestion, block building works). However, MEV boost does not work. There are no POST requests in the logs, so I suspect that the mev-boost process stopped receiving registrations. The essentials flags are (stripped for brevity):

/nix/store/ij1gdm2rny1wn4mck3rg9pp24b2zk8z2-nimbus-eth2-24.1.1/bin/nimbus_beacon_node \
  '--suggested-fee-recipient=0xd8da6bf26964af9d7eed9e03e53415d37aa96045' \
  '--in-process-validators=false' \
  '--payload-builder=true' \
  '--payload-builder-url=http://localhost:18910'
/nix/store/ij1gdm2rny1wn4mck3rg9pp24b2zk8z2-nimbus-eth2-24.1.1/bin/nimbus_validator_client \
  '--suggested-fee-recipient=0xd8da6bf26964af9d7eed9e03e53415d37aa96045'  \
  '--web3-signer-url=http://localhost:9330'
  '--payload-builder=true'

Increasing the log level on nimbus_validator_client shows:

Jan 12 10:58:57 node1 nimbus_validator_client[22860]: DBG 2024-01-12 10:58:27.004+01:00 Could not get fee recipient for registration data validator=XXX
Jan 12 10:58:57 node1 nimbus_validator_client[22860]: DBG 2024-01-12 10:58:27.007+01:00 Validator registration missing validator index validator=XXXX
Jan 12 10:58:57 node1 nimbus_validator_client[22860]: DBG 2024-01-12 10:58:57.323+01:00 Validator registrations prepared           total=1 succeed=0 cached=0 bad=0 errors=0 index_missing=0 fee_missing=1 incorrect_time=0

All validator registrations fail.

Broken code flow

prepareRegistrationList prepares all registrations and keeps the statistics printed above. getValidatorRegistration prepares a single registration and fails at this block:

    let feeRecipient = vc.getFeeRecipient(validator.pubkey, vindex,
                                          currentSlot.epoch())
    if feeRecipient.isNone():
      debug "Could not get fee recipient for registration data",
            validator = shortLog(validator)
      return err(RegistrationKind.MissingFee)

Looking at getFeeRecipient shows that dynamic recipients are treated differently as there is a special block handling them at the top. L956 retrieves information from the dynamicFeeRecipientStore. I don't think that's relevant here though, because I think that the whole function falls through to Opt.None(Eth1Address).

Why does it fall through? Probably because vc.config.defaultFeeRecipient is not set to the command line value provided. Maybe some glue code is missing somewhere to propagate the flag value?

This bug is very similar to #5599 . Maybe @stephanep or @tersec has some insights. Thank you!

@stephanep
Copy link
Contributor

If i get it right, the main relevant difference between #5599 and this issue is that you are using a web3-signer instead of having the validator keys stored with the validator-client. And contrary to #5599, here the validator client fails to register its validators with the beacon node because it cannot determine the fee address to use.

There is no mention of a keymanger API being used to configure per-validator settings, so the correct behaviour in that situation would be that the VC client uses the value passed to the --suggested-fee-recipient instead of failing with the Could not get fee recipient for registration data validator error, right?

@stephanep
Copy link
Contributor

stephanep commented Jan 12, 2024

Would be interesting to add a debug after

vc.config.validatorsDir, pubkey, perValidatorDefaultFeeRecipient)
that prints the value of the withdrawalAddress, perValidatorDefaultFeeRecipient and staticRecipient, and reproduce the situation. Would help understand what is happening here.

The assumption would be that getSuggestedFeeRecipient returns a failed result. I'm not sure it is correct to call getSuggestedFeeRecipient proc with vc.config.validatorsDir as the first argument, as it expects a KeymanagerHost instead (for instance vc.keymanagerHost).

@catwith1hat
Copy link
Contributor Author

I added some debug prints in this commit and found that:

Jan 12 22:33:59 node1 nimbus_validator_client[25980]: DBG 2024-01-12 22:33:59.009+01:00 Registrations                              vc_default_fee_recipient=ok(0xd8da6bf26964af9d7eed9e03e53415d37aa96045) withdrawal_address=none() per_validator_default_fee_recipient=0xd8da6bf26964af9d7eed9e03e53415d37aa96045
Jan 12 22:33:59 node1 nimbus_validator_client[25980]: DBG 2024-01-12 22:33:59.009+01:00 Validator directory for pubkey does not exist dir_for_pub_key=/eth/ssd/N1-I0/nimbus-validator/validators/0xkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

So your assumption

The assumption would be that getSuggestedFeeRecipient returns a failed result.

is correct. The failing code is here:

proc getSuggestedFeeRecipient*(
validatorsDir: string, pubkey: ValidatorPubKey,
defaultFeeRecipient: Eth1Address):
Result[Eth1Address, ValidatorConfigFileStatus] =
# In this particular case, an error might be by design. If the file exists,
# but doesn't load or parse that is more urgent. People might prefer not to
# override default suggested fee recipients per validator, so don't warn.
if not dirExists(validatorsDir.validatorKeystoreDir(pubkey)):
return err noSuchValidator

The comment even explains that the function deliberately does not return defaultFeeRecipient when the validator directory is not found, which is the case for web3signer validators. defaultFeeRecipient would contain my intended fee recipient (specified via the cmdline).

I wonder what's the best way forward. getSuggestedFeeRecipient could just return defaultFeeRecipient when it can't find the validatorDir, or we could rewrite this section:

staticRecipient = getSuggestedFeeRecipient(
vc.config.validatorsDir, pubkey, perValidatorDefaultFeeRecipient)

to handle getSuggestedFeeRecipient returning an error.

@stephanep
Copy link
Contributor

That completely makes sense, that's why it does not work when an external web3 signer is used. Out of curiosity, if you create an empty directory with your validator pubkey under the validators folder, would that make everything work for you as expected?

As mentioned in https://nimbus.guide/web3signer.html when using a web3 signer there should be nothing else to do than passing the --web3-signer-url command line option. So it is on purpose that you do not have a folder for your validator in that case.

My initial thought would be that the current implementation of getSuggestedFeeRecipient does not consider that case, but i don't fully understand the an error might be by design statement in the comment. I can see it was introduced in this commit which is 2 years old, whereas web3 signer was introduced more recently in v23.9.0.

So i would agree that getSuggestedFeeRecipient should be amended not to fail in that case (and instead return the defaultFeeRecipient), but based on the comment i would be wary of breaking other use case. Probably safer to add a check that web3 signer (or verifying signer) is enabled in the configuration before returning the default.

Tagging @arnetheduck who wrote the current version of getSuggestedFeeRecipient for further advice on this.

@cheatfate
Copy link
Contributor

There is one moment which was missed.

Jan 12 10:58:57 node1 nimbus_validator_client[22860]: DBG 2024-01-12 10:58:27.007+01:00 Validator registration missing validator index validator=XXXX

This is actual problem of why registrations not happened, because getFeeRecipient requires validator index to be present, but as we can see validator index was missing, so it fails earlier.

To get more information i need more logs, for at least 24 seconds of running right after Validator registration missing validator index seen, because validators indices updated once per slot so it is possible that indices are not yet updated while preparation process is already happened, or it could be that remote web3signer holds public keys which are (not activated yet, members of different testnet/mainnet).

@catwith1hat
Copy link
Contributor Author

Sorry, I can't share logs verbatim, and what you are seeing is a redacted version. Also let me assure you that my shared logs are pretty much a steady state. Nimbus never succeeds in registering web3signer-provided pubkeys and always claims fee_missing=${ALL_MY_VALIDATORS}.

Also I believe that we have diagnosed the problem sufficiently and the absense of validator directories is to blame. I have patched the problem in my private repo. stable...catwith1hat:nimbus-eth2:stable
I am happy to open a PR.

@cheatfate
Copy link
Contributor

cheatfate commented Jan 16, 2024

IF this log lines

Jan 12 10:58:57 node1 nimbus_validator_client[22860]: DBG 2024-01-12 10:58:27.004+01:00 Could not get fee recipient for registration data validator=XXX
Jan 12 10:58:57 node1 nimbus_validator_client[22860]: DBG 2024-01-12 10:58:27.007+01:00 Validator registration missing validator index validator=XXXX
Jan 12 10:58:57 node1 nimbus_validator_client[22860]: DBG 2024-01-12 10:58:57.323+01:00 Validator registrations prepared           total=1 succeed=0 cached=0 bad=0 errors=0 index_missing=0 fee_missing=1 incorrect_time=0

are real one, then your investigation is followed wrong path, because code flow never reaches getFeeRecipient().
There is present a loop https://github.com/status-im/nimbus-eth2/blob/stable/beacon_chain/validator_client/duties_service.nim#L673-L685 which performs validator registrations every slot, it calls https://github.com/status-im/nimbus-eth2/blob/stable/beacon_chain/validator_client/duties_service.nim#L543 procedure which first forms list of registrations which should be sent to BN using this call https://github.com/status-im/nimbus-eth2/blob/stable/beacon_chain/validator_client/duties_service.nim#L550
prepareRegistrationList() iterates over all the validators VC knows (both dynamic and static) and calls getValidatorRegistration() which first check would be check for validator index in https://github.com/status-im/nimbus-eth2/blob/stable/beacon_chain/validator_client/common.nim#L1025-L1028

So you see this error message in the logs. As you can see code flow never reaches getFeeRecipient() call, because for this call validator index is required https://github.com/status-im/nimbus-eth2/blob/stable/beacon_chain/validator_client/common.nim#L1045-L1046

Remote web3signer did not provide to VC validator indicies, it only provides validator public keys. To get validator index VC should query it from BN. This also happens once per slot in other loop, so if situation is not recovers in one slot time it means that validator public key which was provided by web3signer is not activated yet or slashed or not present in blockchain. Or there is one more bug/issue in VC/web3signer/BN used in configuration.

@catwith1hat
Copy link
Contributor Author

catwith1hat commented Jan 16, 2024

My apologies and sorry for wasting your time here a bit. I do have unactivated validators in my web3signer instance. So I guess that's where the log lines:

Jan 12 10:58:57 node1 nimbus_validator_client[22860]: DBG 2024-01-12 10:58:27.007+01:00 Validator registration missing validator index validator=XXXX

might come from.

However, that problem above is unrelated to getFeeRecipient by my judgement. getFeeRecipient is broken for activated web3signer pubkeys. In conclude that from my additional debug lines added in #5730 (comment) , which clearly show that getFeeRecipient fails.

To me the only remaining question is how to fix this upstream.

@stephanep
Copy link
Contributor

I agree with @catwith1hat diagnosis and i think the proposed fix would work.
The only thing i would do differently, is to use the perValidatorDefaultFeeRecipient value only if we have specified a web3signer. So we don't risk altering the current behavior in other cases.

Did you have a chance to confirm that your commit, or something like this, fixed the issue?

diff --git a/beacon_chain/validator_client/common.nim b/beacon_chain/validator_client/common.nim
index ac713102..eed6c72f 100644
--- a/beacon_chain/validator_client/common.nim
+++ b/beacon_chain/validator_client/common.nim
@@ -967,6 +967,10 @@ proc getFeeRecipient*(vc: ValidatorClientRef, pubkey: ValidatorPubKey,
         vc.config.validatorsDir, pubkey, perValidatorDefaultFeeRecipient)
     if staticRecipient.isOk():
       Opt.some(staticRecipient.get())
+    elif len(vc.config.web3SignerUrls):
+      # See issue 5730: getSuggestedFeeRecipient returns err if the validator directory
+      # does not exist, which is the case when a web3signer is used
+      Opt.some(perValidatorDefaultFeeRecipient)
     else:
       Opt.none(Eth1Address)

@stephanep
Copy link
Contributor

stephanep commented Jan 17, 2024

@catwith1hat could you please check if the following branch no longer has the issue:
https://github.com/stephanep/nimbus-eth2-fix-5730/tree/unstable

stephanep added a commit to stephanep/nimbus-eth2-fix-5730 that referenced this issue Jan 17, 2024
…-im#5730)

This change ensures that suggested-fee-recipient will be used as default even if the validator directory does not exist when a web3-signer-url is specified.
stephanep added a commit to stephanep/nimbus-eth2-fix-5730 that referenced this issue Jan 17, 2024
…-im#5730)

This change ensures that suggested-fee-recipient will be used as default even if the validator directory does not exist when a web3-signer-url is specified.
@tersec
Copy link
Contributor

tersec commented Jan 19, 2024

#5781 should address this.

@catwith1hat
Copy link
Contributor Author

Thanks everyone for the fixes. Please give me two or three days to deploy that change.

@stephanep
Copy link
Contributor

I had a look at #5781 that was merged into unstable, and also came to the conclusion that it (and more precisely this change) should have fixed this issue.

@catwith1hat
Copy link
Contributor Author

unstable does not compile for me. Do I need a newer version of Nim?

[using system Nim: /nix/store/wv2rkg1k6y6s13c1dchj5fg4ynywa4hw-nim-unwrapped-1.6.16/bin/nim]
[using system Nim: /nix/store/wv2rkg1k6y6s13c1dchj5fg4ynywa4hw-nim-unwrapped-1.6.16/bin/nim]
/build/source/beacon_chain/spec/engine_authentication.nim:52:20 template/generic instantiation of `base64urlEncode` fr>
/build/source/beacon_chain/spec/engine_authentication.nim:27:6 Error: 'base64urlEncode' can have side effects
> /build/source/beacon_chain/spec/engine_authentication.nim:30:16 Hint: 'base64urlEncode' calls `.sideEffect` 'encode'
>> /nix/store/wv2rkg1k6y6s13c1dchj5fg4ynywa4hw-nim-unwrapped-1.6.16/nim/lib/pure/base64.nim:165:6 Hint: 'encode' calle>

make: *** [Makefile:448: nimbus_validator_client] Error 1
make: *** Waiting for unfinished jobs....
/build/source/vendor/nim-websock/websock/http/client.nim:173:5 template/generic instantiation of `async` from here
/build/source/vendor/nim-websock/websock/http/client.nim:165:1 Warning: The raises pragma doesn't work on async proced>
/build/source/vendor/nim-websock/websock/websock.nim:257:5 template/generic instantiation of `async` from here
/build/source/vendor/nim-websock/websock/websock.nim:251:1 Warning: The raises pragma doesn't work on async procedures>
/build/source/beacon_chain/spec/engine_authentication.nim:52:20 template/generic instantiation of `base64urlEncode` fr>
/build/source/beacon_chain/spec/engine_authentication.nim:27:6 Error: 'base64urlEncode' can have side effects
> /build/source/beacon_chain/spec/engine_authentication.nim:30:16 Hint: 'base64urlEncode' calls `.sideEffect` 'encode'
>> /nix/store/wv2rkg1k6y6s13c1dchj5fg4ynywa4hw-nim-unwrapped-1.6.16/nim/lib/pure/base64.nim:165:6 Hint: 'encode' calle>

@arnetheduck
Copy link
Member

@catwith1hat you can check the required version by looking at the vendor/nimbus-build-system/vendor/Nim subfolder - nimbus only builds with exactly that version - we do generally not support or endorse using OS-provided compilers (as we might have patched Nim itself)

@etan-status
Copy link
Contributor

Clean build (resetting any pending changes)

ulimit -n 1024 && rm -rf ~/.cache ~/.nimble build nimcache vendor .git/modules && \
    git submodule sync --recursive && git clean -dfx
make update && make -j deps
make -j nimbus_beacon_node

@zah
Copy link
Contributor

zah commented Jan 25, 2024

The fix for this issue was shipped in Nimbus 24.1.2

@zah zah closed this as completed Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants