Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node should panic in PoST supervised mode if service doesn't connect #5267

Closed
1 task done
fasmat opened this issue Nov 15, 2023 · 3 comments
Closed
1 task done

Node should panic in PoST supervised mode if service doesn't connect #5267

fasmat opened this issue Nov 15, 2023 · 3 comments
Assignees

Comments

@fasmat
Copy link
Member

fasmat commented Nov 15, 2023

Description

There are a few ways in which a node can be configured to start in PoST supervised mode but fail to start smeshing. The most common ones are:

  • grpc-private-services/grpc-tls-services are missing post handler and therefor do not accept connections from the post service
  • node-address is not the correct address, i.e. it doesn't match the host + port defined by grpc-private-listener/grpc-tls-listener

This will result in the post service starting up and trying to connect every 5 seconds, but failing to do so.

With #5259 a possibly incorrect config is detected and logged by the node during startup. Additionally if the configuration causes the PoST service to be unable to connect to the node, the node should panic and shut down so a misconfigured setup doesn't go unnoticed by the user.

Acceptance criteria

  • Configurable retries in post-service post-rs#151
  • update the node supervisor to not try to restart the post service when it exits unexpectedly but instead panic with a message that the node is shutting down due to the post service exiting unexpectedly:
    • since the output of the post service is captured and added to the node logs it will show that connection/configuration problems were the cause
@poszu
Copy link
Contributor

poszu commented Nov 15, 2023

I have two ideas for improving the UX.

  1. How about automatically enabling post GRPC service on the private services port when --start-smeshing=true and the post service is not already enabled?
  2. Is there any reason to keep node-address field in the config? We already know in the code what value it should have. Perhaps we could just remove it and calculate it in the code to point to the address on which the post service is enabled on. Or leave it as an optional parameter with such a default - in case the user really needs to change its value to something else.

@fasmat
Copy link
Member Author

fasmat commented Nov 15, 2023

How about automatically enabling post GRPC service on the private services port when --start-smeshing=true and the post service is not already enabled?

A PR with this change was already merged: #5259 the problem is a user might not want to use the private service but instead the TLS service (hybrid PoST setup: supervised + remote). We can say we do not want to support hybrid setup which would make things easier on our side but may interfere with plans by users...

Is there any reason to keep node-address field in the config? We already know in the code what value it should have. Perhaps we could just remove it and calculate it in the code to point to the address on which the post service is enabled on. Or leave it as an optional parameter with such a default - in case the user really needs to change its value to something else.

Theoretically node-address can be automatically set to the listener that exposes post adding https:// if it's TLS listener or http:// otherwise. We should however still add my suggestion to exit the post service if it fails to connect after a certain number of tries and have the node panic when the supervised service exits. Just to ensure that an invalid configuration is not ignored because it only complains in the logs.

@poszu
Copy link
Contributor

poszu commented Nov 20, 2023

Opened spacemeshos/post-rs#151 for the --max-retries.

spacemesh-bors bot pushed a commit that referenced this issue Nov 20, 2023
## Motivation

Part of #5267
Closes #5260

This changes the node configuration to not allow an operator to decide any more which GRPC handlers are exposed on which endpoint. Instead it hard codes the following:

- public: GlobalState, Mesh, Transaction, Node, Activation
- private: Admin, Debug, Smesher, Post
- mTLS: Post

Additionally it removes all configuration options for the PoST service and sets them automatically if the node is started with `smeshing-start=true` via the command line or via the config.

## Changes
- remove ability to set which handlers are available on which endpoint and hardcode them instead
- remove workaround code added in #5259 that was necessary when the operator could still configure the supervised post service
  - supervised post service binary is now always expected to be in the same directory as the node binary
  - the supervised post service is now instructed to always connect to the node via the private listener
  - remote post services can connect via the TLS listener

## Test Plan
updated existing tests

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed
- [x] Update [changelog](../CHANGELOG.md) as needed
@github-project-automation github-project-automation bot moved this from 🏗 Doing to ✅ Done in Dev team kanban Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants