Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Traffic Control] Add dynamic config via admin api #20887

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

williampsmith
Copy link
Contributor

@williampsmith williampsmith commented Jan 15, 2025

Description

Implement an admin API endpoint to do a hot reconfiguration of the traffic control policy without clearing blocklists or existing count min sketch.

To facilitate this, the PR introduces a refactor that allows AuthorityState to hold a reference to TrafficController, and for all sui callsites to reference this rather than instantiating separately. This is ok for our purposes (and indeed is much cleaner) because sui-node enforces that validators do not run json rpc server. Additionally, we need to make the policy a member of the traffic controller struct in order to reference it from the main thread.

Things to note:
For now, this only works when the node is previously running traffic controller with a blocklist policy. i.e. it will fail if used to enable traffic control on a node that did not already have it enabled. It will also fail if called when the node is running an allowlist policy.

Given the above, there are two intended use patterns:

  1. Before being called, the node is running traffic controller in dry-run mode with some non-allowlist policy config so that tallying and metrics collection occurs. In such a case, it can be called with the appropriate flag to disable dry-run mode. Note that it can also be called to enable dry-run mode where the node previously had it running in active mode.
  2. Before being called, the node is running traffic controller in active mode, but the node operator wants to elevate the policy restriction by adjusting the threshold values.

Note also that this currently is supported for rpc node and validator server traffic control use cases. Bridge node is not yet supported.

Test plan

How did you test the new or updated feature?


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • gRPC:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:

Copy link

vercel bot commented Jan 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 3, 2025 11:59pm
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Feb 3, 2025 11:59pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Feb 3, 2025 11:59pm

@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 15, 2025 03:19 — with GitHub Actions Inactive
@williampsmith williampsmith force-pushed the william/traffic-control-admin-api branch from 8e20a05 to 9f63abd Compare January 16, 2025 05:05
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 16, 2025 05:06 — with GitHub Actions Inactive
@williampsmith williampsmith force-pushed the william/traffic-control-admin-api branch from 9f63abd to 813327f Compare January 16, 2025 05:28
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 16, 2025 05:28 — with GitHub Actions Inactive
@williampsmith williampsmith marked this pull request as ready for review January 16, 2025 05:32
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 16, 2025 05:32 — with GitHub Actions Inactive
Copy link
Contributor

@johnjmartin johnjmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the most familiar with this code, but LGTM

crates/sui-core/src/authority.rs Show resolved Hide resolved
crates/sui-core/src/traffic_controller/mod.rs Outdated Show resolved Hide resolved
crates/sui-node/src/admin.rs Outdated Show resolved Hide resolved
…erences with a new instance containing old data, then shutdown old instance
…; during downtime we ignore tally requests with log line
@@ -461,19 +462,27 @@ impl TestNConnIPPolicy {

fn handle_tally(&mut self, tally: TrafficTally) -> PolicyResponse {
let client = if let Some(client) = tally.direct {
warn!("TESTING -- Handling tally for client: {:?}", client);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be removed

Comment on lines +209 to +251
if let Some(error_threshold) = error_threshold {
match *self.error_policy.as_ref().unwrap().lock().await {
TrafficControlPolicy::FreqThreshold(ref mut policy) => {
policy.client_threshold = error_threshold;
if let Some(dry_run) = dry_run {
policy.config.dry_run = dry_run;
}
}
TrafficControlPolicy::TestNConnIP(ref mut policy) => {
policy.threshold = error_threshold;
if let Some(dry_run) = dry_run {
policy.config.dry_run = dry_run;
}
}
_ => {
return Err(SuiError::InvalidAdminRequest(
"Unsupported prior error policy type during traffic control reconfiguration"
.to_string(),
));
}
}
}
if let Some(spam_threshold) = spam_threshold {
match *self.spam_policy.as_ref().unwrap().lock().await {
TrafficControlPolicy::FreqThreshold(ref mut policy) => {
policy.client_threshold = spam_threshold;
if let Some(dry_run) = dry_run {
policy.config.dry_run = dry_run;
}
}
TrafficControlPolicy::TestNConnIP(ref mut policy) => {
policy.threshold = spam_threshold;
if let Some(dry_run) = dry_run {
policy.config.dry_run = dry_run;
}
}
_ => {
return Err(SuiError::InvalidAdminRequest(
"Unsupported prior spam policy type during traffic control reconfiguration"
.to_string(),
));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you avoid repetition here using a helper func, eg:

async fn update_policy_threshold(
    policy: &Arc<Mutex<TrafficControlPolicy>>,
    threshold: u64,
    dry_run: Option<bool>,
) -> Result<(), SuiError> {
match *policy.lock().await {
        TrafficControlPolicy::FreqThreshold(ref mut policy) => {
        [...]

@@ -738,6 +824,7 @@ async fn test_traffic_sketch_allowlist_mode() {
}

async fn assert_traffic_control_ok(mut test_cluster: TestCluster) -> Result<(), anyhow::Error> {
telemetry_subscribers::init_for_testing();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for adding this to the different tests?

@johnjmartin
Copy link
Contributor

To facilitate this, the PR introduces a refactor that allows AuthorityState to hold a reference to TrafficController, and for all sui callsites to reference this rather than instantiating separately. This is ok for our purposes (and indeed is much cleaner) because sui-node enforces that validators do not run json rpc server.

can you expand on this? without the enforcement is there a risk that trafficController blocks JSONRPC requests? We do run this on fullnodes too

it will fail if used to enable traffic control on a node that did not already have it enabled. It will also fail if called when the node is running an allowlist policy.

is this by design? I think it's desirable for validators to enable traffic control only via the api, but I understand if that's adding too much complexity to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants