Set an arbitrary number of sled agents for tests #7353

jmpesp · 2025-01-15T22:09:45Z

As part of testing for oxidecomputer/crucible#1594, I needed the ability to specify multiple extra sled agents. I pulled some work out of the fix for that issue into this commit, which adds the extra_sled_agents parameter to the nexus_test macro (also refactoring that code so that other parameters can be added easier in the future).

Previously, ControlPlaneTestContext would configure two sled agents, and later the impl of nexus_test_interface::NexusServer for Nexus would mark the second sled agent as non-provisionable. This commit changes the default number of sled agents created as part of
ControlPlaneTestContext to 1, and if more are created they are not marked non-provisionable.

Many tests do not require two sled agents, and those that do can specify how many they need with the new the extra_sled_agents parameter.

A lot of other changes in this diff are mechanical: for example changing from reaching into the internals of ControlPlaneTestContext to using functions.

As part of testing for oxidecomputer/crucible#1594, I needed the ability to specify multiple extra sled agents. I pulled some work out of the fix for that issue into this commit, which adds the `extra_sled_agents` parameter to the `nexus_test` macro (also refactoring that code so that other parameters can be added easier in the future). Previously, `ControlPlaneTestContext` would configure two sled agents, and later the impl of nexus_test_interface::NexusServer for Nexus would mark the second sled agent as non-provisionable. This commit changes the default number of sled agents created as part of `ControlPlaneTestContext` to 1, and if more are created they are not marked non-provisionable. Many tests do not require two sled agents, and those that do can specify how many they need with the new the `extra_sled_agents` parameter. A lot of other changes in this diff are mechanical: for example changing from reaching into the internals of `ControlPlaneTestContext` to using functions.

andrewjstone

LGTM! Nice cleanup @jmpesp

andrewjstone · 2025-01-15T22:41:34Z

nexus/test-utils-macros/src/lib.rs

-    name: syn::Path,
-    _eq_token: syn::token::Eq,
-    value: syn::Path,
+pub(crate) enum NexusTestArg {


Nice: I like this structure much better.

andrewjstone · 2025-01-15T22:44:09Z

nexus/test-utils/src/lib.rs

@@ -120,6 +122,39 @@ pub const RACK_SUBNET: &str = "fd00:1122:3344:0100::/56";
 /// both transient deployments with no sensitive data.
 pub const TEST_SUITE_PASSWORD: &str = "oxide";

+pub struct ControlPlaneTestContextSledAgent {
+    #[allow(unused)]


Is this field actually used anywhere, or can we delete it?

It's storing the tempdir, so we can't delete it - I changed the field to _storage and removed the #[allow(unused)]

andrewjstone · 2025-01-15T22:46:24Z

nexus/test-utils/src/lib.rs

+                    }
+                }
+
+                blueprint_disks.insert(


Just an FYI, this just changed to use IdMaps in #7352, so you'll need a bit more cleanup here. Sorry!

No problem :)

When performing region allocation _again_ for a particular volume and increasing the redundancy, the region allocation query was not excluding the sleds that have already been used by the existing allocation as candidates for the new region (note that this only matters when the allocation strategy mandates distinct sleds!). This resulted in region replacement and region snapshot replacement having the potential to allocate the replacement region onto the same sled as an existing region. This commit fixes the region allocation query, and then fixes the fallout: many tests of region replacement or region snapshot replacement were not creating many simulated sled agents, and therefore could not properly satisfy the fixed region allocation query. The `extra_sled_agents` parameter added in oxidecomputer#7353 was used, along with changing the `DiskTestBuilder` to simulated one zpool on many sleds. Changing these replacement tests revealed the next problem: because all the simulated sleds share the same IP (::1) and range of ports for the fake Crucible agents, each region allocation would have the same target, and the volume construction request would look like: "target": [ "[::1]:1100", "[::1]:1100", "[::1]:1100" ] This is very confusing for the replacement routines! We can't change the simulated sled's IPs: they need to bind to localhost to listen for requests. Therefore this commit also adds the idea of "sled index", and this sled index is used to make each range of Crucible agent ports unique. This lead straight to the next problem: the simulated Pantry assumes that there is really only one simulated sled-agent, and this has all the storage! This is no longer true in a post-oxidecomputer#7353 world. Snapshots would not be able to complete because the simulated Pantry would only find one of the three regions when searching the first simulated sled-agent's storage. The real Pantry would construct a Crucible Upstairs, and delegate snapshot requests to that, so this commit also adds a simulated Crucible Upstairs that's aware of each of the simulated sled agent's `Storage`, and can therefore fulfill snapshot requests. Centralizing the place where all simulated sled agents (and Pantry servers!) are created makes this easy: the simulated Upstairs can be aware of all of the `Storage` objects, and the simulated Pantry can use that simulated Upstairs to take fake snapshots. Fixes oxidecomputer/crucible#1594.

When performing region allocation _again_ for a particular volume and increasing the redundancy, the region allocation query was not excluding the sleds that have already been used by the existing allocation as candidates for the new region (note that this only matters when the allocation strategy mandates distinct sleds!). This resulted in region replacement and region snapshot replacement having the potential to allocate the replacement region onto the same sled as an existing region. This commit fixes the region allocation query, and then fixes the fallout: many tests of region replacement or region snapshot replacement were not creating many simulated sled agents, and therefore could not properly satisfy the fixed region allocation query. The `extra_sled_agents` parameter added in #7353 was used, along with changing the `DiskTestBuilder` to simulated one zpool on many sleds. Changing these replacement tests revealed the next problem: because all the simulated sleds share the same IP (::1) and range of ports for the fake Crucible agents, each region allocation would have the same target, and the volume construction request would look like: "target": [ "[::1]:1100", "[::1]:1100", "[::1]:1100" ] This is very confusing for the replacement routines! We can't change the simulated sled's IPs: they need to bind to localhost to listen for requests. Therefore this commit also adds the idea of "sled index", and this sled index is used to make each range of Crucible agent ports unique. This lead straight to the next problem: the simulated Pantry assumes that there is really only one simulated sled-agent, and this has all the storage! This is no longer true in a post-#7353 world. Snapshots would not be able to complete because the simulated Pantry would only find one of the three regions when searching the first simulated sled-agent's storage. The real Pantry would construct a Crucible Upstairs, and delegate snapshot requests to that, so this commit also adds a simulated Crucible Upstairs that's aware of each of the simulated sled agent's `Storage`, and can therefore fulfill snapshot requests. Centralizing the place where all simulated sled agents (and Pantry servers!) are created makes this easy: the simulated Upstairs can be aware of all of the `Storage` objects, and the simulated Pantry can use that simulated Upstairs to take fake snapshots. Fixes oxidecomputer/crucible#1594.

jmpesp added 2 commits January 15, 2025 21:56

Merge branch 'main' into multiple_test_sled_agents

5408438

andrewjstone approved these changes Jan 15, 2025

View reviewed changes

Merge branch 'main' into multiple_test_sled_agents

f8c3804

jmpesp enabled auto-merge (squash) January 16, 2025 15:32

jmpesp merged commit cf83d57 into oxidecomputer:main Jan 16, 2025
16 checks passed

jmpesp deleted the multiple_test_sled_agents branch January 16, 2025 18:58

smklein mentioned this pull request Jan 16, 2025

test failed in CI: test_sled_instance_list #7365

Closed

jmpesp mentioned this pull request Jan 22, 2025

Always allocate regions to different sleds #7382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set an arbitrary number of sled agents for tests #7353

Set an arbitrary number of sled agents for tests #7353

jmpesp commented Jan 15, 2025

andrewjstone left a comment

andrewjstone Jan 15, 2025

jmpesp Jan 16, 2025

andrewjstone Jan 15, 2025

jmpesp Jan 16, 2025

andrewjstone Jan 15, 2025

jmpesp Jan 16, 2025

Set an arbitrary number of sled agents for tests #7353

Set an arbitrary number of sled agents for tests #7353

Conversation

jmpesp commented Jan 15, 2025

andrewjstone left a comment

Choose a reason for hiding this comment

andrewjstone Jan 15, 2025

Choose a reason for hiding this comment

jmpesp Jan 16, 2025

Choose a reason for hiding this comment

andrewjstone Jan 15, 2025

Choose a reason for hiding this comment

jmpesp Jan 16, 2025

Choose a reason for hiding this comment

andrewjstone Jan 15, 2025

Choose a reason for hiding this comment

jmpesp Jan 16, 2025

Choose a reason for hiding this comment