Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Role Mapping Cleanup Security Migration #114830

Closed
wants to merge 44 commits into from

Conversation

jfreden
Copy link
Contributor

@jfreden jfreden commented Oct 15, 2024

This PR adds a cleanup security migration for role mapping duplicates that exist both in cluster state and the .security index. The cleanup task is implemented as a security migration.

The task compares the names of the role mappings in the operator-defined settings.json and the names of the role mappings in the .security index and removes the ones in the .security index.

Issue

This cleanup is done because there was a bug (fixed here) that caused ECK customers to have the same role mappings with the same names present both in cluster state and the .security index. The effective role mapping is the combination of the .security index mapping and the cluster state one. After the bug fix, if the mapping in cluster state is modified (through setting.json), the .security index will still contain the old mapping and this could lead to unintended behaviour.

Dependencies

@@ -117,6 +117,7 @@ private static IndexVersion def(int id, Version luceneVersion) {
public static final IndexVersion ENABLE_IGNORE_MALFORMED_LOGSDB = def(8_514_00_0, Version.LUCENE_9_11_1);
public static final IndexVersion MERGE_ON_RECOVERY_VERSION = def(8_515_00_0, Version.LUCENE_9_11_1);
public static final IndexVersion UPGRADE_TO_LUCENE_9_12 = def(8_516_00_0, Version.LUCENE_9_12_0);
public static final IndexVersion ADD_ROLE_MAPPING_MIGRATION = def(8_517_00_0, Version.LUCENE_9_12_0);
Copy link
Contributor Author

@jfreden jfreden Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to allow us to check if an index was created on the same version as the migration. This means that the index was created with all the settings/mappings/docs needed, so we don't need to run the migration (it can be short-circuit and skipped since there is nothing to migrate).

@jfreden jfreden force-pushed the add_eck_migration_8_16 branch from 84c38fc to d36b152 Compare October 15, 2024 13:53
@@ -267,6 +272,11 @@ private static boolean isCreatedOnLatestVersion(IndexMetadata indexMetadata) {
return indexVersionCreated != null && indexVersionCreated.onOrAfter(IndexVersion.current());
}

private static Set<String> getFileSettingsMetadataHandlerRoleMappingKeys(ClusterState clusterState) {
Copy link
Contributor Author

@jfreden jfreden Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly part of the security index, so this doesn't belong here, but I added it here anyway for simplicity. Ideally the security migration state should be tracked in a separate cluster state handler. I think that's an improvement/refactor that should be done in a separate PR if we keep adding migrations on state outside the security index.

The new state handler would still need to handle the main security index manager state, since we rely heavily on state from it in the framework (is the index available for example).

@jfreden jfreden added the v9.0.0 label Oct 16, 2024
@jfreden jfreden force-pushed the add_eck_migration_8_16 branch from ddfe166 to c23b783 Compare October 16, 2024 12:37
createNativeRoleMapping("no_name_conflict", Map.of("meta", "test"));
assertAllRoleMappings(client(), "no_name_conflict", "operator_role_mapping_1");
} else if (CLUSTER_TYPE == ClusterType.UPGRADED) {
waitForSecurityMigrationCompletion(adminClient(), 2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests will fail here until the file service change is merged, since the preconditions for the migration are not met.

@jfreden jfreden marked this pull request as ready for review October 16, 2024 12:42
@jfreden jfreden added the :Security/Security Security issues without another label label Oct 16, 2024
@jfreden jfreden requested a review from n1v0lg October 16, 2024 12:43
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Oct 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@jfreden jfreden added the >bug label Oct 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @jfreden, I've created a changelog YAML for you.

@@ -189,7 +189,7 @@ public List<TemplateRoleName> getRoleTemplates() {
* This is not used within the mapping process, and does not affect whether the expression matches, nor which roles are assigned.
*/
public Map<String, Object> getMetadata() {
return Collections.unmodifiableMap(metadata);
return metadata != null ? Collections.unmodifiableMap(metadata) : Collections.emptyMap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not set to null anywhere in production code, but could be in the future (since constructor is public) and can in tests (as I found out).

}
}

public static class CleanupRoleMappingDuplicatesMigration implements SecurityMigration {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new migration

@jfreden jfreden requested a review from n1v0lg October 24, 2024 12:09
@jfreden jfreden added the test-full-bwc Trigger full BWC version matrix tests label Oct 24, 2024
Copy link
Contributor

@n1v0lg n1v0lg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prod code looks solid 💯 I'll take another pass over it Monday and go through tests as well.

boolean hasFileSettingsMetadata = fileSettingsMetadata != null;
// If there is no fileSettingsMetadata, there should be no reserved state (this is to catch bugs related to
// name changes to FILE_SETTINGS_METADATA_NAMESPACE)
assert hasFileSettingsMetadata || clusterState.metadata().reservedStateMetadata().isEmpty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd include a message in the assert with a bit of the failure context like: "encountered unknown namespace in reserved state metadata"

Map<Boolean, List<ExpressionRoleMapping>> partitionedRoleMappings = Arrays.stream(roleMappings)
.collect(
Collectors.partitioningBy(
roleMapping -> roleMapping.getMetadata().get(ExpressionRoleMapping.READ_ONLY_ROLE_MAPPING_METADATA_FLAG) != null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it's fair enough to move this into an isReadOnly() method (or similarly-named) on ExpressionRoleMapping

ActionListener<DeleteRoleMappingResponse> groupListener = new GroupedActionListener<>(
names.size(),
ActionListener.wrap(responses -> {
Map<Boolean, List<DeleteRoleMappingResponse>> deletesPartitionedByFound = responses.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we're only after the count here so instead of a partition-by I'd just go with a for loop that counts number not found and number deleted

ActionListener.wrap(responses -> {
Map<Boolean, List<DeleteRoleMappingResponse>> deletesPartitionedByFound = responses.stream()
.collect(Collectors.partitioningBy(DeleteRoleMappingResponse::isFound));
if (deletesPartitionedByFound.containsKey(false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitioningBy will always include entries for both partitions (an empty list if nothing was found) so this would print a WARN even if no role mappings were missed.

this doesn't matter if we use a regular counter var as per my comment above, but if we stick with partitioningBy gotta fix

);
@Override
public boolean checkPreConditions(SecurityIndexManager.State securityIndexManagerState) {
// If there are operator defined role mappings, make sure they've been loaded in to cluster state before launching migration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: need to update the comment to reflect the SKIP condition

.collect(
Collectors.partitioningBy(
roleMapping -> roleMapping.getMetadata().get(ExpressionRoleMapping.READ_ONLY_ROLE_MAPPING_METADATA_FLAG) != null
&& (boolean) roleMapping.getMetadata().get(ExpressionRoleMapping.READ_ONLY_ROLE_MAPPING_METADATA_FLAG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: lets assert the flag is actually boolean to avoid failed cast surprises

import static org.hamcrest.CoreMatchers.is;
import static org.hamcrest.Matchers.containsInAnyOrder;

public class SecurityIndexRoleMappingCleanupIT extends AbstractUpgradeTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@jfreden jfreden Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up changing this to contains and I found a bug in the old file handling. I think we might want to create a patch and backport to versions before 8.15.

The migration never executed because there wasn't a security index (expected behaviour from the migration), but since the role mappings should have been created in the security index it was strange that the index didn't exist. Turns out the call to the native role mapping store here can happen while the SecurityIndexManager is in unrecovered state. This leads to the native role mapping store thinking that the index is not up to date here and nothing is inserted in to the store.

This won't impact the current state of things, so we can proceed, but we should discuss how to address this after this fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have actually gotten fixed here: #92173

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

@jfreden jfreden force-pushed the add_eck_migration_8_16 branch from 6f549ea to 9533f6f Compare October 28, 2024 10:21
Copy link
Contributor

@n1v0lg n1v0lg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look great, too. A few coverage suggestions -- I'll make one more pass when all is ready. Thanks for iterating on this!

awaitFileSettingsWatcher();
// Setup listener to wait for role mapping
var fileBasedRoleMappingsWrittenListener = setupClusterStateListener(masterNode, "everyone_kibana_alone");
// Write role mappings with fallback name, this should block any security migration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is off here: TEST_JSON_WITH_ROLE_MAPPINGS does not have a mapping with a fallback name

waitForMigrationCompletion(SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION);
}

public void testSkipMigrationNoFileBasedMappings() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these tests!

Might also add the other skip case where there are file settings but no role mapping section

true
);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test for a native role mapping with the suffix but without the read-only flag

}

private static ExpressionRoleMapping nativeRoleMapping(String name) {
return new ExpressionRoleMapping(name, null, null, null, null, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we randomize (or otherwise handle) randomBoolean() ? null : Map.of()? Just to cover both cases where metadata is null or explicitly doesn't have the flag set

*
* @return RoleMappingsCleanupMigrationStatus
*/
static RoleMappingsCleanupMigrationStatus getRoleMappingsCleanupMigrationStatus(ClusterState clusterState, int migrationsVersion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It be great to cover this method with some units tests -- the ITs are great but unit tests we can much better cover some of the NOT_READY cases, e.g., size mismatch & some (but not all) mappings with fallback names.

}
assert indexManager.getRoleMappingsCleanupMigrationStatus() == READY;

getRoleMappings(client, ActionListener.wrap(roleMappings -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth covering some of the failure paths with unit tests: in particular, a test where some deletions are 404 or failures showing that we still run through the whole set of names feels important

}

private void deleteNativeRoleMappings(Client client, List<String> names, ActionListener<Void> listener) {
ActionListener<DeleteRoleMappingResponse> groupListener = new GroupedActionListener<>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could we add an assertion that names is not empty?

assertAllRoleMappings(client(), "operator_role_mapping_1", "operator_role_mapping_2");
} else if (CLUSTER_TYPE == ClusterType.MIXED) {
// Create a native role mapping that doesn't conflict with anything before the migration run
createNativeRoleMapping("no_name_conflict", Map.of("meta", "test"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a native mapping with a reserved suffix here (it should not get deleted)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah whoops, my bad -- in a mixed cluster createNativeRoleMapping("native_role_mapping" + ExpressionRoleMapping.READ_ONLY_ROLE_MAPPING_SUFFIX, Map.of("meta", "test")); can result in a rejected request -- I think we want to move this to the (CLUSTER_TYPE == ClusterType.OLD) branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh that's right!

We can't create it there since we won't be bale to validate it. Not all bwc tests will hit the old cluster code because of the assumeTrue statements. I think we should skip it in this case for that reason.

@@ -261,10 +271,63 @@ private SystemIndexDescriptor.MappingsVersion getMinSecurityIndexMappingVersion(
* Check if the index was created on the latest index version available in the cluster
*/
private static boolean isCreatedOnLatestVersion(IndexMetadata indexMetadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to rename this to isCreatedOnLatestMigrationVersion or something similar, since this method no longer checks latest index version per se, but latest migration version

@jfreden
Copy link
Contributor Author

jfreden commented Oct 28, 2024

@n1v0lg Thanks for the review! I've addressed all the comments and reverted the index settings like we discussed offline.

@jfreden jfreden requested a review from n1v0lg October 28, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Security/Security Security issues without another label Team:Security Meta label for security team test-full-bwc Trigger full BWC version matrix tests v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants