From ace6bb3fb478ee62fa7bb747b9f2c530ab6e683c Mon Sep 17 00:00:00 2001 From: Rudolf Meijering Date: Fri, 13 Nov 2020 12:37:24 +0100 Subject: [PATCH] Use .kibana instead of .kibana_current to mark migration completion --- rfcs/text/0013_saved_object_migrations.md | 92 ++++++++++++++++------- 1 file changed, 63 insertions(+), 29 deletions(-) diff --git a/rfcs/text/0013_saved_object_migrations.md b/rfcs/text/0013_saved_object_migrations.md index 1a0967d110d06..6e125c28c04c0 100644 --- a/rfcs/text/0013_saved_object_migrations.md +++ b/rfcs/text/0013_saved_object_migrations.md @@ -13,6 +13,7 @@ - [4.2.1 Idempotent migrations performed without coordination](#421-idempotent-migrations-performed-without-coordination) - [4.2.1.1 Restrictions](#4211-restrictions) - [4.2.1.2 Migration algorithm: Cloned index per version](#4212-migration-algorithm-cloned-index-per-version) + - [Known weaknesses:](#known-weaknesses) - [4.2.1.3 Upgrade and rollback procedure](#4213-upgrade-and-rollback-procedure) - [4.2.1.4 Handling documents that belong to a disabled plugin](#4214-handling-documents-that-belong-to-a-disabled-plugin) - [5. Alternatives](#5-alternatives) @@ -192,26 +193,24 @@ id's deterministically with e.g. UUIDv5. ### 4.2.1.2 Migration algorithm: Cloned index per version Note: - The description below assumes the migration algorithm is released in 7.10.0. - So < 7.10.0 will use `.kibana` and >= 7.10.0 will use `.kibana_current`. + So >= 7.10.0 will use the new algorithm. - We refer to the alias and index that outdated nodes use as the source alias and source index. - Every version performs a migration even if mappings or documents aren't outdated. -1. Locate the source index by fetching aliases (including `.kibana` for - versions prior to v7.10.0) +1. Locate the source index by fetching kibana indices: ``` - GET '/_alias/.kibana_current,.kibana_7.10.0,.kibana' + GET '/_indices/.kibana,.kibana_7.10.0' ``` The source index is: - 1. the index the `.kibana_current` alias points to, or if it doesn’t exist, - 2. the index the `.kibana` alias points to, or if it doesn't exist, - 3. the v6.x `.kibana` index + 1. the index the `.kibana` alias points to, or if it doesn't exist, + 2. the v6.x `.kibana` index If none of the aliases exists, this is a new Elasticsearch cluster and no migrations are necessary. Create the `.kibana_7.10.0_001` index with the - following aliases: `.kibana_current` and `.kibana_7.10.0`. + following aliases: `.kibana` and `.kibana_7.10.0`. 2. If the source is a < v6.5 `.kibana` index or < 7.4 `.kibana_task_manager` index prepare the legacy index for a migration: 1. Mark the legacy index as read-only and wait for all in-flight operations to drain (requires https://github.com/elastic/elasticsearch/pull/58094). This prevents any further writes from outdated nodes. Assuming this API is similar to the existing `//_close` API, we expect to receive `"acknowledged" : true` and `"shards_acknowledged" : true`. If all shards don’t acknowledge within the timeout, retry the operation until it succeeds. @@ -235,13 +234,13 @@ Note: atomically so that other Kibana instances will always see either a `.kibana` index or an alias, but never neither. 6. Use the cloned `.kibana_pre6.5.0_001` as the source for the rest of the migration algorithm. -3. If `.kibana_current` and `.kibana_7.10.0` both exists and are pointing to the same index this version's migration has already been completed. +3. If `.kibana` and `.kibana_7.10.0` both exists and are pointing to the same index this version's migration has already been completed. 1. Because the same version can have plugins enabled at any point in time, perform the mappings update in step (6) and migrate outdated documents with step (7). 2. Skip to step (9) to start serving traffic. 4. Fail the migration if: - 1. `.kibana_current` is pointing to an index that belongs to a later version of Kibana .e.g. `.kibana_7.12.0_001` + 1. `.kibana` is pointing to an index that belongs to a later version of Kibana .e.g. `.kibana_7.12.0_001` 2. (Only in 8.x) The source index contains documents that belong to an unknown Saved Object type (from a disabled plugin). Log an error explaining that the plugin that created these documents needs to be enabled again or that these objects should be deleted. See section (4.2.1.4). 5. Mark the source index as read-only and wait for all in-flight operations to drain (requires https://github.com/elastic/elasticsearch/pull/58094). This prevents any further writes from outdated nodes. Assuming this API is similar to the existing `//_close` API, we expect to receive `"acknowledged" : true` and `"shards_acknowledged" : true`. If all shards don’t acknowledge within the timeout, retry the operation until it succeeds. 6. Clone the source index into a new target index which has writes enabled. All nodes on the same version will use the same fixed index name e.g. `.kibana_7.10.0_001`. The `001` postfix isn't used by Kibana, but allows for re-indexing an index should this be required by an Elasticsearch upgrade. E.g. re-index `.kibana_7.10.0_001` into `.kibana_7.10.0_002` and point the `.kibana_7.10.0` alias to `.kibana_7.10.0_002`. @@ -257,24 +256,62 @@ Note: 8. Transform documents by reading batches of outdated documents from the target index then transforming and updating them with optimistic concurrency control. 1. Ignore any version conflict errors. 2. If a document transform throws an exception, add the document to a failure list and continue trying to transform all other documents. If any failures occured, log the complete list of documents that failed to transform. Fail the migration. -9. Mark the migration as complete by doing a single atomic operation (requires https://github.com/elastic/elasticsearch/pull/58100) that: - 3. Checks that `.kibana_current` alias is still pointing to the source index - 4. Points the `.kibana_7.10.0` and `.kibana_current` aliases to the target index. - 5. If this fails with a "required alias [.kibana_current] does not exist" error fetch `.kibana_current` again: - 1. If `.kibana_current` is _not_ pointing to our target index fail the migration. - 2. If `.kibana_current` is pointing to our target index the migration has succeeded and we can proceed to step (9). -10. Start serving traffic. - -This algorithm shares a weakness with our existing migration algorithm -(since v7.4). When the task manager index gets reindexed a reindex script is -applied. Because we delete the original task manager index there is no way to -rollback a failed task manager migration without a snapshot. +9. Mark the migration as complete. This is done as a single atomic + operation (requires https://github.com/elastic/elasticsearch/pull/58100) + to guarantees when multiple versions of Kibana are performing the + migration in parallel, only one version will win. E.g. if 7.11 and 7.12 + are started in parallel and migrate from a 7.9 index, either 7.11 or 7.12 + should succeed and accept writes, but not both. + 3. Checks that `.kibana` alias is still pointing to the source index + 4. Points the `.kibana_7.10.0` and `.kibana` aliases to the target index. + 5. If this fails with a "required alias [.kibana] does not exist" error fetch `.kibana` again: + 1. If `.kibana` is _not_ pointing to our target index fail the migration. + 2. If `.kibana` is pointing to our target index the migration has succeeded and we can proceed to step (10). +10. Start serving traffic. All saved object reads/writes happen through the + version-specific alias `.kibana_7.10.0`. Together with the limitations, this algorithm ensures that migrations are idempotent. If two nodes are started simultaneously, both of them will start transforming documents in that version's target index, but because migrations are idempotent, it doesn’t matter which node’s writes win. +#### Known weaknesses: +(Also present in our existing migration algorithm since v7.4) +When the task manager index gets reindexed a reindex script is applied. +Because we delete the original task manager index there is no way to rollback +a failed task manager migration without a snapshot. Although losing the task +manager data has a fairly low impact. + +(Also present in our existing migration algorithm since v6.5) +If the outdated instance isn't shutdown before starting the migration, the +following data-loss scenario is possible: +1. Upgrade a 7.9 index without shutting down the 7.9 nodes +2. Kibana v7.10 performs a migration and after completing points `.kibana` + alias to `.kibana_7.11.0_001` +3. Kibana v7.9 writes unmigrated documents into `.kibana`. +4. Kibana v7.10 performs a query based on the updated mappings of documents so + results potentially don't match the acknowledged write from step (3). + +Note: + - Data loss won't occur if both nodes have the updated migration algorithm + proposed in this RFC. It is only when one of the nodes use the existing + algorithm that data loss is possible. + - Once v7.10 is restarted it will transform any outdated documents making + these visible to queries again. + +It is possible to work around this weakness by introducing a new alias such as +`.kibana_current` so that after a migration the `.kibana` alias will continue +to point to the outdated index. However, we decided to keep using the +`.kibana` alias despite this weakness for the following reasons: + - Users might rely on `.kibana` alias for snapshots, so if this alias no + longer points to the latest index their snapshots would no longer backup + kibana's latest data. + - Introducing another alias introduces complexity for users and support. + The steps to diagnose, fix or rollback a failed migration will deviate + depending on the 7.x version of Kibana you are using. + - The existing Kibana documentation clearly states that outdated nodes should + be shutdown, this scenario has never been supported by Kibana. +
In the future, this algorithm could enable (2.6) "read-only functionality during the downtime window" but this is outside of the scope of this RFC. @@ -303,12 +340,9 @@ To rollback to a previous version of Kibana without a snapshot: (Assumes the migration to 7.11.0 failed) 1. Shutdown all Kibana nodes. 2. Remove the index created by the failed Kibana migration by using the version-specific alias e.g. `DELETE /.kibana_7.11.0` -3. Identify the rollback index: - 1. If rolling back to a Kibana version < 7.10.0 use `.kibana` - 2. If rolling back to a Kibana version >= 7.10.0 use the version alias of the Kibana version you wish to rollback to e.g. `.kibana_7.10.0` -4. Point the `.kibana_current` alias to the rollback index. -5. Remove the write block from the rollback index. -6. Start the rollback Kibana nodes. All running Kibana nodes should be on the same rollback version, have the same plugins enabled and use the same configuration. +3. Remove the write block from the rollback index using the `.kibana` alias + `PUT /.kibana/_settings {"index.blocks.write": false}` +4. Start the rollback Kibana nodes. All running Kibana nodes should be on the same rollback version, have the same plugins enabled and use the same configuration. ### 4.2.1.4 Handling documents that belong to a disabled plugin It is possible for a plugin to create documents in one version of Kibana, but then when upgrading Kibana to a newer version, that plugin is disabled. Because the plugin is disabled it cannot register it's Saved Objects type including the mappings or any migration transformation functions. These "orphan" documents could cause future problems: @@ -378,7 +412,7 @@ There are several approaches we could take to dealing with these orphan document deterministically perform the delete and re-clone operation without coordination. -5. Transform outdated documents (step 7) on every startup +5. Transform outdated documents (step 8) on every startup Advantages: - Outdated documents belonging to disabled plugins will be upgraded as soon as the plugin is enabled again.