Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Improve saved object migrations: Use .kibana instead of .kibana_current to mark migration completion #83373

Merged
merged 1 commit into from
Nov 25, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 63 additions & 29 deletions rfcs/text/0013_saved_object_migrations.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
- [4.2.1 Idempotent migrations performed without coordination](#421-idempotent-migrations-performed-without-coordination)
- [4.2.1.1 Restrictions](#4211-restrictions)
- [4.2.1.2 Migration algorithm: Cloned index per version](#4212-migration-algorithm-cloned-index-per-version)
- [Known weaknesses:](#known-weaknesses)
- [4.2.1.3 Upgrade and rollback procedure](#4213-upgrade-and-rollback-procedure)
- [4.2.1.4 Handling documents that belong to a disabled plugin](#4214-handling-documents-that-belong-to-a-disabled-plugin)
- [5. Alternatives](#5-alternatives)
Expand Down Expand Up @@ -192,26 +193,24 @@ id's deterministically with e.g. UUIDv5.
### 4.2.1.2 Migration algorithm: Cloned index per version
Note:
- The description below assumes the migration algorithm is released in 7.10.0.
So < 7.10.0 will use `.kibana` and >= 7.10.0 will use `.kibana_current`.
So >= 7.10.0 will use the new algorithm.
- We refer to the alias and index that outdated nodes use as the source alias
and source index.
- Every version performs a migration even if mappings or documents aren't outdated.

1. Locate the source index by fetching aliases (including `.kibana` for
versions prior to v7.10.0)
1. Locate the source index by fetching kibana indices:

```
GET '/_alias/.kibana_current,.kibana_7.10.0,.kibana'
GET '/_indices/.kibana,.kibana_7.10.0'
```

The source index is:
1. the index the `.kibana_current` alias points to, or if it doesn’t exist,
2. the index the `.kibana` alias points to, or if it doesn't exist,
3. the v6.x `.kibana` index
1. the index the `.kibana` alias points to, or if it doesn't exist,
2. the v6.x `.kibana` index

If none of the aliases exists, this is a new Elasticsearch cluster and no
migrations are necessary. Create the `.kibana_7.10.0_001` index with the
following aliases: `.kibana_current` and `.kibana_7.10.0`.
following aliases: `.kibana` and `.kibana_7.10.0`.
2. If the source is a < v6.5 `.kibana` index or < 7.4 `.kibana_task_manager`
index prepare the legacy index for a migration:
1. Mark the legacy index as read-only and wait for all in-flight operations to drain (requires https://github.com/elastic/elasticsearch/pull/58094). This prevents any further writes from outdated nodes. Assuming this API is similar to the existing `/<index>/_close` API, we expect to receive `"acknowledged" : true` and `"shards_acknowledged" : true`. If all shards don’t acknowledge within the timeout, retry the operation until it succeeds.
Expand All @@ -235,13 +234,13 @@ Note:
atomically so that other Kibana instances will always see either a
`.kibana` index or an alias, but never neither.
6. Use the cloned `.kibana_pre6.5.0_001` as the source for the rest of the migration algorithm.
3. If `.kibana_current` and `.kibana_7.10.0` both exists and are pointing to the same index this version's migration has already been completed.
3. If `.kibana` and `.kibana_7.10.0` both exists and are pointing to the same index this version's migration has already been completed.
1. Because the same version can have plugins enabled at any point in time,
perform the mappings update in step (6) and migrate outdated documents
with step (7).
2. Skip to step (9) to start serving traffic.
4. Fail the migration if:
1. `.kibana_current` is pointing to an index that belongs to a later version of Kibana .e.g. `.kibana_7.12.0_001`
1. `.kibana` is pointing to an index that belongs to a later version of Kibana .e.g. `.kibana_7.12.0_001`
2. (Only in 8.x) The source index contains documents that belong to an unknown Saved Object type (from a disabled plugin). Log an error explaining that the plugin that created these documents needs to be enabled again or that these objects should be deleted. See section (4.2.1.4).
5. Mark the source index as read-only and wait for all in-flight operations to drain (requires https://github.com/elastic/elasticsearch/pull/58094). This prevents any further writes from outdated nodes. Assuming this API is similar to the existing `/<index>/_close` API, we expect to receive `"acknowledged" : true` and `"shards_acknowledged" : true`. If all shards don’t acknowledge within the timeout, retry the operation until it succeeds.
6. Clone the source index into a new target index which has writes enabled. All nodes on the same version will use the same fixed index name e.g. `.kibana_7.10.0_001`. The `001` postfix isn't used by Kibana, but allows for re-indexing an index should this be required by an Elasticsearch upgrade. E.g. re-index `.kibana_7.10.0_001` into `.kibana_7.10.0_002` and point the `.kibana_7.10.0` alias to `.kibana_7.10.0_002`.
Expand All @@ -257,24 +256,62 @@ Note:
8. Transform documents by reading batches of outdated documents from the target index then transforming and updating them with optimistic concurrency control.
1. Ignore any version conflict errors.
2. If a document transform throws an exception, add the document to a failure list and continue trying to transform all other documents. If any failures occured, log the complete list of documents that failed to transform. Fail the migration.
9. Mark the migration as complete by doing a single atomic operation (requires https://github.com/elastic/elasticsearch/pull/58100) that:
3. Checks that `.kibana_current` alias is still pointing to the source index
4. Points the `.kibana_7.10.0` and `.kibana_current` aliases to the target index.
5. If this fails with a "required alias [.kibana_current] does not exist" error fetch `.kibana_current` again:
1. If `.kibana_current` is _not_ pointing to our target index fail the migration.
2. If `.kibana_current` is pointing to our target index the migration has succeeded and we can proceed to step (9).
10. Start serving traffic.

This algorithm shares a weakness with our existing migration algorithm
(since v7.4). When the task manager index gets reindexed a reindex script is
applied. Because we delete the original task manager index there is no way to
rollback a failed task manager migration without a snapshot.
9. Mark the migration as complete. This is done as a single atomic
operation (requires https://github.com/elastic/elasticsearch/pull/58100)
to guarantees when multiple versions of Kibana are performing the
migration in parallel, only one version will win. E.g. if 7.11 and 7.12
are started in parallel and migrate from a 7.9 index, either 7.11 or 7.12
should succeed and accept writes, but not both.
3. Checks that `.kibana` alias is still pointing to the source index
4. Points the `.kibana_7.10.0` and `.kibana` aliases to the target index.
5. If this fails with a "required alias [.kibana] does not exist" error fetch `.kibana` again:
1. If `.kibana` is _not_ pointing to our target index fail the migration.
2. If `.kibana` is pointing to our target index the migration has succeeded and we can proceed to step (10).
10. Start serving traffic. All saved object reads/writes happen through the
version-specific alias `.kibana_7.10.0`.

Together with the limitations, this algorithm ensures that migrations are
idempotent. If two nodes are started simultaneously, both of them will start
transforming documents in that version's target index, but because migrations
are idempotent, it doesn’t matter which node’s writes win.

#### Known weaknesses:
(Also present in our existing migration algorithm since v7.4)
When the task manager index gets reindexed a reindex script is applied.
Because we delete the original task manager index there is no way to rollback
a failed task manager migration without a snapshot. Although losing the task
manager data has a fairly low impact.

(Also present in our existing migration algorithm since v6.5)
If the outdated instance isn't shutdown before starting the migration, the
following data-loss scenario is possible:
1. Upgrade a 7.9 index without shutting down the 7.9 nodes
2. Kibana v7.10 performs a migration and after completing points `.kibana`
alias to `.kibana_7.11.0_001`
3. Kibana v7.9 writes unmigrated documents into `.kibana`.
4. Kibana v7.10 performs a query based on the updated mappings of documents so
results potentially don't match the acknowledged write from step (3).

Note:
- Data loss won't occur if both nodes have the updated migration algorithm
proposed in this RFC. It is only when one of the nodes use the existing
algorithm that data loss is possible.
- Once v7.10 is restarted it will transform any outdated documents making
these visible to queries again.

It is possible to work around this weakness by introducing a new alias such as
`.kibana_current` so that after a migration the `.kibana` alias will continue
to point to the outdated index. However, we decided to keep using the
`.kibana` alias despite this weakness for the following reasons:
- Users might rely on `.kibana` alias for snapshots, so if this alias no
longer points to the latest index their snapshots would no longer backup
kibana's latest data.
- Introducing another alias introduces complexity for users and support.
The steps to diagnose, fix or rollback a failed migration will deviate
depending on the 7.x version of Kibana you are using.
- The existing Kibana documentation clearly states that outdated nodes should
be shutdown, this scenario has never been supported by Kibana.

<details>
<summary>In the future, this algorithm could enable (2.6) "read-only functionality during the downtime window" but this is outside of the scope of this RFC.</summary>

Expand Down Expand Up @@ -303,12 +340,9 @@ To rollback to a previous version of Kibana without a snapshot:
(Assumes the migration to 7.11.0 failed)
1. Shutdown all Kibana nodes.
2. Remove the index created by the failed Kibana migration by using the version-specific alias e.g. `DELETE /.kibana_7.11.0`
3. Identify the rollback index:
1. If rolling back to a Kibana version < 7.10.0 use `.kibana`
2. If rolling back to a Kibana version >= 7.10.0 use the version alias of the Kibana version you wish to rollback to e.g. `.kibana_7.10.0`
4. Point the `.kibana_current` alias to the rollback index.
5. Remove the write block from the rollback index.
6. Start the rollback Kibana nodes. All running Kibana nodes should be on the same rollback version, have the same plugins enabled and use the same configuration.
3. Remove the write block from the rollback index using the `.kibana` alias
`PUT /.kibana/_settings {"index.blocks.write": false}`
4. Start the rollback Kibana nodes. All running Kibana nodes should be on the same rollback version, have the same plugins enabled and use the same configuration.

### 4.2.1.4 Handling documents that belong to a disabled plugin
It is possible for a plugin to create documents in one version of Kibana, but then when upgrading Kibana to a newer version, that plugin is disabled. Because the plugin is disabled it cannot register it's Saved Objects type including the mappings or any migration transformation functions. These "orphan" documents could cause future problems:
Expand Down Expand Up @@ -378,7 +412,7 @@ There are several approaches we could take to dealing with these orphan document
deterministically perform the delete and re-clone operation without
coordination.

5. Transform outdated documents (step 7) on every startup
5. Transform outdated documents (step 8) on every startup
Advantages:
- Outdated documents belonging to disabled plugins will be upgraded as soon
as the plugin is enabled again.
Expand Down