-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New setting to prevent automatically importing dangling indices #49174
New setting to prevent automatically importing dangling indices #49174
Conversation
The new settings didn't work, but now it does, and I've written a couple of ITs that create dangling indices and check what affect the setting has. I'm still fixing the unit tests though.
New setting `gateway.auto_import_dangling_indices` doesn't need to be dynamic. Unfortunatlely, this has broken one of the ITs.
Pinging @elastic/es-docs (:Docs) |
Pinging @elastic/es-distributed (:Distributed/Recovery) |
Note: there's more to do to satisfy the issue, e.g. an API to list the danglies, but I plan on doing that separately. |
@DaveCTurner @original-brownbear ready for a review, if you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked to @DaveCTurner who will add some comments as well. No need to revise until then
indices to be imported, instead of being deleted. | ||
When a node joins the cluster, it will search for any shards that are | ||
stored in its local data directory and do not already exist in the | ||
cluster. If the static setting `gateway.auto_import_dangling_indices` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps call this node setting
cluster. If the static setting `gateway.auto_import_dangling_indices` is | ||
`true` (the default is `false`), then those shards will be imported into | ||
the cluster. This functionality is intended as a best effort to help users | ||
who lose all master nodes. If a new master node is started which is unaware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should explain this in terms of unsafe cluster bootstrapping. As far as I'm aware the only remaining use of dangling indices in 7.x is for the situation where a cluster has lost all master-eligible nodes and where the unsafe-bootstrap
and/or detach-cluster
commands have been used.
old indices to be imported, instead of being deleted. | ||
|
||
Enabling `gateway.auto_import_dangling_indices` should only be done if | ||
absolutely necessary, after understanding the possible consequences (this is not an exhaustive list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should strongly recommend against using auto import of dangling indices as it can cause inconsistencies in a cluster (e.g. automatically necroing deleted indices) and instead refer to our tooling (to be created in follow-ups).
Given that we don't provide the tooling at this time, I wonder if we should leave this setting undocumented for the time being and only write the docs once we can present the full story on this.
private final NodeEnvironment nodeEnv; | ||
private final MetaStateService metaStateService; | ||
private final LocalAllocateDangledIndices allocateDangledIndices; | ||
|
||
private final Map<Index, IndexMetaData> danglingIndices = ConcurrentCollections.newConcurrentMap(); | ||
|
||
private boolean allocateDanglingIndices; | ||
|
||
@Inject | ||
public DanglingIndicesState(NodeEnvironment nodeEnv, MetaStateService metaStateService, | ||
LocalAllocateDangledIndices allocateDangledIndices, ClusterService clusterService) { | ||
this.nodeEnv = nodeEnv; | ||
this.metaStateService = metaStateService; | ||
this.allocateDangledIndices = allocateDangledIndices; | ||
clusterService.addListener(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler might be to only conditionally add the listener based on AUTO_IMPORT_DANGLING_INDICES_SETTING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pugnascotia, I left a few suggestions.
@@ -55,19 +56,33 @@ | |||
|
|||
private static final Logger logger = LogManager.getLogger(DanglingIndicesState.class); | |||
|
|||
public static final Setting<Boolean> AUTO_IMPORT_DANGLING_INDICES_SETTING = Setting.boolSetting( | |||
"gateway.auto_import_dangling_indices", | |||
false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this PR we should preserve the existing behaviour and default this to true
, since we do not have any other alternatives for finding or cleaning up dangling indices. This means we can use this setting as a feature flag for testing the new behaviour while it's still in flight. Once we've added the APIs we have planned we can adjust the default to false
.
lose all master nodes. If a new master node is started which is unaware of | ||
the other indices in the cluster, adding the old nodes will cause the old | ||
indices to be imported, instead of being deleted. | ||
When a node joins the cluster, it will search for any shards that are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given my previous comment about preserving existing behaviour, I think we should leave the docs alone for now. I don't think we should put much effort into improving the docs for today's behaviour, and we'll need to make more changes here when we add these APIs.
|
||
public void testCleanupWhenEmpty() throws Exception { | ||
try (NodeEnvironment env = newNodeEnvironment()) { | ||
MetaStateService metaStateService = new MetaStateService(env, xContentRegistry()); | ||
DanglingIndicesState danglingState = createDanglingIndicesState(env, metaStateService); | ||
|
||
// Given an empty state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am, personally, not a big fan of having these comments everywhere. The given/when/then phases of these tests seem pretty clear to me as written, and far too often do comments like these fall out of sync with the code and end up being misleading to readers.
There's a certain amount of judgement needed here to identify cases where there's something surprising that needs the reader's attention, because all readers are different. But in this specific test case for me they add, rather than remove, confusion.
assertTrue(danglingState.getDanglingIndices().isEmpty()); | ||
|
||
// When passed a metdata with an unknown index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata
we pass in is empty, but we've created a dangling index on disk.
|
||
assertTrue(danglingState.getDanglingIndices().isEmpty()); | ||
|
||
// Given a metadata that does not enable allocation of dangling indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether dangling indices importing is enabled is not a property of the metadata as suggested by this comment. Could we make the enabled/disabled property an explicit argument to createDanglingIndicesState()
instead?
assertTrue(danglingState.getDanglingIndices().isEmpty()); | ||
|
||
// Given a state where automatic allocation is enabled | ||
danglingState.setAllocateDanglingIndicesSetting(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make the enabled/disabled property an explicit argument to createDanglingIndicesState() instead?
9b8c880
to
71bcbde
Compare
Thanks for the reviews, I've pushed 71bcbde to address them, specifically:
|
@ywelsch @DaveCTurner any chance of a re-review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. I've left some comments.
} | ||
} | ||
|
||
public void setAllocateDanglingIndicesSetting(boolean allocateDanglingIndices) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is superfluous now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes, good catch.
@@ -171,10 +188,15 @@ private IndexMetaData stripAliases(IndexMetaData indexMetaData) { | |||
* Allocates the provided list of the dangled indices by sending them to the master node | |||
* for allocation. | |||
*/ | |||
private void allocateDanglingIndices() { | |||
void allocateDanglingIndices() { | |||
if (this.allocateDanglingIndices == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this condition is superfluous now?
private final NodeEnvironment nodeEnv; | ||
private final MetaStateService metaStateService; | ||
private final LocalAllocateDangledIndices allocateDangledIndices; | ||
|
||
private final Map<Index, IndexMetaData> danglingIndices = ConcurrentCollections.newConcurrentMap(); | ||
|
||
private boolean allocateDanglingIndices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this field is superfluous now?
.put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 0) | ||
.put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT) | ||
.build(); | ||
.put(IndexMetaData.SETTING_NUMBER_OF_SHARDS, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid reformatting code unrelated to the PR.
Settings.builder() | ||
.put("number_of_shards", shardCount) | ||
.put("number_of_replicas", replicaCount) | ||
.put(Store.INDEX_STORE_STATS_REFRESH_INTERVAL_SETTING.getKey(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahem...that's copy-and-paste. Can you tell it's the first time I've written an ES IT? 😅
client().admin().indices().prepareStats(name).execute().actionGet(); | ||
} | ||
|
||
private void deleteIndex(String indexName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why define an extra method here? just so that one logging line can be reused? Feels a bit over the top.
*/ | ||
public synchronized boolean stopRandomDataNode() throws IOException { | ||
public synchronized Optional<String> stopRandomDataNode() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this change used?
@@ -1902,6 +1905,12 @@ public String startNode(Settings settings) { | |||
return startNodes(settings).get(0); | |||
} | |||
|
|||
public void startNode(String nodeName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this used? Revert
@@ -55,19 +56,35 @@ | |||
|
|||
private static final Logger logger = LogManager.getLogger(DanglingIndicesState.class); | |||
|
|||
public static final Setting<Boolean> AUTO_IMPORT_DANGLING_INDICES_SETTING = Setting.boolSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's deprecate this setting right away, so that we can remove it in 8.0 if we want
this.allocateDanglingIndices = AUTO_IMPORT_DANGLING_INDICES_SETTING.get(clusterService.getSettings()); | ||
|
||
if (this.allocateDanglingIndices) { | ||
clusterService.addListener(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this change, we are now not scanning for new dangling indices anymore and therefore also not warning about the existence of dangling indices on disk. We should keep this in mind when working out a new story for managing dangling indices via an API. Perhaps we should think about still emitting a warning in the logs. Let's add a TODO for this on the meta issue.
@ywelsch thank you for the review - seems there was quite a bit of cruft to clear up, sorry about that. Notes on the latest changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @pugnascotia. I've left more comments.
@@ -46,6 +51,12 @@ | |||
.put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT) | |||
.build(); | |||
|
|||
// The setting AUTO_IMPORT_DANGLING_INDICES_SETTING is deprecated, so we must disable | |||
// warning checks or all the tests will fail. | |||
protected boolean enableWarningsCheck() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createDanglingIndicesState(env, metaStateService, localAllocateDangledIndices, mockClusterService); | ||
|
||
// Check that no listener was registered, because auto-import is disabled | ||
verify(mockClusterService, never()).addListener(any()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of these whitebox tests, given how fragile they are. There is also no test here that checks that addListener
is ever called when auto-import is enabled? This means that if the dangling indices implementation changes to use another method on clusterservice to register a callback, this test will be pointless.
LocalAllocateDangledIndices indexAllocator, | ||
ClusterService clusterService | ||
) { | ||
return new DanglingIndicesState(env, metaStateService, indexAllocator, clusterService); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point of this method? Let's avoid these unnecessary abstractions.
MetaStateService metaStateService = new MetaStateService(env, xContentRegistry()); | ||
LocalAllocateDangledIndices localAllocateDangledIndices = mock(LocalAllocateDangledIndices.class); | ||
|
||
final ClusterService mockClusterService = buildMockClusterService(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find the test easier to understand if we were to explicitly pass the setting Settings.builder().put(AUTO_IMPORT_DANGLING_INDICES_SETTING.getKey(), false).build()
here instead of the magic boolean.
|
||
// Create an index and distribute it across the 3 nodes | ||
createAndPopulateIndex(INDEX_NAME, 1, 2); | ||
ensureGreen(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createAndPopulateIndex
already has a call to ensureGreen
.
|
||
@Override | ||
public Settings onNodeStopped(String nodeName) throws Exception { | ||
logger.info("--> deleting test index: {}", INDEX_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deletions are already logged by ES at info level, no need to log here.
|
||
ensureGreen(); | ||
|
||
assertTrue("Expected dangling index " + INDEX_NAME + " to be recovered", indexExists(INDEX_NAME)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that the dangling index has been recovered at this point, as that has an asynchronous notification / import process. This line needs to be put into an assertBusy
@@ -1473,7 +1473,8 @@ public int size() { | |||
} | |||
|
|||
/** | |||
* Stops a random data node in the cluster. Returns true if a node was found to stop, false otherwise. | |||
* Stops a random data node in the cluster and removes it. | |||
* @return the name of the stopped node, if a node was found to stop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this.
* the cluster when the recovery setting is disabled. | ||
*/ | ||
public void testDanglingIndicesAreNotRecoveredWhenSettingIsDisabled() throws Exception { | ||
logger.info("--> starting cluster"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this logging message add? It's the first line in the test, which is already logged.
|
||
// This is so that when then node comes back up, we have a dangling index that could | ||
// be recovered, but shouldn't be. | ||
logger.info("--> restarted a random node and deleting the index while it's down"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The restart is already logged at info level. Maybe just change the comment to say // Restart node, deleting the index in its absence, so that there is a dangling index to recover
Thanks again @ywelsch for bearing with me.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
final Settings settings = buildSettings(true); | ||
internalCluster().startNodes(3, settings); | ||
|
||
createIndex(INDEX_NAME, Settings.builder().put("number_of_shards", 1).put("number_of_replicas", 2).build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the number of (primary) shards matter? I understand that you want 2 replicas so that every node has a shard copy.
// amount of time | ||
assertFalse( | ||
"Did not expect dangling index " + INDEX_NAME + " to be recovered", | ||
waitUntil(() -> indexExists(INDEX_NAME), 5, TimeUnit.SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reduce this to 1 second
} | ||
} | ||
|
||
public boolean isAutoImportDanglingIndicesEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not need to be public, only accessed by tests
Introduce a new static setting, `gateway.auto_import_dangling_indices`, which prevents dangling indices from being automatically imported. Part of #48366.
…tic#49174) Introduce a new static setting, `gateway.auto_import_dangling_indices`, which prevents dangling indices from being automatically imported. Part of elastic#48366.
Part of #48366.
Introduce a new static setting,
gateway.auto_import_dangling_indices
, which prevents dangling indices from being automatically imported.I've also updated the dangling index docs a little to cover the new setting, and add some scary caveats.