Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New setting to prevent automatically importing dangling indices #49174

Merged
merged 22 commits into from
Nov 29, 2019

Conversation

pugnascotia
Copy link
Contributor

@pugnascotia pugnascotia commented Nov 15, 2019

Part of #48366.

Introduce a new static setting, gateway.auto_import_dangling_indices, which prevents dangling indices from being automatically imported.

I've also updated the dangling index docs a little to cover the new setting, and add some scary caveats.

The new settings didn't work, but now it does, and I've written a couple
of ITs that create dangling indices and check what affect the setting
has.

I'm still fixing the unit tests though.
New setting `gateway.auto_import_dangling_indices` doesn't need to be
dynamic. Unfortunatlely, this has broken one of the ITs.
@pugnascotia pugnascotia added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. :Docs v8.0.0 labels Nov 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (:Docs)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

@pugnascotia pugnascotia marked this pull request as ready for review November 15, 2019 15:44
@pugnascotia
Copy link
Contributor Author

Note: there's more to do to satisfy the issue, e.g. an API to list the danglies, but I plan on doing that separately.

@pugnascotia
Copy link
Contributor Author

@DaveCTurner @original-brownbear ready for a review, if you have time.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to @DaveCTurner who will add some comments as well. No need to revise until then

indices to be imported, instead of being deleted.
When a node joins the cluster, it will search for any shards that are
stored in its local data directory and do not already exist in the
cluster. If the static setting `gateway.auto_import_dangling_indices` is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps call this node setting

cluster. If the static setting `gateway.auto_import_dangling_indices` is
`true` (the default is `false`), then those shards will be imported into
the cluster. This functionality is intended as a best effort to help users
who lose all master nodes. If a new master node is started which is unaware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should explain this in terms of unsafe cluster bootstrapping. As far as I'm aware the only remaining use of dangling indices in 7.x is for the situation where a cluster has lost all master-eligible nodes and where the unsafe-bootstrap and/or detach-cluster commands have been used.

old indices to be imported, instead of being deleted.

Enabling `gateway.auto_import_dangling_indices` should only be done if
absolutely necessary, after understanding the possible consequences (this is not an exhaustive list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should strongly recommend against using auto import of dangling indices as it can cause inconsistencies in a cluster (e.g. automatically necroing deleted indices) and instead refer to our tooling (to be created in follow-ups).

Given that we don't provide the tooling at this time, I wonder if we should leave this setting undocumented for the time being and only write the docs once we can present the full story on this.

private final NodeEnvironment nodeEnv;
private final MetaStateService metaStateService;
private final LocalAllocateDangledIndices allocateDangledIndices;

private final Map<Index, IndexMetaData> danglingIndices = ConcurrentCollections.newConcurrentMap();

private boolean allocateDanglingIndices;

@Inject
public DanglingIndicesState(NodeEnvironment nodeEnv, MetaStateService metaStateService,
LocalAllocateDangledIndices allocateDangledIndices, ClusterService clusterService) {
this.nodeEnv = nodeEnv;
this.metaStateService = metaStateService;
this.allocateDangledIndices = allocateDangledIndices;
clusterService.addListener(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler might be to only conditionally add the listener based on AUTO_IMPORT_DANGLING_INDICES_SETTING

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pugnascotia, I left a few suggestions.

@@ -55,19 +56,33 @@

private static final Logger logger = LogManager.getLogger(DanglingIndicesState.class);

public static final Setting<Boolean> AUTO_IMPORT_DANGLING_INDICES_SETTING = Setting.boolSetting(
"gateway.auto_import_dangling_indices",
false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this PR we should preserve the existing behaviour and default this to true, since we do not have any other alternatives for finding or cleaning up dangling indices. This means we can use this setting as a feature flag for testing the new behaviour while it's still in flight. Once we've added the APIs we have planned we can adjust the default to false.

lose all master nodes. If a new master node is started which is unaware of
the other indices in the cluster, adding the old nodes will cause the old
indices to be imported, instead of being deleted.
When a node joins the cluster, it will search for any shards that are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given my previous comment about preserving existing behaviour, I think we should leave the docs alone for now. I don't think we should put much effort into improving the docs for today's behaviour, and we'll need to make more changes here when we add these APIs.


public void testCleanupWhenEmpty() throws Exception {
try (NodeEnvironment env = newNodeEnvironment()) {
MetaStateService metaStateService = new MetaStateService(env, xContentRegistry());
DanglingIndicesState danglingState = createDanglingIndicesState(env, metaStateService);

// Given an empty state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am, personally, not a big fan of having these comments everywhere. The given/when/then phases of these tests seem pretty clear to me as written, and far too often do comments like these fall out of sync with the code and end up being misleading to readers.

There's a certain amount of judgement needed here to identify cases where there's something surprising that needs the reader's attention, because all readers are different. But in this specific test case for me they add, rather than remove, confusion.

assertTrue(danglingState.getDanglingIndices().isEmpty());

// When passed a metdata with an unknown index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata we pass in is empty, but we've created a dangling index on disk.


assertTrue(danglingState.getDanglingIndices().isEmpty());

// Given a metadata that does not enable allocation of dangling indices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether dangling indices importing is enabled is not a property of the metadata as suggested by this comment. Could we make the enabled/disabled property an explicit argument to createDanglingIndicesState() instead?

assertTrue(danglingState.getDanglingIndices().isEmpty());

// Given a state where automatic allocation is enabled
danglingState.setAllocateDanglingIndicesSetting(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the enabled/disabled property an explicit argument to createDanglingIndicesState() instead?

@pugnascotia pugnascotia force-pushed the allocate-dangling-indices-flag branch from 9b8c880 to 71bcbde Compare November 21, 2019 14:56
@pugnascotia
Copy link
Contributor Author

Thanks for the reviews, I've pushed 71bcbde to address them, specifically:

  • Reverted the docs changes
  • Default the new setting to false
  • Only subscribe DanglingIndicesState to cluster state updates if the new setting is true
  • Revert changes to UnsafeBootstrapAndDetachCommandIT as they're no longer needed
  • Strip confusing comments from DanglingIndicesStateTests and be specify the new setting's value when calling createDanglingIndicesState().

@pugnascotia
Copy link
Contributor Author

@ywelsch @DaveCTurner any chance of a re-review?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I've left some comments.

}
}

public void setAllocateDanglingIndicesSetting(boolean allocateDanglingIndices) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is superfluous now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, good catch.

@@ -171,10 +188,15 @@ private IndexMetaData stripAliases(IndexMetaData indexMetaData) {
* Allocates the provided list of the dangled indices by sending them to the master node
* for allocation.
*/
private void allocateDanglingIndices() {
void allocateDanglingIndices() {
if (this.allocateDanglingIndices == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this condition is superfluous now?

private final NodeEnvironment nodeEnv;
private final MetaStateService metaStateService;
private final LocalAllocateDangledIndices allocateDangledIndices;

private final Map<Index, IndexMetaData> danglingIndices = ConcurrentCollections.newConcurrentMap();

private boolean allocateDanglingIndices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field is superfluous now?

.put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 0)
.put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT)
.build();
.put(IndexMetaData.SETTING_NUMBER_OF_SHARDS, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid reformatting code unrelated to the PR.

Settings.builder()
.put("number_of_shards", shardCount)
.put("number_of_replicas", replicaCount)
.put(Store.INDEX_STORE_STATS_REFRESH_INTERVAL_SETTING.getKey(), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahem...that's copy-and-paste. Can you tell it's the first time I've written an ES IT? 😅

client().admin().indices().prepareStats(name).execute().actionGet();
}

private void deleteIndex(String indexName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why define an extra method here? just so that one logging line can be reused? Feels a bit over the top.

*/
public synchronized boolean stopRandomDataNode() throws IOException {
public synchronized Optional<String> stopRandomDataNode() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this change used?

@@ -1902,6 +1905,12 @@ public String startNode(Settings settings) {
return startNodes(settings).get(0);
}

public void startNode(String nodeName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used? Revert

@@ -55,19 +56,35 @@

private static final Logger logger = LogManager.getLogger(DanglingIndicesState.class);

public static final Setting<Boolean> AUTO_IMPORT_DANGLING_INDICES_SETTING = Setting.boolSetting(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's deprecate this setting right away, so that we can remove it in 8.0 if we want

this.allocateDanglingIndices = AUTO_IMPORT_DANGLING_INDICES_SETTING.get(clusterService.getSettings());

if (this.allocateDanglingIndices) {
clusterService.addListener(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change, we are now not scanning for new dangling indices anymore and therefore also not warning about the existence of dangling indices on disk. We should keep this in mind when working out a new story for managing dangling indices via an API. Perhaps we should think about still emitting a warning in the logs. Let's add a TODO for this on the meta issue.

@pugnascotia
Copy link
Contributor Author

@ywelsch thank you for the review - seems there was quite a bit of cruft to clear up, sorry about that. Notes on the latest changes:

  • The new setting is now deprecated
  • The state class logs a warning if the setting is false, since dangling indices won't be detected or imported
  • Change the disabled case unit test to verify that no cluster state listener is registered, since the danglingState.allocateDanglingIndices() method no longer checks the setting value
  • Tidied up the IT

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @pugnascotia. I've left more comments.

@@ -46,6 +51,12 @@
.put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT)
.build();

// The setting AUTO_IMPORT_DANGLING_INDICES_SETTING is deprecated, so we must disable
// warning checks or all the tests will fail.
protected boolean enableWarningsCheck() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createDanglingIndicesState(env, metaStateService, localAllocateDangledIndices, mockClusterService);

// Check that no listener was registered, because auto-import is disabled
verify(mockClusterService, never()).addListener(any());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of these whitebox tests, given how fragile they are. There is also no test here that checks that addListener is ever called when auto-import is enabled? This means that if the dangling indices implementation changes to use another method on clusterservice to register a callback, this test will be pointless.

LocalAllocateDangledIndices indexAllocator,
ClusterService clusterService
) {
return new DanglingIndicesState(env, metaStateService, indexAllocator, clusterService);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of this method? Let's avoid these unnecessary abstractions.

MetaStateService metaStateService = new MetaStateService(env, xContentRegistry());
LocalAllocateDangledIndices localAllocateDangledIndices = mock(LocalAllocateDangledIndices.class);

final ClusterService mockClusterService = buildMockClusterService(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find the test easier to understand if we were to explicitly pass the setting Settings.builder().put(AUTO_IMPORT_DANGLING_INDICES_SETTING.getKey(), false).build() here instead of the magic boolean.


// Create an index and distribute it across the 3 nodes
createAndPopulateIndex(INDEX_NAME, 1, 2);
ensureGreen();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createAndPopulateIndex already has a call to ensureGreen.


@Override
public Settings onNodeStopped(String nodeName) throws Exception {
logger.info("--> deleting test index: {}", INDEX_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deletions are already logged by ES at info level, no need to log here.


ensureGreen();

assertTrue("Expected dangling index " + INDEX_NAME + " to be recovered", indexExists(INDEX_NAME));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that the dangling index has been recovered at this point, as that has an asynchronous notification / import process. This line needs to be put into an assertBusy

@@ -1473,7 +1473,8 @@ public int size() {
}

/**
* Stops a random data node in the cluster. Returns true if a node was found to stop, false otherwise.
* Stops a random data node in the cluster and removes it.
* @return the name of the stopped node, if a node was found to stop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this.

* the cluster when the recovery setting is disabled.
*/
public void testDanglingIndicesAreNotRecoveredWhenSettingIsDisabled() throws Exception {
logger.info("--> starting cluster");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this logging message add? It's the first line in the test, which is already logged.


// This is so that when then node comes back up, we have a dangling index that could
// be recovered, but shouldn't be.
logger.info("--> restarted a random node and deleting the index while it's down");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restart is already logged at info level. Maybe just change the comment to say // Restart node, deleting the index in its absence, so that there is a dangling index to recover

@pugnascotia
Copy link
Contributor Author

Thanks again @ywelsch for bearing with me.

  • I've replaced the white box tests with a getter for the new setting. I had avoided that before because the only code calling the getter would be test code, but it'll be less fragile.
  • I've simplified the unit tests and ITs.

@pugnascotia pugnascotia requested a review from ywelsch November 28, 2019 12:09
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

final Settings settings = buildSettings(true);
internalCluster().startNodes(3, settings);

createIndex(INDEX_NAME, Settings.builder().put("number_of_shards", 1).put("number_of_replicas", 2).build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the number of (primary) shards matter? I understand that you want 2 replicas so that every node has a shard copy.

// amount of time
assertFalse(
"Did not expect dangling index " + INDEX_NAME + " to be recovered",
waitUntil(() -> indexExists(INDEX_NAME), 5, TimeUnit.SECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reduce this to 1 second

}
}

public boolean isAutoImportDanglingIndicesEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not need to be public, only accessed by tests

@pugnascotia pugnascotia merged commit f8e39d2 into elastic:master Nov 29, 2019
@pugnascotia pugnascotia deleted the allocate-dangling-indices-flag branch November 29, 2019 11:39
ywelsch pushed a commit that referenced this pull request Jan 8, 2020
Introduce a new static setting, `gateway.auto_import_dangling_indices`, which prevents dangling indices from being automatically imported. Part of #48366.
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
…tic#49174)

Introduce a new static setting, `gateway.auto_import_dangling_indices`, which prevents dangling indices from being automatically imported. Part of elastic#48366.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >non-issue v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants