Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshoot common error "apm-$version-$type' exists, but it is not an alias" #3698

Closed
graphaelli opened this issue Apr 27, 2020 · 6 comments

Comments

@graphaelli
Copy link
Member

Connection marked as failed because the onConnect callback failed: resource 'apm-7.6.2-metric' exists, but it is not an alias

This appears to be related to removing the ILM write alias - perhaps APM Server can detect this. At least, there should be documentation for how to recover manually for now.

@graphaelli graphaelli added this to the 7.8 milestone Apr 27, 2020
@graphaelli
Copy link
Member Author

Confirmed this is reproduceable with DELETE apm-*-metric* (or any type) while apm-server is running. apm-server will then do bulk writes to what it thinks is a write alias but instead will auto_create_index a real index where the write alias used to be. Writes will continue to succeed until apm-server reestablishes a connection with Elasticsearch, like after a restart. At that point the connection callback fails with exists, but it is not an alias.

That's it for troubleshooting from me. Happy to discuss solution considerations with anyone that picks this up.

@graphaelli
Copy link
Member Author

For those coming across this issue, the recovery procedure described by @simitt works great and can be used on cloud as the stopping ingestion step is not required. For example, assuming someone deleted apm-7.7.0-transaction with DELETE apm*transaction*.

  1. Block writes to the index:
PUT apm-7.7.0-transaction/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}

This should also cause any APM Servers still writing to the underlying index to stop ingestion and allow the rest of the process to complete

  1. [optional] Clone the index to retain data
POST apm-7.7.0-transaction/_clone/apm-7.7.0-transaction-original

and wait for clone to complete - stage: done

GET _cat/recovery/apm*transaction*?s=index&v=true&h=index,stage
  1. Delete the index that should be a write alias
DELETE apm-7.7.0-transaction
  1. Confirm an APM Server recreated the write alias on the next connection attempt:
GET _cat/aliases/apm*transaction*?s=index&v=true&h=alias,index,is_write_index

should return:

alias                 index                        is_write_index
apm-7.7.0-transaction apm-7.7.0-transaction-000001 true

@graphaelli graphaelli removed this from the 7.8 milestone Apr 29, 2020
@graphaelli
Copy link
Member Author

Datastreams should solve this problem, removing the aliasing issues altogether. Let's resolve this issue by documenting these steps under common problems.

@axw
Copy link
Member

axw commented Jun 8, 2020

I wonder if we should disable auto_create_index for APM indices when ILM is enabled, as a bandaid solution? Probably depends on when we expect to have datastreams support implemented.

@simitt
Copy link
Contributor

simitt commented Jun 10, 2020

That's definitly worth looking into it. We would need to set the auto_create_index accordingly when users switch between unmanaged and managed indices.
Additionally, even when using ILM we do have indices that are not managed (sourcemap, onboarding, apm fallback index). We could either ensure to manage all of the created indices and then apply the setting simply for apm* or take the concrete index prefixes into account and ensure the auto creation is still respected for unmanaged indices. I can take a stab on that.

@graphaelli
Copy link
Member Author

Closing this out as docs are up now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants