-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue on deleting ES Snapshot with Azure plugin #25424
Comments
You cannot delete a snapshot if one is being executed or deleted. In your case, the first operation 1) hangs and it forbids any deletion like 3) to happen. Before deleting a snapshot, the master node retrieves the list of snapshots from the remote repository (and it can takes time... in the meanwhile if your execute some get snapshot like 2) it will answer that it doesn't know the snapshot) and after that it update the cluster state to inform that a deletion is going to be executed. When the cluster state is updated, it deletes the snapshot on the repository. It would be interesting to know why the first operation takes time. Maybe you have a lot of snapshots on the Azure repository? |
The key will be finding out the cause of the first operation taking a while. Given the above described scenario, it looks like all of the snapshot files are deleted appropriately from Azure, but the snapshot deletion in progress is not removed from the cluster state, which is the last step in the process of deleting a snapshot. We will likely need to see your logs from the master node to help diagnose the problem. It seems like the master node is hanging on something. |
Hi, Thank you for your help. Here are the new elements :
Is there some request (HTTP or transport API) that can give me the process of delete of the snapshot . I see nothing on pending_tasks nor snapshot API endpoints ? or some log I could activate ? |
No, deletions should be quick, so we don't provide a status endpoint for those. Is there nothing in the logs whatsoever? Can you increase the logging for snapshotting? You can use:
|
I suspect the large number of snapshots being an issue here, but I have no clue so I 👍 @abeyad suggestion to grab more information by logging at the debug level. |
I put the log in debug and it was acknowledged correctly
I retried to remove a snapshot on my test platform : es master node : It confirm that the delete part is fast but then it hangs up. Regards, Etienne |
Complement on the logfile with the end of the remove snapshot : |
This is problematic:
The deletion started at 11:27 and we only begin to remove the snapshot deletion from the cluster state at 12:35, a full hour after the deletion started. It seems to me the Azure access is very slow. Apologies, but can you kindly re-run the test with DEBUG logging enabled as follows:
and send the logs, as complete as possible? |
Hi, In the same time, we upgraded to 5.4.3 our testing database so the new test was with 5.4.3 . The snapshot log is on gist : https://gist.github.com/etiennecarriere/cc21d7e079fb24d8b7d0c65449065f65 Regards, Etienne |
Hi, I confirm what seems to be a very ineffective implementation of the list in the azure plugin. On a bad Internet access (6 Mbits download/ 1 Mbits upload / 50 ms latency to Azure DC), I come from 1162 seconds to remove one snapshot to 185 seconds with my patch to remove a snapshot from a repository with 45 snapshots of 90 shards each. I see with my employer Monday how I can pull request it. Regards, Etienne |
@abeyad , I permit to come back to you to see if you had time to have an opinion on the pull request I proposed |
Hi,
I have issues to delete snapshot on a repository managed with Azure plugin :
= ElasticSearch version : 5.4.1 (via https://artifacts.elastic.co/packages/5.x/yum)
= Plugins installed : x-pack, repository-azure
= JVM version : Openjdk 1.8.0.121 (RPM : java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64)
= OS Version : Centos 7.3 / Linux xxx 3.10.0-514.10.2.el7.x86_64 Query DSL: Terms Filter #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
= Point to note : We use the Azure plugin but we are not in the Azure infrastructure
Description of the problem including expected versus actual behavior:
Expected behaviour :
Actual Behaviour:
Steps to reproduce:
I always achieve to reproduce it on my test environnement (happens first on production) :
{"error":{"root_cause":[{"type":"snapshot_missing_exception","reason":"[testbackups-azure:snapshot_XXX] is missing"}],"type":"snapshot_missing_exception","reason":"[testbackups-azure:snapshot_XXX] is missing"},"status":404}
{"error":{"root_cause":[{"type":"concurrent_snapshot_execution_exception","reason":"[testbackups-azure:snapshot_YYY/ZZZ] cannot delete - another snapshot is currently being deleted"}],"type":"concurrent_snapshot_execution_exception","reason":"[testbackups-azure:snapshot_YYY/ZZZ] cannot delete - another snapshot is currently being deleted"},"status":503}
When I do a rolling restart of the cluster ES, it come back to normal and the first snapshot (XXX in our case) is no more present and it is possible to delete an snapshot.
The text was updated successfully, but these errors were encountered: