Rename and clarify TempoFlushesFailing -> TempoIngesterFlushesFailing (…

…#887)
grafana · Aug 18, 2021 · ca82cb8 · ca82cb8
1 parent e2c6872
commit ca82cb8
Show file tree

Hide file tree

Showing 2 changed files with 15 additions and 18 deletions.
diff --git a/operations/tempo-mixin/alerts.libsonnet b/operations/tempo-mixin/alerts.libsonnet
@@ -82,7 +82,7 @@
             },
           },
           {
-            alert: 'TempoFlushesFailing',
+            alert: 'TempoIngesterFlushesFailing',
             expr: |||
               sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[1h])) > %s and
               sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0
@@ -92,7 +92,7 @@
             },
             annotations: {
               message: 'Greater than %s flushes have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed,
-              runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoFlushesFailing'
+              runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing'
             },
           },
           {

diff --git a/operations/tempo-mixin/runbook.md b/operations/tempo-mixin/runbook.md
@@ -82,22 +82,19 @@ There are several settings which can be tuned to reduce the amount of work done
   There are platform-specific limits on how low this can go.  AWS S3 cannot be set lower than 5MB, or cause more than 10K flushes
   per block.
 
-## TempoFlushesFailing
-
-Check ingester logs for flushes and compactor logs for compations.  Failed flushes or compactions could be caused by any number of
-different things.  Permissions issues, rate limiting, failing backend, ...  So check the logs and use your best judgement on how to
-resolve.
-
-In the case of failed compactions your blocklist is now growing and you may be creating a bunch of partially written "orphaned"
-blocks.  An orphaned block is a block without a `meta.json` that is not currently being created.  These will be invisible to
-Tempo and will just hang out forever (or until a bucket lifecycle policy deletes them).  First, resolve the issue so that your
-compactors can get the blocklist under control to prevent high query latencies.  Next try to identify any "orphaned" blocks and
-remove them.
-
-In the case of failed flushes your local WAL disk is now filling up.  Tempo will continue to retry sending the blocks
-until it succeeds, but at some point your WAL files will start failing to write due to out of disk issues.  If the problem
-persists consider removing files from `/var/tempo/wal/blocks` and restarting the ingester or increasing the amount of disk space
-available to the ingester.
+## TempoIngesterFlushesFailing
+
+Check ingester logs for flushes.  Failed flushes could be caused by any number of different things: bad block, permissions issues,
+rate limiting, failing backend,...  So check the logs and use your best judgement on how to resolve.  Tempo will continue to retry
+sending the blocks until it succeeds, but at some point your WAL files will start failing to write due to out of disk issues.
+
+If a single block can not be flushed, this block might be corrupted.  Inspect the block manually and consider moving this file out
+of the WAL or outright deleting it. Restart the ingester to stop the retry attempts. Removing blocks from a single ingester will
+not cause data loss if replication is used and the other ingesters are flushing their blocks successfully.
+By default, the WAL is at `/var/tempo/wal/blocks`.
+
+If multiple blocks can not be flushed, the local WAL disk of the ingester will be filling up.  Consider increasing the amount of disk
+space available to the ingester.
 
 ## TempoPollsFailing