Skip to content

Commit

Permalink
Rename and clarify TempoFlushesFailing -> TempoIngesterFlushesFailing (
Browse files Browse the repository at this point in the history
  • Loading branch information
yvrhdn authored Aug 18, 2021
1 parent e2c6872 commit ca82cb8
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 18 deletions.
4 changes: 2 additions & 2 deletions operations/tempo-mixin/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@
},
},
{
alert: 'TempoFlushesFailing',
alert: 'TempoIngesterFlushesFailing',
expr: |||
sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[1h])) > %s and
sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0
Expand All @@ -92,7 +92,7 @@
},
annotations: {
message: 'Greater than %s flushes have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed,
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoFlushesFailing'
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing'
},
},
{
Expand Down
29 changes: 13 additions & 16 deletions operations/tempo-mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,22 +82,19 @@ There are several settings which can be tuned to reduce the amount of work done
There are platform-specific limits on how low this can go. AWS S3 cannot be set lower than 5MB, or cause more than 10K flushes
per block.
## TempoFlushesFailing
Check ingester logs for flushes and compactor logs for compations. Failed flushes or compactions could be caused by any number of
different things. Permissions issues, rate limiting, failing backend, ... So check the logs and use your best judgement on how to
resolve.
In the case of failed compactions your blocklist is now growing and you may be creating a bunch of partially written "orphaned"
blocks. An orphaned block is a block without a `meta.json` that is not currently being created. These will be invisible to
Tempo and will just hang out forever (or until a bucket lifecycle policy deletes them). First, resolve the issue so that your
compactors can get the blocklist under control to prevent high query latencies. Next try to identify any "orphaned" blocks and
remove them.
In the case of failed flushes your local WAL disk is now filling up. Tempo will continue to retry sending the blocks
until it succeeds, but at some point your WAL files will start failing to write due to out of disk issues. If the problem
persists consider removing files from `/var/tempo/wal/blocks` and restarting the ingester or increasing the amount of disk space
available to the ingester.
## TempoIngesterFlushesFailing
Check ingester logs for flushes. Failed flushes could be caused by any number of different things: bad block, permissions issues,
rate limiting, failing backend,... So check the logs and use your best judgement on how to resolve. Tempo will continue to retry
sending the blocks until it succeeds, but at some point your WAL files will start failing to write due to out of disk issues.
If a single block can not be flushed, this block might be corrupted. Inspect the block manually and consider moving this file out
of the WAL or outright deleting it. Restart the ingester to stop the retry attempts. Removing blocks from a single ingester will
not cause data loss if replication is used and the other ingesters are flushing their blocks successfully.
By default, the WAL is at `/var/tempo/wal/blocks`.
If multiple blocks can not be flushed, the local WAL disk of the ingester will be filling up. Consider increasing the amount of disk
space available to the ingester.
## TempoPollsFailing
Expand Down

0 comments on commit ca82cb8

Please sign in to comment.