"Frequently" is probably an overstatement, but here's a collection of questions and answers that pop up on the mailing list and issues.
All the labels are under your own control, except for the prog
label which is used for namespace deconfliction -- i.e. multiple programs can be running in mtail
and they should not be able to affect each other.
It is best if you do some post processing in your collection system and configure it to filter out the prog
label, so that strange aggregations don't occur.
In Prometheus, this could be achieved like so:
metric_relabel_configs:
- target_label: prog
replacement: ''
(See this comment).
mtail
lets you use the settimestamp()
function to extract a timestamp from
a log file, and use that timestamp to carry to the monitoring system the
closest thing that mtail
knows to be the actual time of the event, and not
the time at which mtail
scraped the log.
However, Prometheus needs to track the existence of a metric in the time series database in order to avoid showing very old data when querying the same metric for multiple instances at a specific timestamp. Exposing the timestamp can lead to triggering this staleness handling.
mtail
, being a metric creator, falls under bbrazil's comment on the
prometheus-users list, in which he says "It doesn't make sense to have
timestamps for direct instrumentation, only for proxying metrics from another
monitoring system with a custom
collector.".
The mtail
timestamp handling is also broken for counters. The timestamp is
set to 0 (UNIX epoch) at startup. If no matches are made, the initial zero
count will never be ingested and the metric will only appear when first
incremented. To avoid this, mtail
disables exporting timestamps to Prometheus
by default.
You can turn this behaviour back on with the --emit_metric_timestamp
commandline flag, and if you have slow moving counters, you should tune your
Prometheus' query.lookback-delta
parameter. See also Staleness under
Querying
Basics
in the Prometheus docs.
If you are looking to expose the timestamp of an event, for example the start time of a process, you can create a timestamp metric. This is a metric that contains the timestamp as the value:
counter mtail_lines_read_count by filename
gauge mtail_file_lastread_timestamp by filename
/.*/ {
mtail_lines_read_count[getfilename()]++
mtail_file_lastread_timestamp[getfilename()] = timestamp()
}
mtail
is intended to be stateless, deferring the problem of long term metric
storage to a timeseries database and collector like
Prometheus.
Partially this reason is technical -- not having to save checkpoints and restore them makes the program much simpler.
This means that mtail programs should prefer metrics that perform better in stateless systems, like counters rather than gauges. Prometheus for example is capable of handling counter resets in its rate and delta calculations, so mtail not remembering the value of a metric should not be cause for concern.
Another reason is that failure is normal, and thus Prometheus handles these
counter restarts because they are normal. If mtail
checkpointed its state,
filesystem and state file corruption will still occur, and in those edge cases
a counter reset would still be observed, and thus need to be handled
regardless.
So, given that the monitoring system needs to handle missing and resetting data
already in a distributed system, there is no compelling reason to implement
metric checkpointing in mtail
as well. It just adds complexity for little
overall gain.