Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Added labels content #1999

Merged
merged 4 commits into from
Apr 30, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Started Guide](https://github.com/grafana/loki/tree/v1.4.1/docs/getting-started/
Commonly used sections (from the latest release v1.4.1):

- [API documentation](https://github.com/grafana/loki/tree/v1.4.1/docs/api.md) for alternative ways of getting logs into Loki.
- [Labels](https://github.com/grafana/loki/blob/master/docs/getting-started/labels.md)
- [Operations](https://github.com/grafana/loki/tree/v1.4.1/docs/operations) for important aspects of running Loki.
- [Promtail](https://github.com/grafana/loki/tree/v1.4.1/docs/clients/promtail) is an agent which can tail your log files and push them to Loki.
- [Pipelines](https://github.com/grafana/loki/tree/v1.4.1/docs/clients/promtail/pipelines.md) for detailed log processing pipeline documentation
Expand Down
19 changes: 11 additions & 8 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,13 @@ simplifies the operation and significantly lowers the cost of Loki.
3. [Getting Started](getting-started/README.md)
1. [Grafana](getting-started/grafana.md)
2. [LogCLI](getting-started/logcli.md)
3. [Labels](getting-started/labels.md)
4. [Troubleshooting](getting-started/troubleshooting.md)
4. [Configuration](configuration/README.md)
4. [Best Practices](best-practices/README.md)
1. [Current Best Practices](best-practices/current-best-practices.md)
5. [Configuration](configuration/README.md)
1. [Examples](configuration/examples.md)
5. [Clients](clients/README.md)
6. [Clients](clients/README.md)
1. [Promtail](clients/promtail/README.md)
1. [Installation](clients/promtail/installation.md)
2. [Configuration](clients/promtail/configuration.md)
Expand All @@ -37,8 +40,8 @@ simplifies the operation and significantly lowers the cost of Loki.
1. [Configuration](clients/docker-driver/configuration.md)
4. [Fluent Bit](../cmd/fluent-bit/README.md)
3. [Fluentd](clients/fluentd/README.md)
6. [LogQL](logql.md)
7. [Operations](operations/README.md)
7. [LogQL](logql.md)
8. [Operations](operations/README.md)
1. [Authentication](operations/authentication.md)
2. [Observability](operations/observability.md)
3. [Scalability](operations/scalability.md)
Expand All @@ -48,11 +51,11 @@ simplifies the operation and significantly lowers the cost of Loki.
3. [BoltDB Shipper](operations/storage/boltdb-shipper.md)
5. [Multi-tenancy](operations/multi-tenancy.md)
6. [Loki Canary](operations/loki-canary.md)
8. [HTTP API](api.md)
9. [Architecture](architecture.md)
10. [Community](community/README.md)
9. [HTTP API](api.md)
10. [Architecture](architecture.md)
11. [Community](community/README.md)
1. [Governance](community/governance.md)
2. [Getting in Touch](community/getting-in-touch.md)
3. [Contributing to Loki](community/contributing.md)
11. [Loki Maintainers Guide](./maintaining/README.md)
12. [Loki Maintainers Guide](./maintaining/README.md)
1. [Releasing Loki](./maintaining/release.md)
3 changes: 3 additions & 0 deletions docs/best-practices/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Best Practices

1. [Current Best Practices](current-best-practices.md) includes a (hopefully) current guide for some best practices regarding Label usage and configuration in Loki.
88 changes: 88 additions & 0 deletions docs/best-practices/current-best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Loki label best practices

Loki is under active development, and we are constantly working to improve performance. But here are some of the most current best practices for labels that will give you the best experience with Loki.

## 1. Static labels are good

Things like, host, application, and environment are great labels. They will be fixed for a given system/app and have bounded values. Use static labels to make it easier to query your logs in a logical sense (e.g. show me all the logs for a given application and specific environment, or show me all the logs for all the apps on a specific host).

## 2. Use dynamic labels sparingly

Too many label value combinations leads to too many streams. The penalties for that in Loki are a large index and small chunks in the store, which in turn can actually reduce performance.

To avoid those issues, don't add a label for something until you know you need it! Use filter expressions ( |= “text”, |~ “regex”, …) and brute force those logs. It works -- and it's fast.

From early on, we have set a label dynamically using promtail pipelines for `level`. This seemed intuitive for us as we often wanted to only show logs for `level=”error”`; however, we are re-evaluating this now as writing a query. `{app=”loki”} |= “level=error”` is proving to be just as fast for many of our applications as `{app=”loki”,level=”error”}`.

This may seem surprising, but if applications have medium to low volume, that label causes one application's logs to be split into up to five streams, which means 5x chunks being stored. And loading chunks has an overhead associated with it. Imagine now if that query were `{app=”loki”,level!=”debug”}`. That would have to load **way** more chunks than `{app=”loki”} != “level=debug”`.

Above, we mentioned not to add labels until you _need_ them, so when would you _need_ labels?? A little farther down is a section on `chunk_target_size`. If you set this to 1MB (which is reasonable), this will try to cut chunks at 1MB compressed size, which is about 5MB-ish of uncompressed logs (might be as much as 10MB depending on compression). If your logs have sufficient volume to write 5MB in less time than `max_chunk_age`, or **many** chunks in that timeframe, you might want to consider splitting it into separate streams with a dynamic label.

What you want to avoid is splitting a log file into streams, which result in chunks getting flushed because the stream is idle or hits the max age before being full. As of [Loki 1.4.0](https://grafana.com/blog/2020/04/01/loki-v1.4.0-released-with-query-statistics-and-up-to-300x-regex-optimization/), there is a metric which can help you understand why chunks are flushed `sum by (reason) (rate(loki_ingester_chunks_flushed_total{cluster="dev"}[1m]))`.

It’s not critical that every chunk be full when flushed, but it will improve many aspects of operation. As such, our current guidance here is to avoid dynamic labels as much as possible and instead favor filter expressions. For example, don’t add a `level` dynamic label, just `|= “level=debug”` instead.

## 3. Label values must always be bounded

If you are dynamically setting labels, never use a label which can have unbounded or infinite values. This will always result in big problems for Loki.

Try to keep values bounded to as small a set as possible. We don't have perfect guidance as to what Loki can handle, but think single digits, or maybe 10’s of values for a dynamic label. This is less critical for static labels. For example, if you have 1,000 hosts in your environment it's going to be just fine to have a host label with 1,000 values.

## 4. Be aware of dynamic labels applied by clients

Loki has several client options: [Promtail](https://github.com/grafana/loki/tree/master/docs/clients/promtail) (which also supports systemd journal ingestion and TCP-based syslog ingestion), [FluentD](https://github.com/grafana/loki/tree/master/fluentd/fluent-plugin-grafana-loki), [Fluent Bit](https://github.com/grafana/loki/tree/master/cmd/fluent-bit), a [Docker plugin](https://grafana.com/blog/2019/07/15/lokis-path-to-ga-docker-logging-driver-plugin-support-for-systemd/), and more!

Each of these come with ways to configure what labels are applied to create log streams. But be aware of what dynamic labels might be applied. Use the Loki series API to get an idea of what your log streams look like and see if there might be ways to reduce streams and cardinality. Details of the Series API can be found [here](https://github.com/grafana/loki/blob/master/docs/api.md#series), or you can use [logcli](https://github.com/grafana/loki/blob/master/docs/getting-started/logcli.md) to query Loki for series information.

## 5. Configure caching

Loki can cache data at many levels, which can drastically improve performance. Details of this will be in a future post.

## 6. Logs must be in increasing time order per stream

One issue many people have with Loki is their client receiving errors for out of order log entries. This happens because of this hard and fast rule within Loki:

- For any single log stream, logs must always be sent in increasing time order. If a log is received with a timestamp older than the most recent log received for that stream, that log will be dropped.

There are a few things to dissect from that statement. The first is this restriction is per stream. Let’s look at an example:

```
{job=”syslog”} 00:00:00 i’m a syslog!
{job=”syslog”} 00:00:01 i’m a syslog!
```

If Loki received these two lines which are for the same stream, everything would be fine. But what about this case:

```
{job=”syslog”} 00:00:00 i’m a syslog!
{job=”syslog”} 00:00:02 i’m a syslog!
{job=”syslog”} 00:00:01 i’m a syslog! <- Rejected out of order!
```

What can we do about this? What if this was because the sources of these logs were different systems? We can solve this with an additional label which is unique per system:

```
{job=”syslog”, instance=”host1”} 00:00:00 i’m a syslog!
{job=”syslog”, instance=”host1”} 00:00:02 i’m a syslog!
{job=”syslog”, instance=”host2”} 00:00:01 i’m a syslog! <- Accepted, this is a new stream!
{job=”syslog”, instance=”host1”} 00:00:03 i’m a syslog! <- Accepted, still in order for stream 1
{job=”syslog”, instance=”host2”} 00:00:02 i’m a syslog! <- Accepted, still in order for stream 2
```

But what if the application itself generated logs that were out of order? Well, I'm afraid this is a problem. If you are extracting the timestamp from the log line with something like [the promtail pipeline stage](https://github.com/grafana/loki/blob/master/docs/clients/promtail/stages/timestamp.md), you could instead _not_ do this and let Promtail assign a timestamp to the log lines. Or you can hopefully fix it in the application itself.

But I want Loki to fix this! Why can’t you buffer streams and re-order them for me?! To be honest, because this would add a lot of memory overhead and complication to Loki, and as has been a common thread in this post, we want Loki to be simple and cost-effective. Ideally we would want to improve our clients to do some basic buffering and sorting as this seems a better place to solve this problem.

It's also worth noting that the batching nature of the Loki push API can lead to some instances of out of order errors being received which are really false positives. (Perhaps a batch partially succeeded and was present; or anything that previously succeeded would return an out of order entry; or anything new would be accepted.)

## 7. Use `chunk_target_size`

This was added earlier this year when we [released v1.3.0 of Loki](https://grafana.com/blog/2020/01/22/loki-1.3.0-released/), and we've been experimenting with it for several months. We have `chunk_target_size: 1536000` in all our environments now. This instructs Loki to try to fill all chunks to a target _compressed_ size of 1.5MB. These larger chunks are more efficient for Loki to process.

A couple other config variables affect how full a chunk can get. Loki has a default `max_chunk_age` of 1h and `chunk_idle_period` of 30m to limit the amount of memory used as well as the exposure of lost logs if the process crashes.

Depending on the compression used (we have been using snappy which has less compressibility but faster performance), you need 5-10x or 7.5-10MB of raw log data to fill a 1.5MB chunk. Remembering that a chunk is per stream, the more streams you break up your log files into, the more chunks that sit in memory, and the higher likelihood they get flushed by hitting one of those timeouts mentioned above before they are filled.

Lots of small, unfilled chunks are currently kryptonite for Loki. We are always working to improve this and may consider a compactor to improve this in some situations. But, in general, the guidance should stay about the same: Try your best to fill chunks!

If you have an application that can log fast enough to fill these chunks quickly (much less than `max_chunk_age`), then it becomes more reasonable to use dynamic labels to break that up into separate streams.
1 change: 1 addition & 0 deletions docs/getting-started/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@

1. [Grafana](grafana.md)
2. [LogCLI](logcli.md)
3. [Labels](labels.md)
4. [Troubleshooting](troubleshooting.md)

Loading