Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JUJU-1889] Log forwarding, take two #165

Closed
wants to merge 14 commits into from

Conversation

barrettj12
Copy link
Contributor

@barrettj12 barrettj12 commented Nov 30, 2022

This PR enables forwarding of Pebble service logs to remote Loki servers and UDP/TCP syslog receivers, following the spec here.

Design

Each (service, log target) pair has an associated "forwarder", which pulls logs from the service ringbuffer (using the iterators introduced in #29), and sends them to the log target. Each forwarder runs two goroutines:

  • an iterate goroutine, which pulls logs from the iterator/parser and sends them on a messages channel
  • a forward goroutine, which reads logs from the messages channel, buffers them, and flushes periodically to the remote log target (after 100 logs have been buffered, or 1 second after the first write)

The forwarders are controlled by a "log manager", which is notified when a new service starts, or when the plan changes. On a plan change, the log manager tears down all forwarders, and starts new ones as determined by the plan.

Plan changes

The Pebble layer specification has changed as follows:

  • A new top-level section log-targets, where users specify the log targets they want Pebble to interact with.
  • A new (optional) field log-targets in the service spec. This takes a list of names, which are matched against the targets defined in the log-targets top-level section.

The default behaviour is to forward all services' logs to all targets. This makes it easy to get started with log forwarding by just defining a new layer with the log targets - no need to change the service definitions.

Users can have more granular control over log forwarding by:

  • Setting the optional selection field for a log target;
  • In the definition for a service, explicitly listing log targets they want to forward this service's logs to.

More details are in the README.

Other changes

The plan.Service.Equal method was being used by the service manager to determine whether or not to restart a service on a replan. However, not all changes to the service config require a service restart. In particular, a change to the new log-targets field should not invoke a service restart. Hence, we've defined a servstate.NeedsRestart(old, new *plan.Service) function, which compares the old and new config structs for a service and determines if the service needs to be restarted. At the moment, NeedsRestart only ignores changes to log-targets - future work should also ignore changes to e.g. metadata, so we can be more conservative with restarting services.

This was the only use of plan.Service.Equal, hence we've also removed that method.

Future work

  • Add support for custom Loki labels (spec).
  • Allow configuring the 100 log limit / 1 second timeout in the plan (?).

internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
@tlm

This comment was marked as resolved.

tlm

This comment was marked as outdated.

internal/overlord/logstate/buffer.go Outdated Show resolved Hide resolved
internal/plan/plan.go Show resolved Hide resolved
internal/plan/plan.go Outdated Show resolved Hide resolved
internal/plan/plan.go Outdated Show resolved Hide resolved
internal/plan/plan.go Outdated Show resolved Hide resolved
internal/plan/plan.go Outdated Show resolved Hide resolved
internal/plan/plan_test.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/plan/plan_test.go Outdated Show resolved Hide resolved
internal/plan/plan_test.go Outdated Show resolved Hide resolved
Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, just some feedback so far

internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/pumper.go Outdated Show resolved Hide resolved
internal/overlord/logstate/syslog.go Outdated Show resolved Hide resolved
internal/overlord/logstate/syslog.go Outdated Show resolved Hide resolved
internal/overlord/logstate/syslog.go Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internal/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/loki.go Outdated Show resolved Hide resolved
internal/overlord/logstate/syslog.go Show resolved Hide resolved
internal/plan/plan.go Outdated Show resolved Hide resolved
internal/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internal/plan/plan.go Outdated Show resolved Hide resolved
@barrettj12 barrettj12 requested a review from benhoyt February 17, 2023 04:29
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big body of work and looking close now. The structure is really nice. I've added a bunch of comments. Many are minor/nits, but a few requesting small changes or adding tests -- nothing particularly structural.

Also, have you tested / will you test it against a real Loki and real syslog server and hammer it a bit?

internal/overlord/logstate/forwarder.go Outdated Show resolved Hide resolved
internal/overlord/logstate/forwarder.go Outdated Show resolved Hide resolved
internal/overlord/logstate/forwarder.go Outdated Show resolved Hide resolved
internal/overlord/logstate/forwarder.go Outdated Show resolved Hide resolved
internal/overlord/logstate/forwarder.go Show resolved Hide resolved
internal/overlord/logstate/manager.go Show resolved Hide resolved
internal/overlord/logstate/manager_test.go Show resolved Hide resolved
internal/overlord/logstate/manager_test.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager_test.go Outdated Show resolved Hide resolved
internal/overlord/logstate/manager_test.go Outdated Show resolved Hide resolved
@barrettj12
Copy link
Contributor Author

Also, have you tested / will you test it against a real Loki and real syslog server and hammer it a bit?

Did some load testing today. I deployed Loki and Grafana on local microk8s using Juju. I used Pebble to run 2 services, each spitting out 1000 logs/sec. I tested the following forwarding scenarios:

  • Forward logs directly to Loki.
  • Run Promtail locally, receive logs from Pebble via syslog, and forward to Loki.

Both scenarios gave similar results. Pebble used about 4-10% of my CPU doing this (on my pretty-average laptop). All the logs seemed to get through to Loki.

- add some TODOs for later
... so we get deterministic output. Update the test to match.
Fixes a rare data race on the logger where:
- One test uses the real logger (read)
- The subsequent test replaces it with a mock logger (write)
@barrettj12

This comment was marked as resolved.

- managerSuite.TestNoLogDuplication
- forwarderSuite.TestForwarder

Increasing the flush delay to 10ms, and the request timeout to 1s,
seems to have fixed things.
@barrettj12 barrettj12 requested a review from benhoyt March 7, 2023 23:50
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed. A few nits (plus the log forwarder simplification we discussed).

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
internal/overlord/logstate/forwarder_test.go Show resolved Hide resolved
internal/overlord/logstate/syslog.go Show resolved Hide resolved
internal/overlord/logstate/loki.go Show resolved Hide resolved
internal/overlord/logstate/manager.go Show resolved Hide resolved
internal/overlord/logstate/manager_test.go Show resolved Hide resolved
- README improvements
- syslogClient.Send: exit early if len(entries) == 0
- separate iterate goroutine which feeds logs to the main loop
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good now, thanks @barrettj12! Conversation: 244. Wonder what the GitHub record is? Way more than this one. :-)

@barrettj12 barrettj12 requested a review from niemeyer March 10, 2023 04:11
@jnsgruk jnsgruk added the High Priority Look at me first label Mar 11, 2023
Copy link
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work here, everyone. I have a few recommendations below to get started, but as a general overview this is looking very nice. In particular, the whole plan side of things looks pretty much ready to land, while on the forwarding implementation there might be a couple of details for us to figure out.

Also on that basis, another recommendation here is for us to break down this PR into smaller parts that can more easily evolve and get merged independently. Considering the design in place, I suggest the following parts:

  1. The entire parsing and configuration aspects, also including a stub manager that does almost nothing other than reading and validating targets.
  2. The manager implementation, which includes all the glue, but no real clients.
  3. The syslog client.
  4. The loki client.

Step 1 is probably most of the code, and it's also mostly ready to merge, so we have some quick wins ahead of us. Step 2 is likely where most of our conversations will take place. Step 3 and 4 should be easy when we get there.

How does that sound?

internal/plan/plan.go Outdated Show resolved Hide resolved
internal/plan/plan.go Outdated Show resolved Hide resolved
internal/overlord/servstate/manager.go Outdated Show resolved Hide resolved
internal/overlord/logstate/syslog.go Outdated Show resolved Hide resolved
internal/overlord/logstate/forwarder.go Outdated Show resolved Hide resolved

// stop interrupts the main forward loop, telling the forwarder to shut down.
// Once this method returns, it is safe to re-use the iterator.
func (f *logForwarder) stop() {
Copy link
Contributor

@niemeyer niemeyer Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use gopkg.in/tomb.v2 to control the goroutine lifecycle. Among other benefits, we'll get proper error reporting in those methods which right now are being buried in logs only. It'll also simplify quite a bit the logic of these signal channels and the use of timeouts which seem lacking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to use tombs. However, we should be careful with error handling; I don't think we don't want a single error to be bubbled up and end up restarting the forwarder. We did have errors "buried in logs only", but that was by design -- without retry logic that's about all you can do, and when new logs come in, we'll attempt to send those. We wanted to do the initial version without retry logic, as that adds complexity and raises a number of other questions (what if logs get backed up when the target is error-ing, how long / how many do we keep, and so on). In any case, we'll consider this when we switch to tombs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's still a distinction between a fatal error a non-fatal error. Right now neither is surfacing.

internal/overlord/logstate/forwarder.go Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
for parser.Next() {
// The forward loop is either selecting on this channel or draining it
// in the cancel case. So no need for a select here.
messages <- parser.Entry()
Copy link
Contributor

@niemeyer niemeyer Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machinery that decouples the iterator, the parser, the buffer inside the forward loop, and the client, seem to add significant complexity that we don't seem to benefit from. Let's just take a step back and consider the overall flow.

At the moment we have data buffered at those layers:

  • The ring buffer
  • (NOT the iterator, it's just a pointer to the ring)
  • The servicelog.Parser
  • The buffer inside forward()
  • The logClient (whether persistent or not, there's a data buffer there while sending)

All of those intermediate layers are just processing data from one side to the other. It might be tempting to buffer data in between to make sure the system does not run out of memory, well, except both sides are already in memory! So any attempt to buffer data in the middle in an intermediate state is actually increasing the amount of memory used in the overall system. If we want less memory consumption, all we need to do is to consume more slowly from the left side, and when we do consume, send it all the way to the right side.

Assuming such a design, we'd need a single goroutine that parses out of the iterator, and any time it manages to parse anything, it pushes into the client which can then serialize to its buffer and keep it pretty much ready for dispatch. Straight all the way from ring buffer to the output buffer.

Then, there's apparently another issue in the current system: there is an independent client for each forwarder, which means if we have 10 services and 2 targets, we have 20 different connections, each with their own series of buffers. Is that the case? If so, that really doesn't feel so great.

So, the recommended simplified design would be something like this: one single client per log target for all of Pebble, no intermediate buffers. All the data that leaves the ring buffer is processed on one single control goroutine, and lands on the client buffer as ready to dispatch as possible, so we know the actual amount of bytes we have to send, and are not holding references to Go structures. One Push blocking method to add entries into the client which also flushes when above a given size limit, and a second method Flush to force it empty. These methods make it easy for the client to tolerate concurrency properly (any number of goroutines pushing their messages will naturally block and wait when a size limit is exceeded. To handle the time-based delay, we can have a second method like FlushAfter(duration), where the timing of last flush is held internally by the client, which means concurrency would also work fine if multiple goroutines are recommending the flush at once (Flush is basically FlushAfter(0)).

And that would be all, I think. Of course, it's a bit hard to keep the whole picture at once in mind, but hopefully I'm not missing major points here.

- deduplicate log targets in Service.Merge
- log message improvements
- rename targets in README example
@barrettj12
Copy link
Contributor Author

Hi @niemeyer, thanks a lot for the review. I've addressed most of your review comments and opened the first PR #209 as you suggested, focusing just on the changes to the plan. Once this merges, I can start working on the simplified log forwarding design as you suggested.

@jnsgruk jnsgruk removed the Reviewed label Apr 17, 2023
@jnsgruk
Copy link
Member

jnsgruk commented Apr 17, 2023

@barrettj12 @hpidcock what's the status here - we could do with progressing this. Perhaps we should open a new PR to address the remaining features, and close this one to avoid any further confusion?

@niemeyer
Copy link
Contributor

PR #209 has been reviewed and merged with the first phase. I'm going to close this PR to make the status more clear, as it needs to be broken down into the remaining parts as outlined in the review above.

@niemeyer niemeyer closed this Apr 17, 2023
benhoyt pushed a commit that referenced this pull request Aug 22, 2023
This PR contains the second part of #165, including the actual mechanics of
log forwarding in Pebble. It builds upon #209 and #252. This includes an
abstract `logClient` interface but doesn't include the specific Loki/syslog
implementations - this will come in later PRs.

*This is a modification of #218, designed to fix fundamental issues identified
 with that PR.*

### Current design

For each log target, there is a single "gatherer", which receives logs from a
bunch of services and writes them to a "client". The gatherer runs a control
loop in a separate goroutine, which waits to receive logs on a channel, and
writes these to the client.

For each service, the gatherer spawns a "log puller". Each puller runs in a
separate goroutine, pulling logs from an iterator on the service's
ringbuffer. Pulled logs are sent to the gatherer's control loop on the shared
channel.

The `logClient` interface represents a client to a specific type of backend.
In future PRs, we will add `lokiClient` and `syslogClient` types which
implement `logClient`.

`logClient` includes two methods

```go
type logClient interface {
	Write(context.Context, servicelog.Entry) error
	Flush(context.Context) error
}
```

Client implementations have some freedom about the semantics of these
methods.

- For a buffering client (e.g. HTTP), `Write` could add the log to the
  client's internal buffer, while `Flush` would prepare and send an HTTP
  request with the buffered logs.
- For a non-buffering client (TCP?), `Write` could serialise the log directly
  to the open connection, while `Flush` would be a no-op.

### Teardown

Gracefully stopping a log gatherer is a complex process with multiple steps.

- At the instant we attempt to stop the gatherer, it may be in the middle of
  flushing its client. So, wait a small amount of time for the client to
  finish flushing, but cancel the flush if this takes too long.
- The service may have emitted some final logs on shutdown. Give each puller
  some time to pull the final logs from its iterator - but again, force kill
  it if this is taking too long.
- Once the pullers are all finished, we must have received all the logs we're
  gonna get, so we can safely shut down the main loop.
- Do a final flush of the client to send off any remaining buffered logs.

All this logic is encapsulated in the `gatherer.stop()` method.

## QA

I've included some sample implementations of `logClient` [here]
(https://github.com/barrettj12/pebble/blob/logfwd-fake/internals/overlord/logstate/fake.go).
They just print the logs to stdout. These can be used to verify that the log
forwarding mechanics work properly.

Create a simple logging service, e.g.

```bash
#!/bin/bash
while true; do
  echo "Hello"
  sleep 1
done
```

and a simple plan using this service

```yaml
services:
  svc1: &logsvc
    command: /home/jb/git/canonical/pebble/logfwd-impl2/pebble/logsvc
    startup: enabled
    override: merge
  svc2: *logsvc

log-targets:
  tgt1:
    override: merge
    services: [all]
    type: loki
    location: unnecessary
```

Add the [`fake.go`]
(https://github.com/barrettj12/pebble/blob/logfwd-fake/internals/overlord/logstate/fake.go)
file to the repo.

Comment out the following line
https://github.com/canonical/pebble/blob/3e904f9d22f297b68cba2dc33c9cf8e1bbbadd90/internals/overlord/logstate/gatherer.go#L356
and replace it with e.g.

```go
return &nonBufferingClient{}, nil          // unbuffered
return &bufferingClient{}, nil             // unlimited-size buffer, will flush on timeout only
return &bufferingClient{threshold: 3}, nil // buffer with max size: 3 logs
```

You might also want to change the gatherer's tick period:
https://github.com/canonical/pebble/blob/3e904f9d22f297b68cba2dc33c9cf8e1bbbadd90/internals/overlord/logstate/gatherer.go#L32

Run Pebble with

```
go run ./cmd/pebble run
```

and verify the logs are printing to stdout.

---

JUJU-3776
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Look at me first
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants