[Feature] Fluentd Logging Sidecar #353

simon-mo · 2022-07-07T01:44:02Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Ray head and worker pod logs are not very useful. The application specific log is either streamed to drivers (which are ephemeral) or buried in /tmp/ray/session_latest/logs directory in each pod.

In production deployment, the directory content in the logs directory is typically scrapped by log agent and send to a log aggregation service. For example, we have recommended loki for Ray Serve user. However, this approach require specific setup for logging stacks.

Typically production K8s cluster already have pod log aggregation setup for all pods. This means, if there's a way to print out the logs to stdout of a pod, then the log will be automatically picked up and shipped to whatever logging solution K8s cluster adopted.

One approach is to just tail -f /tmp/ray/session_latest/logs directory as the pod cmd instead of sleep. This might work but lost all the metadata.

I want to propose an alternative to add a new sidecar container to each head and worker pod. The container will

run fluentd agent
read the logs directory (via volume mount)
read a config map
print the log and annotate it with file path, output jsonl by default

In this way, the fluentd container will be able to read and annotate logs properly, and the configured logging agent (Logstash, Loki, Splunk) can pick it up from pod stdout.

Please let me know what do you think! @DmitriGekhtman @Jeffwan

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

DmitriGekhtman · 2022-07-07T06:14:26Z

Thanks for opening this issue! (Logging is one of many todo items swimming around in the back of my head :) )

Ray supports redirecting logs to stderr with a component prefix (at least in recent Ray versions).

cc @clarkzinzow @rkooo567 who are most familiar with this functionality

Could that remove the need for fluentd to scrape from the log files?

simon-mo · 2022-07-07T06:37:02Z

No I don't think the RAY_log_to_stderr flag is enough unfortunately. It doesn't work with ray start --head for example.

DmitriGekhtman · 2022-07-07T15:38:06Z

(Is it a fixable issue that it doesn't work with "ray start --head"?)

Ok, then using fluentd sounds reasonable.

My question then is if the operator should automatically insert the fluentd sidecar and log volume.

Or do we document that setup and leave it to users? (We could update all of the example configs to demonstrate this setup.)

Or (intermediate option) do we add a toggle that allows users to enable/disable automatic configuration of logging side cars. (This would effectively be a feature flag.)

If we go with automatic configuration, I've seen that some users cannot pull directly from dockerhub, so making the sidecar image configurable would be important.

rkooo567 · 2022-07-07T18:10:34Z

Based on the description, I feel like this is something we should support kuberay. However, I also wonder how other systems handle this (it seems to be a pretty common problem in all systems iiuc)?

simon-mo · 2022-07-07T18:12:06Z

Ray is special here because we run multiple processes in the same container and don't output logs to pod stdout. Other kubernetes native system runs one process per pod and log to pod output; which is automatically aggregated.

Jeffwan · 2022-07-08T17:36:17Z

In Bytedance, we mount ray logs folder /tmp/ray to a fixed hostPath /opt/tiger/containers and we do configure filebeat rules to collect all Ray logs. In this case, our log daemon is at the host level.

Using streaming sidecar container also works, I think we need to make it optional and pluggable and do not raise failure if external logging system is not there.

DmitriGekhtman · 2022-07-08T20:06:15Z

Sounds good, definitely we've narrowed the options to

document the setup and leave it to users

or

add a toggle (and additional config options) that allows users to enable/disable automatic configuration of logging side cars

akanso · 2022-07-12T18:11:27Z

at Microsoft we use Azure Kubernetes Service, with enabled AKS container log collection to Log Analytics through Azure Monitor for telemetry and logs.

Using FluentD is a great idea for users that are not already leveraging a cloud solution to handle logs. However it seems to be outside the scope of KubeRay, and more of a best practice design pattern when using KubeRay.

@DmitriGekhtman it seems we have a few design patterns of users using KubeRay, does it make sense to have a directory in the repo that we reserve for documenting best practices?

DmitriGekhtman · 2022-07-12T18:27:14Z

@akanso that is a great idea.

In general, detailed documentation with lots of examples is what we want!
I'll spend some time in the coming weeks fleshing out the docs.

mihajenko · 2022-08-08T08:57:19Z

👍 for this.

@simon-mo

One approach is to just tail -f /tmp/ray/session_latest/logs directory as the pod cmd instead of sleep. This might work but lost all the metadata.

This doesn't work. tail and similar off-the-shelf programs have limitations, they don't detect new files in the path, like promtail does.

❓ Can this sidecar container already be added using the RayClusters spec? At the very least, it seems like it is possible to create a new worker type and hack the helm spec to simulate the sidecar pattern with Pod affinity rules (to the head).

Since this exact effort is on my team's plate as well, what is the timeframe for completing this feature?

DmitriGekhtman · 2022-08-08T15:08:19Z

It can already be added using the RayCluster spec.

I'm working on a PR which documents how to do this
ray-project/ray#27607

To make this possible to do with Helm, one option is to expose sidecar container specs in values.yaml.

As discussed further up in this thread, it's also possible to extract logs by running a log-processing daemonset and mounting the ray containers' /tmp/ray director to the relevant hostPath. (This kind of approach has the advantage that you don't need to modify the RayCluster CR and the disadvantage that you have to set up some node-level logging infra.)

To avoid imposing any particular logging solution, we are documenting strategies rather than modifying the operator to set up logging automatically.

DmitriGekhtman · 2022-11-04T16:03:31Z

Satisfactorily resolved by https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/logging.html

simon-mo added the enhancement New feature or request label Jul 7, 2022

Jeffwan added this to the v0.4.0 release milestone Jul 27, 2022

DmitriGekhtman mentioned this issue Aug 30, 2022

[Feature][Helm] Enable sidecar configuration in Helm chart #511

Closed

2 tasks

kevin85421 mentioned this issue Sep 21, 2022

[Feature][Helm] Enable sidecar configuration in Helm chart #584

Closed

4 tasks

kevin85421 mentioned this issue Sep 29, 2022

[Feature][Helm] Enable sidecar configuration in Helm chart #604

Merged

4 tasks

DmitriGekhtman closed this as completed Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Fluentd Logging Sidecar #353

[Feature] Fluentd Logging Sidecar #353

simon-mo commented Jul 7, 2022 •

edited

Loading

DmitriGekhtman commented Jul 7, 2022

simon-mo commented Jul 7, 2022

DmitriGekhtman commented Jul 7, 2022 •

edited

Loading

rkooo567 commented Jul 7, 2022

simon-mo commented Jul 7, 2022

Jeffwan commented Jul 8, 2022 •

edited

Loading

DmitriGekhtman commented Jul 8, 2022 •

edited

Loading

akanso commented Jul 12, 2022

DmitriGekhtman commented Jul 12, 2022

mihajenko commented Aug 8, 2022 •

edited

Loading

DmitriGekhtman commented Aug 8, 2022 •

edited

Loading

DmitriGekhtman commented Nov 4, 2022

[Feature] Fluentd Logging Sidecar #353

[Feature] Fluentd Logging Sidecar #353

Comments

simon-mo commented Jul 7, 2022 • edited Loading

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

DmitriGekhtman commented Jul 7, 2022

simon-mo commented Jul 7, 2022

DmitriGekhtman commented Jul 7, 2022 • edited Loading

rkooo567 commented Jul 7, 2022

simon-mo commented Jul 7, 2022

Jeffwan commented Jul 8, 2022 • edited Loading

DmitriGekhtman commented Jul 8, 2022 • edited Loading

akanso commented Jul 12, 2022

DmitriGekhtman commented Jul 12, 2022

mihajenko commented Aug 8, 2022 • edited Loading

DmitriGekhtman commented Aug 8, 2022 • edited Loading

DmitriGekhtman commented Nov 4, 2022

simon-mo commented Jul 7, 2022 •

edited

Loading

DmitriGekhtman commented Jul 7, 2022 •

edited

Loading

Jeffwan commented Jul 8, 2022 •

edited

Loading

DmitriGekhtman commented Jul 8, 2022 •

edited

Loading

mihajenko commented Aug 8, 2022 •

edited

Loading

DmitriGekhtman commented Aug 8, 2022 •

edited

Loading