Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Fluentd Logging Sidecar #353

Closed
2 tasks done
simon-mo opened this issue Jul 7, 2022 · 12 comments
Closed
2 tasks done

[Feature] Fluentd Logging Sidecar #353

simon-mo opened this issue Jul 7, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Jul 7, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Ray head and worker pod logs are not very useful. The application specific log is either streamed to drivers (which are ephemeral) or buried in /tmp/ray/session_latest/logs directory in each pod.

In production deployment, the directory content in the logs directory is typically scrapped by log agent and send to a log aggregation service. For example, we have recommended loki for Ray Serve user. However, this approach require specific setup for logging stacks.

Typically production K8s cluster already have pod log aggregation setup for all pods. This means, if there's a way to print out the logs to stdout of a pod, then the log will be automatically picked up and shipped to whatever logging solution K8s cluster adopted.

One approach is to just tail -f /tmp/ray/session_latest/logs directory as the pod cmd instead of sleep. This might work but lost all the metadata.

I want to propose an alternative to add a new sidecar container to each head and worker pod. The container will

  • run fluentd agent
  • read the logs directory (via volume mount)
  • read a config map
  • print the log and annotate it with file path, output jsonl by default

In this way, the fluentd container will be able to read and annotate logs properly, and the configured logging agent (Logstash, Loki, Splunk) can pick it up from pod stdout.

Please let me know what do you think! @DmitriGekhtman @Jeffwan

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@simon-mo simon-mo added the enhancement New feature or request label Jul 7, 2022
@DmitriGekhtman
Copy link
Collaborator

Thanks for opening this issue! (Logging is one of many todo items swimming around in the back of my head :) )

Ray supports redirecting logs to stderr with a component prefix (at least in recent Ray versions).

cc @clarkzinzow @rkooo567 who are most familiar with this functionality

Could that remove the need for fluentd to scrape from the log files?

@simon-mo
Copy link
Collaborator Author

simon-mo commented Jul 7, 2022

No I don't think the RAY_log_to_stderr flag is enough unfortunately. It doesn't work with ray start --head for example.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jul 7, 2022

(Is it a fixable issue that it doesn't work with "ray start --head"?)

Ok, then using fluentd sounds reasonable.

My question then is if the operator should automatically insert the fluentd sidecar and log volume.

Or do we document that setup and leave it to users? (We could update all of the example configs to demonstrate this setup.)

Or (intermediate option) do we add a toggle that allows users to enable/disable automatic configuration of logging side cars. (This would effectively be a feature flag.)

If we go with automatic configuration, I've seen that some users cannot pull directly from dockerhub, so making the sidecar image configurable would be important.

@rkooo567
Copy link

rkooo567 commented Jul 7, 2022

Based on the description, I feel like this is something we should support kuberay. However, I also wonder how other systems handle this (it seems to be a pretty common problem in all systems iiuc)?

@simon-mo
Copy link
Collaborator Author

simon-mo commented Jul 7, 2022

Ray is special here because we run multiple processes in the same container and don't output logs to pod stdout. Other kubernetes native system runs one process per pod and log to pod output; which is automatically aggregated.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 8, 2022

In Bytedance, we mount ray logs folder /tmp/ray to a fixed hostPath /opt/tiger/containers and we do configure filebeat rules to collect all Ray logs. In this case, our log daemon is at the host level.

Using streaming sidecar container also works, I think we need to make it optional and pluggable and do not raise failure if external logging system is not there.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jul 8, 2022

Sounds good, definitely we've narrowed the options to

  • document the setup and leave it to users

or

  • add a toggle (and additional config options) that allows users to enable/disable automatic configuration of logging side cars

@akanso
Copy link
Collaborator

akanso commented Jul 12, 2022

at Microsoft we use Azure Kubernetes Service, with enabled AKS container log collection to Log Analytics through Azure Monitor for telemetry and logs.

Using FluentD is a great idea for users that are not already leveraging a cloud solution to handle logs. However it seems to be outside the scope of KubeRay, and more of a best practice design pattern when using KubeRay.

@DmitriGekhtman it seems we have a few design patterns of users using KubeRay, does it make sense to have a directory in the repo that we reserve for documenting best practices?

@DmitriGekhtman
Copy link
Collaborator

@akanso that is a great idea.

In general, detailed documentation with lots of examples is what we want!
I'll spend some time in the coming weeks fleshing out the docs.

@Jeffwan Jeffwan added this to the v0.4.0 release milestone Jul 27, 2022
@mihajenko
Copy link

mihajenko commented Aug 8, 2022

👍 for this.

@simon-mo

One approach is to just tail -f /tmp/ray/session_latest/logs directory as the pod cmd instead of sleep. This might work but lost all the metadata.

This doesn't work. tail and similar off-the-shelf programs have limitations, they don't detect new files in the path, like promtail does.

❓ Can this sidecar container already be added using the RayClusters spec? At the very least, it seems like it is possible to create a new worker type and hack the helm spec to simulate the sidecar pattern with Pod affinity rules (to the head).

Since this exact effort is on my team's plate as well, what is the timeframe for completing this feature?

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Aug 8, 2022

It can already be added using the RayCluster spec.

I'm working on a PR which documents how to do this
ray-project/ray#27607

To make this possible to do with Helm, one option is to expose sidecar container specs in values.yaml.

As discussed further up in this thread, it's also possible to extract logs by running a log-processing daemonset and mounting the ray containers' /tmp/ray director to the relevant hostPath. (This kind of approach has the advantage that you don't need to modify the RayCluster CR and the disadvantage that you have to set up some node-level logging infra.)

To avoid imposing any particular logging solution, we are documenting strategies rather than modifying the operator to set up logging automatically.

@DmitriGekhtman
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants