Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log collection and aggregation #32

Closed
felipemontoya opened this issue Apr 20, 2023 · 8 comments · Fixed by #93
Closed

Log collection and aggregation #32

felipemontoya opened this issue Apr 20, 2023 · 8 comments · Fixed by #93
Assignees

Comments

@felipemontoya
Copy link
Member

During the latest meeting we reviewed @gabor-boros answer at #26. Most missing features had a ticket covering them, but log collection did not.

The situation is:

  • there are some tools to collect and aggregate logs inside of a namespace where tutor is already installed. Logstash and Vector are common alternatives.
  • in the umbrella portions of the cluster, the charts and pods that run on the global namespace we don't have yet anything for log collection.

The question remains open if we want/need a specific tool for that and if there is interest in the participants of this repo in building one.

On the plus side we could have a tool that makes handling many instances simples.
The con is that we would be splitting the effort that could otherwise go into making the tools for log collection an individual namespace better.

I personally have not taken a side for any of the options, but we need a place where it can be discussed.

@felipemontoya
Copy link
Member Author

@Ian2012 I know we are storing logs for some installations that want to start aspects with some data from before redwood. Could you please share in this context how we are doing that?

@Ian2012
Copy link
Contributor

Ian2012 commented May 28, 2024

On production, we are using Vector deployed with a helm chart with a sink configuration that saves all the logs on an S3 bucket splitting the logs per namespace/kind/application. Would that be a suitable solution for this problem?

Eventually, once Aspects is configured we can trigger a job that reads from: <namespace>/tracking/lms|lms-worker tracking logs and does the proper backfill

@Ian2012
Copy link
Contributor

Ian2012 commented May 31, 2024

Another solution that I see feasible is to store the tracking log data into ClickHouse using Vector to have quicker backfills on Aspects and being able to have an out-of-box backups solution for tracking logs. This is nothing new, as Cairn performs a similar operation by storing all tracking logs into ClickHouse via Vector

@gabor-boros
Copy link
Contributor

@bradenmacdonald and @Agrendalath inviting you to this conversation. I think both solutions could be feasible, though you may have better insights here. Especially @Agrendalath as I know one of your clients is using tracking logs.

@bradenmacdonald
Copy link
Contributor

@pomegranited might be a better person to ask :) I don't have much insight on this topic.

@pomegranited
Copy link

Hi @felipemontoya, thank you for starting the discussion! I think we need to define some scope and goals before making technology decisions.

Is this about general Open edX log collection/aggregation, like for monitoring instance health and investigating incidents? Or is it just about storing tracking logs?

How much of a solution should we provide? If we're providing log collection, do we need parsing, monitoring, dashboards, and alerting too?

What solutions are people currently using? What are their pain points?

There's a lot to consider. But we can totally take cues from @bmtcril 's Aspects architecture and integrate with suitable open source 3rd party tools, rather than writing our own.

@bmtcril
Copy link

bmtcril commented Jun 18, 2024

FWIW Aspects can store tracking logs in ClickHouse via Vector now, though I'm not sure when the last time was that we tested it.

I definitely agree that having long term, flexible, rotated log storage for both operational and tracking logs (and potentially xAPI logs) is hugely important. I personally wouldn't mind seeing Vector used for that, but I'm sure site operators have much more valuable insight on any pain points with it.

@MoisesGSalas
Copy link
Contributor

I've seen two common patterns when collecting logs in k8s: A sidecar container that runs alongside the application and a DaemonSet that runs on every node and mounts the /var/log/ from the host.

IIRC Adam Blackwell mentioned that they were using the sidecar approach in 2U.

With @Ian2012, we have tested the DaemonSet approach in a couple of clusters. We installed a global helm chart for vector and configured the sinks, sources and transforms.

We retrieve all the logs from certain pods (i.e with the annotation app.kubernetes.io/managed-by=tutor) and Cristhian wrote the transformer to extract the tracking logs. We push all the logs to S3.

We also found that this vector instance can serve multiple purposes, we can extract and push the tracking logs to s3, but we can also push the standard application logs of the openedx services to cloudwatch or even push the logs of other services (ingress-nginx, etc).

I think with a similar approach we can eventually cover most of this:

Is this about general Open edX log collection/aggregation, like for monitoring instance health and investigating incidents? Or is it just about storing tracking logs?

How much of a solution should we provide? If we're providing log collection, do we need parsing, monitoring, dashboards, and alerting too?

@MoisesGSalas MoisesGSalas self-assigned this Jul 23, 2024
@Ian2012 Ian2012 self-assigned this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants