Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traces not showing in GUI when viewing sessions with large no of traces #82

Open
andyozj opened this issue Feb 27, 2025 · 9 comments
Open
Labels
bug Something isn't working

Comments

@andyozj
Copy link

andyozj commented Feb 27, 2025

Good Morning Developers!

I'm self-hosting langfuse v3 through helm chart v0.10.2.

Used the default helmchart dependencies for clickhouse, m`inio and valkey, but existing postgres set-up.

I've also increased all deployments/sts to reasonable resources such as 2cpu,8gb memory.

My developers team has submitted jobs of sessions with traces counts >20k, and have difficulty viewing them.

I've increased NODE_OPTIONS to value: --max-old-space-size=12288, for both langfuse web/worker with 4cpu and 16gpu memory.

The website will "stuck" at the following status for awhile before entering a "loading" state.
Image

Image

Upon monitoring the logs of the langfuse-web pod, I noticed that there's the following error:

2025-02-27T03:43:09.055Z error  Unable to get sessions.byId Timeout error.
Error: Timeout error.
    at Socket.b (/app/web/.next/server/chunks/6220.js:61:58104)
    at Object.onceWrapper (node:events:632:28)
    at Socket.emit (node:events:530:35)
    at Socket._onTimeout (node:net:595:8)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7)
2025-02-27T03:43:09.057Z error  middleware intercepted error with code INTERNAL_SERVER_ERROR unable to get
 session

Any suggestions/ potential resolutions will be good!

Copy link

dosubot bot commented Feb 27, 2025

Good morning, @andyozj! I'm here to help you with the issue you're facing. Let's work together to find a solution.

To address the issue of traces not showing in the GUI for sessions with a large number of traces, you can consider the following steps:

  1. Optimize ClickHouse Configuration: Ensure that your ClickHouse setup is optimized for handling large datasets. Use a single shard, set a minimum of 3 replicas, and use at least the large resourcesPreset for CPU and memory. Make sure your storage class allows volume expansion to handle the disk-heavy nature of the workload [1].

  2. Redis Configuration: Make sure Redis is configured with maxmemory-policy set to noeviction to prevent queue jobs from being evicted. Allocate sufficient memory for Redis, approximately 1GB for every ~100,000 events per minute [2].

  3. Increase Timeout Settings: The timeout error suggests that the current timeout settings might be too low for handling large sessions. Check and increase the timeout settings in your application configuration to allow more time for processing large datasets.

  4. Monitor Resource Usage: Ensure that the resources allocated (CPU and memory) are sufficient for the workload. You have already increased resources, but continue to monitor the usage to see if further adjustments are needed.

  5. Check Network Latency: High network latency can also contribute to timeout errors. Ensure that the network connection between your application and the database is stable and has low latency.

By implementing these optimizations and configurations, you should be able to improve the performance and resolve the timeout issues when viewing sessions with a large number of traces.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added the bug Something isn't working label Feb 27, 2025
@Steffen911
Copy link
Contributor

@andyozj Do you see whether any of the Langfuse Pods fail with an out of memory error or are being killed?
It may be a slow query on ClickHouse - usually those happens if ClickHouse is memory constrained. Could you increase the container size there and check whether this mitigates the problem?

@andyozj
Copy link
Author

andyozj commented Feb 27, 2025

@Steffen911, I noticed Langfuse Pods are being restarted, but not OOM caused. I've also tried increasing memory size of clickhouse to 16gb, to no avail but I'm out of ideas as to what could have caused it.

@Steffen911
Copy link
Contributor

@andyozj Can you connect to the ClickHouse database and check the error log at /var/log/clickhouse-server/clickhouse-server.err.log within the cickhouse servers?

@andyozj
Copy link
Author

andyozj commented Feb 27, 2025

@Steffen911 there is no clickhouse-server.err.log when I did kubectl exec -it services/langfuse-clickhouse -- /bin/bash to access the database

@Steffen911
Copy link
Contributor

@andyozj Mh, can you see anything in the clickhouse container logs?

@andyozj
Copy link
Author

andyozj commented Feb 28, 2025

@Steffen911 there's errors from the clickhouse container logs when I re-try to open the session:

2025.02.28 01:59:18.317707 [ 890 ] {0c39e8d2-9184-4c0d-9dc4-7e1266376e6f} <Error> DynamicQueryHandler: Cannot send exception to client: Code: 209. DB::NetException: Timeout exceeded while writing to socket (10.112.1.22:36976, 30000 ms). (SOCKET_TIMEOUT), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000cf7c73b
1. DB::NetException::NetException<String, long>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<long>::type>, String&&, long&&) @ 0x000000000d0ebe0c
2. DB::WriteBufferFromPocoSocket::socketSendBytes(char const*, unsigned long) @ 0x000000000d0eb928
3. DB::WriteBufferFromHTTPServerResponse::nextImpl() @ 0x000000001295133b
4. DB::HTTPHandler::trySendExceptionToClient(String const&, int, DB::HTTPServerRequest&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&) @ 0x00000000128a972f
5. DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&, StrongTypedef<unsigned long, ProfileEvents::EventTag> const&) @ 0x00000000128ac5c6
6. DB::HTTPServerConnection::run() @ 0x000000001294aa1d
7. Poco::Net::TCPServerConnection::start() @ 0x000000001580b827
8. Poco::Net::TCPServerDispatcher::run() @ 0x000000001580bcb9
9. Poco::PooledThread::run() @ 0x00000000157d8821
10. Poco::ThreadImpl::runnableEntry(void*) @ 0x00000000157d6ddd
11. ? @ 0x00007f606e3ed1c4
12. ? @ 0x00007f606e46cac0
 (version 24.10.2.80 (official build))

@Steffen911
Copy link
Contributor

@andyozj Can you estimate how many observations there are that relate to those 20000 sessions? And what's the total number of scores, traces, and observations within the system?
You should be able to get all four numbers by executing queries against the ClickHouse warehouse.

We're not aware of major performance problems around that endpoint, but I see how your load pattern could cause them. Given the information above, I can try to reproduce them and see if there are further optimizations we can do in the queries. I can say ahead of time that this will be a bit more involved to fix.

@andyozj
Copy link
Author

andyozj commented Feb 28, 2025

We have difficulties viewing 4 sessions primarily. These sessions have 67300, 25674, 25674 and 16000 traces respectively.

I've done the query and receive the following results in the database:

  • Total observations: 3.31 Million
  • Total Traces: 142631
  • Total scores: 1.4 Million
  • Total Sessions: 46

Roger on the expected efforts required for this potential fix. Thank you very much!

edit: For now, I've instructed my users to cap each observation to a total of 1 ~ 2 k traces for minimal stress during loading.
Out of curiosity, It is interesting that the loading of said sessions could cause potential performance issue around the endpoint. Does that mean that its loading all traces within the session upon the viewing of said session rather than loading N traces to display initially, and load further traces as we view the session?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants