Traces not showing in GUI when viewing sessions with large no of traces #82

andyozj · 2025-02-27T03:50:29Z

Good Morning Developers!

I'm self-hosting langfuse v3 through helm chart v0.10.2.

Used the default helmchart dependencies for clickhouse, m`inio and valkey, but existing postgres set-up.

I've also increased all deployments/sts to reasonable resources such as 2cpu,8gb memory.

My developers team has submitted jobs of sessions with traces counts >20k, and have difficulty viewing them.

I've increased NODE_OPTIONS to value: --max-old-space-size=12288, for both langfuse web/worker with 4cpu and 16gpu memory.

The website will "stuck" at the following status for awhile before entering a "loading" state.

Upon monitoring the logs of the langfuse-web pod, I noticed that there's the following error:

2025-02-27T03:43:09.055Z error  Unable to get sessions.byId Timeout error.
Error: Timeout error.
    at Socket.b (/app/web/.next/server/chunks/6220.js:61:58104)
    at Object.onceWrapper (node:events:632:28)
    at Socket.emit (node:events:530:35)
    at Socket._onTimeout (node:net:595:8)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7)
2025-02-27T03:43:09.057Z error  middleware intercepted error with code INTERNAL_SERVER_ERROR unable to get
 session

Any suggestions/ potential resolutions will be good!

The text was updated successfully, but these errors were encountered:

dosubot · 2025-02-27T03:52:07Z

Good morning, @andyozj! I'm here to help you with the issue you're facing. Let's work together to find a solution.

To address the issue of traces not showing in the GUI for sessions with a large number of traces, you can consider the following steps:

Optimize ClickHouse Configuration: Ensure that your ClickHouse setup is optimized for handling large datasets. Use a single shard, set a minimum of 3 replicas, and use at least the large resourcesPreset for CPU and memory. Make sure your storage class allows volume expansion to handle the disk-heavy nature of the workload ^[1].
Redis Configuration: Make sure Redis is configured with maxmemory-policy set to noeviction to prevent queue jobs from being evicted. Allocate sufficient memory for Redis, approximately 1GB for every ~100,000 events per minute ^[2].
Increase Timeout Settings: The timeout error suggests that the current timeout settings might be too low for handling large sessions. Check and increase the timeout settings in your application configuration to allow more time for processing large datasets.
Monitor Resource Usage: Ensure that the resources allocated (CPU and memory) are sufficient for the workload. You have already increased resources, but continue to monitor the usage to see if further adjustments are needed.
Check Network Latency: High network latency can also contribute to timeout errors. Ensure that the network connection between your application and the database is stable and has low latency.

By implementing these optimizations and configurations, you should be able to improve the performance and resolve the timeout issues when viewing sessions with a large number of traces.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

Steffen911 · 2025-02-27T08:00:32Z

@andyozj Do you see whether any of the Langfuse Pods fail with an out of memory error or are being killed?
It may be a slow query on ClickHouse - usually those happens if ClickHouse is memory constrained. Could you increase the container size there and check whether this mitigates the problem?

andyozj · 2025-02-27T08:15:05Z

@Steffen911, I noticed Langfuse Pods are being restarted, but not OOM caused. I've also tried increasing memory size of clickhouse to 16gb, to no avail but I'm out of ideas as to what could have caused it.

Steffen911 · 2025-02-27T09:38:48Z

@andyozj Can you connect to the ClickHouse database and check the error log at /var/log/clickhouse-server/clickhouse-server.err.log within the cickhouse servers?

andyozj · 2025-02-27T09:54:54Z

@Steffen911 there is no clickhouse-server.err.log when I did kubectl exec -it services/langfuse-clickhouse -- /bin/bash to access the database

Steffen911 · 2025-02-27T10:07:59Z

@andyozj Mh, can you see anything in the clickhouse container logs?

andyozj · 2025-02-28T02:00:28Z

@Steffen911 there's errors from the clickhouse container logs when I re-try to open the session:

2025.02.28 01:59:18.317707 [ 890 ] {0c39e8d2-9184-4c0d-9dc4-7e1266376e6f} <Error> DynamicQueryHandler: Cannot send exception to client: Code: 209. DB::NetException: Timeout exceeded while writing to socket (10.112.1.22:36976, 30000 ms). (SOCKET_TIMEOUT), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000cf7c73b
1. DB::NetException::NetException<String, long>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<long>::type>, String&&, long&&) @ 0x000000000d0ebe0c
2. DB::WriteBufferFromPocoSocket::socketSendBytes(char const*, unsigned long) @ 0x000000000d0eb928
3. DB::WriteBufferFromHTTPServerResponse::nextImpl() @ 0x000000001295133b
4. DB::HTTPHandler::trySendExceptionToClient(String const&, int, DB::HTTPServerRequest&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&) @ 0x00000000128a972f
5. DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&, StrongTypedef<unsigned long, ProfileEvents::EventTag> const&) @ 0x00000000128ac5c6
6. DB::HTTPServerConnection::run() @ 0x000000001294aa1d
7. Poco::Net::TCPServerConnection::start() @ 0x000000001580b827
8. Poco::Net::TCPServerDispatcher::run() @ 0x000000001580bcb9
9. Poco::PooledThread::run() @ 0x00000000157d8821
10. Poco::ThreadImpl::runnableEntry(void*) @ 0x00000000157d6ddd
11. ? @ 0x00007f606e3ed1c4
12. ? @ 0x00007f606e46cac0
 (version 24.10.2.80 (official build))

Steffen911 · 2025-02-28T09:46:30Z

@andyozj Can you estimate how many observations there are that relate to those 20000 sessions? And what's the total number of scores, traces, and observations within the system?
You should be able to get all four numbers by executing queries against the ClickHouse warehouse.

We're not aware of major performance problems around that endpoint, but I see how your load pattern could cause them. Given the information above, I can try to reproduce them and see if there are further optimizations we can do in the queries. I can say ahead of time that this will be a bit more involved to fix.

andyozj · 2025-02-28T11:26:37Z

We have difficulties viewing 4 sessions primarily. These sessions have 67300, 25674, 25674 and 16000 traces respectively.

I've done the query and receive the following results in the database:

Total observations: 3.31 Million
Total Traces: 142631
Total scores: 1.4 Million
Total Sessions: 46

Roger on the expected efforts required for this potential fix. Thank you very much!

edit: For now, I've instructed my users to cap each observation to a total of 1 ~ 2 k traces for minimal stress during loading.
Out of curiosity, It is interesting that the loading of said sessions could cause potential performance issue around the endpoint. Does that mean that its loading all traces within the session upon the viewing of said session rather than loading N traces to display initially, and load further traces as we view the session?

dosubot bot added the bug Something isn't working label Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traces not showing in GUI when viewing sessions with large no of traces #82

Traces not showing in GUI when viewing sessions with large no of traces #82

andyozj commented Feb 27, 2025

dosubot bot commented Feb 27, 2025

Steffen911 commented Feb 27, 2025

andyozj commented Feb 27, 2025

Steffen911 commented Feb 27, 2025

andyozj commented Feb 27, 2025

Steffen911 commented Feb 27, 2025

andyozj commented Feb 28, 2025

Steffen911 commented Feb 28, 2025

andyozj commented Feb 28, 2025 •

edited

Loading

Traces not showing in GUI when viewing sessions with large no of traces #82

Traces not showing in GUI when viewing sessions with large no of traces #82

Comments

andyozj commented Feb 27, 2025

dosubot bot commented Feb 27, 2025

Steffen911 commented Feb 27, 2025

andyozj commented Feb 27, 2025

Steffen911 commented Feb 27, 2025

andyozj commented Feb 27, 2025

Steffen911 commented Feb 27, 2025

andyozj commented Feb 28, 2025

Steffen911 commented Feb 28, 2025

andyozj commented Feb 28, 2025 • edited Loading

andyozj commented Feb 28, 2025 •

edited

Loading