Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU use due to new thread created on every websocket message in DispatchedMessageSink #6566

Closed
nezoat opened this issue Aug 1, 2021 · 28 comments · Fixed by #6585
Closed
Assignees
Labels
Bug For general bugs on Jetty side

Comments

@nezoat
Copy link

nezoat commented Aug 1, 2021

In websocket-jetty-server 11.0.6, the class "org.eclipse.jetty.websocket.core.internal.messages.DispatchedMessageSink" accept method starts a new Thread (NOT using a thread pool). In my app this is causing a new thread creation for every web socket message (not connection). As a result, I am seeing about 4x total app CPU usage. As a workaround, I updated the code to use a Thread Pool and now see low CPU usage again.

@nezoat nezoat added the Bug For general bugs on Jetty side label Aug 1, 2021
@nezoat
Copy link
Author

nezoat commented Aug 1, 2021

Well, the high CPU issue is back for me even with the thread pool modification, so this may not be a Jetty issue although I still question why a new thread is created in the class above instead of a thread pool.

@joakime
Copy link
Contributor

joakime commented Aug 1, 2021

If you are using java.io Stream based reads (eg: InputStream, Reader) then per spec we are required to spawn a new thread for each incoming message.

That style of read is for VERY long lived (and undetermined length) messsages, think gaming, or video chat.

It is inappropriate to use for short messages (how short is short? well, that depends on a huge number of factors, from your hardware, memory, OS, network, concurrent connections, level of load, etc. But anything that can be sent in under 5 minutes is a metric i personally use).

@nezoat
Copy link
Author

nezoat commented Aug 1, 2021

Thank you so very much for taking the time to respond and explain, really. I had no idea WebSocket.onMessage(InputStream) vs non-stream methods made such a difference in implementation. I am now using onMessage(ByteBuffer) and this has solved the problem of creating a new thread for every message. I am still, unfortunately, having the issue of very high CPU since switching from Jetty 9 (15% cpu) to Jetty 11 (60% cpu), so the hunt continues, but presumably not related to this topic. Thanks again.

@nezoat nezoat closed this as completed Aug 1, 2021
@gregw
Copy link
Contributor

gregw commented Aug 2, 2021

Was the only change jetty 9 -> 11?
Taking multiple thread dumps at short intervals whilst you are in high CPU can be a rudimentary way to find busy threads.
Better still, if you inspect the contents of /proc/$PID/task/*/stat you can find the actual thread that is burning CPU and match it to one in the thread dump.

@nezoat
Copy link
Author

nezoat commented Aug 2, 2021

The change was jetty 9->11 but also from not embedded to embedded. I need to have embedded so now in the process of downgrading from 11->9 embedded. At this point, trying to maintain the performance of 9.4 just changing to embedded.

@nezoat
Copy link
Author

nezoat commented Aug 2, 2021

Also, thank you for the tips in dumping thread info! That's gold.

@nezoat
Copy link
Author

nezoat commented Aug 2, 2021

The only change made now, is from Jetty 11 embedded to 9.4 embedded, and CPU went from 60% to 15%. Something going on, I no idea what. Here is the server startup code this is used:

            Server server = new Server(new QueuedThreadPool(500,30,99999));
            HttpConfiguration httpConfig = new HttpConfiguration();
            httpConfig.setIdleTimeout((Command.clientDataSyncIntervalSeconds + 30) * 1000);
            httpConfig.setSendServerVersion(false);
            httpConfig.setSendServerVersion(false);
            httpConfig.setPersistentConnectionsEnabled(true);
            HttpConnectionFactory http11 = new HttpConnectionFactory(httpConfig);
            //supports translating client IP address when behind load balancers
            ProxyConnectionFactory proxy = new ProxyConnectionFactory(http11.getProtocol());

            ServerConnector connector = new ServerConnector(server, proxy, http11);
            connector.setPort(8080);
            server.addConnector(connector);

            var servletHandler = new ServletContextHandler(server,"/");
            servletHandler.setBaseResource(Resource.newClassPathResource("/web"));
            //JettyWebSocketServletContainerInitializer.configure(servletHandler, null);
            servletHandler.addServlet(MainServlet.class, "").setAsyncSupported(true);
            servletHandler.addServlet(LoginServlet.class, "/login");
            servletHandler.addServlet(StatisticsServlet.class, "/stats");
            servletHandler.addServlet(BlockedServlet.class, "/blocked");
            servletHandler.addServlet(AdminServlet.class, "/admin");
            servletHandler.addServlet(PingServlet.class, "/ping");
            servletHandler.addServlet(MapServlet.class, "/map");
            servletHandler.addServlet(ConsoleLogServlet.class, "/log");
            servletHandler.addServlet(DefaultServlet.class,"/");

            servletHandler.addFilter(RedirectFilter.class, "/*", EnumSet.of(DispatcherType.REQUEST)).setAsyncSupported(true);

            FilterHolder filterJavaMelody = servletHandler.addFilter(MonitoringFilter.class, "/*", EnumSet.of(DispatcherType.REQUEST));
            filterJavaMelody.setAsyncSupported(true);
            filterJavaMelody.setInitParameter("system-actions-enabled","false");
            filterJavaMelody.setInitParameter("disable","false");
            filterJavaMelody.setInitParameter("gzip-compression-disabled","true");
            filterJavaMelody.setInitParameter("no-database","true");
            filterJavaMelody.setInitParameter("displayed-counters","http,error,log");

            // Link the context to the server.
            GzipHandler gzipHandler = new GzipHandler();
            gzipHandler.setMinGzipSize(1024);
            gzipHandler.addIncludedMethods("GET");
            gzipHandler.addIncludedMethods("POST");
            gzipHandler.setHandler(servletHandler);
            server.setHandler(gzipHandler);

            initServices();

            server.start();

@nezoat
Copy link
Author

nezoat commented Aug 2, 2021

Screen Shot 2021-08-02 at 7 45 20 PM

gc

This is the before/after. Dramatic CPU reduction after dropping back to Jetty 9.4. Nearly all the consumed CPU usage was going to the Garbage Collector

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

OS: | Linux, 4.14.238-182.422.amzn2.aarch64 (2 cores)
Java: | OpenJDK Runtime Environment, 15.0.1+9-18
VM args:

-XX:-OmitStackTraceInFastThrow
-XX:+UseZGC
-Xmx9G
-XX:ThreadStackSize=256k

@sbordet
Copy link
Contributor

sbordet commented Aug 3, 2021

@nezoat can you also report the heap size?

In the graphs above, at what time did you switch to Jetty 9.4.x?

@joakime
Copy link
Contributor

joakime commented Aug 3, 2021

Just some minor things noticed in your code (which has nothing to do with your issue) ...

            httpConfig.setSendServerVersion(false);
            httpConfig.setSendServerVersion(false);

Did you mean to set that twice?

            httpConfig.setPersistentConnectionsEnabled(true);

This is true by default.
What it means to be true is ...

  • HTTP/1.0 connections will have the response header Connection: keep-alive automatically added. (but only if the client doesn't send the Connection request header, and the application doesn't set the Connection response header)
  • HTTP/1.1 connections will have the response header Connection: close not automatically added. (but only if the client doesn't send the Connection request header, and the application doesn't set the Connection response header)

I mention this detail, as it's confused people on stackoverflow for years, thinking that it forces persistent connections. It merely adds a header on the response to indicate the server's desire for persistence, which the client and/or application can override easily.

            HttpConnectionFactory http11 = new HttpConnectionFactory(httpConfig);
            //supports translating client IP address when behind load balancers

Sounds like you really want the ForwardedRequestCustomizer here.
Use httpConfig.addCustomizer(new ForwardedRequestCustomizer()); if you think so.

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

@nezoat can you also report the heap size?

In the graphs above, at what time did you switch to Jetty 9.4.x?

Heap is 9G. The switch from 11-->9.4.x on the CPU graph is around 18:20, and on the GC graph its just past 22:00 hrs.

@sbordet
Copy link
Contributor

sbordet commented Aug 3, 2021

Heap is 9G. The switch from 11-->9.4.x on the CPU graph is around 18:20, and on the GC graph its just past 22:00 hrs.

Not sure I understand.

Are you saying that the switch happened at 18:20 on the CPU graph, but at 22:00 on the GC graph?
So the two graphs are unrelated, meaning not from the same run?

What load are you running?
It's a load test? It's a real application?
Both 9.4.x and 11 are run with the same JVM vendor/version/configuration (especially with the same GC)?

We don't see this high CPU in our load tests.

What's your application doing?
I ask because from the GC profile it seems that it incrementally puts the GC in troubles as the pauses are longer and longer.

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

@joakime Thank you very much, yes I think ForwardedRequestCustomizer() is what I was looking for. My server code is still checking X-FORWARDED-FOR headers if present so I had not yet followed up to see if the Proxy code was handling this for me, but it appears ForwardedRequestCustomizer is what I actually need. I've made the other code adjustments based on your comments as well.

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

@sbordet. Sorry for the confusion, I should have clarified the graphs are from different apps but they are of the same run, correlation of the CPU dropping and GC going to nearly zero is still correct. The CPU graphs is from an Amazon EC2 CloudWatch dashboard and is my local time, and it capturing aggregate data of all three servers in a cluster. The green GC chart is from JavaMelody and is in UTC-0 time, of only one server directly. 1800 on the CPU chart is same as 2200 on the GC chart. I made the downgrade from 11->9 on each server, which takes about 10 minutes for the reboot on each so the CPU graph has a couple bumps from reboots before it settles down to a smooth 15% avg all servers.

I should also note that I tried changing the GC from ZGC to Shenandoah and still have high CPU issues with Jetty 11.

@joakime
Copy link
Contributor

joakime commented Aug 3, 2021

@nezoat did you actually do the required code change when going from Jetty 9.4.x to Jetty 11.x? (due to the Jakarta "Big Bang")

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

The app handles the server side of a chess app, syncing moves between players, profile data, lists of games, etc. Messages are all simple JSON requests (HTTP POST for non websocket clients, websocket all others). The load in the graphs are from the live production servers. Each of the three servers handles around 2500 requests per minute.

Yes, the GC load increases over time but that's partly due to the daily cycle of load increases due to peak time of day. The CPU load was fine with Jetty 11 during low traffic, but seems to get exponentially worse as traffic picked up and then level off to a steady state around 60%. With Jetty 9, GC stays below 2% all the time.

Just to let you know, I am ok with staying on Jetty 9 for a couple more years, I have no real reason to go to Jetty 11 right now so no need to solve this problem for me, but it does seem to be some kind of issue between GC and Jetty 11. Maybe related to the reduced thread stack I'm using , 256K, or the ARM64 based Linux on the Amazon servers. Maybe an edge case that isn't showing up on your normal load tests.

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

Screen Shot 2021-08-03 at 10 11 11 AM
Screen Shot 2021-08-03 at 10 11 30 AM

Just to help clarify, here is updated charts from just now, from the same server (in production) showing the drop in GC and CPU.

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

And this is the basics of the MainServlet where the WebSocket upgrade occurs:

public class MainServlet extends WebSocketServlet {

@OverRide
public void configure(WebSocketServletFactory factory) {
factory.register(InnerWebSocket.class);
}

@websocket
static public class InnerWebSocket {

    @OnWebSocketMessage
    public void onMessage(byte[] b, int offset, int length) {
         ...request is read here
     }
 } // end inner class

} // end servlet

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

@nezoat did you actually do the required code change when going from Jetty 9.4.x to Jetty 11.x? (due to the Jakarta "Big Bang")

Yes, all my servlets were updated with base class and method parameters due to the Jakarta namespace change for Jetty 11, and then switched back to javax for Jetty 9.4.

@sbordet
Copy link
Contributor

sbordet commented Aug 3, 2021

@nezoat we need more information.
If you can reproduce this, would be great if you can profile the CPU/allocation with AsyncProfiler, and attach the flame graphs to this issue.
Or send us a minimal reproducer so that we will do it.

@nezoat
Copy link
Author

nezoat commented Aug 3, 2021

I'll see if I can get a simple client/server example setup with load to reproduce it.

@nezoat
Copy link
Author

nezoat commented Aug 4, 2021

jetty-test.zip

Not a perfect test, but the attached is three projects, a client, and a jetty9 server and jetty11 server. The servers just echo the data back to the client. Client starts a couple threads and loops 50k random bytes. I ran it long enough to see that the issue does still show up in this test, most noticeably in the GC. CPU a small amount but these charts had the client and server running on the same machine, so much of the cpu is due to the client.

If you haven't seen any problems in your load tests, I'm guess the issue is related to ZGC and/or Amazon's ARM Linux. This is tested on an Amazon Linux arm64 r6g.large. Nothing else on the AMI or changes other than increased open files limits, shouldn't affect anything here.

java -jar -XX:+UseZGC -XX:ThreadStackSize=256k jetty-test-server9.jar

gc
cpu

Jetty11 test was run from about 1800->1900, and Jetty 9 immediate after for the rest of the run (about 1900->2000)

Also, this is OpenJDK 15.0.1+9-18, 64bit.

lachlan-roberts added a commit that referenced this issue Aug 5, 2021
…utor

Issue #6566 - add executor to WebSocketComponents & use for Dispatched Messages
@joakime
Copy link
Contributor

joakime commented Aug 5, 2021

Reopening as github closed this issue due to the merge of the speculative PR #6585

@joakime joakime reopened this Aug 5, 2021
@sbordet
Copy link
Contributor

sbordet commented Aug 6, 2021

@lachlan-roberts for "large messages", say 50000 bytes as in this issue, in both Jetty 9.4 and 10 a 4k network buffer is used.

In Jetty 10 with autoFragment=true (the default), the implementation creates a synthetic frame with a sliced direct ByteBuffer as payload, which is then passed to JettyWebSocketFrameHandler.accept() and then to ByteArrayMessageSink where it is copied multiple times:

  1. from the direct ByteBuffer to a temp byte[] in BufferUtil.writeTo() (first copy)
  2. from the temp byte[] to ByteArrayOutputStream (second copy)
  3. when the last frame arrives, from the ByteArrayOutputStream back to a byte[] that is then passed to the application (third copy) -- note that in Jetty 9.4 this copy is avoided by using ByteArrayOutputStream2.

With autoFragment=false the implementation:

  1. aggregates the chunks of the payload into the Parser.aggregate buffer (first copy)
  2. when the last frame arrives a new synthetic frame with the whole payload is created with the Parser.aggregate buffer as payload (no copy)
  3. the synthetic frame is then passed to ByteArrayMessageSink.accept() where it is copied back into a byte[] (second copy)

So there is one more copy when autoFragment=true.

I think the only necessary copy is the one the implementation must do before calling the application; this is necessary to avoid that the application retains the byte[] that the implementation may pool/recycle (although for advanced usages we may skip this copy too -- this would require a new configuration parameter).

In the autoFragment=true case, we should strive for 1 copy only, either by retaining the synthetic frames aside (and demanding more explicitly) and copy them all at the end, or by allocating an aggregate buffer of the exact length like the Parser does.

In the best case we should do zero copies: read from the network, get the frame with a slice of the payload (or a synthetic frame), call the application when the callback of the form onMessage(ByteBuffer, boolean).
For signatures like onMessage(byte[], boolean) we have to do 1 copy from ByteBuffer to byte[].

We need 2 copies if we want to guarantee that the application does not mess with our buffers that we may pool/recycle, so we do a second, defensive copy, passing to the application a throw-away byte[].

@gregw
Copy link
Contributor

gregw commented Aug 6, 2021

Definitely let's look at reducing copying.... but also 4K is either a massively too large buffer for many ws messages.... or an entirely too small one for many others. Can we be a bit more dynamic in choosing the size? or at least configurable?

lachlan-roberts added a commit that referenced this issue Aug 18, 2021
…with their callbacks

Signed-off-by: Lachlan Roberts <[email protected]>
lachlan-roberts added a commit that referenced this issue Aug 18, 2021
lachlan-roberts added a commit that referenced this issue Aug 19, 2021
…amMessageSinkTest failures

Signed-off-by: Lachlan Roberts <[email protected]>
lachlan-roberts added a commit that referenced this issue Aug 24, 2021
lachlan-roberts added a commit that referenced this issue Aug 24, 2021
lachlan-roberts added a commit that referenced this issue Aug 27, 2021
lachlan-roberts added a commit that referenced this issue Aug 31, 2021
…ageSinks

Issue #6566 - use the demand API for the websocket MessageSinks
@lachlan-roberts
Copy link
Contributor

By using the websocket-core demand API in the MessageSinks we have reduced the copying for WebSocket whole binary messages. See PR #6635.

@nezoat if you can try out the current 11.0.x branch to see if that resolves your issue then that would be great. Otherwise you could wait to try it out when 10.0.7/11.0.7 are released.

@joakime joakime changed the title High CPU use due to NEW THREAD created on every websocket message in DispatchedMessageSink High CPU use due to new thread created on every websocket message in DispatchedMessageSink Oct 6, 2021
@joakime
Copy link
Contributor

joakime commented Nov 2, 2021

FYI, Jetty 10.0.7 and 11.0.7 were released with this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For general bugs on Jetty side
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants