Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Places to potentially find startup performance improvements #45436

Open
geoand opened this issue Jan 8, 2025 · 35 comments
Open

Places to potentially find startup performance improvements #45436

geoand opened this issue Jan 8, 2025 · 35 comments
Labels
kind/enhancement New feature or request

Comments

@geoand
Copy link
Contributor

geoand commented Jan 8, 2025

Description

These ideas come from looking at this flamegraph

@geoand geoand added the kind/enhancement New feature or request label Jan 8, 2025
@franz1981
Copy link
Contributor

@geoand I would add #43022 as well
since, with 2 or more event loops, the "new" concurrency allowed by the CL make it load more byte[] and fail on definition later on during class loading. This is both impacting memory and time.

@geoand
Copy link
Contributor Author

geoand commented Jan 8, 2025

ExtLogRecord looking up HostName and Process related data that aren't needed by the default logging pattern

@dmlloyd I wanted to move the relevant fields into a holder, but the need to support serialization prevents this approach.

@franz1981
Copy link
Contributor

franz1981 commented Jan 8, 2025

Another "big" (relatively) one @geoand is this -> smallrye/smallrye-common#370

this is very easy: just use 64 bytes cache line as other known projects does i.e. https://github.com/ziglang/zig/blob/master/lib/std/atomic.zig#L429 (which for both arm and x86 uses 2 * 64 bytes due to jbossas/jboss-threads#191)

We are reading from a proc file, in linux, performing both a syscall, parsing and creating garbage for such. It looks an easy win with low impact.

@geoand
Copy link
Contributor Author

geoand commented Jan 8, 2025

Just for posterity, the comment above should be handled by jbossas/jboss-threads#216

@radcortez
Copy link
Member

It also has to be handled here:
https://github.com/smallrye/smallrye-common/blob/e43dfdf9b512185fd8c96ad720fcd4070ff12030/cpu/src/main/java/io/smallrye/common/cpu/CacheInfo.java#L80-L155

I did a quick return with the expected bytes, and on my Mac I noticed around 20 ms improvement in the JVM startup time.

@radcortez
Copy link
Member

  • ExtLogRecord looking up HostName and Process related data that aren't needed by the default logging pattern

What's the plan with HostName? After #41664, I'm getting 2 / 3 ms increase in the native image startup time.

@geoand
Copy link
Contributor Author

geoand commented Jan 10, 2025

It also has to be handled here: https://github.com/smallrye/smallrye-common/blob/e43dfdf9b512185fd8c96ad720fcd4070ff12030/cpu/src/main/java/io/smallrye/common/cpu/CacheInfo.java#L80-L155

I did a quick return with the expected bytes, and on my Mac I noticed around 20 ms improvement in the JVM startup time.

With the change from @franz1981, I think that CacheInfo won't be used/loaded at all in Quarkus. Did you see otherwise @radcortez ?

@geoand
Copy link
Contributor Author

geoand commented Jan 10, 2025

  • ExtLogRecord looking up HostName and Process related data that aren't needed by the default logging pattern

What's the plan with HostName? After #41664, I'm getting 2 / 3 ms increase in the native image startup time.

I was hoping @dmlloyd has an idea as my original idea won't work.

@Sanne
Copy link
Member

Sanne commented Jan 10, 2025

Offload Vertx / Netty class loading to other thread (like we do for DefaultChannelId)

Regarding this, we also do something for async initialize the JPA components. Should we have a unified/consistent approach? Just flagging it for awareness.

@geoand
Copy link
Contributor Author

geoand commented Jan 10, 2025

Regarding this, we also do something for async initialize the JPA components. Should we have a unified/consistent approach? Just flagging it for awareness.

Right, that thought did cross my mind

@dmlloyd
Copy link
Member

dmlloyd commented Jan 10, 2025

  • ExtLogRecord looking up HostName and Process related data that aren't needed by the default logging pattern

What's the plan with HostName? After #41664, I'm getting 2 / 3 ms increase in the native image startup time.

2/3 ms for host name or for process name? or both? That seems like a pretty crazy number. Perhaps for host name, we should track the time and print a warning if it seems like DNS may be misconfigured (check your /etc/hosts). The user also has the option of setting the HOSTNAME environment variable, which bypasses some work.

@geoand
Copy link
Contributor Author

geoand commented Jan 14, 2025

application.properties / META-INF/microprofile-config.properties reading from classpath

@radcortez do you plan on taking this one on soon? Asking because it definitely has a measurable impact.

@geoand
Copy link
Contributor Author

geoand commented Jan 14, 2025

ExtLogRecord looking up HostName and Process related data that aren't needed by the default logging pattern

@dmlloyd any ideas about tackling this one?

@radcortez
Copy link
Member

application.properties / META-INF/microprofile-config.properties reading from classpath

@radcortez do you plan on taking this one on soon? Asking because it definitely has a measurable impact.

Yes, but only next week.

@geoand
Copy link
Contributor Author

geoand commented Jan 14, 2025

🙏

@dmlloyd
Copy link
Member

dmlloyd commented Jan 15, 2025

ExtLogRecord looking up HostName and Process related data that aren't needed by the default logging pattern

@dmlloyd any ideas about tackling this one?

I did mention this:

What's the plan with HostName? After #41664, I'm getting 2 / 3 ms increase in the native image startup time.

2/3 ms for host name or for process name? or both? That seems like a pretty crazy number. Perhaps for host name, we should track the time and print a warning if it seems like DNS may be misconfigured (check your /etc/hosts). The user also has the option of setting the HOSTNAME environment variable, which bypasses some work.

Another idea is to add access to the gethostname system call in Java 22/23/24+ using FFM which might be fast (it does rely on method handle setup though, and a couple of other things).

@geoand
Copy link
Contributor Author

geoand commented Jan 16, 2025

But if I'm not mistaken, the default Quarkus logging pattern does not use the hostname (or any process info), so wouldn't it be better if we didn't try to load it eagerly?

@dmlloyd
Copy link
Member

dmlloyd commented Jan 16, 2025

That is an option. But I think it works be best to understand why it is slow and see if it can be fixed before jumping to that.

@geoand
Copy link
Contributor Author

geoand commented Jan 16, 2025

On my machine (Linux), it definitely does not take nearly as long as @radcortez, but both the call to HostName and Process are noticeable:

Image

Looking at the flamegraph and the code, there don't seem to be any obvious wins for my case.
Sure setting the system properties would avoid the hostname cost, by I don't see why one would have to do that. Getting rid of the method references might also help a bit, but we are still paying the (admittedly small) cost for something that in most cases will not be used

@radcortez
Copy link
Member

I don't see anything unusual in my /etc/hosts. The interesting thing is that the change has been available since Quarkus 3.8.6, and I've measured the time with 3.9, 3.10, 3.11, and 3.12 and I didn't notice any specific degradation coming from the Hostname. Setting the HOSTNAME env var didn't have any impact either.

Starting with 3.13, including the current main, I notice the startup degradation, and setting HOSTNAME does make it go away. Used same terminal, same settings, same JVM.

Looking into the flamegraphs, I didn't notice anything unusual. Maybe it is something that only applies to native mode?

@gsmet
Copy link
Member

gsmet commented Jan 21, 2025

This is odd, maybe if you preferred two binaries, someone from the Mandrel team could help figure out what's going on?

If you can pinpoint the exact version, I can also check if something rings a bell.

@geoand
Copy link
Contributor Author

geoand commented Jan 21, 2025

Offload Vertx / Netty class loading to other thread (like we do for DefaultChannelId)

Regarding this, we also do something for async initialize the JPA components. Should we have a unified/consistent approach? Just flagging it for awareness.

I didn't do anything unified, but this test does yield something like a 5ms improvement on JVM startup time for me, however I am not sure it's worth the memory cost of starting new threads

@radcortez
Copy link
Member

This is odd, maybe if you preferred two binaries, someone from the Mandrel team could help figure out what's going on?

If you can pinpoint the exact version, I can also check if something rings a bell.

This is just using config-quickstart and updating the version. I notice the increased degradation moving from 3.12.0 to 3.13.0. I didn't check specific fix versions.

@radcortez
Copy link
Member

radcortez commented Jan 21, 2025

Another thing that I've noticed is that previous versions (<= 3.12.0) were more stable at the startup time. Startup time would measure between 18-20 ms with an ocassional 16 ms.

After that, the range got higher 18-24 ms, and less consistent on the startup time, with a tendency to the higher side of the bracket.

@radcortez
Copy link
Member

After some digging, I guess the reason it didn't happen earlier was that the Wildfly Common class was not used anywhere, and no reinitialization was required. Once it was added by #41664, it requires that extra work which adds 3-4 ms startup time. Also, notice that there is a substitution for the native image: https://github.com/smallrye/smallrye-common/blob/062dec69c655fc49c60ecc20283fed3f8fc1c312/net/src/main/java/io/smallrye/common/net/Substitutions.java#L20-L104

I did measure that code, and it accounts for the 3-4 ms that I'm observing in the startup time. I chatted with @geoand, and apparently, it doesn't notice such a delay, so maybe it is something specific for Mac? Strange because the code should be the same for both.

I also tried to measure the cost without substitution, and it is mostly the same.

@geoand
Copy link
Contributor Author

geoand commented Jan 28, 2025

On a slightly different note, I tried a little experiment where I turned all the jars used by the RunnerClassLoader into normal directories and read those instead of the jars when the application starts.
The change has no effect whatsoever on startup time and only shed a few MBs off of startup RSS.

I just wanted to write that down since it's been something I've always wanted to check and see whether it's pursuing or not - clearly it isn't.

@Postremus
Copy link
Member

@geoand Out of curiosity, do you have a branch I could try out on my windows machine? I believe the zipfilesystem showed up in my startup profiling.

@geoand
Copy link
Contributor Author

geoand commented Jan 28, 2025

@Postremus https://github.com/geoand/quarkus/tree/fastjar-dirs. That only contains the read part. I didn't make any changes to the write part, so you need to manually unzip the jars in lib/main, quarkus/generated-bytecode, quarkus/transformed-bytecode and app

@Postremus
Copy link
Member

@geoand I tested the branch. Also found no measurable difference in the startup time compared to 3.18.0.

@geoand
Copy link
Contributor Author

geoand commented Jan 28, 2025

Thanks a lot for checking @Postremus !

@gsmet
Copy link
Member

gsmet commented Jan 28, 2025

If it’s not slower but has lower RSS requirements, isn’t it a win?

My experience is that the cen fields of zip files consume quite some memory.

@franz1981
Copy link
Contributor

franz1981 commented Jan 28, 2025

Let's see the flame graphs @geoand mad I can suggest something ;)

If it’s not slower but has lower RSS requirements, isn’t it a win?

Here we have a mix of factors in place:

  1. no zipping means less CPU cycles used for that
  2. it means worst compression ratio (N vs 0) - which means we have to read more and based on "disk speed" we could be limited there instead
  3. it "could" be a RSS win since gzip is a sh*t in term of memory usage: it keeps on calling malloc to allocate small buffers and/or biggger ones to read (you can run async profiler with -e malloc and see how many you got there - not fun)

The problem here is how to measure startup; my suggestion - use a container.
Containers usually force reading the full disk data for real; this is necessary to observe the real cost (or reading more, too).
There's a trick in Fedora to drop the OS page caches and sync before starting (via sync; echo 1 > /proc/sys/vm/drop_caches), and that will work similarly; don't trust restarting the quarkus process over and over since it won't measure what you believe it measure i.e. the disk is nearly not used since Linux just buffer and cache the data read before, saving I/O to happen.
This last point could be easily verified by using other tools like sar, iotop or just vmstat to show the amount of cached data into the OS page cache.

@geoand
Copy link
Contributor Author

geoand commented Jan 29, 2025

If it’s not slower but has lower RSS requirements, isn’t it a win?

The win was only a few MB, so I wouldn't risk changing the format for such a small win.

Let's see the flame graphs @geoand mad I can suggest something ;)

You can find one here

@Sanne
Copy link
Member

Sanne commented Jan 30, 2025

The win was only a few MB, so I wouldn't risk changing the format for such a small win.

A "few MB" is quite a lot, relatively speaking, for small applications - especially if it comes with no functional drawbacks I'd take that.

@geoand
Copy link
Contributor Author

geoand commented Jan 30, 2025

especially if it comes with no functional drawbacks I'd take that.

I don't think there would be any, but unless it was put through the gauntlet of tests, we wouldn't know. I could complete the draft when I have time and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants