Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TROUBLESHOOTING.md with JFR recommendations #43092

Merged
merged 1 commit into from
Sep 30, 2024

Conversation

gsmet
Copy link
Member

@gsmet gsmet commented Sep 6, 2024

@franz1981 I updated this doc as it's probably better to have it properly documented rather than "ask Franz or Roberto" :)

Note that I haven't updated the first part with the profiler.sh shell script as it's not something I use. But if you have the info, maybe it's worth updating it too?
If you can provide me with a recommended command line, I can update it too.

@geoand
Copy link
Contributor

geoand commented Sep 10, 2024

cc @radcortez

@franz1981
Copy link
Contributor

Thanks @geoand dealing with the regression and I will come back to this


# profile allocation startup
java -agentpath:/path/to/async-profiler/build/libasyncProfiler.so=start,event=alloc,file=/tmp/startup-alloc-profile.html,interval=1000000,simple\
-jar my-application.jar
java -agentpath:${PATH_TO_ASYNC_PROFILER}/lib/libasyncProfiler.so=start,alloc=1,total,event=alloc,file=startup-alloc-profile.jfr -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:-UseTLAB -Xmx1G -Xms1G -XX:+AlwaysPreTouch -jar target/quarkus-app/quarkus-run.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need for -XX:+AlwaysPreTouch and both -XX:+UseEpsilonGC or -XX:+UseG1GC - with the latter which improve tremendously the life of users, because doesn't risk to OOM because cannot recycle memory.

I suggest to use a duration, which is realistic enough considered the actual startup time with the tracing enabled (which can be of many seconds i.e. +12 seconds too!)

i.e.

-agentpath:${PATH_TO_ASYNC_PROFILER}/lib/libasyncProfiler.so=start,alloc=1,total,event=alloc,file=startup-alloc-profile.jfr,timeout=20s

This is required to not mix the stop of application allocation events with the startup only: it means too that users shouldn't stop the application before the expected timeout setup.

There's another option instead, using the async profiler executable and using the application's pid, to forcibly stop the profiler, without specifying any timeout configuration.
I can explain that if you prefer this option, which is what we do in the QE startstop test.

Copy link
Member Author

@gsmet gsmet Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment but I have some further comments:

Don't need for -XX:+AlwaysPreTouch and both -XX:+UseEpsilonGC or -XX:+UseG1GC - with the latter which improve tremendously the life of users, because doesn't risk to OOM because cannot recycle memory.

I don't understand what you're saying tbh. This command line was apparently given by you to Roberto. I'm just using it there.

Can you please provide an alternative command if you think it's not good?

Keep in mind that this command is used solely to profile the startup. So I'm not sure what improve tremendously the life of users have to do with it? Here we want to get the best picture of the allocations. That's all we want. So if you have an alternative command that does exactly that, it's very welcome.

I suggest to use a duration, which is realistic enough considered the actual startup time with the tracing enabled (which can be of many seconds i.e. +12 seconds too!)

I don't think that's a good option for this particular purpose. I think we want the control of when we stop the app.
The shutdown of a Quarkus application can't be mixed really, it's very well isolated in the flamegraph so I would rather use something that is deterministic and you can stop when you want, rather than something with a timeout that might not be very practical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not let perfection be the enemy of progress on this. I think that we all agree that what @gsmet has added here is an improvement over the what exists in the guide, so best get it in and @franz1981 or anyone else can further improve when they have the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on this @geoand but I'm just brain dumping before I forgot. If the changes here are not possible, I'll send myself a PR

The point about -XX:+AlwaysPreTouch is that was used for a different purpose (i.e. paying upfront on the RSS side the full cost of a used heap) - and the -XX:+UseEpsilonGC requires users to carefully size the heap or they will get OOM while profiling, which is a PITA.

I hope that explain why I've suggested to remove the first and and switch to G1 (no need for experimental options too, at this poin!) - while keeping -XX:-UseTLAB.

So, TLDR @gsmet

Can you please provide an alternative command if you think it's not good?
Yep, is:

java -agentpath:${PATH_TO_ASYNC_PROFILER}/lib/libasyncProfiler.so=start,alloc=1,total,event=alloc=1,file=startup-alloc-profile.jfr -XX:+UseG1GC -XX:-UseTLAB -Xmx1G -Xms1G -jar target/quarkus-app/quarkus-run.jar

I don't think that's a good option for this particular purpose. I think we want the control of when we stop the app.

In that case, your choice: I personally prefer to stop the profiler manually i.e.
${PATH_TO_ASYNC_PROFILER}/bin/asprof stop -f startup-alloc-profile.jfr <pid of the java app>

But if you prefer instead to have the stop, as you rightly said, it should be clearly visible in the flamegraph, and could just be ignored.


# profile allocation startup
java -agentpath:/path/to/async-profiler/build/libasyncProfiler.so=start,event=alloc,file=/tmp/startup-alloc-profile.html,interval=1000000,simple\
-jar my-application.jar
java -agentpath:${PATH_TO_ASYNC_PROFILER}/lib/libasyncProfiler.so=start,alloc=1,total,event=alloc,file=startup-alloc-profile.jfr -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:-UseTLAB -Xmx1G -Xms1G -XX:+AlwaysPreTouch -jar target/quarkus-app/quarkus-run.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add the reference to this comment from Andrei P. at async-profiler/async-profiler#583 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather keep it simple and not point to comments that will make things more complicated for people not familiar with this.
We just want a command that allows to troubleshoot for people who don't know how to do it.


# profile allocation startup
mvn quarkus:dev -Djvm.args="-agentpath:/path/to/async-profiler/build/libasyncProfiler.so=start,event=alloc,file=/tmp/startup-alloc-profile.html,interval=1000000,simple"
mvn quarkus:dev -Djvm.args="-agentpath:${PATH_TO_ASYNC_PROFILER}/lib/libasyncProfiler.so=start,alloc=1,total,event=alloc,file=startup-alloc-profile.jfr -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:-UseTLAB -Xmx1G -Xms1G -XX:+AlwaysPreTouch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as before

Copy link
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM is up to @gsmet address the changes; I'm happy regardless, cause is an improvement nonetheless 👍

well done!

@geoand geoand merged commit 4bfb981 into quarkusio:main Sep 30, 2024
3 checks passed
@quarkus-bot quarkus-bot bot added this to the 3.16 - main milestone Sep 30, 2024
@gsmet
Copy link
Member Author

gsmet commented Sep 30, 2024

FWIW, I was open to suggestions but somehow forgot about this PR :).

So feel free to create a follow-up PR if needed. Let's just make sure things are simple enough for people not accustomed to performance work to be able to collect data as it's the primary purpose of this document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants