-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add functionality for an automatic thread dump upon thread pool exhaustion #137
base: 2.4
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in a comment, the pool filling up is not an error condition. But I wonder if maybe we should instead have a periodic deadlock detection task, or something like that, if deadlocks are the problem being detected here?
@@ -1514,6 +1520,9 @@ int tryAllocateThread(final float growthResistance) { | |||
oldSize = currentSizeOf(oldStat); | |||
if (oldSize >= maxSizeOf(oldStat)) { | |||
// max threads already reached | |||
if (reachedMax.compareAndSet(false, true)) { | |||
ThreadDumpUtil.threadDump(maxSizeOf(oldStat)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be unconditional. Thread pools reaching capacity is not an error condition, it just means that the behavior changes from submitting tasks to new or waiting threads, to instead adding the task to a queue that the worker threads read from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the condition? This has a boolean to only log on the first occurrence of exhaustion. But if we'd accept such a thread dump utility, I think there should be some further configuration required to enable it for the pool if desired. So this would then prompt another change primarily in XNIO to enable this thread dumping for its task pool where such dumping would be desired. Such a dump would likely not be desired from the MSC thread pool upon its exhaustion for instance.
While the behavior of queuing upon is exhaustion is the expected behavior, the reality for customers using JBoss and support engineers assisting them is that this is universally a critical issue state for the XNIO task thread pool used by undertow as that begins to immediately degrade response times if not lead to full request unresponsiveness as well if already processing requests are not quickly completing. We often see multiple cases a day requiring diagnosis of such thread exhaustion induced by many varied issues and most often not from a conventional deadlock. Further details about this is recorded at https://issues.redhat.com/browse/EAPSUP-1325.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition as you suggest should be a configuration switch, and I'd also propose that there should be either two switches or a single three-state switch to separately configure whether a warning is logged when the thread pool queues, and separately whether the warning includes the thread dump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved this to make it configurable. Let me know any suggestions.
@@ -97,6 +97,25 @@ interface Messages extends BasicLogger { | |||
@LogMessage(level = Logger.Level.ERROR) | |||
void taskSubmitFailed(@Cause RejectedExecutionException e, Runnable task); | |||
|
|||
@Message(id = 15, value = "Thread pool has reached %d max threads in use. Performance may be impacted. Thread dump:\n"+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These messages probably shouldn't have IDs like this; a thread dump should probably just go directly to the output, or be formatted otherwise like a normal JDK thread dump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, IDs have previously been assigned to auto generated thread dump to logging functions like in eap/wildfly-core itself:
Or in activemq-artemis:
Message ids can be quite helpful to support and I and others use them in simple tooling (https://github.com/aogburn/yala) to quickly search and handle critical known message ids or sum up key information in logging. So for a usability perspective, I'd request that it have an ID.
I think to be truly formatted exactly like a normal JDK thread dump it would probably require being generated through normal JDK utilities, like forking an external process that's calling the JDK jstack command if available (then a dump can be missed if someone has not installed the jdk-devel package or is running a JRE).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then let's make it one big message with one ID instead of multiple. Loggers can accept a StringBuilder
as a string argument, if that is helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to just one message id for the thread dump messages. I did leave one additional id to handle any exception from a dump attempt. Is that suitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmlloyd Just wanted to check if you'd have any further feedback here? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been ruminating on this. For too long I guess. :) I have a couple more comments.
I'm trying to reconcile the design issues around what it means to exhaust a pool, with the practical correlation between a pool with queued tasks and performance problems. There are many many ways that a thread pool can be expected to work in the presence of a variety of workloads. I'm just not ready to declare that having queued tasks means that we've hit a "knee" in the performance graph in all or even most cases. And yet there is a definite observation that this correlation does exist in the real world, at least in some cases. So, how to resolve the dilemma? I have a couple of ideas.
WDYT? |
I've been following this conversation from the sidelines, figured I'd share my thoughts: When a webserver worker pool becomes fully saturated and queues work, a thread dump is often helpful for debugging. However, there are many scenarios where this executor is used which are designed to expect queueing in order to limit concurrent task execution outside of the webserver use case. I think part of the issue is that we have a specific failure mode in which the queueing predicate is met, where it's helpful to have a thread dump, however we don't the data to understand how often queueing occurs when we're not in a degenerate state that requires debugging. As an aside, if queueing suggests a serious problem, would it be better not to allow any queueing? The queue allows us to linearize spiky workloads at the cost of predictable response times -- in some cases we can better distribute that latency cost by increasing the maximum thread count instead (at the limit, to the previous max-threads + queue-size values, and removing the queue) such that the scheduler may allow all work to progress (though thread resource costs must be considered). |
Thanks @dmlloyd.
Which is why it can be disabled by default and enabled in an appropriate context where needed (like being enabled by default by Undertow/XNIO for easier web request slowness RCAs). It could also have other configurable capture points if we wish (a certain queue amount or duration at a queued amount, a time gap since the last task was fetched from the queue...). But my request on behalf of support and customers is really just limited to the context of an automatic thread exhaustion dump for the vast majority of web request slowness RCA cases we see. I certainly understand from a general development standpoint a thread pool could have varied use and behaviors that could impact desires for any automatic thread dump if ever. But I'm not aware of anyone else in any other context requesting automatic dumps (at least not that I come across in the day to day working with lots of JBoss customers). So I think trying to expound too much on the automatic dump capabilities and trigger points for other potentially varied workloads outside of my web request context may ultimately lead to overthinking solutions to problems that don't exist or aren't significant in comparison.
We could, but this would only have the desired impact if it is ultimately hooked into something that will automatically invoke the dump with default EAP configurations not requiring any other manual commands or config changes (like Undertow/XNIO flipping on the dump for its task pool by default). And thread exhaustion I think would be the most universally beneficial option for a default capture point for web requests. If this enhancement gets back to a place where it requires manual configuration or manual intervention for commands at an issue time, then data would still be missed in issue states and it defeats the purpose of implementing it.
I suppose, but WildFly/EAP/XNIO/Undertow teams are already happy to absorb such an auto dumping feature if in the EQE itself. So if we really want to push this back to some other additional WildFly level feature to invoke, then this would have to be pushed back through the WildFly/EAP teams to review and would likely take more overall Eng/QE resources to get that out the door for likely the same end result (capturing an automatic dump for typical web request thread exhaustion RCAs).
The problem is not specifically the queuing, it's the unsuccessful web requests with an exhausted thread pool. We can disable the queuing already, but that just trades the problem symptom of indeterminably growing response times in queue (EAP 7's default) for immediately rejected requests instead (EAP 6's default). Neither problem symptoms are really tolerable for the majority of typical web request work loads so it leads to an outage type RCA either way. For a possibly temporary thread exhaustion issue, queuing can lead to impacts that are more minimal or less visible with possibly small response time impacts compared to easily visible immediate rejections. But a lot of the nominal state web request workloads we see do not oscillate often between exhaustion/queuing and no queuing. They're generally tuned to avoid thread exhaustion points and any queuing so that becomes an obvious degraded response time point once it occurs compared to the nominal no queuing states. |
No description provided.