-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to support command mode #2866
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my comments above I have a lot of issues with this PR. It fundamentally changes the core architecture in a way that I think in many cases is not an improvement.
I also don't see how this is a requirement for command mode apps. With the exception of sighup based restart (which also I think is really problematic) I don't really see how this makes a significant difference in terms of what we can do.
We already run what is basically a command mode app for our AWS Lambda support. I think that this approach has the potential to provide a worse user experience than the AWS type approach as if applications are run in the main thread then System.exit() will not work as expected as it will block the main thread and prevent application shutdown.
* | ||
* @return the virtual build item | ||
*/ | ||
Class<? extends VirtualBuildItem> value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not really matter, but it would probably be cleaner to just make this a class array, rather than a repeatable annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaner in what way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean "what's the difference", I mean why is it cleaner to have an array than repeating annotations? It seems pretty subjective IMO.
public GeneratedResourceBuildItem serializeExecutionChain(ExecutionChainBuildItem chainBuildItem) throws IOException { | ||
ExecutionChain chain = chainBuildItem.getChain(); | ||
|
||
// Serialize the chain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am 100% opposed to this approach. I think it would be a huge step backwards, and undermines what we are trying to do with Quarkus.
The whole point of the current architecture is to do as little as possible at startup time. Adding serialisation to startup is the opposite of 'doing as little as possible at startup'. While this might not add noticeable overhead at the moment and the number of classes is small as time goes on if this is used more often this will increase.
It also opens the door to other similar changes, all of which are 'fast' when taken in isolation, however over time lots of 'fast' changes add up to slow startup. We can't both have a 'do as little as possible at startup' philosophy and allow changes like this in.
I also do not believe there is any technical reason for this change. It is quite possible to implement command mode applications with the current architecture (e.g. the AWS lambda stuff basically works the same as a command mode application).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am 100% opposed to this approach. I think it would be a huge step backwards, and undermines what we are trying to do with Quarkus.
Do you have evidence to support this opinion?
The whole point of the current architecture is to do as little as possible at startup time. Adding serialisation to startup is the opposite of 'doing as little as possible at startup'. While this might not add noticeable overhead at the moment and the number of classes is small as time goes on if this is used more often this will increase.
Sure, this can be easily addressed at that time. Or sooner.
It also opens the door to other similar changes, all of which are 'fast' when taken in isolation, however over time lots of 'fast' changes add up to slow startup. We can't both have a 'do as little as possible at startup' philosophy and allow changes like this in.
Again, evidence?
I also do not believe there is any technical reason for this change. It is quite possible to implement command mode applications with the current architecture (e.g. the AWS lambda stuff basically works the same as a command mode application).
The technical reason is simply to materialize the execution chain during static init. Of course we could replace this mechanism with (for example) bytecode-generated deserialization. But this takes more time and I wanted to get this out the door, viewing it as a premature optimization for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am 100% opposed to this approach. I think it would be a huge step backwards, and undermines what we are trying to do with Quarkus.
Do you have evidence to support this opinion?
Because the whole architecture is based around doing as little as possible on startup to reduce startup time as much as possible. Adding extra work on startup that is not technically necessary is a step backwards.
The whole point of the current architecture is to do as little as possible at startup time. Adding serialisation to startup is the opposite of 'doing as little as possible at startup'. While this might not add noticeable overhead at the moment and the number of classes is small as time goes on if this is used more often this will increase.
Sure, this can be easily addressed at that time. Or sooner.
It also opens the door to other similar changes, all of which are 'fast' when taken in isolation, however over time lots of 'fast' changes add up to slow startup. We can't both have a 'do as little as possible at startup' philosophy and allow changes like this in.
Again, evidence?
It's human nature. If we say yes to this then its harder to say no to similar changes that add extra work to startup.
I also do not believe there is any technical reason for this change. It is quite possible to implement command mode applications with the current architecture (e.g. the AWS lambda stuff basically works the same as a command mode application).
The technical reason is simply to materialize the execution chain during static init. Of course we could replace this mechanism with (for example) bytecode-generated deserialization. But this takes more time and I wanted to get this out the door, viewing it as a premature optimization for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am 100% opposed to this approach. I think it would be a huge step backwards, and undermines what we are trying to do with Quarkus.
Do you have evidence to support this opinion?
Because the whole architecture is based around doing as little as possible on startup to reduce startup time as much as possible. Adding extra work on startup that is not technically necessary is a step backwards.
Only data will show if it's a step backwards.
Again, evidence?
It's human nature. If we say yes to this then its harder to say no to similar changes that add extra work to startup.
This doesn't really prove anything. I have a full work load so I'm not really interested in having this kind of debate. Point to a part of the code that means the change cannot be merged, with a technical justification for that assertion, and we'll discuss.
} catch (NoSuchMethodException e) { | ||
throw new NoSuchMethodError(e.getMessage()); | ||
} | ||
STARTUP_CHAIN = deserializeResource(ExecutionChain.class, "META-INF/quarkus.init"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said above I am fundamentally opposed to this approach. Even though this will not increase boot time on Substrate the JVM is still very much a first class citizen. JVM bootstrap is just as important as substrate bootstrap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this will not increase boot time on Substrate the JVM is still very much a first class citizen. JVM bootstrap is just as important as substrate bootstrap.
This approach has one key advantage: it is expedient to implement. Of course other materialization approaches are possible as well; we can probably follow this up with an enhancement to bytecode-generate the chain materialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because it is 'expedient to implement' I still don't understand how it is an improvement on what we currently have or why any of this is necessary for command mode. Basically these changes do lots of things in a different way, but different is not the same as better.
It looks like most of the changes come from the idea that arbitrary stages of the application might need to be restarted, but I don't really think this is relevant to command mode, and I don't think that it is a good idea in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because it is 'expedient to implement' I still don't understand how it is an improvement on what we currently have or why any of this is necessary for command mode. Basically these changes do lots of things in a different way, but different is not the same as better.
Different is not the same as worse either. It cleanly solves problems that the original approach cannot provide clean solutions for, like having a single exit point and correctly handling System.exit()
.
It looks like most of the changes come from the idea that arbitrary stages of the application might need to be restarted, but I don't really think this is relevant to command mode, and I don't think that it is a good idea in general.
Command mode and other things need to intercept startup at various points. They may need to exit early. Being able to restart is a natural consequence of this. If the program does not abide by the contracts set out in the API then it's faulty, and it's not only restart that will likely show this to be the case; if it does, then it's restartable and there's nothing to worry about. I'm sorry you don't think it's a good idea, but frankly I'm not responsible for your feelings. I'm happy to answer your questions about the code though.
public int proceed(ExecutionContext context) throws Exception { | ||
Assert.checkNotNullParam("context", context); | ||
if (next != null) { | ||
return next.run(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really not a huge fan of this interceptor based model of startup. It brings back memories of how EAP used to run its step handlers before we had to change it. In particular:
- This model makes adding parallel startup much more difficult, as it requires all startup takes to be run in the same stack
- If there are lots of steps this can overflow the stack (i.e. the same issue we had when EAP used this approach)
- It can be confusing to the users when there is a large stack trace present in a thread dump
I don't really see any benefits it provides over our current model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This model makes adding parallel startup much more difficult, as it requires all startup takes to be run in the same stack
There is nothing stopping startup tasks from distributing their work across multiple threads, as long as they occur after the thread pool is set up.
- If there are lots of steps this can overflow the stack (i.e. the same issue we had when EAP used this approach)
Even if every extension had 10 steps, and each step used 200 bytes of stack, you'd need thousands of active extensions in one application to even come close to hitting this problem. Do you really believe this is a risk?
- It can be confusing to the users when there is a large stack trace present in a thread dump
How so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This model makes adding parallel startup much more difficult, as it requires all startup takes to be run in the same stack
There is nothing stopping startup tasks from distributing their work across multiple threads, as long as they occur after the thread pool is set up.
- If there are lots of steps this can overflow the stack (i.e. the same issue we had when EAP used this approach)
Even if every extension had 10 steps, and each step used 200 bytes of stack, you'd need thousands of active extensions in one application to even come close to hitting this problem. Do you really believe this is a risk?
Probably not, but I also just don't really see what the advantage is?
- It can be confusing to the users when there is a large stack trace present in a thread dump
How so?
Because if you don't know anything about Quarkus and you see a large stack trace you will likely assume it is related to your application and start looking there. IDE's also generally sort by stack depth in their snapshots, so this thread will often be the first thing a user sees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This model makes adding parallel startup much more difficult, as it requires all startup takes to be run in the same stack
There is nothing stopping startup tasks from distributing their work across multiple threads, as long as they occur after the thread pool is set up.
- If there are lots of steps this can overflow the stack (i.e. the same issue we had when EAP used this approach)
Even if every extension had 10 steps, and each step used 200 bytes of stack, you'd need thousands of active extensions in one application to even come close to hitting this problem. Do you really believe this is a risk?
Probably not, but I also just don't really see what the advantage is?
Well, if you don't see it by now, then perhaps you never will, but I think it might be more apparent as the next phases of this come in.
- It can be confusing to the users when there is a large stack trace present in a thread dump
How so?
Because if you don't know anything about Quarkus and you see a large stack trace you will likely assume it is related to your application and start looking there. IDE's also generally sort by stack depth in their snapshots, so this thread will often be the first thing a user sees.
It is related to the application though. I don't really see what the problem is.
|
||
public int run(final ExecutionChain chain, final ExecutionContext context) throws Exception { | ||
int result; | ||
result = chain.proceed(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very much against this. Fine grained restarts were hugely problematic in EAP, and caused a large number of support cases until it was eventually fixed by disabling it entirely.
The issue is that fine grained restart affects every feature, however none of these features test for it. This means it always ends up as an untested code path, which is just not acceptable. If we want fine grained restartability we need to commit to extensive testing with every feature. Note that dev mode tests do not cover this type of restart, as dev mode is a fundamentally different environment with a new ClassLoader per restart.
I also don't think that designing an architecture around this fits with our 'cloud first' mandate. Cloud based apps are restarted by spinning up new pods, not via signals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very much against this. Fine grained restarts were hugely problematic in EAP, and caused a large number of support cases until it was eventually fixed by disabling it entirely.
Do you have evidence that restarts would not work correctly in Quarkus? Such evidence of course means that the existing shut down code is buggy and needs fixing, so it should be accompanied by a PR & test.
We have already had bugs related to hot deployment, so doing up a proper hot deployment test suite is on the priority list. Hot deployment is fundamentally different that this kind of restart as it is a new class loader each time.
The issue is that fine grained restart affects every feature, however none of these features test for it.
This is patently false in our context. We start up and shut down in every test. If shutdown doesn't work, our tests will have discovered it. And even if there's some reason that this isn't the case, it's trivial to test for it, unlike in EAP.
We start up and shut down in a new class loader, with new generated bytecode each time. We also shut down the whole thing each time, we don't do partial restarts.
I also don't think that designing an architecture around this fits with our 'cloud first' mandate. Cloud based apps are restarted by spinning up new pods, not via signals.
As I always say, "cloud first" doesn't mean "cloud only".
In what cases specifically?
It's meant to only make a difference in certain areas. The scope isn't a complete redesign so the differences will be targeted, not general.
Users shouldn't use System.exit(). This is the very reason that it has been requested to have a request-exit method, which this patch includes. I believe that even with the thread pool shutdown checks, it is still possible (today) to hang Quarkus with System.exit() by calling it during startup. I don't view this as a problem to worry about. |
Guys, since there seems to be important disagreement, we should discuss this issue in the next standup, right @emmanuelbernard @n1hility ? |
System.exit() is the standard API for exiting a command mode java application. IMHO it is much more user friendly to risk a potential hang in a command mode application if they are using our thread pool, than definitely cause a hang if they use the standard Java API (is there even a way to use it in command mode? Command mode apps would generally manage their own threads. Maybe CDI async events is the only one I can think of). Also is there any reason we actually even need to wait for the thread pool to shut down? I can't really see any reason why our shutdown code should wait for this. It's the correct thing to do in MSC but I can't really think of any reason why this should be a blocking operation. If there are any threads still running tasks they can keep the JVM running long enough to finish their shutdown tasks. |
I think it will be obvious enough that the command methods can return their exit code that we don't need to worry about this. If you really disagree though, we can log an error if this happens.
I think it is generally just a good idea. It doesn't cost much (if anything) of shutdown time in most cases and it allows any user async tasks to wrap up cleanly, even in cases where they might not realize they have async tasks which are outstanding. But we can (and should) make that a completely separate discussion. |
Pushed changes to solve a windows problem. |
Blocking and waiting for the thread pool to stop is not the same as not allowing user tasks to finish. Basically why do we call awaitTermination() in the shutdown thread? If we just don't call awaitTermination and let the shutdown thread exit then we will avoid any hanging problems, as we are not blocked on waiting for a thread that has called System.exit() to complete. |
In any situation other than full process exit, you could end up consuming more resources than necessary by not waiting until earlier resources were freed before committing to new resources; pathologically speaking this could result in file descriptor or process exhaustion, especially in small devices. This might include testing, and yes partial or full process reload situations, but also perhaps command-driven or interactive CLI applications as well. In general the way I'd look at it is that skipping resource release (be it threads, files, sockets, any large swaths of memory, or whatever) is an optimization, and should be done only under controlled/known circumstances (like when the process is going to exit). It's better to make an exception for one case than it is to make an exception for all-but-one cases. |
Updated the virtual produce/consume annotations; "before/after" was causing some confusion. |
I've split out virtual build items to #2952. |
b071912
to
c02a3d8
Compare
But the problematic case where the hang occurs is full process exit. If we know the JVM is going away and that blocking on shutdown has the potential to cause hangs then why wait? |
The problematic case isn't possible unless something calls Waiting for thread pool termination on shutdown is something we're already doing today. It has nothing to do with this change in particular, so I don't see why it's being discussed in the context of this change and tied to it. |
Eliminating serialization turned out to be easier than expected. As a bonus I removed reflection in favor of compiling against stubs. |
Include information on weak and virtual build items, as well as other things that have changed since the doc was written
👍 |
try (InputStream stream = url.openStream()) { | ||
try (BufferedInputStream bis = new BufferedInputStream(stream)) { | ||
try (ObjectInputStream ois = new ObjectInputStream(bis) { | ||
protected Class<?> resolveClass(final ObjectStreamClass desc) throws ClassNotFoundException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an opinion about the ExecutionHandler
approach, but about this specific class, please be aware that any deserialization from an untrusted source is a security risk.
- https://christian-schneider.net/JavaDeserializationSecurityFAQ.html
- https://docs.oracle.com/en/java/javase/12/core/serialization-filtering1.html#GUID-3ECB288D-E5BD-4412-892F-E9BB11D4C98A
- https://support.symantec.com/us/en/article.symsa1344.html
Over at the Log4j2 project we deprecated the SerializationLayout that used to be the default in the SocketAppender and replaced it with JsonLayout because a security vulnerability was reported against the library that referred to this layout. Even whitelisting and blacklisting may leave security holes.
Generally, avoid deserialization to save yourself many headaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, avoid deserialization to save yourself many headaches.
Thanks for the basic education on serialization. :)
But you should know that the latest version does not in fact use serialization anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I missed that. I was looking at the Files changed link and I saw the Deserializer class. So this is no longer called and could be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
@dmlloyd @stuartwdouglas @FroMage I am interested in better support for Quarkus-based command line apps (#284). What is blocking progress on this PR? |
We are too :) It's really a matter of priority but expect @maxandersen to restart that conversation rather soon. He is also super interested. |
I have significant updates pending on this one, but I have to finish a big config fix first. |
Closing due to inactivity. |
Does this mean that Quarkus support for CLI apps is dead? |
There is an ongoing discussion: https://groups.google.com/d/msg/quarkus-dev/KiqCbqQZe_o/DEVowWmyAQAJ |
Lays the groundwork for #284.