Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArC: use a separate lock for each bean in generated ContextInstances #37982

Merged
merged 1 commit into from
Jan 5, 2024

Conversation

mkouba
Copy link
Contributor

@mkouba mkouba commented Jan 2, 2024

I'd like to stress that I would still recommend users to avoid things like calling external services in the @PostConstruct/@PreDestroy callbacks. Very often it's not clear when exactly a bean will be instantiated and since the API is blocking an undesired behavior may occur.

CC @franz1981 - I've replaced the forEach() method with removeEach() which is more useful for the request context destruction. It would be great if you could run some perf benchmarks so that we can be sure that this PR does not cause any significant perf regression. Although I expect a perf drop because we must use a separate lock for each bean...

@quarkus-bot quarkus-bot bot added the area/arc Issue related to ARC (dependency injection) label Jan 2, 2024
@mkouba mkouba marked this pull request as ready for review January 3, 2024 11:26
@mkouba mkouba requested a review from Ladicek January 3, 2024 11:27
@Ladicek
Copy link
Contributor

Ladicek commented Jan 3, 2024

LGTM, just wondering how [un]likely is it that we reach the method bytecode size limit. Have you checked?

@mkouba
Copy link
Contributor Author

mkouba commented Jan 3, 2024

LGTM, just wondering how [un]likely is it that we reach the method bytecode size limit. Have you checked?

Nope. I only used your https://github.com/Ladicek/arc-crazybeans with 500 @ApplicationScoped beans to test the generated ContextInstances...

@Ladicek
Copy link
Contributor

Ladicek commented Jan 3, 2024

I would expect 500 beans to be fine, but wouldn't be sure about 5000 beans :-)

@franz1981
Copy link
Contributor

franz1981 commented Jan 3, 2024

I'm still in vacation, but...do we really need any lock?
We cannot use atomic ref set release/get acquire?
In theory we just care about visibility, because we have nor defined any expected behaviour with concurrent updates...this will solve any potential problem, would improve performance, and guarantee visibility rules to work (at the expense of atomicity, which I don't still see why we should enforce).

@mkouba
Copy link
Contributor Author

mkouba commented Jan 3, 2024

I would expect 500 beans to be fine, but wouldn't be sure about 5000 beans :-)

Hm, so 1500 beans is OK but somewhere above this number I get an org.objectweb.asm.MethodTooLargeException. But I believe that we would get the same even before this PR.

I can try to reduce the size of the relevant methods but honestly, I'd rather document to use quarkus.arc.optimize-contexts=false for this kind of "microservice" ;-).

@Ladicek
Copy link
Contributor

Ladicek commented Jan 3, 2024

IIUC, we need to guarantee (at least for normal scopes) that if a contextual instance is needed, exactly one is created, even if multiple threads ask for it at the "same" time.

@mkouba
Copy link
Contributor Author

mkouba commented Jan 3, 2024

I'm still in vacation, but...do we really need any lock? We cannot use atomic ref set release/get acquire? In theory we just care about visibility, because we have nor defined any expected behaviour with concurrent updates...this will solve any potential problem, would improve performance, and guarantee visibility rules to work (at the expense of atomicity, which I don't still see why we should enforce).

It's not only about visibility. We need to make sure that exactly one bean instance is created.

@mkouba mkouba added this to the 3.7 - main milestone Jan 3, 2024
@franz1981
Copy link
Contributor

franz1981 commented Jan 3, 2024

@mkouba which translate in N getAndSet/compareAndSet/getVolatile based on the case, saving a MonitorExit-volatile store op, per instance (which I am checking now if is a volatile store as I fear).

Additionally it means saving memory too, because you can have N static final atomic ref field updaters saving any additional lock instance and a generic util method which just need to bass the atomic reference as param and perform the expected pattern. Additionally, by doing it right, we would preserve the same behaviour of CHM, while now we perform a lock/unlock while reading vs being lock free.
And that mean that forEach can still be lock-free, no need to lock anything

@franz1981
Copy link
Contributor

franz1981 commented Jan 3, 2024

IIUC, we need to guarantee (at least for normal scopes) that if a contextual instance is needed, exactly one is created, even if multiple threads ask for it at the "same" time.

The allocation happen in the critical protected region, or we assume we can produce more then one but let just one to win and be set?
If the former we just need to have 2 states and a loop/fail fast depending the requirements.

eg

  1. getVolatile and eventually CompareAndSet(none, alloc)
  2. Loss: exit
  3. Win: allocate bean and putRelease(bean)

In this way you still have a single volatile op

@mkouba
Copy link
Contributor Author

mkouba commented Jan 3, 2024

@mkouba which translate in N getAndSet/compareAndSet/getVolatile based on the case, saving a MonitorExit-volatile store op, per instance (which I am checking now if is a volatile store as I fear).

Additionally it means saving memory too, because you can have N static final atomic ref field updaters saving any additional lock instance and a generic util method which just need to bass the atomic reference as param and perform the expected pattern. Additionally, by doing it right, we would preserve the same behaviour of CHM, while now we perform a lock/unlock while reading vs being lock free.

I cannot say I completely understand that "perf gibberish" above :D but compareAndSet() from AtomicReference and AtomicReferenceFieldUpdater is not usable here because we need to compute the bean instance lazily and only once...

@franz1981
Copy link
Contributor

franz1981 commented Jan 3, 2024

On @#37982 (comment) I have written a potential protocol in 2 phases (claim + commit value) which allow lazy allocations of beans

@Ladicek
Copy link
Contributor

Ladicek commented Jan 3, 2024

IIUC, we need to guarantee (at least for normal scopes) that if a contextual instance is needed, exactly one is created, even if multiple threads ask for it at the "same" time.

The allocation happen in the critical protected region, or we assume we can produce more then one but let just one to win and be set? I'd the former we just need to have 2 states and a loop.

eg

1. getVolatile and eventually CompareAndSet(none, alloc)

2. Loss: exit

3. Win: allocate bean and putRelease(bean)

If I understand you correctly, the alloc thing is a sentinel that marks "some thread won a CAS and is in the process of creating an instance". This doesn't work, because CAS losers cannot simply exit (returning null I suppose). They must wait in one way or another. Maybe spinning wouldn't be so bad in this situation, but we're calling an external callback to create the instance, so I would be cautious about that.

@franz1981
Copy link
Contributor

franz1981 commented Jan 3, 2024

@Ladicek if the guarantee is to return what ever is set or setting ourself, yep, the loop is the appropriate behaviour. Consider that in case the caller thread is virtual we can still accept using Thread.yield while waiting, which would work fine. Given that the allocation is not supposed to perform I/O (I hope!! Correct me please if I am wrong) it shouldn't be a big deal. In case of I/O we can still wait 50us each attempt (via LockSupport::parkNanos which doesn't care about interruption) that's the default Linux time slack ns (min granularity of sleep)

@Ladicek
Copy link
Contributor

Ladicek commented Jan 3, 2024

So if I understand you correctly, computeIfAbsent would look roughly like this:

Object value = this.1; // volatile read
if (value != null && value != SENTINEL) {
    return value;
}
if (CAS(this.1, null, SENTINEL) {
    value = instanceSupplier.get();
    this.1 = value; // volatile write
    return value;
} else {
    while (value == null || value == SENTINEL) {
        Thread.onSpinWait();
        value = this.1; // volatile read
    }
}

This could even be written as a static method in some util class, accepting a VarHandle or AtomicReferenceFieldUpdater, it wouldn't need to be generated with Gizmo.

I guess that's doable, but we'd have to make sure that getIfPresent and remove handle the presence of SENTINEL correctly.

I'm not sure how the bulk-processing methods (getAllPresent, clear / forEach or removeEach) would work. Clearing probably isn't too big of a deal, if someone accesses a context that is being destroyed, they are doing something wrong. Now, getAllPresent is used to do 2 things: create a Map<Bean<?>, Object and destroy an instance. The Map cannot possibly be "accurate at the end" in case of concurrent access, but we should easily be able to guarantee "accurate at the beginning", which should be enough. The usage of getAllPresent to destroy an instance is IMHO an oversight and should be reimplemented using remove.

Which leaves me wondering about spinning. I didn't look at what the suppliers do. I'll check.

@franz1981
Copy link
Contributor

franz1981 commented Jan 3, 2024

Yep, the only suggestion is to replace the volatile store with a lazySet/setRelease, because it doesn't need any Store load barrier and still work fine, because it should be translated as

X = supplier.get();
LoadStore
StoreStore
Store X

The 2 barriers are a nop in x86 and are mostly compiler barriers which doesn't cost.
It will still guarantee safe publication of the value X even if it is not immutable.

If we don't do it, we risk to have the same exact performance hit of lock/unlock.
The beauty of this approach is that we can perform volatile reads now (assuming no concurrent allocations, or we have to spin till the value is available).

Related OOM or thrown exceptions: what we need to do while spin waiting is up to us, but it's a good practice to assume it could fail, unless we can guarantee the supplier get to never fail, which would make the provided code to be ok (apart from the volatile store point explained above).

@Ladicek
Copy link
Contributor

Ladicek commented Jan 3, 2024

OK so the supplier does 2 things:

  1. Call Contextual.create()
  2. Create a ContextInstanceHandleImpl

The second step is very cheap, and the first one is usually also fairly cheap, but in case of synthetic beans, Contextual.create() can do anything. So that's a bit worrying.

@franz1981
Copy link
Contributor

franz1981 commented Jan 3, 2024

@Ladicek given that the provided util method is written just once, I think adding a local counter for the number of spin and then force a LockSupport::parkNanos(50000L) while exhausted, to be the safer choice. It would save burning too much CPU while waiting the value to be available.

Let's say that we are optimizing 2 possible but still degenerative use cases:

  • concurrent allocation
  • long allocation

They are kind of related, cause being slower increase the chance of concurrent attempts, but I would be surprised to see this as the common scenario

@mkouba
Copy link
Contributor Author

mkouba commented Jan 3, 2024

OK so the supplier does 2 things:

1. Call `Contextual.create()`

2. Create a `ContextInstanceHandleImpl`

The second step is very cheap, and the first one is usually also fairly cheap, but in case of synthetic beans, Contextual.create() can do anything. So that's a bit worrying.

Also in case of #37958 it calls an external service...

@mkouba
Copy link
Contributor Author

mkouba commented Jan 3, 2024

So I'm not against experimenting with the lock-less approach described above. But:

  1. It should be discussed in a separate PR.
  2. It's even less readable/maintainable. Which is OK since it's an implementation detail of a perf optimization. OTOH it means fewer people will be able to maintain this code.
  3. In any case, I'd like to see some numbers before we make a sacrifice.

@franz1981
Copy link
Contributor

franz1981 commented Jan 3, 2024

@mkouba
For

In any case, I'd like to see some numbers before we make a sacrifice.

Do you already have a micro benchmark with a non-trivial but still realistic number of beans?
If yes, we can just compare this PR as it is vs existing approach, and we can first evaluate what is the impact.
What we could extract from this is that we cut in half the number of volatile operations, meaning that it can just improve over the existing approach or have no impacts at all, but unlikely, worsen it (unless some programmatic error).

Just an hint: to make such microbenchmark more realistic, add a configurable parameter work (with 0, 10, 100 values) which can be used to call BlackHole.consumeCpu(work) in the benchmarking method, to simulate user work between atomic operations (or could be used inside the supplier too, to emulate longer bean creations, even better).

@Ladicek
Copy link
Contributor

Ladicek commented Jan 3, 2024

OK so the supplier does 2 things:

1. Call `Contextual.create()`

2. Create a `ContextInstanceHandleImpl`

The second step is very cheap, and the first one is usually also fairly cheap, but in case of synthetic beans, Contextual.create() can do anything. So that's a bit worrying.

Also in case of #37958 it calls an external service...

That's in @PostConstruct, that should be outside of the code path we're interested in here, right?

@mkouba
Copy link
Contributor Author

mkouba commented Jan 3, 2024

OK so the supplier does 2 things:

1. Call `Contextual.create()`

2. Create a `ContextInstanceHandleImpl`

The second step is very cheap, and the first one is usually also fairly cheap, but in case of synthetic beans, Contextual.create() can do anything. So that's a bit worrying.

Also in case of #37958 it calls an external service...

That's in @PostConstruct, that should be outside of the code path we're interested in here, right?

I don't think so. A @PostConstruct callback is executed within Contextual#create(). @PostConstruct and @AroundConstruct interceptors will be called too.

@Ladicek
Copy link
Contributor

Ladicek commented Jan 3, 2024

Ouch, I didn't realize that. In that case, I'm against spinning, as we basically allow running arbitrary user code there.

@gsmet
Copy link
Member

gsmet commented Jan 4, 2024

@mkouba we just got a second report of this causing issues, should we make the default false for next 3.6 micro and make further progress for 3.7?

@manovotn
Copy link
Contributor

manovotn commented Jan 4, 2024

@mkouba we just got a second report of this causing issues, should we make the default false for next 3.6 micro and make further progress for 3.7?

I haven't had the time to fully delve into the details here but it seems it might be better to apply the fix suggested here and then continue iterating on it based on what Francesco and Ladislav wrote. My understading was that the current fix is not inherently wrong, just perhaps not optimal. Or did I miss something?

@mkouba
Copy link
Contributor Author

mkouba commented Jan 5, 2024

My understading was that the current fix is not inherently wrong, just perhaps not optimal. Or did I miss something?

@manovotn That is my understanding as well ;-). At least we know that it fixes #37958 and #38040. My simple benchmarks do not show a significant regression either.

@mkouba we just got a second report of this causing issues, should we make the default false for next 3.6 micro and make further progress for 3.7?

@gsmet I'd rather get this fix in and backport to 3.6 if needed.

Of course, I'd like to continue with investigation in this area. First, we need some more benchmarks...

@gsmet
Copy link
Member

gsmet commented Jan 5, 2024

@mkouba oh I think I wasn't very clear: I completely agree we should pursue with the fix. My only worry is that it might be too complex to backport thus why I proposed to switch the default in 3.6.

But maybe it's better to push the fix to 3.6 to have some more bake time before the LTS.

Let me know what you prefer.

- fixes quarkusio#37958 and quarkusio#38040
- use a separate lock for each bean in the generated ContextInstances
- replace ContextInstances#forEach() and ContextInstances#clear() with
  ContextInstances#removeEach()
- optimize the generated ContextInstances to significantly reduce the
  size of the generated bytecode
@mkouba
Copy link
Contributor Author

mkouba commented Jan 5, 2024

My only worry is that it might be too complex to backport thus why I proposed to switch the default in 3.6.

@gsmet Would it help if we prepare a separate branch for 3.6?

Copy link
Contributor

@Ladicek Ladicek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just wondering if this thing fell through the cracks:

I can try to reduce the size of the relevant methods but honestly, I'd rather document to use quarkus.arc.optimize-contexts=false for this kind of "microservice" ;-).

@mkouba
Copy link
Contributor Author

mkouba commented Jan 5, 2024

LGTM, just wondering if this thing fell through the cracks:

I can try to reduce the size of the relevant methods but honestly, I'd rather document to use quarkus.arc.optimize-contexts=false for this kind of "microservice" ;-).

It didn't but I cannot find a good place in the docs. I mean, currently we don't generate the docs for the optimize-contexts config property (#36626 (comment)) 🤷

Copy link
Contributor

@manovotn manovotn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to go through this mainly by comparing the generated ContextInstances class before and after this code change and I think it looks good.
Given that it fixes the two reported scenarios, I'd get this merged and discuss further optimizations in separate PRs/issues.

@mkouba mkouba added the triage/waiting-for-ci Ready to merge when CI successfully finishes label Jan 5, 2024
@gsmet
Copy link
Member

gsmet commented Jan 5, 2024

@mkouba I will try to backport it once it's merged and will let you know if we need a specific branch. No need to do the work if it's not required.

@mkouba
Copy link
Contributor Author

mkouba commented Jan 5, 2024

@mkouba I will try to backport it once it's merged and will let you know if we need a specific branch. No need to do the work if it's not required.

There is a record in this PR so I'm prettty sure it won't be that easy :-(

@gsmet
Copy link
Member

gsmet commented Jan 5, 2024

Ah, yes so we will need a separate branch. FWIW, I would be a bit cautious about using Java 17 features in fixes for now.

@mkouba
Copy link
Contributor Author

mkouba commented Jan 5, 2024

@gsmet FYI https://github.com/mkouba/quarkus/tree/issue-37982-36. It contains 2 commits (from this PR and #37529)

Copy link

quarkus-bot bot commented Jan 5, 2024

✔️ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

@mkouba mkouba merged commit 6406376 into quarkusio:main Jan 5, 2024
@quarkus-bot quarkus-bot bot added kind/bugfix and removed triage/waiting-for-ci Ready to merge when CI successfully finishes labels Jan 5, 2024
@gsmet
Copy link
Member

gsmet commented Jan 5, 2024

@mkouba can you create a PR so that I get it on my radar and we can get CI to run?

Thanks!

@gsmet
Copy link
Member

gsmet commented Jan 9, 2024

This has been backported as part of #38069 .

@Ladicek
Copy link
Contributor

Ladicek commented Jan 10, 2024

LGTM, just wondering if this thing fell through the cracks:

I can try to reduce the size of the relevant methods but honestly, I'd rather document to use quarkus.arc.optimize-contexts=false for this kind of "microservice" ;-).

It didn't but I cannot find a good place in the docs. I mean, currently we don't generate the docs for the optimize-contexts config property (#36626 (comment)) 🤷

Just got this silly idea this morning: how about we disable this optimization automatically in case there's more than, say, 1000 beans in the application? It could be per context, but I think globally would be fine too.

@mkouba
Copy link
Contributor Author

mkouba commented Jan 10, 2024

LGTM, just wondering if this thing fell through the cracks:

I can try to reduce the size of the relevant methods but honestly, I'd rather document to use quarkus.arc.optimize-contexts=false for this kind of "microservice" ;-).

It didn't but I cannot find a good place in the docs. I mean, currently we don't generate the docs for the optimize-contexts config property (#36626 (comment)) 🤷

Just got this silly idea this morning: how about we disable this optimization automatically in case there's more than, say, 1000 beans in the application? It could be per context, but I think globally would be fine too.

That is actually a good idea. And 1000 is a nice number. Hm, maybe we could change the type of the property to string and do something similar to ArcConfig.removeUnusedBeans. The set of supported values would be true, false and auto. Where if the auto value is used the optimization would be disabled if more than 1000 beans is found. WDYT?

@Ladicek
Copy link
Contributor

Ladicek commented Jan 10, 2024

That makes sense to me.

@manovotn
Copy link
Contributor

Just got this silly idea this morning: how about we disable this optimization automatically in case there's more than, say, 1000 beans in the application? It could be per context, but I think globally would be fine too.

That's great idea! +1

@mkouba
Copy link
Contributor Author

mkouba commented Jan 10, 2024

For the record: #38121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/arc Issue related to ARC (dependency injection) kind/bugfix
Projects
None yet
5 participants