Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate Julia reference accesses for atomic release/consume #35535

Closed
wants to merge 3 commits into from

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Apr 20, 2020

This lets us say that anytime you observe the store of an object reference (for example, by reading a global, or a pointer field of an object, or a pointer array), that it is valid to assume consume ordering on that read—meaning that you can safely read through that pointer (on all processors except the Alpha, where these would need to be strengthened to an acquire load). It's also already the default behavior on the x86 hardware, this just enforces the ordering on the compiler too and should add the required store barriers on ARM and PPC to make it viable there also.

@yuyichao
Copy link
Contributor

I don't think we should do this. This deviate from the model that LLVM is optimized for. Also, I think even though this ordering is OK for normal load/store, I believe it is not very well defined/specified for new instructioins (even SSE) The added compiler barrier is almost certainly going to kill vectorization.

@vtjnash
Copy link
Member Author

vtjnash commented Apr 20, 2020

This is a (slightly) weakened form of the Java memory model (since it's only applied to pointers, not values), which LLVM has been expected to handle. And the new barrier on the stores would never have vectorized anyways (because of the required gc barrier).

This code shows a mildly unfortunate consequence:

julia> f(x) = @inbounds begin; a = 0; for i = 1:length(x); a += isassigned(x, i); end; a; end
f (generic function with 1 method)

julia> code_llvm(f, (Vector{Any},))

Which reveals a missed vectorization opportunity inside LLVM this causes. But this specific code also feels extremely rare and rather useless.

I believe it is not very well defined/specified

That is true of our current naive emission. This should fix it based on the x86-TSO (total store order) model. See research at https://www.cl.cam.ac.uk/~pes20/weakmemory/index.html for x86, ARM, and PPC that discuss why this is valid.

@vtjnash
Copy link
Member Author

vtjnash commented Apr 20, 2020

Also the relevant page in the intel manual is available at
https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-262.html and lists out the instructions (and the circumstances) that would violate the TSO memory model. Note that these instructions are explicitly for the purpose of bypassing the memory model, and are thus very slow, since they can bypass the primary memory model.

@JeffBezanson JeffBezanson added compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality labels Apr 20, 2020
Copy link
Contributor

@yuyichao yuyichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a (slightly) weakened form of the Java memory model

Does Java have automatically memory barrier? I thought they only have that for volatile (atomic).

And the new barrier on the stores would never have vectorized anyways (because of the required gc barrier).

OK, this is fair.

That is true of our current naive emission. This should fix it based on the x86-TSO (total store order) model. See research at https://www.cl.cam.ac.uk/~pes20/weakmemory/index.html for x86, ARM, and PPC that discuss why this is valid.

Actually I looked again, it is the atomicity that is not specified, not the ordering. I assume this means that read of atomic_order_relaxed can't be vectorized. Also I don't think memcpy, memmove have that kind of guarantee either. (And I'm pretty sure some implementations use string operations, which isn't even ordered according to intel).

My main point is that almost all the code from llvm to libc are written to use all single thread optimizations that doesn't introduce race. Requiring a stronger memory model than that will need to handle/disable all of the related optimizations/assumptions.

@@ -459,7 +459,8 @@ static Value *literal_pointer_val(jl_codectx_t &ctx, jl_value_t *p)
return literal_static_pointer_val(p);
Value *pgv = literal_pointer_val_slot(ctx, p);
return tbaa_decorate(tbaa_const, maybe_mark_load_dereferenceable(
ctx.builder.CreateLoad(T_pjlvalue, pgv), false, jl_typeof(p)));
ctx.builder.CreateAlignedLoad(T_pjlvalue, pgv, sizeof(void*)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these addition of alignment related? I thought this is the default alignment anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic instructions are required to always specify alignment, so I was just uniformly examining everything already. LLVM also just now mostly prefers that we specify it specifically always, rather than depending on TargetData ("code will be correct, but may be inefficient" https://llvm.org/docs/Frontend/PerformanceTips.html#when-to-specify-alignment).

Type *ptrty = PointerType::get(elty, ptr->getType()->getPointerAddressSpace());
if (ptr->getType() != ptrty)
ptr = ctx.builder.CreateBitCast(ptr, ptrty);
if (idx_0based)
ptr = ctx.builder.CreateInBoundsGEP(r->getType(), ptr, idx_0based);
Instruction *store = ctx.builder.CreateAlignedStore(r, ptr, isboxed || alignment ? alignment : julia_alignment(jltype));
if (!isboxed && CountTrackedPointers(elty).count)
ctx.builder.CreateFence(AtomicOrdering::Release);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, I don't think the fence can replace atomic here. (also more places in C code). It "improves" the ordering but not atomicity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're confusing it with sequential consistency fences—that cannot be adequately defined as a fence, since it needs to be a property of the load/store operation itself. The release fence is documented to generally be okay here, and in practice is always okay on hardware.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, what I said is that as long as the store is not atomic LLVM is allowed to do whatever it want including making the store non-atomic.

src/array.c Outdated
@@ -1219,6 +1221,7 @@ JL_DLLEXPORT void jl_array_ptr_copy(jl_array_t *dest, void **dest_p,
}
}
memmove(dest_p, src_p, n * sizeof(void*));
jl_fence_release(); // to finish up, ensure contents of dest is now visible on other processors too
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the missing atomicity aside, I think the fence needs to go before all stores to dest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I should stop using memmove also, and switch to llvm.memmove.element.unordered.atomic to be correct. This fence just means you can use a different (relaxed) variable as a flag to indicate the operation is done.

There is also a fence added at the start, to ensure that the pointers consume-loaded in this loop will also be valid to dereference (consume-load) on another thread.

@vtjnash
Copy link
Member Author

vtjnash commented Apr 21, 2020

Does Java have automatically memory barrier? I thought they only have that for volatile (atomic).

yes, llvm calls this unordered, which ensures it won't break up or invent from thin air the read/write

Requiring a stronger memory model than that will need to handle/disable all of the related optimizations/assumptions.

Yep, that's what this PR adds/changes 650 lines to deal with, as a start.

@yuyichao
Copy link
Contributor

yes, llvm calls this unordered, which ensures it won't break up or invent from thin air the read/write

Unordered is the issue I have elsewhere but that's not what I was talking about here. I mean the store must have a release order.

Yep, that's what this PR adds/changes 650 lines to deal with, as a start.

Right, I know, the point is that those have to add unacceptable amount of performance penalty. Of course if you just look at the instructions it will seem fine and I agree if the compiler can do no optimization that'l be the case. The point is that these are meant to disable a lot of single thread optimizations (due to the write barrier this will mostly affect read of course, though given the direction vector instructions are heading towards, I won't be surprised at all if the write barrier can be vectorized with some tweak). Some of the changes here already give a clear example of such penalty and I believe there are more cases for architectures that are not considered important/tested enough like AVX512 and SVE, both of which have scatter/gather. It seem to me that people working on wide instructions cares little about using those for synchronization and rightly so and it'll be a similar story for the compiler as well. Unless there's a guarantee that all future instruction extensions and compiler optimizations can support this well, I don't think it is a good move to kill all of them in a user visible way now.

@vtjnash vtjnash force-pushed the jn/globals-atomics branch from 3e9d9af to 61d8d44 Compare April 25, 2020 22:11
@vtjnash
Copy link
Member Author

vtjnash commented Apr 25, 2020

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@vtjnash
Copy link
Member Author

vtjnash commented Apr 26, 2020

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Additionally, Julia automatically will ensure memory safety in the presence of
some types of data-races.

When publishing a boxed value to a global global, that value can be read on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When publishing a boxed value to a global global, that value can be read on
When publishing a value to a global, that value can be read on


When publishing a boxed value to a global global, that value can be read on
another thread without additional synchronization. However, do not assume that
observing one value is set transitively implies anything about other values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
observing one value is set transitively implies anything about other values,
observing one value implies anything about other values,

src/toplevel.c Outdated
@@ -32,6 +32,7 @@ JL_DLLEXPORT int jl_lineno = 0; // need to update jl_critical_error if this is T
JL_DLLEXPORT const char *jl_filename = "none"; // need to update jl_critical_error if this is TLS

htable_t jl_current_modules;
jl_mutex_t jl_modules_mutex;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split this addition into a separate commit

@vtjnash
Copy link
Member Author

vtjnash commented Apr 27, 2020

I don't think it is a good move to kill all of them in a user visible way now.

This will only change the default behavior. The next step is still to add instructions that let the user specify the atomic model explicitly, allowing them to request the stronger seq_cst or weaker non-atomic models where required.

@yuyichao
Copy link
Contributor

This will only change the default behavior. The next step is still to add instructions that let the user specify the atomic model explicitly, allowing them to request the stronger seq_cst or weaker non-atomic models where required.

But then why should we change the default? I can see argument for either having non-atomic (what everyone else does) or full global seq_cst as default since one is what everything is built on top and has no effect on the majority of the code whereas seq_cst is could be the default when you want to use thread synchronization since that's the only one that an average person can intuitively reason about.

This PR is in some wierd place in between. You get neither fast nor safe code by default. The behavior cannot be reasoned easily and you get performance hit everywhere.


Put it in another way, there are two different types of code,

  1. Code that do no synchrnoization.

    This is the kind of code that are basically the same as (or is) single threaded code. You don't care at all about any memory orders and only the performance matters. This kind of code needs no memory barriers or atomics.

  2. Code that are doing synchronization.

    This is the kind of code where the thread-safety matters more than the performance. The preferred behavior for this kind of code is seq_cst (or acq_rel at least) since it is easy to reason about. Non-default more relaxed order should be allowed for better performance for advanced users.

All well written programs, multiple threaded or not, will have the majority of the code excuting non-synchrnoization (i.e. first kind). If not, you should get rid of the thread and remove all the synchronization since it's just wasting time. Because of that, the code most people need most of the time is going to have no atomics. If this was not the default, almost every single function everyone ever written will need a default version and a single thread version and I don't think that's practical.

Now, ideally, whatever the default is, one would want every function to support all possible memory orderings. Since that's clearly impractical (exccess code generation if nothing else) we should put the burden of missing function on the use cases that doesn't need as much. Even without all the functions having a strong memory ordered version, one can still usually write slower but correct code of the second kind using simpler synchrinization preemptives, which is the main thing that matters for that kind of code. OTOH, without all the functions having a non-atomic version, it is impossible to write performant code of the first kind which is the main goal of that kind.

@vtjnash
Copy link
Member Author

vtjnash commented Apr 27, 2020

what everyone else does

Because it's not. In particular, the second most popular language in the world uses this model (https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/?utm_source=rss&utm_medium=rss&utm_campaign=language-rankings-1-20).

full global seq_cst

Because it's massively expensive everywhere, so mostly only python does this.

The behavior cannot be reasoned easily and you get performance hit everywhere

It's the x86 model. It's better than we do now, and not as understandable as seq_cst. But it also has no performance cost on x86. That's why I ran the benchmarks—to show this doesn't affect them.

@yuyichao
Copy link
Contributor

yuyichao commented Apr 27, 2020

Because it's not. In particular, the second most popular language in the world uses this model (https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/?utm_source=rss&utm_medium=rss&utm_campaign=language-rankings-1-20).

Everyone as in everyone that can decide after things have converged and also where performance matters.

It's the x86 model. It's better than we do now, and not as understandable as seq_cst.

No it's not because no one is coding against the hardware, you are coding against the compiler model. It is not the model you see when you write code for a x86 machine.
It will be the "x86" model if you are generating naive code, using no libc function and no compiler optimizations, but that's far from the case.

But it also has no performance cost on x86. That's why I ran the benchmarks—to show this doesn't affect them.

And that's exactly why I mentioned before you even run the benchmarks that they won't make sense here. In particular, are you running the benchmarks on a computer that has high performance scatter gather (with predicate) and are you running benchmarks where those matters? You already showed that isassigned is affected and it is the only case that you find only because you don't have gatter load.
Also, again, how do you garantee that all future optimizations are not affected? Given that few if any new vector extensions make strong guarantee of atomicity or ordering.

Also, I don't think it's OK to be this x86 centric. If something already work on x86 and could exist elsewhere it is certainly OK to have it. However, it is wrong to have everyone emulate x86.

@JeffBezanson
Copy link
Member

The behavior cannot be reasoned easily and you get performance hit everywhere.

Post a benchmark?

I would also like to see a test case that crashes or otherwise misbehaves (presumably on non-x86) without this change. Then we'll be better able to assess the trade-off here.

@yuyichao
Copy link
Contributor

f(x) = @inbounds begin; a = 0; for i = 1:length(x); a += isassigned(x, i) ? x[i][] : 0; end; a; end for Vector{RefValue{Int}}. This should vectorize on avx512 and shouldn't vectorize if the load is atomic. (testing in C shows the difference in th output code).

In general, this disables vectorizations on load of references so it'll mainly show up on x86 when optimiztions on such operation has significant effect. Other than NULL check, a load of the reference can only vectorize with scatter/gatter and unless the field can't be NULL (Tuple for example) a predicate on the load is also required hence the requirement of avx512 or sve to see significant difference.

As a manual version of that, as I already mentioned, no normal memcpy or memmove call can be used on reference arrays since they provide no atomicity or ordering guarantee. (i.e. the PR isn't even implementing the ordering guarantee yet)

IIRC, it may also affect forwarding (two loads on the same field can't be be merged into one) but I'm less sure about that and it'll certainly harder to come up with a standalone test case where it affect performance significantly (It may still affect other performance but a case where multiple optimizations interacts with each other is much harder to come up with).

When not limited to x86 or current optimization limit. I have done benchmark before that the release store is much slower on arm/aarch64 than a normal store, easily slower than our write barrier which is just a (hopefully predictable) branch... As mentioned above, I also don't see why our write barrier can't be vectorized using scatter/gatter. It requires much better LLVM vectorizer integration and a vectorized root queueing function for sure but it should be doable but won't be anymore with this change. Generally speaking, again as already mentioned above, almost none of the new vectorize feature are for synchronization and adding this will disable all of those. AFAICT even avx512 usage seems to be just getting started.....


As for a counter example, see https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering for the C++ example. It'll be very hard to produce a wrong result, especially on x86 since the naive codegen will easily give stronger guaruantee. You need to convince the optimizer to reorder instructions to produce a program that misbehave on x86 which can easily give the illusion that this is a small change from what we already have. It should be much easier to do on arm/ppc (as well as the performance hit).

@JeffBezanson
Copy link
Member

Can you provide numbers? I'd like to know what the actual slowdown is. Non-x86 is fine.

@yuyichao
Copy link
Contributor

Tested on a cortex-a9, which is the only hardware mentioned above that I currently have access to.

Normal store takes 5 ns, release store takes 16 ns.

I don't have a avx512 machine to test on but you can assume whatever difference from vectorzation.

@tkf
Copy link
Member

tkf commented Apr 29, 2020

(I'm still a noob when it comes to the memory model, but I think there might be other relevant points, on top of what yuyichao has been mentioning.)

As a user, I'd appreciate it if the subtle examples whose behavior would be defined by this PR are actually left undefined, so that the compiler (or tools like TSAN?) can actually throw an error or warning when finding patterns like this. If the behavior is well-defined, it cannot be turned to an error because that would break existing code.

Furthermore, I'd prefer to read code that is explicit about memory ordering especially when it's doing something tricky. Even after this PR, I suspect the best thread programming practice in Julia would be to avoid the implicit default ordering (even when that's exactly what you need) as that would be hard to notice when it's embedded in an innocent-looking piece of code. If that's the case, the only case that the behavior defined by this PR would be used would likely be by accidents. However, the compiler can't tell the user that there is a "code smell" because the behavior is actually well-defined.

In other words, defining this behavior is not free, even in x86. The cost is the ability to tell the programming errors to the users.


Here are some examples by @vtjnash who helped me understanding this PR in slack:

function f1()
    local x
    @sync begin
        @spawn begin
            while !isdefined(x)
            end
            println(typeof(x))
        end
        @spawn x = inferencebarrier(12345678) + 1
    end
end

function f2()
    local x = 1
    @sync begin
        @spawn while true
            println(typeof(x))
        end
        @spawn while true
            x += 1
        end
    end
end

This one is mine but @vtjnash confirmed that this would also be well-defined:

struct MutableStruct
    a::Any
end

function f3()
    local x
    @sync begin
        @spawn begin
            while !isdefined(x)
            end
            println(x.a)
        end
        @spawn x = MutableStruct(1)
    end
end

(Just sharing mostly for myself so that it'd be easy to look up later.)

(Edit: fixed typos s/@async/@sync/)

@vtjnash
Copy link
Member Author

vtjnash commented Apr 30, 2020

Everyone as in everyone that can decide after things have converged and also where performance matters.

So essentially nobody?

I didn't just make this stuff up. Of the top languages e.g. (https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/?utm_source=rss&utm_medium=rss&utm_campaign=language-rankings-1-20), many use this model or stronger (single threaded / seq_cst):

Java does https://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.4
C# does https://docs.microsoft.com/en-us/archive/msdn-magazine/2013/january/csharp-the-csharp-memory-model-in-theory-and-practice-part-2#c-memory-model-implementation-on-arm
Swift is also trying to figure this out this month it appears (https://github.com/apple/swift-evolution/blob/master/proposals/0282-atomics.md and https://forums.swift.org/t/low-level-atomic-operations/34683). But I wonder if their required reference-counting updates already acted as this memory barrier, so they're perhaps already paying this performance penalty, just more often.

I'd posit that Rust, C, and C++ are actually the outliers (and I give rust a bit of a pass due to its linear type system, since you can't write the operation if it'd be an implied data race).

Tested on a cortex-a9

Would be interesting to know about a newer machine too, since ARMv8 added the required instruction (STLR) in 2011 to the ISA. While it may still be micro-coded the same way in some CPUs, it has the ability now to be implemented faster.

@tkf
Copy link
Member

tkf commented Apr 30, 2020

many use this model or stronger

Do they define such ordering semantics even without qualifiers/annotations like volatile and atomic?

@yuyichao
Copy link
Contributor

many use this model or stronger (single threaded / seq_cst):
I'd posit that Rust, C, and C++ are actually the outliers

Single threaded is exactly what the default model should be. The point is exactly about not paying the panenty by default. And I'll also say that many of the languages in that list aren't something to compare the performance with. It is a no brainer that having a overly strong memory model will make multithread behavior easier to understand which will certainly make the language more popular if performance isn't even a major concern.

Would be interesting to know about a newer machine too, since ARMv8 added the required instruction (STLR) in 2011 to the ISA. While it may still be micro-coded the same way in some CPUs, it has the ability now to be implemented faster.

That was what I have tested, on cortex-a57. I don't have the access now but I'm pretty sure it is also at least 2x slower if not more.

@vtjnash
Copy link
Member Author

vtjnash commented Apr 30, 2020

Do they define such ordering semantics even without qualifiers/annotations like volatile and atomic?

Yes. While not explicitly required by the ECMA-334 C# standard, it is likely the cheapest way to implement the requirements of section 15.5.5, and is known to be the current implementation.

@vtjnash vtjnash added the triage This should be discussed on a triage call label Apr 30, 2020
@yuyichao
Copy link
Contributor

FWIW, the cost here is about 33% (0.756367s vs. 0.566988s). And it only really applies to this precise example function, there isn't another one possible, regardless of instruction set, since anything else would contain a UndefRefError check.

Which is exactly why I was talking about scatter gather. That's how the vectorize can have a bigger effect and also would be more useful. It's also the area that the optimization or even the hardware isn't as mature as it would be so any current benchmark would only be a lower bound on how bad it would be in the future.

That's an old version. Read the current one. It says that it must be impossible to see a value before it's been initialized.

Can you give a link? Not that it's very important here though since it's not part of the stack we use nor does it have very similar performance goal afaict.
Does it have that guarantee despite undefined behavior?

Not true, check code_llvm here: LLVM still applies the same optimization after we define the behavior.

I don't have your version to test but the c version I've tested shows that a relaxed atomic load correctly disables the hoist.

@tkf
Copy link
Member

tkf commented May 1, 2020

Is this the latest ECMA-334 C# standard? https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-334.pdf#page=308

Quoting:

The initial value of a field, whether it be a static field or an instance field, is the default value (§10.3) of the field’s type. It is not possible to observe the value of a field before this default initialization has occurred, and a field is thus never “uninitialized”

But we don't have this strict semantics (always initialized to default value) even in the current Julia used with a single thread, right?

@vtjnash
Copy link
Member Author

vtjnash commented May 5, 2020

But we don't have this strict semantics (always initialized to default value) even in the current Julia used with a single thread, right?

We have stricter semantics than this actually, though you can opt out of it partially. But in single-threaded code, we ensure you can't even see the object not only before the default initialization (which we do have), but also before construction is finished.

@vtjnash
Copy link
Member Author

vtjnash commented May 5, 2020

the c version I've tested shows that a relaxed atomic load correctly disables the hoist

Okay, but let's talk about the actual PR proposal here. C does not include a specification equivalent to the java/C# memory model, so it's not useful as a test platform of this PR. Though with TBAA, clang theoretically could still prove that hoisting was valid, I don't expect that as being particularly productive to write such an optimization analysis into that pass.

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2020

Okay, but let's talk about the actual PR proposal here. C does not include a specification equivalent to the java/C# memory model, so it's not useful as a test platform of this PR. Though with TBAA, clang theoretically could still prove that hoisting was valid, I don't expect that as being particularly productive to write such an optimization analysis into that pass.

If C doesn't have that then neither is LLVM, and this is not even talking about the other cases that is breaking the atomicity in the runtime C code. Without having at least as strong as relaxed load I do not believe LLVM can actually provide such a guarantee. Sure the generated code might happen to work on x86 or even arm since the compiler isn't able to find a way to break it but that's definitely not how one should write multithreaded code.

Also, TBAA has nothing to do with the optimization here. An atomic load within a loop, with TBAA or not is a synchronization and it is well defined to write to that location from another thread. OTOH, hoisting that load may cause the program to never termiate which is undefined so the compiler is not allowed to do that.

@tkf
Copy link
Member

tkf commented May 6, 2020

you can opt out of it partially

OK, I'm thinking this (new with fewer arguments) but I guess bringing up here was not very meaningful as I explicitly opt-out the initialization.


By the way, I find an interesting example in CppCon 2016: Hans Boehm “Using weakly ordered C++ atomics correctly" - YouTube (around 31:35) which is close to the MutableStruct example:

image

So, from the same reasoning, does this mean

struct MutableStruct
    a::Any
end

function f4()
    local x
    y = MutableStruct(1)
    @sync begin
        @spawn begin
            while !isdefined(x)
            end
            if x === y
                println(x.a)
            end
        end
        @spawn x = y
    end
end

can print something other than 1?

@vtjnash
Copy link
Member Author

vtjnash commented May 6, 2020

An atomic load within a loop is a synchronization

Quoth the standard: "[this ordering has] no synchronization or ordering constraints"

If C doesn't have that then neither is LLVM

Quoth the langref: "[this ordering] is intended to provide a guarantee strong enough to model Java’s non-volatile shared variables". It has no equivalent expression in the C memory model.

Sure the generated code might happen to work

Since using that part of Java's memory model is what I'm proposing here, I'm relatively confident that it's not just happening to work, but is working exactly as per design.

@vtjnash
Copy link
Member Author

vtjnash commented May 6, 2020

explicitly opt-out the initialization

FWIW, that's only partially opting out. It's impossible to fully opt out (which would imply not even setting the type tag, and would definitely wreak havoc with single-threaded use)

can print something other than 1?

Not with this PR (since that's always the value it'd get regardless of which pointer it followed), though it's dangerously close to being true if you make some small changes. Though it is also permitted (expected?) never to return at all (that the while loop shouldn't synchronize, so post-optimization it shouldn't stop).

Here's the small change I think you need to make this example interesting:

struct MutableStruct
    a::Any
end

function f5()
    local x
    y = MutableStruct(1)
    @sync begin
        @spawn begin
            while !isdefined(x); end
            if x === y
                println(x.a)
            end
        end
        @spawn (y.a = 2; x = y)
    end
end

So even though y.a = 2 seems to appear before x = y in source code order, and even though x = y is a release store (to the content field of the Box), there's no constraint anymore (relaxed) that the load of x.a happens after the load of x

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2020

I didn't realize unorder is defined this way in LLVM. So the fact that it still allow certain optimization make sense, (though TBAA won't ever help). I would still want to see evidence that vectorization isn't disabled (especially scatter/gatter) though and that does not seem to be the case currently.

However, what's the documented behavior of the "java memory model"? From the description of this ordering here it does not mention anything that'll make it a consume ordering. In another word, reading through the pointer requires a stronger ordering than relaxed (llvm monotonic) and this one seems to be weaker.

Though it is also permitted (expected?) never to return at all (that the while loop shouldn't synchronize, so post-optimization it shouldn't stop).

I believe that's why most of the examples usually don't use a loop....

So even though y.a = 2 seems to appear before x = y in source code order, and even though x = y is a release store (to the content field of the Box), there's no constraint anymore (relaxed) that the load of x.a happens after the load of x

Which means that even though the boxed value was "published to a global variable" (btw the document says "global global") the value it puslished cannot be read. You said in the top comment that

that it is valid to assume consume ordering on that read

and I believe consume ordering WILL have that guarantee (and will be stronger than relaxd and will disable vectorization and a bunch of other single thread optimization). It is the consume ordering that I don't think we should have by default so if that's not the case (in particular if the new ordering allows almost all of the single thread optimizations) then it'll certainly be more likely to be OK. If that's the case, I think the document and that comment should be updated and there should be a link or copy of the precise model you want. (FWIW, even if it is consume ordering such a link or copy is still necessary but it just isn't as necessary during the review for me since I know where to find the C++ version)

@vtjnash
Copy link
Member Author

vtjnash commented May 6, 2020

From the description of this ordering here it does not mention anything that'll make it a consume ordering

Correct—it's not consume ordering on all processors, just all of the ones we'd ever care about (the exception is the Alpha).

and I believe consume ordering WILL have that guarantee. If that's the case, I think the document and that comment should be updated and there should be a link or copy of the precise model you want. (FWIW, even if it is consume ordering such a link or copy is still necessary but it just isn't as necessary during the review for me since I know where to find the C++ version)

Consume ordering is not well-defined presently, due partly to the counterexample you gave. I don't know what their thinking is yet for correcting it. I'd perhaps suggest that we treat these atomics like their namesakes, electrons, and claim that interacting them results in an entanglement and a superposition of their properties, haha.

what's the documented behavior of the "java memory model"

It's documented in JSR-133 here https://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html. I think the mc in Java 17.5.1. (Semantics of final Fields) is a specification of a consume load if the data is not later changed (which introduces a second data race, and causes the problems discussed above). I'll update my documentation to clarify that this guarantee only applies to immutable objects and their type tags. That also will at least make it so that the === test you used there has defined behavior for all objects.

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2020

Correct—it's not consume ordering on all processors, just all of the ones we'd ever care about (the exception is the Alpha).

Well, if it is not consume ordering than it is not anywhere. Relying on the hardware behavior assuming some compiler limitation is not a ordering. Counterexample above

Consume ordering is not well-defined presently, due partly to the counterexample you gave. I don't know what their thinking is yet for correcting it. I'd perhaps suggest that we treat these atomics like their namesakes, electrons, and claim that interacting them results in an entanglement and a superposition of their properties, haha.

I didn't think I gave an counter example. If you meant the one you and @tkf gave above e.g. #35535 (comment) and #35535 (comment) than it is very well defined for consume ordering I believe. In fact, I believe both the definition of the ordering and the hardware behavior for naive codegen guarantees that everything in the branch can observe everything released by the synchronizing store. The branch itself, not just reading through the pointer counts as a dependency on the consume load. The only orderings where the result is not guaranteed to see the change is relaxed or weaker.

It's documented in JSR-133 here https://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html. I think the mc in Java 17.5.1. (Semantics of final Fields) is a specification of a consume load if the data is not later changed (which introduces a second data race, and causes the problems discussed above).

mc seems to be describing an order (similar to the one) needed for synchronization for a consume load but it doesn't seem to guarantee such order is "happens before" unless an finalization of a field is involved. I can see how that can map to our immutable object (or even initialization of mutable ones). However, I don't see how that maps to the unordered load or any particular LLVM construct. The documentation for unordered seems to be pretty clear that it is used to

guarantee that the generated code never exhibits undefined behavior

and even explicitly mention that (as you've quoted)

This cannot be used for synchronization

AFAICT, a java frontend might as well lower load from a final field as an acquired load (with everything else being unordered) and that'll satisfy this requirement just fine. This can for sure also be implemented using much more expensive synchroization during finalization of the object.

In another word, just from the document presented, it's clear that for a final field in java the read through pointer is well defined. I can see how that could be implemented efficiently with a custom compiler but I don't see a clear and well defined mapping to LLVM. AFAICT without an consume or acquire load LLVM will not guarantee on the dependency being honored. There should be some document on how this mapping can be done, what if any assumption (outside of standard C memory model at least) it is making and how are those assumptions guaranteed.

@vtjnash
Copy link
Member Author

vtjnash commented May 6, 2020

it is very well defined for consume ordering

Consume ordering was removed from the standard in C++17 due to problems with the original attempt at defining it and the pitfalls it entailed, discussion at https://lwn.net/Articles/586838/ and presumably in the YouTube video.

that'll satisfy this requirement just fine

Only if you ignore the rest of the document

I don't see a clear and well defined mapping [from Java] to LLVM

From the discussion here, seems like the issue is on you not reading the LLVM LangRef, and not on the lack of information therein. I know I initially related this to the C11 consume ordering, though I've since intended to clarify that to claim instead the Java memory model for this PR. That is also literally exactly what the LangRef says this does ("This is intended to provide a guarantee strong enough to model Java's [requirements]"). Why are you claiming that this won't do what Java intends and that Java needs to use something else than this? ("[The LangRef needs to] document on how this mapping [from Java to LLVM] can be done")

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2020

Consume ordering was removed from the standard in C++17 due to problems with the original attempt at defining it and the pitfalls it entailed, discussion at https://lwn.net/Articles/586838/ and presumably in the YouTube video.

It seems to be still inn the sandard. It is being revised and that's why I mentioned the hardware behavior on the branch being a dependency which should be the behavior that the standard try to match.

Only if you ignore the rest of the document

Hhmmm, what I said is that

lower load from a final field as an acquired load

which is definately stronger than everything else mentioned here. If this cannot satisfy the requirement I don't see what can. Which other part of the document that I'm ignoring? (I mean, sure there are other types that I'm not talking about and I'm not implying that a volaltile/atomic load can be implemented as unordered. I just don't see why lowering this one as acquired will violate anything)

From the discussion here, seems like the issue is on you not reading the LLVM LangRef, and not on the lack of information therein. I know I initially related this to the C11 consume ordering, though I've since intended to clarify that to claim instead the Java memory model for this PR. That is also literally exactly what the LangRef says this does ("This is intended to provide a guarantee strong enough to model Java's [requirements]"). Why are you claiming that this won't do what Java intends and that Java needs to use something else than this? ("[The LangRef needs to] document on how this mapping [from Java to LLVM] can be done")

It's because I don't want to read from the document more than what it says explicitly. I fully agree that the unordered ordering could/should be used to match part of the java memory model that doesn't exist in C. I don't believe the LLVM document implies that this ordering is all what you need to implement all different types of loads in java (it's missing atomics at the least). And that's why I'm claiming that if it is intended to provide more than what what the document says those should also be explicitly stated. For the record, other than

This is intended to match the Java memory model for shared variables.

which AFAICT does not talk about finals, everything else that LLVM documented are just

This cannot be used for synchronization

and

a load cannot see a value which was never stored

Another reason this cannot possibly be the whole story is that a load ordering has to match with a store ordering. That's why the C++ document metions all ordering between relaxed and seq-cst in pairs.

@vtjnash
Copy link
Member Author

vtjnash commented May 6, 2020

read from the document more than what it says explicitly.
could/should be used to match part of the java memory model

The LangRef says that this explicitly provides the java memory model. I realize that is nearly the only thing it says. But claiming it might not be useful for java is utter nonsense when it's explicitly defined for that exact purpose. The LangRef also explains it in slightly more detail in the linked document: https://llvm.org/docs/Atomics.html

missing atomics at the least

Are you ever going to read the document you keep quoting? They are in fact also explicitly covered a few paragraphs later in the LangRef: "This corresponds to...Java volatile" (In Java's vocabulary, volatile is defined to be the same as what C calls _Atomic.)

cannot possibly be the whole story

Yeah, there's recent literature that talks about how this is a deficiency in the standard, since it doesn't accurately reflect the behavior of hardware. I read a lot of it last week, and can pass along links, but basically it boils down to saying that the standard is not the whole story, since reasonable possible physical implementations ends up being different from it.

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2020

But claiming it might not be useful for java is utter nonsense when it's explicitly defined for that exact purpose.

When did I say this? I said I believe it can be used for java, but it won't be the only thing java needs.

Are you ever going to read the document you keep quoting? They are in fact also explicitly covered a few paragraphs later in the LangRef: "This corresponds to...Java volatile" (In Java's vocabulary, volatile is defined to be the same as what C calls _Atomic.)

Yes I did and yes I did later realized that atomic isn't a good example since the document did specify "shared variable" which can probably at least rule out volatiles. It was just the few special memory ordering tools from java that I know of. OTOH, I am correct to say that this (unordered) is everything which is exactly the argument. Even for non-volatile I don't believe this is the only thing java need for or at least I don't see how that can be inferred from the document. The java doc you've linked about mc and final seems to be something special, much like volatiles are special and unless that part can be derived from other part I don't see how adding additional constraints can still allow the same implementation.

Yeah, there's recent literature that talks about how this is a deficiency in the standard, since it doesn't accurately reflect the behavior of hardware. I read a lot of it last week, and can pass along links, but basically it boils down to saying that the standard is not the whole story, since reasonable possible physical implementations ends up being different from it.

That's not at all what I mean. What I meant is

  1. The LLVM document mentioned shared variables but it seems that final that gives the semantic you want should require additional synchronization. It does not seem to be mentioned anywhere.
  2. The LLVM guide for optimization for unordered only prevent spliting the load/store and encourages reordering (i.e. speculation from compiler). This is exactly what give rise to the case where relaxed isn't enough to observe a store from another thread so I don't see how unordered can prevent that case.
  3. Ordering always require pairing between two barriers/atomics and LLVM/C++ doc only describe them for atomics stronger than relaxed/monotonic. It's unclear from the document what ordering one can get when mixing unordered and other orders.

@vtjnash vtjnash force-pushed the jn/globals-atomics branch 2 times, most recently from 22bab6e to 89462f8 Compare May 7, 2020 00:53
@vtjnash
Copy link
Member Author

vtjnash commented May 12, 2020

Update: from various external discussions, I think the current proposal for this PR is to change most of the release stores to unordered (or relaxed) and change the loads as herein. This makes it sufficient to use explicit fences to implement some of the the stronger models (acquire and release, but probably not sequentially-consistent). And then possibly resume the discussion later of whether to upgrade these defaults to release/consume.

since the document did specify "shared variable" which can probably at least rule out volatiles

In Java, "shared" means anything non-local. It's orthogonal to volatile.

final that gives the semantic you want should require additional synchronization

In Java, every object effectively has at least one final field (the typeof tag), thus it's required for all objects.

 It's unclear from the document what ordering one can get when mixing unordered and other orders

True, it's then something implementation defined. For all except Alpha, this gives us the proposed release/consume for all operations that occurred in the constructor.

@vtjnash
Copy link
Member Author

vtjnash commented Jun 24, 2020

Triage call's conclusion still is to update this PR to change most of the release stores to unordered (or relaxed) and keep the changes to the loads as herein.

vtjnash added 3 commits July 1, 2020 17:48
The read may still be torn, as that still requires fixes elsewhere, so
this documentation is not yet actually correct.
This lets us say that anytime you observe the store of a pointer, it's
valid to assume consume ordering on that read (on all processors except
the Alpha). It's also already the default behavior on the x86 hardware,
this just enforces the ordering on the compiler too.
And add load/store alignment annotations, because LLVM now prefers that
we try to specify those explicitly, even though it's not required.

This does not yet include correct load/store behaviors for objects with
inlined references (the recent #34126 PR).
@vtjnash vtjnash force-pushed the jn/globals-atomics branch from 89462f8 to 4a50693 Compare July 1, 2020 21:52
@vtjnash vtjnash closed this Jul 1, 2020
@StefanKarpinski StefanKarpinski removed the triage This should be discussed on a triage call label Jul 16, 2020
@DilumAluthge DilumAluthge deleted the jn/globals-atomics branch March 25, 2021 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants