[ocaml5-issue] `dummy found!` in `Lin Dynarray stress test with Domain` on musl trunk #528

jmid · 2025-01-19T20:06:06Z

On #521 the musl trunk workflow triggered a dummy found! error exiting the Lin Dynarray stress test:
https://github.com/ocaml-multicore/multicoretests/actions/runs/12846368030/job/35821476446?pr=521

random seed: 225104823
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray test with Domain
[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray test with Domain (generating)
[✓]    2    0    1    1 / 1000     2.1s Lin Dynarray test with Domain

dummy found!
File "_build/.dune/default/src/dynarray/dune", line 3, characters 7-16:
3 |  (name lin_tests)
           ^^^^^^^^^
(cd _build/default/src/dynarray && ./lin_tests.exe --verbose)
Command exited with code 1.
[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray stress test with Domain

This is caused by these lines implementing a get instrumented with a tag check:

multicoretests/src/dynarray/lin_tests.ml

Lines 11 to 13 in a166b24

    
           let get_check a i = 
        
             let v = Dynarray.get a i in 
        
             if not (Obj.is_int (Obj.repr v)) then (Printf.eprintf "dummy found!\n%!"; exit 1) else v

IIUC, this means the test is observing a non-int dummy value from the unboxed implementation from ocaml/ocaml#12885.
A code comment in stdlib/dynarray.ml reads:

   [...] This dummy is *not* type-safe, it is not a valid value of type ['a], so
   we must be very careful never to return it to the user. After
   accessing any element of the array, we must check that it is not
   the dummy. In particular, this dummy must be distinct from any
   other value the user could provide -- we ensure this by using
   a dynamically-allocated mutable reference as our dummy.

At the same time, the module is documented to be unsafe for parallel usage:

Unsynchronized accesses

    Unsynchronized accesses to dynamic arrays are a programming error.

   Concurrent accesses to dynamic arrays must be synchronized
   (for instance with a {!Mutex.t}). Unsynchronized accesses to
   a dynamic array are a programming error that may lead to an invalid
   dynamic array state, on which some operations would fail with an
   [Invalid_argument] exception.

Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔

The text was updated successfully, but these errors were encountered:

OlivierNicole · 2025-01-20T10:22:26Z

Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔

It shouldn’t be, and yet it seems that it is.

I guess my next move, if I wanted to track this, would be to modify the test to return a result or something like that, so that qcheck-stm will show me the sequence of commands that led to the error rather than merely exiting; and try to work it out from there.

Cc-ing @gasche who might be interested to know.

gasche · 2025-01-20T12:10:05Z

It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test?

Slightly more information: there is a dynamic check in the Dynarray implementation that should ensures that we fail with an exception instead of ever returning a dummy value. Here this check does not suffice to guarantee this property. Different dynarrays may have different dummy values (but only marshalling creates distinct dummies, and I suppose you don't exercise it?), and it may be the case that an implementation bug causes two different dummies to be mixed up in the same backing array, which would result in the dynamic check being incomplete. For example, the implementation of blit looks suspiciously complex and could hide such a bug.

gasche · 2025-01-20T12:30:58Z

I suppose that the operations being exercised are the ones listed in

multicoretests/src/dynarray/lin_tests.ml

Lines 17 to 32 in 05a5ac9

    
           [ val_ "get_check" get_check (t @-> int @-> returning_or_exc elem); 
        
             val_ "set" Dynarray.set (t @-> int @-> elem @-> returning_or_exc unit); 
        
             val_ "length" Dynarray.length (t @-> returning int); 
        
             val_freq 3 "add_last" Dynarray.add_last (t @-> elem @-> returning_or_exc unit); 
        
             val_ "append_seq" Dynarray.append_seq (t @-> seq_small elem @-> returning_or_exc unit); 
        
             val_ "get_last" Dynarray.get_last (t @-> returning_or_exc elem); 
        
             val_ "pop_last" Dynarray.pop_last (t @-> returning_or_exc elem); 
        
             val_freq 2 "remove_last" Dynarray.remove_last (t @-> returning_or_exc unit); 
        
             val_ "clear" Dynarray.clear (t @-> returning_or_exc unit); 
        
             val_ "truncate" Dynarray.truncate (t @-> int @-> returning_or_exc unit); 
        
             val_ "ensure_capacity" Dynarray.ensure_capacity (t @-> int @-> returning_or_exc unit); 
        
             val_ "fit_capacity" Dynarray.fit_capacity (t @-> returning_or_exc unit); 
        
             (*val_ "blit" Dynarray.blit (t @-> int_not_too_big @-> t @-> int_not_too_big @-> int_not_too_big @-> returning_or_exc unit);*) 
        
             val_freq 2 "set_capacity" Dynarray.set_capacity (t @-> int @-> returning_or_exc unit); 
        
             val_ "reset" Dynarray.reset (t @-> returning_or_exc unit); 
        
           ]

I see in particular that blit is commented out, so it should not be the one causing this issue. (It's not obvious whether there is marshalling involved in the machinery around it; could a dynarray used in the test be produced by marshalling then unmarshalling?)

jmid · 2025-01-20T12:45:15Z

Thanks for the feedback both! 🙏

It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test?

The test just exercises the operations you list above. Indeed neither blit nor marshalling is involved.
Since the Lin test exits with the above message when encountering a dummy value, we regrettably don't know which sequence of operations caused it... 😬
As Olivier suggests, adjusting the STM test to express "unexpected value", e.g., as a result type, should allow us to print and understand what caused it.

I'll see if I can create a reproducer. A few quick local experiments however indicate that this is a rare, hard-to reproduce one, which also explain why it hasn't shown up before. Further note, that it showed up on a musl workflow, which tend to exhibit sufficiently different behaviour from glibc to surface things like these...

OlivierNicole · 2025-01-27T17:17:08Z

I am also trying at the moment; working with a modified version of src/dynarray/stm_tests that ignores every command result except if get returns a non-int value (on a int Dynarray.t). So far, working on a musl switch, I have been unable to reproduce it in many thousand runs.

jmid · 2025-01-27T21:55:34Z

We can leave out the "multiple-t"-accepting functions from the Dynarray API, as only the ones tested by Lin were enough to trigger the bug. Perhaps also double check that the distributions (say for int, etc.) on the Lin and STM tests agree, to avoid one generating, e.g., 0-100 and the other ints uniformly.

OlivierNicole · 2025-01-29T09:41:32Z

I am now trying to reproduce by running https://github.com/OlivierNicole/multicoretests/blob/8bd670363da181d8c81bed0a0dff93e3622f9d95/src/dynarray/stm_tests.ml, but no luck so far.

OlivierNicole · 2025-01-30T14:21:25Z

Update: I have been running it in a loop for about 3 days now during the day, without being able to reproduce. The trunk commit being exercised is ocaml/ocaml@b5420186c7.

Either the bug is extremely unlikely to happen, or there is an error in my test, or the problem is specific to an earlier version of the compiler or to the CI machine.

jmid · 2025-01-31T11:40:25Z

Thanks for the update!

Some thoughts:

One option is to check the versions of the problematic run of gcc, musl, ... (it could potentially be version specific)
If the problem is not reproducible locally, perhaps we should go back to CI - and the Lin test even?
Previously, we've used "focused tests" to help with rerunning a single test (mostly in CI). Here's an example I just kicked off: https://github.com/ocaml-multicore/multicoretests/tree/focus-lin-dynarray-musl
The seed may be important to trigger it - or it may not 🤷
As this may be GC-specific you may have luck reproducing more often when playing with GC parameters. I've used OCAMLRUNPARAM's s and o in the past.
If this is timing-specific, I've previously had luck with inserting a strategic sleep to trigger something more often, after having eyeballed the relevant code and built a suspicion
Perhaps this triggers more often with more spawned domains? A custom Lin stress property with 3 (or 4) domains is a possibility

Finally, we can't rule out that this is not due to a Dynarray error but to something else.
This remark on another issue indicates that there may be a rare GC bug lurking still.

jmid · 2025-01-31T11:55:56Z

If the problem is not reproducible locally, perhaps we should go back to CI - and the Lin test even?

Previously, we've used "focused tests" to help with rerunning a single test (mostly in CI). Here's an example I just kicked off: https://github.com/ocaml-multicore/multicoretests/tree/focus-lin-dynarray-musl

Oh wow - it triggered 5/100 on 5.3 and 6/100 on trunk apparently! 😮

jmid mentioned this issue Jan 19, 2025

Expand on Lin Bytes test and add stress test #521

Merged

jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Jan 20, 2025

jmid changed the title ~~dummy found! in Lin Dynarray stress test with Domain on musl trunk~~ [ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk Jan 20, 2025

OlivierNicole mentioned this issue Jan 27, 2025

Improve reporting in Dynarray lin test #529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ocaml5-issue] `dummy found!` in `Lin Dynarray stress test with Domain` on musl trunk #528

[ocaml5-issue] `dummy found!` in `Lin Dynarray stress test with Domain` on musl trunk #528

jmid commented Jan 19, 2025

OlivierNicole commented Jan 20, 2025

gasche commented Jan 20, 2025

gasche commented Jan 20, 2025

jmid commented Jan 20, 2025

OlivierNicole commented Jan 27, 2025

jmid commented Jan 27, 2025

OlivierNicole commented Jan 29, 2025

OlivierNicole commented Jan 30, 2025

jmid commented Jan 31, 2025

jmid commented Jan 31, 2025

[ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk #528

[ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk #528

Comments

jmid commented Jan 19, 2025

OlivierNicole commented Jan 20, 2025

gasche commented Jan 20, 2025

gasche commented Jan 20, 2025

jmid commented Jan 20, 2025

OlivierNicole commented Jan 27, 2025

jmid commented Jan 27, 2025

OlivierNicole commented Jan 29, 2025

OlivierNicole commented Jan 30, 2025

jmid commented Jan 31, 2025

jmid commented Jan 31, 2025

[ocaml5-issue] `dummy found!` in `Lin Dynarray stress test with Domain` on musl trunk #528

[ocaml5-issue] `dummy found!` in `Lin Dynarray stress test with Domain` on musl trunk #528