Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk #528

Open
jmid opened this issue Jan 19, 2025 · 10 comments
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Jan 19, 2025

On #521 the musl trunk workflow triggered a dummy found! error exiting the Lin Dynarray stress test:
https://github.com/ocaml-multicore/multicoretests/actions/runs/12846368030/job/35821476446?pr=521

random seed: 225104823
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray test with Domain
[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray test with Domain (generating)
[✓]    2    0    1    1 / 1000     2.1s Lin Dynarray test with Domain

dummy found!
File "_build/.dune/default/src/dynarray/dune", line 3, characters 7-16:
3 |  (name lin_tests)
           ^^^^^^^^^
(cd _build/default/src/dynarray && ./lin_tests.exe --verbose)
Command exited with code 1.
[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray stress test with Domain

This is caused by these lines implementing a get instrumented with a tag check:

let get_check a i =
let v = Dynarray.get a i in
if not (Obj.is_int (Obj.repr v)) then (Printf.eprintf "dummy found!\n%!"; exit 1) else v

IIUC, this means the test is observing a non-int dummy value from the unboxed implementation from ocaml/ocaml#12885.
A code comment in stdlib/dynarray.ml reads:

   [...] This dummy is *not* type-safe, it is not a valid value of type ['a], so
   we must be very careful never to return it to the user. After
   accessing any element of the array, we must check that it is not
   the dummy. In particular, this dummy must be distinct from any
   other value the user could provide -- we ensure this by using
   a dynamically-allocated mutable reference as our dummy.

At the same time, the module is documented to be unsafe for parallel usage:

Unsynchronized accesses

    Unsynchronized accesses to dynamic arrays are a programming error.

   Concurrent accesses to dynamic arrays must be synchronized
   (for instance with a {!Mutex.t}). Unsynchronized accesses to
   a dynamic array are a programming error that may lead to an invalid
   dynamic array state, on which some operations would fail with an
   [Invalid_argument] exception.

Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔

@OlivierNicole
Copy link
Contributor

Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔

It shouldn’t be, and yet it seems that it is.

I guess my next move, if I wanted to track this, would be to modify the test to return a result or something like that, so that qcheck-stm will show me the sequence of commands that led to the error rather than merely exiting; and try to work it out from there.

Cc-ing @gasche who might be interested to know.

@gasche
Copy link

gasche commented Jan 20, 2025

It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test?

Slightly more information: there is a dynamic check in the Dynarray implementation that should ensures that we fail with an exception instead of ever returning a dummy value. Here this check does not suffice to guarantee this property. Different dynarrays may have different dummy values (but only marshalling creates distinct dummies, and I suppose you don't exercise it?), and it may be the case that an implementation bug causes two different dummies to be mixed up in the same backing array, which would result in the dynamic check being incomplete. For example, the implementation of blit looks suspiciously complex and could hide such a bug.

@gasche
Copy link

gasche commented Jan 20, 2025

I suppose that the operations being exercised are the ones listed in

[ val_ "get_check" get_check (t @-> int @-> returning_or_exc elem);
val_ "set" Dynarray.set (t @-> int @-> elem @-> returning_or_exc unit);
val_ "length" Dynarray.length (t @-> returning int);
val_freq 3 "add_last" Dynarray.add_last (t @-> elem @-> returning_or_exc unit);
val_ "append_seq" Dynarray.append_seq (t @-> seq_small elem @-> returning_or_exc unit);
val_ "get_last" Dynarray.get_last (t @-> returning_or_exc elem);
val_ "pop_last" Dynarray.pop_last (t @-> returning_or_exc elem);
val_freq 2 "remove_last" Dynarray.remove_last (t @-> returning_or_exc unit);
val_ "clear" Dynarray.clear (t @-> returning_or_exc unit);
val_ "truncate" Dynarray.truncate (t @-> int @-> returning_or_exc unit);
val_ "ensure_capacity" Dynarray.ensure_capacity (t @-> int @-> returning_or_exc unit);
val_ "fit_capacity" Dynarray.fit_capacity (t @-> returning_or_exc unit);
(*val_ "blit" Dynarray.blit (t @-> int_not_too_big @-> t @-> int_not_too_big @-> int_not_too_big @-> returning_or_exc unit);*)
val_freq 2 "set_capacity" Dynarray.set_capacity (t @-> int @-> returning_or_exc unit);
val_ "reset" Dynarray.reset (t @-> returning_or_exc unit);
]

I see in particular that blit is commented out, so it should not be the one causing this issue. (It's not obvious whether there is marshalling involved in the machinery around it; could a dynarray used in the test be produced by marshalling then unmarshalling?)

@jmid
Copy link
Collaborator Author

jmid commented Jan 20, 2025

Thanks for the feedback both! 🙏

It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test?

The test just exercises the operations you list above. Indeed neither blit nor marshalling is involved.
Since the Lin test exits with the above message when encountering a dummy value, we regrettably don't know which sequence of operations caused it... 😬
As Olivier suggests, adjusting the STM test to express "unexpected value", e.g., as a result type, should allow us to print and understand what caused it.

I'll see if I can create a reproducer. A few quick local experiments however indicate that this is a rare, hard-to reproduce one, which also explain why it hasn't shown up before. Further note, that it showed up on a musl workflow, which tend to exhibit sufficiently different behaviour from glibc to surface things like these...

@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Jan 20, 2025
@jmid jmid changed the title dummy found! in Lin Dynarray stress test with Domain on musl trunk [ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk Jan 20, 2025
@OlivierNicole
Copy link
Contributor

I am also trying at the moment; working with a modified version of src/dynarray/stm_tests that ignores every command result except if get returns a non-int value (on a int Dynarray.t). So far, working on a musl switch, I have been unable to reproduce it in many thousand runs.

@jmid
Copy link
Collaborator Author

jmid commented Jan 27, 2025

We can leave out the "multiple-t"-accepting functions from the Dynarray API, as only the ones tested by Lin were enough to trigger the bug. Perhaps also double check that the distributions (say for int, etc.) on the Lin and STM tests agree, to avoid one generating, e.g., 0-100 and the other ints uniformly.

@OlivierNicole
Copy link
Contributor

@OlivierNicole
Copy link
Contributor

Update: I have been running it in a loop for about 3 days now during the day, without being able to reproduce. The trunk commit being exercised is ocaml/ocaml@b5420186c7.

Either the bug is extremely unlikely to happen, or there is an error in my test, or the problem is specific to an earlier version of the compiler or to the CI machine.

@jmid
Copy link
Collaborator Author

jmid commented Jan 31, 2025

Thanks for the update!

Some thoughts:

  • One option is to check the versions of the problematic run of gcc, musl, ... (it could potentially be version specific)
  • If the problem is not reproducible locally, perhaps we should go back to CI - and the Lin test even?
  • Previously, we've used "focused tests" to help with rerunning a single test (mostly in CI). Here's an example I just kicked off: https://github.com/ocaml-multicore/multicoretests/tree/focus-lin-dynarray-musl
  • The seed may be important to trigger it - or it may not 🤷
  • As this may be GC-specific you may have luck reproducing more often when playing with GC parameters. I've used OCAMLRUNPARAM's s and o in the past.
  • If this is timing-specific, I've previously had luck with inserting a strategic sleep to trigger something more often, after having eyeballed the relevant code and built a suspicion
  • Perhaps this triggers more often with more spawned domains? A custom Lin stress property with 3 (or 4) domains is a possibility

Finally, we can't rule out that this is not due to a Dynarray error but to something else.
This remark on another issue indicates that there may be a rare GC bug lurking still.

@jmid
Copy link
Collaborator Author

jmid commented Jan 31, 2025

Oh wow - it triggered 5/100 on 5.3 and 6/100 on trunk apparently! 😮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

3 participants