Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[monovm] Assertion at object.c:657, condition `lock->done' not met #96872

Closed
jeromelaban opened this issue Jan 11, 2024 · 24 comments · Fixed by #96903
Closed

[monovm] Assertion at object.c:657, condition `lock->done' not met #96872

jeromelaban opened this issue Jan 11, 2024 · 24 comments · Fixed by #96903
Assignees
Milestone

Comments

@jeromelaban
Copy link
Contributor

jeromelaban commented Jan 11, 2024

Description

When running some threaded code in a MAUI app (android in this case), the app may fail with the following error:

[nyname.UnoApp16] * Assertion at /__w/1/s/src/mono/mono/metadata/object.c:657, condition `lock->done' not met

g_assert (lock->done);

This happens when connecting in parallel to multiple websockets, see the StartConnection() in MainPage.xaml.cs.

Reproduction Steps

Expected behavior

The app does not crash

Actual behavior

the app crashes immediately most of the time

Regression?

Yes, was working fine in 8.0.100.

Known Workarounds

Yes, don't execute any threaded code.

Configuration

dotnet --version
8.0.101
dotnet workload update --print-rollback
==workloadRollbackDefinitionJsonOutputStart==
{
  "microsoft.net.sdk.android": "34.0.52/8.0.100",
  "microsoft.net.sdk.ios": "17.0.8490/8.0.100",
  "microsoft.net.sdk.maccatalyst": "17.0.8490/8.0.100",
  "microsoft.net.sdk.macos": "14.0.8490/8.0.100",
  "microsoft.net.sdk.maui": "8.0.3/8.0.100",
  "microsoft.net.sdk.tvos": "17.0.8490/8.0.100",
  "microsoft.net.workload.mono.toolchain.current": "8.0.1/8.0.100",
  "microsoft.net.workload.emscripten.current": "8.0.1/8.0.100",
  "microsoft.net.workload.emscripten.net6": "8.0.1/8.0.100",
  "microsoft.net.workload.emscripten.net7": "8.0.1/8.0.100",
  "microsoft.net.workload.mono.toolchain.net6": "8.0.1/8.0.100",
  "microsoft.net.workload.mono.toolchain.net7": "8.0.1/8.0.100",
  "microsoft.net.sdk.aspire": "8.0.0-preview.2.23619.3/8.0.100"
}
==workloadRollbackDefinitionJsonOutputEnd==

Other information

No response

@lambdageek
Copy link
Member

Probably I messed something up in #93875

@lambdageek lambdageek self-assigned this Jan 11, 2024
@lambdageek lambdageek added this to the 8.0.x milestone Jan 11, 2024
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jan 11, 2024
@lambdageek
Copy link
Member

lambdageek commented Jan 11, 2024

Oh... yea...

if (!lock->done) {
int timeout_ms = 500;
int wait_result = mono_coop_cond_timedwait (&lock->cond, &lock->mutex, timeout_ms);
if (wait_result == -1) {
/* timed out - go around again from the beginning. If we got here
* from the "is_blocked = FALSE" case, above (another thread was
* blocked on the current thread, but on a lock that was already
* done but it didn't get to wake up yet), then it might still be
* the case that the current thread cannot proceed even if the other
* thread got to wake up - there might be a new deadlock. We need
* to re-evaluate.
*
* This can happen if two threads A and B need to call the cctors
* for classes X and Y but in opposite orders, and also call a cctor
* for a third class Z. (That is thread A wants to init: X, Z, Y;
* thread B wants to init: Y, Z, X.) In that case, if B is waiting
* for A to finish initializing Z, and A (the current thread )
* already finished Z and wants to init Y. In A, control will come
* here with "lock" being Y's lock. But we will time out because B
* will see that A is responsible for initializing X and will also
* block. So A is waiting for B to finish Y and B is waiting for A
* to finish X. So the fact that A allowed B to wait for Z to
* finish didn't actually let us make progress. Thread A must go
* around to the top once more and try to init Y - and detect that
* there is now a deadlock between X and Y.
*/
mono_type_init_unlock (lock);
// clean up blocked thread hash and lock refcount.
mono_type_initialization_lock ();
g_hash_table_remove (blocked_thread_hash, GUINT_TO_POINTER (tid));
gboolean deleted = unref_type_lock (lock);
if (deleted)
g_hash_table_remove (type_initialization_hash, vtable);
mono_type_initialization_unlock ();
goto retry_top;
} else if (wait_result == 0) {
/* Success: we were signaled that the other thread is done. Proceed */
} else {
g_assert_not_reached ();
}
}
mono_type_init_unlock (lock);
g_assert (lock->done);

on line 615 we know !lock->done
then on 650-651 we find out that wait_result == 0 - ie we were signaled on lock->cond. And then we fall through to 656. But we probably need a barrier so that we actually see lock->done (although I thought mutex locking/unlocking is a barrier.)

alternately a thread that is doing the initialization is dying and we're calling mono_release_type_locks which doesn't set lock->done nope... release_type_locks also sets lock->done.

I'll need to try this in a debugger after all. Trying to guess what is wrong isn't working

@lambdageek
Copy link
Member

Just noticed another thing that is wrong. when we goto retry_top, we repeat HANDLE_FUNCTION_ENTER() without doing a corresponding HANDLE_FUNCTION_RETURN, so we're incidentally messing up the coop handle stack.

@lambdageek
Copy link
Member

lambdageek commented Jan 12, 2024

Oh. It's probably a spurious wakeup. We need to actually re-evaluate lock->done if wait_result == 0

When using condition variables there is always a Boolean
predicate involving shared variables associated with each
condition wait that is true if the thread should proceed.
Spurious wakeups from the pthread_cond_timedwait() or
pthread_cond_wait() functions may occur. Since the return from
pthread_cond_timedwait() or pthread_cond_wait() does not imply
anything about the value of this predicate, the predicate should
be re-evaluated upon such return.

https://man7.org/linux/man-pages/man3/pthread_cond_wait.3p.html

@lambdageek
Copy link
Member

Can repro the crash on a real Pixel 6a, but not in an emulator

lambdageek added a commit to lambdageek/runtime that referenced this issue Jan 12, 2024
the condition variable may be signaled even if the initialization by
the other thread is not done yet.  Handle spurious wakeups the same
way as timeouts: go around once more from the beginning.

Fixes dotnet#96872
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 12, 2024
github-actions bot pushed a commit that referenced this issue Jan 12, 2024
the condition variable may be signaled even if the initialization by
the other thread is not done yet.  Handle spurious wakeups the same
way as timeouts: go around once more from the beginning.

Fixes #96872
lambdageek added a commit that referenced this issue Jan 12, 2024
* mono_runtime_class_init_full: handle spurious wakeups

the condition variable may be signaled even if the initialization by
the other thread is not done yet.  Handle spurious wakeups the same
way as timeouts: go around once more from the beginning.

Fixes #96872 and #96804

* fix unbalanced handle frames

if we goto retry_top, don't set up a new handle frame that lacks a
matching HANDLE_FUNCTION_RETURN_VAL.

Instead setup the handle frame once upfront
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 12, 2024
@lambdageek
Copy link
Member

Reopening until the 8.0.x backport is merged

@lambdageek lambdageek reopened this Jan 12, 2024
lambdageek added a commit that referenced this issue Jan 13, 2024
#96905)

* mono_runtime_class_init_full: handle spurious wakeups

the condition variable may be signaled even if the initialization by
the other thread is not done yet.  Handle spurious wakeups the same
way as timeouts: go around once more from the beginning.

Fixes #96872 and #96804

* fix unbalanced handle frames

if we goto retry_top, don't set up a new handle frame that lacks a
matching HANDLE_FUNCTION_RETURN_VAL.

Instead setup the handle frame once upfront

---------

Co-authored-by: Aleksey Kliger <[email protected]>
@lambdageek
Copy link
Member

Backport has been merged, too.

@ChristopherStephan
Copy link

We were also able to reproduce this error with Android SDK 31, 33 and 34.

@liejuntao001
Copy link

When will the fix be released? We are all stuck by this issue.
Both 8.0.101 and 8.0.100 not working.

@lambdageek
Copy link
Member

When will the fix be released? We are all stuck by this issue.

@liejuntao001 see #96804 (comment)
there is a suggestion in there for how to enable additional logging that may help you to work around the issue.

servicing releases for .NET come out approximately monthly. The fix for this issue is queued up for the February servicing release.

Both 8.0.101 and 8.0.100 not working.

This assertion is new in 8.0.101. What is the issue in 8.0.100? Could you open a new GitHub issue for it.

@jeromelaban
Copy link
Contributor Author

jeromelaban commented Jan 17, 2024

This assertion is new in 8.0.101. What is the issue in 8.0.100? Could you open a new GitHub issue for it.

@lambdageek as rolling back workloads is not easy (it requires rollback files), changing the active version of the SDK in global.json does not have any effect on the active workload, which can cause confusion. I hope that dotnet/designs#294 will clear a lot of this confusion...

The fix for this issue is queued up for the February servicing release

As for the monthly servicing, February is quite far away for an issue that arises when updating visual studio, and for which rolling back is difficult.

@lambdageek
Copy link
Member

This assertion is new in 8.0.101. What is the issue in 8.0.100? Could you open a new GitHub issue for it.

@lambdageek as rolling back workloads is not easy (it requires rollback files), changing the active version of the SDK in global.json does not have any effect on the active workload, which can cause confusion. I hope that dotnet/designs#294 will clear a lot of this confusion...

You're right. that would explain it.

The fix for this issue is queued up for the February servicing release

As for the monthly servicing, February is quite far away for an issue that arises when updating visual studio, and for which rolling back is difficult.

As a temporary workaround you might be able to force the 8.0.101 SDK to use the 8.0.0 runtime pack with something like this:

<Target Name="UpdateRuntimePackVersion" BeforeTargets="ProcessFrameworkReferences">
  <ItemGroup>
    <KnownRuntimePack Condition="%(RuntimePackLabels) == 'Mono'" LatestRuntimeFrameworkVersion="8.0.0" />
  </ItemGroup>
</Target>

@liejuntao001
Copy link

This workaround works for me.

tmds pushed a commit to tmds/runtime that referenced this issue Jan 23, 2024
* mono_runtime_class_init_full: handle spurious wakeups

the condition variable may be signaled even if the initialization by
the other thread is not done yet.  Handle spurious wakeups the same
way as timeouts: go around once more from the beginning.

Fixes dotnet#96872 and dotnet#96804

* fix unbalanced handle frames

if we goto retry_top, don't set up a new handle frame that lacks a
matching HANDLE_FUNCTION_RETURN_VAL.

Instead setup the handle frame once upfront
@orimslala
Copy link

orimslala commented Jan 27, 2024

Hi, has the fix been released. i am getting crashes as well..
Assertion at /__w/1/s/src/mono/mono/metadata/object.c:657, condition `lock->done' not met

@BeepBeepBopBop
Copy link

Hello, I am still experiencing this issue despite being up to date with MAUI latest version and also including the workaround in the csproj.

@mlyrstad
Copy link

Hello, I am still experiencing this issue despite being up to date with MAUI latest version and also including the workaround in the csproj.

I don't think the fix is released yet. These guys don't believe in hotfixes, it seems, even for critical issues like this. May have something to do with the fact that the error is so far down in the system, I dunno. We have to wait for the Februrary push.

@bitzeal-johan
Copy link

When is the February push out? It is February now. We're waiting. Our customers projects are crashing.

@gdamino
Copy link

gdamino commented Feb 6, 2024

This bug broken our production App.
Our app starts some threads to do some stuff with tcp sockets, it crash on every user devices and even on emulators.
* Assertion at /__w/1/s/src/mono/mono/metadata/object.c:657, condition lock->done not met
It's a critical issue, no workarounds works here when it will fixed?

@bcaceiro
Copy link

bcaceiro commented Feb 9, 2024

So when will this be available in MAUI? I am getting this crashes all the time

@filipnavara
Copy link
Member

When is the February push out?

As always, on patch Tuesday - the second Tuesday of a month - so February 13th.

@paulwroe
Copy link

Any joy with the new update, has anyone had a chance to try it (if it's available)

@blackmesacode
Copy link

Upgrading to VS 17.9.0 fixed the issue for me 👍

@paulwroe
Copy link

I'm on Visual Studio for Mac 2022 they've got a new update out today so I'll try that tomorrow, I also use Rider and there's a Mono update for VS, which I think is where the issue is, fingers crossed!

@gdamino
Copy link

gdamino commented Feb 14, 2024

Sdk 8.0.200 fixed the issue on Debug configuration for me, but still persist in Release for android. No way

@github-actions github-actions bot locked and limited conversation to collaborators Mar 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.