Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

background handler terminated by 0xdead10cc sometimes #1202

Closed
r10s opened this issue May 11, 2021 · 39 comments
Closed

background handler terminated by 0xdead10cc sometimes #1202

r10s opened this issue May 11, 2021 · 39 comments

Comments

@r10s
Copy link
Member

r10s commented May 11, 2021

getting "Namespace RUNNINGBOARD, Code 0xdead10cc" crashes from time to time, i cannot reproduce that, but it was reported by @dignifiedquire; some logs are in App Store Connect, excerpt

Exception Type:  EXC_CRASH (SIGKILL)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Exception Note:  EXC_CORPSE_NOTIFY
Termination Reason: Namespace RUNNINGBOARD, Code 0xdead10cc
Triggered by Thread:  0

full log - so, there seems sth. wrong with background thread handling, searching for "Namespace RUNNINGBOARD, Code 0xdead10cc" brings up some advices, need to investigate further.

EDIT: to sum up, the issue with 0xdead10cc is that a lock is held on some files (probably sql-database) after the app gets signals to be suspended. reason might be dispatched jobs that are still running for a little moment, we tried to mitigate that by recent commits and to exclude this cause. however, might also be that "shared" usage of the database is the cause, that would be very hard to fix, see discussions wrt share-to-delta.

stats of recent versions:

Screen Shot 2021-05-11 at 10 45 13

compared with older stats, we had 700 installations and only 1 crash/7 days (at least after removing sqlx)

ftr, 0xdead10cc was also topic in #1057

@r10s r10s added the bug label May 11, 2021
@cyBerta
Copy link
Contributor

cyBerta commented May 11, 2021

we should check the diff between 1.19.0 and 1.19.1

@r10s
Copy link
Member Author

r10s commented May 11, 2021

this is the diff between 1.19.0 and 1.19.1:
v1.19.0...v1.19.1
not sure, if the can be found in these few diff-lines. might be, the issue was also be there before and became somehow more visible. but the "fix a rare crash in chatlist" comment in changelog is suspicious at least 🧐

@cyBerta
Copy link
Contributor

cyBerta commented May 11, 2021

@r10s does maybeNetwork() v1.19.0...v1.19.1#diff-fc4377fbb4c60bfc9a11f53fd448de91bbf34787141509ddb85242515fea9e23R378 contain any database action?

There are two nice articles describing similar errors to what we see:
https://inessential.com/2020/02/13/how_we_fixed_the_dreaded_0xdead10cc_cras
https://blog.iconfactory.com/2019/08/the-curious-case-of-the-core-data-crash/

Somewhere we're keeping a file handle, probably the database while the app is about to be suspended.

Due to incorrectly placed task markers, the background task would be marked done before it had finished saving the database. When things happened in the wrong sequence, iOS would unceremoniously kill the process — making it look like it had crashed somewhere inside of Core Data.

This quote describes what might go wrong in our app, too:

we call maybe network from a background task and mark the background task some steps afterwards as completed: completionHandler(.newData). The async block is still executed in the meanwhile and iOS raises this error while suspending the app.

@cyBerta cyBerta self-assigned this May 11, 2021
@r10s
Copy link
Member Author

r10s commented May 11, 2021

does maybeNetwork() contain any database action?

not sure, at a first glance, i did not find something, but might be, maybe some config or so. and even if not directly - it may still keep the handle open, if it takes longer for whatever reason.

maybe @link2xt knows better.

@link2xt
Copy link
Contributor

link2xt commented May 11, 2021

dc_maybe_network returns immediately, you can verify it with repl, and I have also tested with kdeltachat. You can also add time measurements to iOS logs and verify that it is true for iOS.

@r10s said on irc:

we fixed a lot of anr by scheduling maybe_network in a bg thread

Not sure why this happens, all dc_maybe_network does is a try_send on the "interrupt" channels. If the channel is full, the operation should fail immediately.

From reading the links above, I get that the problem is unclosed database "connections" (file descriptors in case of SQLite) at the time of suspension. I see that you are trying to stop working by calling dc_stop_io before the deadline. But it does not close SQL connections. Core only closes connections during import/export and when the context is dropped (by calling dc_context_unref). dc_stop_io only prevents future network operations, but you can still use the database: read contact list, chatlist, messages, create new message sending jobs and so on. If you really need to close SQLite file descriptors, you need a separate call for this in the API, let's call it dc_context_suspend, followed by dc_context_unsuspend to reopen the database.

@r10s
Copy link
Member Author

r10s commented May 11, 2021

thanks for the insights, @link2xt !

@cyBerta we are sharing the database between extensions, or? we did not really have a choice iirc, however, is is trally what is discouraged in the links above? i thought about that always more as a "lib" doing core stuff, that is used by main-app and needed extensions.

also, we did not always get the error - or maybe it was just covered by other ones?

EDIT: i checked some older crash logs - it was also present there - but of course much rare as the background fetch was called less often. so, indeed, there are some hints that it is not really related to maybeNetwork() ...
and yes, maybe a dedicated close() or suspend() or so is the way to go.

@r10s
Copy link
Member Author

r10s commented May 12, 2021

thinking it over - even if maybeNetwork() returns immediately, it might be that it is started much later as expected.

we schedule maybeNetwork() call in a dispatch queue with a very low priority, so, might be, it is called even after we call the completionHandler(). may not happen always, but sometimes - but the issue is also sometimes only.

so, maybe instead of #1206 that schedules everything, just call maybeNetwork in the given thread as it was done before 1.19.1. i would not expect that the issue will be gone completely, but maybe it will be much less, on the level as before.

even if it does not help, we would use maybeNetwork() on-purpose, and will notice if that really somehow results in more anr. and, in this case, we can think over a dc_suspend() api as suggested by @link2xt

nb: i am wondering if the crash is actually user-visible when not using testflight or a debugger. so, maybe, some mitigating is good enough here.

@r10s
Copy link
Member Author

r10s commented May 12, 2021

@r10s r10s changed the title background crashes from time to time background handler terminated by 0xdead10cc from time to time May 12, 2021
@r10s r10s changed the title background handler terminated by 0xdead10cc from time to time background handler terminated by 0xdead10cc sometimes May 12, 2021
@r10s
Copy link
Member Author

r10s commented May 13, 2021

the issue is not fixed by #1208, maybe mitigated.

for getting the error popup: you have to install from testflight.

i never got the exception-popup when installing from xcode. instead i am getting the launch screen maybe once a day on starting, that might be how 0xdead10cc gets visible without testflight.

i am wondering, if the issue is user visible when installing from regular app store; if not, maybe it is okay to release with the issue for now, unless we find an easy way to fix it.

@cyBerta
Copy link
Contributor

cyBerta commented May 13, 2021

some more deep information: https://developer.apple.com/forums/thread/126438

also some explanation how GRDB mitigates the issue on the user API level: https://github.com/groue/GRDB.swift/blob/master/Documentation/SharingADatabase.md#how-to-limit-the-0xdead10cc-exception

@cyBerta
Copy link
Contributor

cyBerta commented May 13, 2021

If I understand the issue correctly, our problem is related to the fact that we share the database between the share extension and main app. In this case database locks are created for syncing. Both share extension and main app run as separate processes.

Another way to mitigate the issue is described here: https://inessential.com/2020/02/13/how_we_fixed_the_dreaded_0xdead10cc_cras

  • move the database back to the main app
  • use a (plist) file in the shared container the extension writes which the app later reads if the app resumes

I think this will be very cumbersome or impossible for us to implement though, because the extension needs data from the database to show the chat list, contacts etc. and to send messages.

@r10s
Copy link
Member Author

r10s commented May 13, 2021

our problem is related to the fact that we share the database between the share extension

yip, some reports point in that direction, while others do not. what i do not get: if it is really related to the database moved to shared spaces: (a) i would expect the issue to appear far more often, not only once a day or so (b) what would the shared space be good for then? :)

@r10s
Copy link
Member Author

r10s commented May 13, 2021

with latest testflight, there may be a new issue, where 0xdead10cc happens, on startup, reported by @dignifiedquire - and sth. similar happens also to me when running testflight.

the new issue may be related to the fix-markseen-forground-fix, #1182 - this is the log, that points to that direction, https://gist.github.com/r10s/f93220a734a8a3e26705fed996356283, line 48, startTimer. i do not really get that, the app is coming to forground, why do we get a 0xdead10cc here. in any case, that is new. EDIT: it is the fired timer, that causes the issue, that, indeed makes some sense if it is not stopped for whatever reason. but i still do not get it completely.

@r10s
Copy link
Member Author

r10s commented May 15, 2021

0xdead10cc is still there in 1.20.3, here is a recent log with a different stack trace: https://gist.github.com/r10s/ffd99677f5b6e8f1d62ae4d82238311b, we'll see if at least the new coming-to-foreground-issue is fixed and if the number of exceptions are lowered by 1.20.3 in general (i personally had none so far)

@r10s
Copy link
Member Author

r10s commented May 17, 2021

another idea: may it be that we have to do beginBackgroundTask/endBackgroundTask also in performFetchWithCompletionHandler and didReceiveRemoteNotification? as we return from the function and call the completionHandler a bit later?

https://stackoverflow.com/questions/37321498/background-task-on-remote-notification-suspended-after-a-short-while points in this direction, but i did not find anything "official" yet. maybe check some other open source apps.

@cyBerta
Copy link
Contributor

cyBerta commented May 17, 2021

another idea: may it be that we have to do beginBackgroundTask/endBackgroundTask also in performFetchWithCompletionHandler and didReceiveRemoteNotification? as we return from the function and call the completionHandler a bit later?

https://github.com/deltachat/deltachat-ios/issues?utf8=✓&q=is%3Aissue+sort%3Aupdated-desc points in this direction, but i did not find anything "official" yet. maybe check some other open source apps.

the link leads to the issue overview. Is that where you wanted to point to?

@r10s
Copy link
Member Author

r10s commented May 17, 2021

i updated the link :)

@cyBerta
Copy link
Contributor

cyBerta commented May 17, 2021

Regarding our questions on the last meeting:

  • you can call UIApplication.shared.beginBackgroundTask multiple times, but you need to keep track of them and correctly call UIApplication.shared.endBackgroundTask for each returned UIBackgroundTaskIdentifier instance.

from https://developer.apple.com/documentation/uikit/uiapplication/1623031-beginbackgroundtask:

You can call this method at any point in your app’s execution. You may also call this method multiple times to mark the beginning of several background tasks that run in parallel. However, each task must be ended separately. You identify a given task using the value returned by this method.

To assist with debugging, this method generates a name for the task, based on the name of the calling method or function. If you want to specify a custom name, use the beginBackgroundTask(withName:expirationHandler:) method instead.

This method can be safely called on a non-main thread.

@cyBerta
Copy link
Contributor

cyBerta commented May 17, 2021

How many background tasks can be called:

from https://developer.apple.com/forums/thread/85066

Do not create too many background tasks. How many is too many? It’s absolutely fine to create tens of background tasks but creating thousands is not a good idea.

IMPORTANT iOS 11 introduced a hard limit on the number of background task assertions a process can have (currently about 1000, but the specific value may change in the future). If you see a crash report with the exception code 0xbada5e47, you’ve hit that limit.

Note The practical limit that you’re most likely to see here is the time taken to call your expiry handlers. The watchdog has a strict limit (a few seconds) on the total amount of time taken to run background task expiry handlers. If you have thousands of handlers, you may well run into this limit.

@r10s r10s reopened this Mar 27, 2022
@r10s r10s removed the notify label Mar 27, 2022
@r10s
Copy link
Member Author

r10s commented Apr 14, 2022

0xdead10cc crashes in 1.28.1 are at typically at the following line: https://github.com/deltachat/deltachat-ios/blob/v1.28.1/deltachat-ios/AppDelegate.swift#L460

meanwhile, however, the code is different, also, these crashes seem to be replaced by a "force suspend" when being not in testflight, see above.

but we are still in the same code are and i still do not full get the too long times from #1461 (we meanwhile just stop the last 16 wakeups - all of them should be around 11 seconds, however, some are 10 minutes or longer, so there is a wakeup in between):

also @link2xt : your suggestion from above, a dc_accounts_suspend() - how much effort would that be? not sure if that would totally fix the issue, however, it still seems the right way to go.

@r10s
Copy link
Member Author

r10s commented Apr 16, 2022

ftr: a maybe somehow related crosslink: deltachat/deltachat-core-rust#2955 (comment)

@link2xt
Copy link
Contributor

link2xt commented Apr 16, 2022

With current API it is possible to introduce dc_context_close with minimal changes. This is making things worse wrt reducing statefullness of context though, see deltachat/deltachat-core-rust#3063).

@r10s
Copy link
Member Author

r10s commented Apr 17, 2022

thanks for the information, @link2xt! we are currently on checking/fixing some other parts and bugs (#1541 #1542), let's see after that if dc_context_close is needed at the end. maybe things shape nicely even without :)

@r10s r10s unassigned cyBerta May 6, 2022
@r10s
Copy link
Member Author

r10s commented May 7, 2022

reopening, the issue is still there on testflight (not on debug-build, not on regular release).

i am running 1.30.1 from testflight on an iphone7 that got these errors before. well, up to no, 30+ hours later, i did not encountered any crashes. however, after all, i stay skeptical 🧐

@r10s r10s closed this as completed Jun 20, 2022
@link2xt
Copy link
Contributor

link2xt commented Feb 17, 2023

The database is going to be closed more reliably with deltachat/deltachat-core-rust#4053

@r10s
Copy link
Member Author

r10s commented Feb 17, 2023

nice, looking forward to try that out and if it helps on testflight builds

@dignifiedquire
Copy link
Member

lets hope for the best, currently getting about 10 crashes per day..

@r10s
Copy link
Member Author

r10s commented Feb 18, 2023

on testflight? or otherwise debug?

@dignifiedquire
Copy link
Member

on testflight

@r10s
Copy link
Member Author

r10s commented Aug 9, 2023

the issue is not completely fixed, however, still seems to appear on testflight only.

we probably need a way to really suspend the core, the theory is still that core accesses some files a moment after we say "okay, apple, ready to suspend".

@r10s r10s reopened this Aug 9, 2023
@link2xt
Copy link
Contributor

link2xt commented Aug 11, 2023

Currently the core maintains an SQL connection pool for each Context that is never closed. Context contains Sql and Sql has a .close() method, but it is private and is never called except when .open() fails and in the tests.

If we want to close Sql and all the connections in the connection pool, we need an API to close the Context that will in turn close the underlying database. Alternatively, to avoid making the Context stateful, we can drop the Context and reopen it from scratch in an account manager.

It is also technically possible to destroy async tokio Runtime which is currently lazily created by the deltachat-ffi, but it is probably not needed and would take additional effort, closing the database is the first thing to do.

@Simon-Laux
Copy link
Member

I think this is related to encrypted accounts and to the new lock on the account manager.

@r10s
Copy link
Member Author

r10s commented Dec 6, 2023

I think this is related to encrypted accounts and to the new lock on the account manager.

+1, this is tracked by #2010 and #2009 now

if we find out more reasons for 0xdead10cc, we should file more on-point issues

@r10s r10s closed this as completed Dec 6, 2023
@r10s r10s closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants