Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash: Tor together with testing server responses #1587

Closed
HonzaR opened this issue Sep 11, 2024 · 8 comments · Fixed by #1592
Closed

Crash: Tor together with testing server responses #1587

HonzaR opened this issue Sep 11, 2024 · 8 comments · Fixed by #1592
Assignees
Labels
bug Something isn't working

Comments

@HonzaR
Copy link
Contributor

HonzaR commented Sep 11, 2024

The current SDK state causes the Zashi and the Demo app to crash in the scenario described below.

NOTE: It seems that only the Android API level 27 is impacted.

NOTE 2: The crash happens only when Tor and server switching are called simultaneously.

Can you reliably reproduce the issue?

If so, please list the steps to reproduce below:

  1. Create Emulator API level 27
  2. Run Zashi app (Demo app impacted, too)
  3. Enable Currency conversion feature
  4. Go to Settings -> Advanced Settings -> Choose server
  5. Let the automated server responses test run and try to change server manually by clicking any available in the meantime
  6. App crashes with stacktrace attached in this issue

Actual behavior + errors

tor_server_switch_api_27_crash.mp4

crash_tor_server_switch_api_27.txt

  • App Version: any
  • Android Version: 27
  • Device: (if applies): any

Any extra information that might be useful in the debugging process.

Reproducible on both emulators and physical devices.

@HonzaR HonzaR added the bug Something isn't working label Sep 11, 2024
@HonzaR HonzaR changed the title Crash when Tor together with testing server responses Crash: Tor together with testing server responses Sep 11, 2024
@Milan-Cerovsky
Copy link
Contributor

If we do not find the root cause of this issue there are two hacks coming to my mind.

  1. Each Synchronizer has its own TorClient which is killed with it. Instead we can hold a static TorClient in cache to avoid its recreation
  2. We can disable fetching exchange rates while fastest server calculation is ongoing

@str4d
Copy link
Contributor

str4d commented Sep 11, 2024

This part of the crash log shows that the crash happened while trying to free the Tor runtime:

 A  runtime.cc:523] "DefaultDispatcher-worker-1" prio=5 tid=15 Runnable\
 A  runtime.cc:523]   | group="" sCount=0 dsCount=0 flags=0 obj=0x13080bf8 self=0x779922d400\
 A  runtime.cc:523]   | sysTid=7967 nice=0 cgrp=default sched=0/0 handle=0x779957b4f0\
 A  runtime.cc:523]   | state=R schedstat=( 11030712 5135458 80 ) utm=1 stm=0 core=1 HZ=100\
 A  runtime.cc:523]   | stack=0x7799479000-0x779947b000 stackSize=1037KB\
 A  runtime.cc:523]   | held mutexes= "abort lock" "mutator lock"(shared held)\
 A  runtime.cc:523]   native: #00 pc 00000000003c951c  /system/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*)+208)\
 A  runtime.cc:523]   native: #01 pc 000000000049970c  /system/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool, BacktraceMap*, bool) const+348)\
 A  runtime.cc:523]   native: #02 pc 00000000004b1298  /system/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+884)\
 A  runtime.cc:523]   native: #03 pc 00000000004a9b44  /system/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+480)\
 A  runtime.cc:523]   native: #04 pc 00000000004a92e8  /system/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool)+288)\
 A  runtime.cc:523]   native: #05 pc 0000000000480d84  /system/lib64/libart.so (art::AbortState::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char>>&) const+220)\
 A  runtime.cc:523]   native: #06 pc 00000000004724f0  /system/lib64/libart.so (art::Runtime::Abort(char const*)+196)\
 A  runtime.cc:523]   native: #07 pc 00000000005677a8  /system/lib64/libart.so (android::base::LogMessage::~LogMessage()+996)\
 A  runtime.cc:523]   native: #08 pc 000000000030016c  /system/lib64/libart.so (art::JavaVMExt::FindCodeForNativeMethod(art::ArtMethod*)+848)\
 A  runtime.cc:523]   native: #09 pc 000000000050486c  /system/lib64/libart.so (artFindNativeMethod+628)\
 A  runtime.cc:523]   native: #10 pc 0000000000546248  /system/lib64/libart.so (art_jni_dlsym_lookup_stub+40)\
 A  runtime.cc:523]   at cash.z.ecc.android.sdk.internal.model.TorClient.access$freeTorRuntime(TorClient.kt:12)\
 A  runtime.cc:523]   at cash.z.ecc.android.sdk.internal.model.TorClient$Companion.freeTorRuntime(TorClient.kt:-1)\
 A  runtime.cc:523]   at cash.z.ecc.android.sdk.internal.model.TorClient$Companion.access$freeTorRuntime(TorClient.kt:25)\
 A  runtime.cc:523]   at cash.z.ecc.android.sdk.internal.model.TorClient$dispose$2.invokeSuspend(TorClient.kt:17)\
 A  runtime.cc:523]   at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)\
 A  runtime.cc:523]   at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)\
 A  runtime.cc:523]   at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:111)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:99)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:585)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:802)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:706)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:693)\

In particular, it is something around where the native method is being resolved, as we are still inside artFindNativeMethod (and therefore I assume have not run the freeing logic):

 A  runtime.cc:523]   native: #06 pc 00000000004724f0  /system/lib64/libart.so (art::Runtime::Abort(char const*)+196)\
 A  runtime.cc:523]   native: #07 pc 00000000005677a8  /system/lib64/libart.so (android::base::LogMessage::~LogMessage()+996)\
 A  runtime.cc:523]   native: #08 pc 000000000030016c  /system/lib64/libart.so (art::JavaVMExt::FindCodeForNativeMethod(art::ArtMethod*)+848)\
 A  runtime.cc:523]   native: #09 pc 000000000050486c  /system/lib64/libart.so (artFindNativeMethod+628)\
 A  runtime.cc:523]   native: #10 pc 0000000000546248  /system/lib64/libart.so (art_jni_dlsym_lookup_stub+40)\
 A  runtime.cc:523]   at cash.z.ecc.android.sdk.internal.model.TorClient.access$freeTorRuntime(TorClient.kt:12)\

Note also that it is being run on the IO dispatcher. The dispose method was made suspend in 7fe7499 (part of #1537, squash-merged into #1512, squash-merged into main as 618e8eb).

suspend fun dispose() =
withContext(Dispatchers.IO) {
nativeHandle?.let { freeTorRuntime(it) }
}

The only other thread doing something interesting in the log is this one:

 A  runtime.cc:523] "pool-2-thread-1" prio=5 tid=26 Runnable\
 A  runtime.cc:523]   | group="" sCount=0 dsCount=0 flags=0 obj=0x130810a0 self=0x7799230600\
 A  runtime.cc:523]   | sysTid=7978 nice=0 cgrp=default sched=0/0 handle=0x77987cd4f0\
 A  runtime.cc:523]   | state=R schedstat=( 50239872 7237959 125 ) utm=5 stm=0 core=3 HZ=100\
 A  runtime.cc:523]   | stack=0x77986cb000-0x77986cd000 stackSize=1037KB\
 A  runtime.cc:523]   | held mutexes= "mutator lock"(shared held)\
 A  runtime.cc:523]   native: #00 pc 00000000003c951c  /system/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*)+208)\
 A  runtime.cc:523]   native: #01 pc 000000000049970c  /system/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool, BacktraceMap*, bool) const+348)\
 A  runtime.cc:523]   native: #02 pc 00000000004b1298  /system/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+884)\
 A  runtime.cc:523]   native: #03 pc 000000000049a5b4  /system/lib64/libart.so (art::Thread::RunCheckpointFunction()+412)\
 A  runtime.cc:523]   native: #04 pc 000000000052127c  /system/lib64/libart.so (artTestSuspendFromCode+28)\
 A  runtime.cc:523]   native: #05 pc 000000000054fa48  /system/lib64/libart.so (art_quick_test_suspend+168)\
 A  runtime.cc:523]   native: #06 pc 00000000000d312c  /dev/ashmem/dalvik-jit-code-cache (deleted) (Java_java_util_ArrayList_00024Itr__0003cinit_0003e__Ljava_util_ArrayList_2Ljava_util_ArrayList_00024Itr_2+76)\
 A  runtime.cc:523]   at java.util.ArrayList$Itr.<init>(ArrayList.java:-1)\
 A  runtime.cc:523]   at java.util.ArrayList.iterator(ArrayList.java:834)\
 A  runtime.cc:523]   at javax.crypto.Cipher.tryCombinations(Cipher.java:2556)\
 A  runtime.cc:523]   at javax.crypto.Cipher$SpiAndProviderUpdater.updateAndGetSpiAndProvider(Cipher.java:2480)\
 A  runtime.cc:523]   - locked <0x09d777c0> (a java.lang.Object)\
 A  runtime.cc:523]   at javax.crypto.Cipher.chooseProvider(Cipher.java:567)\
 A  runtime.cc:523]   at javax.crypto.Cipher.init(Cipher.java:975)\
 A  runtime.cc:523]   at javax.crypto.Cipher.init(Cipher.java:910)\
 A  runtime.cc:523]   at com.google.crypto.tink.aead.internal.InsecureNonceAesGcmJce.decrypt(InsecureNonceAesGcmJce.java:136)\
 A  runtime.cc:523]   at com.google.crypto.tink.subtle.AesGcmJce.decrypt(AesGcmJce.java:63)\
 A  runtime.cc:523]   at com.google.crypto.tink.aead.AeadWrapper$WrappedAead.decrypt(AeadWrapper.java:91)\
 A  runtime.cc:523]   at androidx.security.crypto.EncryptedSharedPreferences.getDecryptedObject(EncryptedSharedPreferences.java:553)\
 A  runtime.cc:523]   at androidx.security.crypto.EncryptedSharedPreferences.getString(EncryptedSharedPreferences.java:427)\
 A  runtime.cc:523]   at co.electriccoin.zcash.preference.AndroidPreferenceProvider$getString$2.invokeSuspend(AndroidPreferenceProvider.kt:64)\
 A  runtime.cc:523]   at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)\
 A  runtime.cc:523]   at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)\
 A  runtime.cc:523]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1162)\
 A  runtime.cc:523]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:636)\
 A  runtime.cc:523]   at java.lang.Thread.run(Thread.java:764)\

That is fetching a preference inside a dispatcher. I don't know which one.
https://github.com/Electric-Coin-Company/zashi-android/blob/b75836f941134a83e98519bf437ce46c6cd225b4/preference-impl-android-lib/src/main/java/co/electriccoin/zcash/preference/AndroidPreferenceProvider.kt#L62-L65

However, it also looks like the crash log is incomplete:

 A  runtime.cc:523]   - locked <0x0675f9bb> (a java.lang.Object)\
 A  runtime.cc:523]   at sun.misc.Unsafe.park(Unsafe.java:358)\
 A  runtime.cc:523]   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:353)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.park(CoroutineScheduler.kt:847)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.tryPark(CoroutineScheduler.kt:792)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:740)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:693)\
 A  runtime.cc:523] \
 A  runtime.cc:523] "DefaultDispatcher-worker-5" prio=5 tid=18 TimedWaiting\
 A  runtime.cc:523]   | group="" sCount=1 dsCount=0 flags=1 obj=0x13081128 self=0x7799232400\
 A  runtime.cc:523]   | sysTid=7971 nice=0 cgrp=default sched=0/0 handle=0x7798ef04f0\
 A  runtime.cc:523]   | state=S schedstat=( 1002916 3173834 11 ) utm=0 stm=0 core=1 HZ=100\
 A  runtime.cc:523]   | stack=0x7798dee000-0x7798df0000 stackSize=1037KB\
 A  runtime.cc:523]   | held mutexes=\
 A  runtime.cc:523]   kernel: __switch_to+0x90/0xc4\
 A  runtime.cc:523]   kernel: futex_wait_queue_me+0xd0/0x160\
 A  runtime.cc:523]   kernel: futex_wait+0xf0/0x2a0\
 A  runtime.cc:523]   kernel: do_futex+0x168/0xb1c\
 A  runtime.cc:523]   kernel: SyS_futex+0x104/0x16c\
 A  runtime.cc:523]   kernel: __sys_trace_return+0x0/0x4\
 A  runtime.cc:523]   native: #00 pc 000000000001d7f0  /system/lib64/libc.so (syscall+32)\
 A  runtime.cc:523]   native: #01 pc 00000000000e6a3c  /system/lib64/libart.so (art::ConditionVariable::TimedWait(art::Thread*, long, int)+172)\
 A  runtime.cc:523]   native: #02 pc 00000000003c47f0  /system/lib64/libart.so (art::Monitor::Wait(art::Thread*, long, int, bool, art::ThreadState)+652)\
 A  runtime.cc:523]   native: #03 pc 00000000003c6250  /system/lib64/libart.so (art::Monitor::Wait(art::Thread*, art::mirror::Object*, long, int, bool, art::ThreadState)+376)\
 A  runtime.cc:523]   native: #04 pc 0000000000000aec  /system/framework/arm64/boot.oat (Java_java_lang_Object_wait__JI+140)\
 A  runtime.cc:523]   at java.lang.Object.wait(Native method)\
 A  runtime.cc:523]   - waiting on <0x03f47ed8> (a java.lang.Object)\
 A  runtime.cc:523]   at java.lang.Thread.parkFor$(Thread.java:2137)\
 A  runtime.cc:523]   - locked <0x03f47ed8> (a java.lang.Object)\
 A  runtime.cc:523]   at sun.misc.Unsafe.park(Unsafe.java:358)\
 A  runtime.cc:523]   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:353)\
 A  runtime.cc:523]   at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.park(CoroutineScheduler.kt:847)\
 A  Fatal signal 6 (SIGABRT), code -6 in tid 7967 (DefaultDispatch), pid 7932 (oin.zcash.debug)}

(Note how at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.park appears in two thread traces above; the first with some parent calls, the second interrupted by a Fatal signal 6 (SIGABRT) log line.)

@HonzaR is there any way to get more of the log? In particular, we are not seeing anything in the logs related to server testing (other than maybe that preference fetch), so there's not much to work on here (though I'm going to spend some time looking into the TorClient free logic).

@HonzaR
Copy link
Contributor Author

HonzaR commented Sep 11, 2024

@HonzaR is there any way to get more of the log? In particular, we are not seeing anything in the logs related to server testing (other than maybe that preference fetch), so there's not much to work on here (though I'm going to spend some time looking into the TorClient free logic).

Yes, I'll get more logs. The Kotlin logs and tracing logs from the Rust side.

@HonzaR
Copy link
Contributor Author

HonzaR commented Sep 13, 2024

@str4d I'm attaching a new crash log file. Unfortunately, the Android SDK level 27, on which we're able to reproduce the crash, does not print the Rust trancing logs, as we recently discovered with @nuttycom. We should probably check that, too. But the new log at least adds the Kotlin logs.
crash_api_27_logs_v2.txt

@str4d
Copy link
Contributor

str4d commented Sep 13, 2024

@HonzaR explained how the server switcher works to me, and now I'm almost certain that the bug is a UAF.

In order to satisfy Rust lifetime rules, these two methods MUST NOT be called in parallel:

suspend fun dispose() =
withContext(Dispatchers.IO) {
nativeHandle?.let { freeTorRuntime(it) }
}
suspend fun getExchangeRateUsd(): BigDecimal =
withContext(Dispatchers.IO) {
getExchangeRateUsd(nativeHandle!!)
}

The problem is that:

  • The TorClient is owned by the Synchronizer.
  • The new server switching logic restarts the Synchronizer, which closes the current one, which disposes of its TorClient via SdkSynchornizer.close() -> UsdExchangeRateFetcher.dispose() -> TorClientHolder.dispose() -> TorClient.dispose() -> freeTorRuntime().

Thus, if the app has called Synchronizer.refreshExchangeRateUsd, it has a mutable reference to the Tor client, and then the server switching logic attempts to free it, violating the Rust lifetime rules.

@str4d
Copy link
Contributor

str4d commented Sep 13, 2024

@HonzaR pointed me to this mutex:

private class TorClientHolder(private val torDir: File) {
private val mutex = Mutex()
private var torClient: TorClient? = null
suspend operator fun invoke(): TorClient =
mutex.withLock {
if (torClient == null) {
torClient = TorClient.new(torDir)
}
return torClient!!
}
suspend fun dispose() =
mutex.withLock {
torClient?.dispose()
}
}

The mutex does not enforce the necessary Rust lifetime rules, because it only gates access to the TorClient handle, not to using it. Specifically, invoke() does not invoke the method on TorClient inside the mutex:

  • TorClientHolder.invoke() called.
    • Mutex locked.
      • TorClient returned.
    • Mutex unlocked
    • TorClient.getExchangeRateUsd() called outside mutex.
  • TorClientHolder.invoke() called.
    • Mutex locked.
      • TorClient.dispose() called in parallel -> crash.

@str4d
Copy link
Contributor

str4d commented Sep 13, 2024

The fix here is to either add another Mutex inside TorClient itself, or pass a closure to TorClientHolder.invoke() that is given the TorClient handle while still inside the locked Mutex.

EDIT: an extra Mutex does not help because then you get the following race condition:

  • 1️⃣ locks Holder mutex to get a handle.
  • 2️⃣ locks Holder mutex to dispose.
  • 2️⃣ locks accessMutex to dispose.
  • 1️⃣ locks accessMutex to get exchange rates.

@HonzaR
Copy link
Contributor Author

HonzaR commented Sep 15, 2024

@str4d I've tried several improvements on the Kotlin side to get rid of this error on the impacted API 27. Non of it helped, unfortunately. I've also tried play a bit with threading, logged all possible states, or simplify the entire Kotlin implementation.

It's always the same, any time we try to stop Synchornizer, which holds TorClient reference, and calls freeTorRuntime when disposing itself, the crash happens. The easiest way to reproduce it, is to start the Demo app (it loads currency conversion after the start automatically), and stop the Demo app by tapping the system back button, the crash comes up.

I believe that we need to handle this inside the Rust part. From my point of view as a Rust API consumer, when I use any API that provides a disposal method, then the disposal method (freeTorRuntime in this case) needs to check its internal running stuffs, safely stop them, stop itself, and then report the result. We as consumers should not wait until the API running is done and then dispose it, because it might take a long time, which might be true in case of this new API (loading the currency rate might take up to 20s).

@HonzaR HonzaR mentioned this issue Sep 16, 2024
13 tasks
str4d added a commit that referenced this issue Sep 16, 2024
This was added to the `TorClient.dispose()` JNI pathway because that
native coded did not require access to `JNIEnv` or `JClass`. However,
it turns out that this annotation was only added in Android 8 for
system usage [0], and was not CTS-tested public API until Android 14.
For whatever reason, in API 27 (Android 8.1) the native method symbol
can't be found, but in API 28 and above it can be.

The performance cost of regular JNI is no longer necessary to worry
about given that we now keep a `TorClient` around long-term as part of
`SdkSynchronizer`, so instead of doing anything complex to preserve it
on API 28+ we just remove the `@CriticalNative` annotation and adjust
the native method's signature to take (and ignore) the extra arguments.

Closes #1587.

[0]: https://developer.android.com/reference/dalvik/annotation/optimization/CriticalNative
@str4d str4d mentioned this issue Sep 16, 2024
13 tasks
@HonzaR HonzaR closed this as completed in a6ed7e1 Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants