Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

native libc crash when trying to get a realm Instance in multithreaded environment #2207

Closed
fede-marsiglia opened this issue Jan 27, 2021 · 9 comments

Comments

@fede-marsiglia
Copy link

fede-marsiglia commented Jan 27, 2021

Goals

Getting a Realm Instance with Realm.GetInstance().

Expected Results

The resulting Realm instance.

Actual Results

Native libc crash, supposedly on mutex use.

mono-stdout: [Threads 6][Realm 15][Utils.TraceException:149] - W - Managed Exception in RealmProvider.GetRealm:55 -> Realms.Exceptions.RealmException: Cannot access realm that has been closed.
mono-stdout:   at Realms.NativeException.ThrowIfNecessary (System.Func`2[T,TResult] overrider) [0x00011] in <adbb920c027d4e04b3ca6a6920a44a1f>:0 
mono-stdout:   at Realms.SharedRealmHandle.Open (Realms.Native.Configuration configuration, 
Realms.Schema.RealmSchema schema, System.Byte[] encryptionKey) [0x00024] in <adbb920c027d4e04b3ca6a6920a44a1f>:0 
mono-stdout:   at Realms.RealmConfiguration.CreateRealm (Realms.Schema.RealmSchema schema) [0x00096] in <adbb920c027d4e04b3ca6a6920a44a1f>:0 
mono-stdout:   at Realms.Realm.GetInstance (Realms.RealmConfigurationBase config, 
Realms.Schema.RealmSchema schema) [0x0003c] in <adbb920c027d4e04b3ca6a6920a44a1f>:0 
mono-stdout:   at Realms.Realm.GetInstance (Realms.RealmConfigurationBase config) [0x0000a] in <adbb920c027d4e04b3ca6a6920a44a1f>:0 
mono-stdout:   at ByMeLib.Database.ByMeRealmProvider.GetRealm (System.Boolean isFirstInstanceMts) [0x00063] in <38bf14fccd7e41d98e9a214c80bed754>:0 
libc    : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x8 in tid 7677 (Thread Pool Wor)
DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
DEBUG   : Build fingerprint: 'Android/a93/a93:5.1/LMY47D/build12111549:user/test-keys'
DEBUG   : Revision: '0'
DEBUG   : ABI: 'arm'
DEBUG   : pid: 4892, tid: 7677, name: Thread Pool Wor  >>> com.mtsbyme <<<
DEBUG   : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x8
DEBUG   :     r0 00000008  r1 8a0fad9c  r2 00000001  r3 00000000
DEBUG   :     r4 00000008  r5 00000000  r6 401a5dd4  r7 8a0fad70
DEBUG   :     r8 ffffffff  r9 ffffffff  sl 8a0faf88  fp 00000000
DEBUG   :     ip 6e49a170  sp 8a0fad58  lr 6e3321b9  pc 40160f3e  cpsr 600d0030
DEBUG   : 
DEBUG   : backtrace:
DEBUG   :     #00 pc 00016f3e  /system/lib/libc.so (pthread_mutex_lock+7)
DEBUG   :     #01 pc 004321b5  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so 
(std::__ndk1::recursive_mutex::lock()+4)
DEBUG   :     #02 pc 002dc877  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so
DEBUG   :     #03 pc 00104c2d  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so
DEBUG   :     #04 pc 000f4221  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so
DEBUG   :     #05 pc 000cd6b7  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so (shared_realm_open+350)
DEBUG   :     #06 pc 0016b900  <unknown>
NativeCrashListener: Exception dealing with report
NativeCrashListener: android.system.ErrnoException: read failed: EAGAIN (Try again)
NativeCrashListener:    at libcore.io.Posix.readBytes(Native Method)
NativeCrashListener:    at libcore.io.Posix.read(Posix.java:165)
NativeCrashListener:    at libcore.io.BlockGuardOs.read(BlockGuardOs.java:230)
NativeCrashListener:    at android.system.Os.read(Os.java:350)
NativeCrashListener:    at 
com.android.server.am.NativeCrashListener.consumeNativeCrashData(NativeCrashListener.java:240)
NativeCrashListener:    at com.android.server.am.NativeCrashListener.run(NativeCrashListener.java:138)

Steps to Reproduce

non reproducible, happens in random contexts.

Version of Realm and Tooling

Realm 5.1.1 on Android/Ios

@papafe
Copy link
Contributor

papafe commented Jan 28, 2021

Hi, can you maybe show how are you calling Realm.GetInstance() in your app?
Also, are you using Xamarin.Forms or Xamarin Native? Is it happening on both iOS and Android?

@fede-marsiglia
Copy link
Author

fede-marsiglia commented Jan 28, 2021

Hi @papafe, we are using Xamarin.Forms on Android/IOS, the problem has been reported only on custom devices running Android 5.0/6.0. The relevant piece of code is the following:

if (_realmConfiguration == null)
{
    lock (Padlock)
    {
        _realmConfiguration ??= GetRealmConfiguration();
    }
}

Realm realm;

try
{
    realm = Realm.GetInstance(_realmConfiguration);
}
catch (Realms.Exceptions.RealmException ex)
{
    Utils.TraceException(ex,false);
    
    if (!string.IsNullOrEmpty(ex.Message) && ex.Message.Contains("is less than last set version"))
    {
        lock (Padlock)
        {
            Realm.DeleteRealm(_realmConfiguration);
            _realmConfiguration = GetRealmConfiguration();
        }
    }
    realm = Realm.GetInstance(_realmConfiguration);
}

This is called everytime our app needs a fresh realm instance to work on. The realm configuration is built using the method below:

var config = new RealmConfiguration("app_db.db")
{
    SchemaVersion = DataModelSchema.MODEL_SCHEMA_VERSION,
    ShouldDeleteIfMigrationNeeded = false,
    ShouldCompactOnLaunch = (totalBytes, usedBytes) =>
    {
        try
        {
            var totalMb = totalBytes / ConvertToMb;
            var usedMb = usedBytes / ConvertToMb;
            var usedPercentage = (double)usedBytes / totalBytes * 100;
            LogBroker.Instance.TraceDebug($"Realm current Total Size: {totalMb:0.##}MB", false);
            LogBroker.Instance.TraceDebug($"Realm current Used Size: {usedMb:0.##}MB", false);
            LogBroker.Instance.TraceDebug($"Realm current Used Percentage : {usedPercentage:0.##}%", false);

            // Compact if the file is over 10MB in size and less than 50% 'used'
            const int treshold = 10 * 1024 * 1024;
            return totalBytes > treshold && (double)usedBytes / totalBytes < 0.5;
        }
        catch (Exception e)
        {
            Utils.TraceException(e);
            return false;
        }
    },

    MigrationCallback = (migration, oldSchemaVersion) =>
    {
        if (oldSchemaVersion < 24)
        {
            migration.NewRealm.RemoveAll();
            return;
        }

        if (oldSchemaVersion < 26)
        {
            var newSceneryDatas = Queries.GetAllSceneryData(migration.NewRealm);
            for (var i = 0; i < newSceneryDatas.Count(); i++)
            {
                var newSceneryData = newSceneryDatas.ElementAt(i);
                newSceneryData.IsLocked = false;
            }
        }

        if (oldSchemaVersion < 27)
        {
            var newSystemFunctions = Queries.GetAllSfData(migration.NewRealm);

            for (var i = 0; i < newSystemFunctions.Count(); i++)
            {
                var newSf = newSystemFunctions.ElementAt(i);
                newSf.IsActive = 0;
            }
        }

        if (oldSchemaVersion < 28)
        {
            var newFavouriteModels = Queries.GetAllFavouriteModel(migration.NewRealm);

            for (var i = 0; i < newFavouriteModels.Count(); i++)
            {
                var newFavourite = newFavouriteModels.ElementAt(i);
                newFavourite.OrderView = i + 1;
            }
        }

        if (oldSchemaVersion < 30)
        {
            var bridges = migration.NewRealm.All<Bridge>();

            for (var i = 0; i < bridges.Count(); i++)
            {
                var bridge = bridges.ElementAt(i);
                bridge.Connected = false;
            }
        }

        if (oldSchemaVersion < 32)
        {
            var sfDataVideoentryphones = Queries.GetAllVideoentryphone(migration.NewRealm);

            for (var i = 0; i < sfDataVideoentryphones.Count(); i++)
            {
                var bridge = sfDataVideoentryphones.ElementAt(i);
                bridge.IsLocal = false;
            }
        }

        if (oldSchemaVersion < 33)
        {
            var sfDataVideoentryphones = Queries.GetAllVideoentryphone(migration.NewRealm);
            for (var i = 0; i < sfDataVideoentryphones.Count(); i++)
            {
                var sf = sfDataVideoentryphones.ElementAt(i);
                sf.IdAmbientUser = null;
            }
        }

        if (oldSchemaVersion < 35)
        {
            var newSceneryDatas = Queries.GetAllSceneryData(migration.NewRealm);
            var oldSceneryDatas = migration.OldRealm.All("SceneryData");

            for (var i = 0; i < newSceneryDatas.Count(); i++)
            {
                var newSf = newSceneryDatas.ElementAt(i);
                var oldSf = oldSceneryDatas.ElementAt(i);

                newSf.IdAmbient = (int?)oldSf.IdAmbient;
            }
        }

        if (oldSchemaVersion < 36)
        {
            var newFavouriteModels = Queries.GetAllFavouriteModel(migration.NewRealm);

            for (var i = 0; i < newFavouriteModels.Count(); i++)
            {
                var newFavourite = newFavouriteModels.ElementAt(i);
                newFavourite.IsPlateCamera = false;
            }
        }

        if (oldSchemaVersion < 37)
        {
            var notificationLedData = migration.NewRealm.All<NotificationLedData>();

            foreach (var notification in notificationLedData)
            {
                notification.IsEnable = false;
            }
        }

        if (oldSchemaVersion < 38)
        {
            migration.NewRealm.RemoveAll<VideoMessageItemModel>();
        }

        if (oldSchemaVersion < 39)
        {
            var newRingtoneSettings = migration.NewRealm.All<RingtonesSettings>().FirstOrDefault();
            if(newRingtoneSettings != null)
            {
                newRingtoneSettings.SettingKey = 0;
            }
        }

        if (oldSchemaVersion < 47)
        {
            migration.NewRealm.RemoveAll<VideoMessageItemModel>();
            migration.NewRealm.RemoveAll<CallHistoryItemModel>();
            migration.NewRealm.RemoveAll<TextMessageItemModel>();
        }
    }
};

Thanks for your time!

@nirinchev
Copy link
Member

Can you clarify a little what "custom devices" means in this case?

@fede-marsiglia
Copy link
Author

fede-marsiglia commented Jan 28, 2021

By custom devices I mean devices with boards manufactured by our client with closed specifications running Android. I don't think this should be an issue for the problem outlined here, but I added this detail for completeness.

I should also mention that those devices run on an AOSP with little changes of which I don't know the extension.

@nirinchev
Copy link
Member

If the issue only manifests itself on devices with custom hardware and modifications to the AOSP, there's very little we can do to diagnose or work around it. Since the vendor may have implemented system functionality, such as mutexes, in a non-standard or non-compliant way, we can't offer support for such devices.

@mattiascibien
Copy link

mattiascibien commented Feb 5, 2021

I am a colleague of @fede-marsiglia working on this very same issue. I have found that, by looking on the source code of the wrappers library that the only mutex lock used are on the coordinator and on the get_cached_realm functions when called from shared_realm_open.

I think that the case we are seeing here is that we are trying to open a new Realm again after getting the Realms.Exceptions.RealmException: Cannot access realm that has been closed and for some reason this crashes the library (prabably due to problems with the uniquelock or the RAII lock used in the lib).

I found that probably it is possible to circumvent this by disabiling realm chache and getting a new realm every time.

// If false, always return a new Realm instance, and don't return
// that Realm instance for other requests for a cached Realm. Useful
// for dynamic Realms and for tests that need multiple instances on
// one thread
bool cache = false;

Unfortunately the realm C# wrapper does not allow this as the EnableCache variable is internal. I was able to circumvent this by using reflection so I can try to stress-test the application.

Is it possible that the caching mechanism returns a cached realm that has already been closing, therefore causing the first exception we see and so trying to open a new realm in the same thread causes the second problem due to a bug in the caching mechanism?

@nirinchev
Copy link
Member

There is certainly a possibility for a bug in the caching mechanism, especially if your pattern for opening/disposing Realm instances is unconventional and we didn't do a good job testing for it. Good call going down the reflection approach - note that this may have unintended side effects related to notifications delivery on the main thread, but if it does resolve the original issue, it will definitely narrow down the investigation area.

@mattiascibien
Copy link

mattiascibien commented Feb 8, 2021

Thanks for the advice.

The code posted by @fede-marsiglia exposes the problem we are facing which in my opinion is more related to the fact that Realm.GetInstance returns an already closed Realm. That's the reason I thought that the problem is in the caching mechanism. The code inside the catch is actually a workaround we had to implement due to problems in upgrading the app in the environment by client request.

Apart from this the flow we are using is just what it is expected from Realm. Unfortunately the fact that trying to request a new realm in this case fails with the error on the mutext does not let us recover the situation and we cannot even check if the realm is closed.

We will try with the workaround but the fact that we are receiving a closed realm is a difficult problem to overcome.

Is there some kind of test we can try to reproduce the situation or do some stress test so both us and the Realm team may come to a solution?

EDIT: is it possible to keep this issue open while we investigate?

@nirinchev
Copy link
Member

I closed this issue because we can't offer support for custom hardware or modified AOSP distributions because we don't have the means to test those or to ensure that they're implementing basic Unix functionality correctly. It seems that your colleague @lagmac has reported that the issue occurs on a Samsung device and opened #2224 to track it. To avoid splitting the discussion, it's probably best to just use that one.

It's hard to suggest a test that will exhibit the problem without knowing how your app works and what are its usage patterns. If you do manage to create a test case that synthetically reproduces the issue somewhat reliably (e.g. >20% of the runs), we'd be happy to take it over from there.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants