Flaky cache tests #993

designatednerd · 2020-02-03T21:28:42Z

We're seeing some really flaky cache tests, particularly since #552 was merged - it may be related to the changes in that or it may not.

designatednerd · 2020-02-03T21:29:17Z

@RolandasRazma @AnthonyMDev Would either of you have a second to take a look at this? Obviously flakiness is obnoxious to diagnose but we're getting intermittent and weird failures on a bunch of tests on CI.

RolandasRazma · 2020-02-04T08:25:38Z

Yes, it probably related, don't like that tight loop 0...1000 at all.
We definitely can reduce that loop to only few interactions as in new code base there is no locking at all

designatednerd · 2020-02-04T15:58:33Z

It's not failing consistently on one particular test though - that's the bit that makes this a bit obnoxious to try to figure out

RolandasRazma · 2020-02-04T16:06:37Z

If tests running in parallel maybe we slow down is enough to knock them off

designatednerd · 2020-02-04T16:10:20Z

I don't think that's a great solution - we do want the tests to reflect potentially very high loads, and if we just slow them down in order to make them pass, we're absolutely going to miss something.

designatednerd · 2020-02-04T16:12:52Z

This is a good example of some of the failures we're seeing: There's no change except to our documentation, but it's randomly having a bunch of tests time out.

AnthonyMDev · 2020-02-04T17:07:23Z

If that test is failing intermittently, it almost certainly means that there are bugs in the locking. That’s what it was designed to test. It has to loop a large number of times to trigger the edge case of the dead lock. If you reduce the loop to 10, the test becomes useless. If you reduce it to 500, it will maybe sometimes deadlock, just not quite as often. Because it’s a race condition we are testing for, the test will not ALWAYS fail if the bug exists. It’s near impossible to make that happen. But that test shouldn’t be failing randomly. It most likely indicates something is wrong.

designatednerd · 2020-02-04T19:59:07Z

Does seem like we're getting some repeated crashes. Going to keep updating links as these repeat:

testThreadedCache:
1
2
3
4
5
6
7
8

testLoadingHeroAndFriendsNamesQueryWithIDs:
1
2
3
4
5

And also some one-offs (so far):

testReadHeroNameQuery:
1

testLoadingHeroAndFriendsNamesQueryWithNullFriends:
1

testUpdateHeroAndFriendsNamesQueryWithVariable:
1

testHeroNameConditionalInclusion:
1

designatednerd · 2020-02-04T20:59:11Z

OK running locally, it looks like I can make testThreadedCache crash pretty frequently running the ApolloSQLite tests on macOS.

It's crashing here when it does crash:

I'm not quite sure why this would be crashing though - especially if it's not consistent.

AnthonyMDev · 2020-02-04T21:04:22Z

I would love to help look into this, but I am slammed at work this week. I'll be able to look into this more next week. But I'm pretty confident that if that test is failing or crashing, it is indicative of a problem in the code not just in the test... I could be wrong, but I'd have to look at it next week if this isn't resolved by then.

designatednerd · 2020-02-04T21:14:18Z

@AnthonyMDev No worries, totally understand! I'm just throwing my notes up to try and see if it helps anyone think of something, because I'm kind of mystified by this.

The recursive lock was only added to the in-memory cache, but this test was added to the SQLite tests as well since it's in the ApolloCacheDependentTests bundle. On both CI and local, this test is passing consistently when run in not-SQLite test cases for macOS, but crashing pretty often in SQLite test cases for macOS.

It's crashing intermittently in the same place on iOS and tvOS when not using SQLite, both on CI and locally.

AnthonyMDev · 2020-02-04T21:22:05Z

Yeah, this should be in ApolloCacheDependentTests. It should be run against every cache type to ensure that all cache types are A) doing locking to prevent race conditions, and B) the locking is not causing deadlocks.

This comment from before might give you something to think about to solve this? This could be due to the way the DataLoader is working. Adding the locks as we did in that PR helped, but I think it may not have solved the underlying issue?
#552 (comment)

designatednerd · 2020-02-04T21:45:05Z

Oof, yeah that's a bit of a mess.

Definitely seems like something weird is going on with the DispatchGroup when it crashes - it's getting a count - 1073741823, which from light googling looks like the max value for long. When it doesn't crash the count is what I expect it to be (3003, since we have 1001 loops entering the group 3x). I'm betting this is a wraparound error and we're actually at a count of -1.

I've tried the solution proffered in this article about using dispatch groups but it doesn't appear to have made any difference - still seeing that same wraparound error.

Will keep futzing with this.

AnthonyMDev · 2020-02-04T21:52:32Z

Yeah, I really think that the DataLoader should not be accessing the cache. That's a bigger infra change than what this last PR did. But I think it's an issue with the design of the system. Nothing outside of the store should be accessing the cache IMO. the DataLoader should be accessing the store, which internally accesses the cache I think. I'm not 100% sure that's whats causing this issue, but it's something.

designatednerd · 2020-02-04T22:30:42Z

It looks like watch is getting over-called pretty significantly in certain instances - I added count vars to each of the three completion closures to see when there's a crash how much each has been called, and it's repeatedly been well over 1000 calls when there should theoretically only be 1001. I can't make it crash in a reproducible fashion, but that'd certainly crash from a standpoint of "leave was called more than enter on a `DispatchGroup".

I'm not sure I see where the loader is accessing the store directly, though.

RolandasRazma · 2020-02-05T09:34:19Z

looking at master history https://github.com/apollographql/apollo-ios/commits/master it looks like it started after weak delegate was merged. Are we looking in correct place?

designatednerd · 2020-02-05T14:06:36Z

Those are builds that are failing after the merge to master - builds that were failing prior to merge to master started earlier, but you're correct that does seem to have increased the velocity. I'll take a look.

designatednerd · 2020-02-13T18:19:51Z

The flakiness has been resolved. Underlying issues are now being tracked in #1011. Closing this out.

designatednerd added the caching label Feb 3, 2020

designatednerd mentioned this issue Feb 4, 2020

Fix SQLite Cache Key Bug #991

Merged

designatednerd closed this as completed Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky cache tests #993

Flaky cache tests #993

designatednerd commented Feb 3, 2020

designatednerd commented Feb 3, 2020

RolandasRazma commented Feb 4, 2020 •

edited

Loading

designatednerd commented Feb 4, 2020

RolandasRazma commented Feb 4, 2020

designatednerd commented Feb 4, 2020

designatednerd commented Feb 4, 2020

AnthonyMDev commented Feb 4, 2020 via email

designatednerd commented Feb 4, 2020 •

edited

Loading

designatednerd commented Feb 4, 2020

AnthonyMDev commented Feb 4, 2020

designatednerd commented Feb 4, 2020

AnthonyMDev commented Feb 4, 2020

designatednerd commented Feb 4, 2020

AnthonyMDev commented Feb 4, 2020

designatednerd commented Feb 4, 2020

RolandasRazma commented Feb 5, 2020

designatednerd commented Feb 5, 2020

designatednerd commented Feb 13, 2020

Flaky cache tests #993

Flaky cache tests #993

Comments

designatednerd commented Feb 3, 2020

designatednerd commented Feb 3, 2020

RolandasRazma commented Feb 4, 2020 • edited Loading

designatednerd commented Feb 4, 2020

RolandasRazma commented Feb 4, 2020

designatednerd commented Feb 4, 2020

designatednerd commented Feb 4, 2020

AnthonyMDev commented Feb 4, 2020 via email

designatednerd commented Feb 4, 2020 • edited Loading

designatednerd commented Feb 4, 2020

AnthonyMDev commented Feb 4, 2020

designatednerd commented Feb 4, 2020

AnthonyMDev commented Feb 4, 2020

designatednerd commented Feb 4, 2020

AnthonyMDev commented Feb 4, 2020

designatednerd commented Feb 4, 2020

RolandasRazma commented Feb 5, 2020

designatednerd commented Feb 5, 2020

designatednerd commented Feb 13, 2020

RolandasRazma commented Feb 4, 2020 •

edited

Loading

designatednerd commented Feb 4, 2020 •

edited

Loading