Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky cache tests #993

Closed
designatednerd opened this issue Feb 3, 2020 · 18 comments
Closed

Flaky cache tests #993

designatednerd opened this issue Feb 3, 2020 · 18 comments
Labels

Comments

@designatednerd
Copy link
Contributor

We're seeing some really flaky cache tests, particularly since #552 was merged - it may be related to the changes in that or it may not.

@designatednerd
Copy link
Contributor Author

@RolandasRazma @AnthonyMDev Would either of you have a second to take a look at this? Obviously flakiness is obnoxious to diagnose but we're getting intermittent and weird failures on a bunch of tests on CI.

@RolandasRazma
Copy link
Contributor

RolandasRazma commented Feb 4, 2020

Yes, it probably related, don't like that tight loop 0...1000 at all.
We definitely can reduce that loop to only few interactions as in new code base there is no locking at all

@designatednerd
Copy link
Contributor Author

It's not failing consistently on one particular test though - that's the bit that makes this a bit obnoxious to try to figure out

@RolandasRazma
Copy link
Contributor

If tests running in parallel maybe we slow down is enough to knock them off

@designatednerd
Copy link
Contributor Author

I don't think that's a great solution - we do want the tests to reflect potentially very high loads, and if we just slow them down in order to make them pass, we're absolutely going to miss something.

@designatednerd
Copy link
Contributor Author

This is a good example of some of the failures we're seeing: There's no change except to our documentation, but it's randomly having a bunch of tests time out.

@AnthonyMDev
Copy link
Contributor

AnthonyMDev commented Feb 4, 2020 via email

@designatednerd
Copy link
Contributor Author

designatednerd commented Feb 4, 2020

Does seem like we're getting some repeated crashes. Going to keep updating links as these repeat:

testThreadedCache:
1
2
3
4
5
6
7
8

testLoadingHeroAndFriendsNamesQueryWithIDs:
1
2
3
4
5

And also some one-offs (so far):

testReadHeroNameQuery:
1

testLoadingHeroAndFriendsNamesQueryWithNullFriends:
1

testUpdateHeroAndFriendsNamesQueryWithVariable:
1

testHeroNameConditionalInclusion:
1

@designatednerd
Copy link
Contributor Author

OK running locally, it looks like I can make testThreadedCache crash pretty frequently running the ApolloSQLite tests on macOS.

It's crashing here when it does crash:
Screen Shot 2020-02-04 at 2 58 14 PM

I'm not quite sure why this would be crashing though - especially if it's not consistent.

@AnthonyMDev
Copy link
Contributor

I would love to help look into this, but I am slammed at work this week. I'll be able to look into this more next week. But I'm pretty confident that if that test is failing or crashing, it is indicative of a problem in the code not just in the test... I could be wrong, but I'd have to look at it next week if this isn't resolved by then.

@designatednerd
Copy link
Contributor Author

@AnthonyMDev No worries, totally understand! I'm just throwing my notes up to try and see if it helps anyone think of something, because I'm kind of mystified by this.

The recursive lock was only added to the in-memory cache, but this test was added to the SQLite tests as well since it's in the ApolloCacheDependentTests bundle. On both CI and local, this test is passing consistently when run in not-SQLite test cases for macOS, but crashing pretty often in SQLite test cases for macOS.

It's crashing intermittently in the same place on iOS and tvOS when not using SQLite, both on CI and locally.

@AnthonyMDev
Copy link
Contributor

Yeah, this should be in ApolloCacheDependentTests. It should be run against every cache type to ensure that all cache types are A) doing locking to prevent race conditions, and B) the locking is not causing deadlocks.

This comment from before might give you something to think about to solve this? This could be due to the way the DataLoader is working. Adding the locks as we did in that PR helped, but I think it may not have solved the underlying issue?
#552 (comment)

@designatednerd
Copy link
Contributor Author

Oof, yeah that's a bit of a mess.

Definitely seems like something weird is going on with the DispatchGroup when it crashes - it's getting a count - 1073741823, which from light googling looks like the max value for long. When it doesn't crash the count is what I expect it to be (3003, since we have 1001 loops entering the group 3x). I'm betting this is a wraparound error and we're actually at a count of -1.

I've tried the solution proffered in this article about using dispatch groups but it doesn't appear to have made any difference - still seeing that same wraparound error.

Will keep futzing with this.

@AnthonyMDev
Copy link
Contributor

Yeah, I really think that the DataLoader should not be accessing the cache. That's a bigger infra change than what this last PR did. But I think it's an issue with the design of the system. Nothing outside of the store should be accessing the cache IMO. the DataLoader should be accessing the store, which internally accesses the cache I think. I'm not 100% sure that's whats causing this issue, but it's something.

@designatednerd
Copy link
Contributor Author

It looks like watch is getting over-called pretty significantly in certain instances - I added count vars to each of the three completion closures to see when there's a crash how much each has been called, and it's repeatedly been well over 1000 calls when there should theoretically only be 1001. I can't make it crash in a reproducible fashion, but that'd certainly crash from a standpoint of "leave was called more than enter on a `DispatchGroup".

I'm not sure I see where the loader is accessing the store directly, though.

@RolandasRazma
Copy link
Contributor

looking at master history https://github.com/apollographql/apollo-ios/commits/master it looks like it started after weak delegate was merged. Are we looking in correct place?

@designatednerd
Copy link
Contributor Author

Those are builds that are failing after the merge to master - builds that were failing prior to merge to master started earlier, but you're correct that does seem to have increased the velocity. I'll take a look.

@designatednerd
Copy link
Contributor Author

The flakiness has been resolved. Underlying issues are now being tracked in #1011. Closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants