-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup & Fix data races in ApolloWebSocket #307
Conversation
- lowercase enum cases - switch on actual enum cases, not its raw values - avoid using rawValue for things not RawRepresentable - avoid using abbreviations - Slightly more consistent use of spaces
Thanks! I'm not too familiar with the code, so I think @knutaa is the best person to give detailed feedback on the changes. As for the concurrency issues, would it be possible to have all processing happen on a single serial queue? If we configure Starscream to use that as its |
Maybe, there's a few other critical paths outside the callbacks though, such as (un)subscribing and I gave it a quick shot locally, but I get test failures by just setting the callbackQueue on the starscream websocket object (and no other changes), haven't dug into why yet… |
Regardless if we can get rid of the lock, I think avoiding hitting the main queue when WebSocket messages are received would make me feel a lot better. Thanks for looking into this, I didn't realize that was what Starscream was doing! |
I'm really waiting for this because it's a fix #308 |
The websocket in WebSocketTransport declared as internal let websocket: WebSocketClient is the protocol and not the full Starscream WebSocket. This is done in support of the mocked testing. It should be possible to set the callback processingQueue with something like
Would it then be possible to wrap the content of write() and replace the locks in sendHelper and unsubscribe? Other locks should be redundant. The current use of processingQueue for the delegate methods should (of course?) be redundant. |
Yeah I didn't discover the I quickly tried setting the callbackQueue and wrapping Honestly I feel a bit bad about dumping this PR in your lap. It contains some opinionated stylistic and future proofing changes mixed in with some actual fixes. |
in case it isn't obvious you run the tests with TSAN by going to Edit Scheme (opt click the run icon) for the ApolloWebSockets scheme, then select the Test configuration in the sidebar thing, then the Diagnostics tab > ThreadSanitizer in case any one else want to have a look the data race it finds. |
Based on #310 here is a compare with just the concurrency fixes It still uses a NSLock, because, well, dispatch queues are a pretty heavy hammer for a small locking problem, and it tends to complicate things by turning a synchronous problem into an asynchronous one. TSAN still triggers on the state mutation FetchQueryOperation.fetchFromNetwork() in the main Apollo project though. |
@js ping me when you get back from vacation, let's try to get this good to merge 😃 |
@js You still on vacation or do you think you might have a chance to take a look at this? |
@designatednerd I'll take a look over the weekend |
Any updates on this? I am experiencing race condition issues on disconnecting from multiple subscriptions and am interested in seeing this merged. |
Sorry, I simply haven't found the time to look at this. Took a brief look last night but I've completely lost all context I had a year ago :) and my project still doesn't use graphql subscriptions so haven't found the excuse to weave it in there When running the tests (the main apollo test bundle), TSAN seems triggered by something with the ReadWriteLock inside the barrier blocks but it doesn't give much of the usual backtrace/hints as to what it thinks the problem is… 🤷♂ I'm closing this, but if anyone else wants the interesting bits seems to be in ec42e1d |
This PR commits the ultimate sin of mixing several changes in the same PR, sorry not sorry.
The ApolloWebSocket had a vastly different coding style than the rest of the project, and diverged a bit from standard swift style as well, so since I'm a total snob I had to fix that before I could fix the actual problem:
Due to the fact that Starscream delivers its callbacks on the main queue by default (it can be configured to another DispatchQueue but we'd still have the same problem) and the
NetworkTransport
'ssend(....)
is operated from whatever queue the callingAsynchronousOperation
gets assigned to, this can lead to race conditions on some of the state variables, as detected by the Thread Sanitizer. So this PR adds a lock around some of the state variables (mainly the acking of messages) as well as processing of starscream callbacks on a serial queue.I'm not too happy about the latter there as I feel there should be another design around coordinating thread access with the NetworkTransport implementations, what do you think?