Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(api-graphql): Data messages should maintain the keep alive status #14164

Merged
merged 8 commits into from
Jan 27, 2025

Conversation

stocaaro
Copy link
Member

@stocaaro stocaaro commented Jan 24, 2025

Description of changes

Problem: In low bandwidth or high js thread contention cases, the AppSync websocket can get backed up. If it gets backed up to the point where 1 minute of data can't be processed in 5 minutes, then no keep alive messages arrive which leads the client to think that the connection is dead, prompting disconnect.

Solution: Any receipt of data should count as a keep alive message.

Next Problem: We receive substantial more data than keep alive messages and using the existing keep alive maintenance logic with each fresh piece of data is expensive. In testing for 500,000 messages received the processing time goes from ~2100ms up to ~3000ms by adding this logic.

Next Solution: Revise the keep alive maintenance logic to capture a timestamp, then have a heartbeat interval that reviews the keep alive alert and disconnect conditions every 5 seconds. As a result, the processing time of maintaining the keep alive heartbeat for each receipt of data for 500,000 messages is ~2150ms

Issue #, if available

Description of how you validated changes

A unit test was added to validate the change. Tested failing without the logic update, passing with the data maintenance logic.

I will be performing manual app testing on Monday 1/27.

Checklist

  • PR description included
  • yarn test passes
  • Unit Tests are changed or added
  • Relevant documentation is changed or added (and PR referenced)

Checklist for repo maintainers

  • Verify E2E tests for existing workflows are working as expected or add E2E tests for newly added workflows
  • New source file paths included in this PR have been added to CODEOWNERS, if appropriate

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@stocaaro stocaaro requested a review from a team as a code owner January 24, 2025 15:48
@@ -147,6 +147,10 @@ describe('AWSAppSyncRealTimeProvider', () => {
Object.defineProperty(constants, 'RECONNECT_DELAY', {
value: 100,
});
// Reduce the keep alive heartbeat to 10ms
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing we always reduce the heartbeat to 10ms, but only reduce the alert and failure times in the tests that are exercising those event conditions.

await fakeWebSocketInterface?.waitUntilConnectionStateIn([
CS.Connected,
]);
await fakeWebSocketInterface?.waitUntilConnectionStateIn([
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation changed to always access the constants directly where before they where saved in private/instance variables, so these events need to be scoped inside of the constant mocking now.

}

// Recognize we are disconnected if we haven't seen messages in the keep alive timeout period
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check because of unfamiliarity but are we expecting that we might record KA/KA missed above and then disconnect here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconnect change introduced multiple chatty connection state observables, like keep alive. Rather than adding logic to track current state and check the state to avoid sending a duplicate state, the states observable is pipe'ed through a filter that removes duplicates before they are surfaced to the user.

In this instance, we could check is the connection is times out and return. I can't think of a case where the disconnect would occur on the same check as the first keep alive alert. Its possible that there are suspense/recovery patterns that don't break the web socket that I'm not aware of.

Since KA missed and disconnect are both checks against time, I wouldn't expect KA -> Disconnect logically, but if the timeout constants where changed, I accept that this outcome is possible given how the code is structured right now.

zxl629
zxl629 previously approved these changes Jan 24, 2025
Copy link

@zxl629 zxl629 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stocaaro stocaaro merged commit 8b489d1 into main Jan 27, 2025
30 checks passed
@stocaaro stocaaro deleted the stocaaro/graphql-data-ka/main branch January 27, 2025 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants