Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement pull query routing to standbys if active is down #4398

Merged
merged 11 commits into from
Feb 6, 2020

Conversation

vpapavas
Copy link
Member

Description

Previously, pull queries were handled only by the active node of a partition. If the active goes down or rebalance happens, pull queries experience a period of unavailability. This PR allows standbys to handle pull queries until a new active is elected.

For this, it uses the liveness information provided by the heartbeat mechanism. If a node determines that the active is dead, it will forward the query to a standby that is alive. A query fails only if both the active and all standbys are dead.

Additionally, there is no busy loop with timeout when routing queries. Instead queries fail fast. This allows the client to implement the logic of retries.

Testing done

Functional test that ensures queries are routed to standby or active based on whether active is dead or alive.

In addition to this, manual tests will be performed both locally and on EC2 cluster as outlined in #4360

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@vpapavas vpapavas requested a review from a team as a code owner January 28, 2020 21:45
@vpapavas vpapavas requested a review from agavra January 28, 2020 21:45
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vpapavas, this is great work!

There's a lot of comments below, but many/most are just suggestions of improvements to the code.

Happy to review again - just add me back in or ping me on slack.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't had a chance to do a full review yet and I've gotta go, so will complete later. I think this could benefit from being split into 2 PRs (see inline comment)

@agavra agavra self-requested a review January 31, 2020 16:42
Copy link
Contributor

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few refactoring/structural comments.. Will do a final pass once the pending comments are pushed and pr rebased against your other change

progress

Added functional test for routing and endpoint for active standby info

undo changes to log files

fixing tests

fixed tests

addressed vinoth's comments

fixed tests

fixed broken test after kafka merge

address alans comments

add missed file

made hostStatus map atomic reference, changed copy-on-read to copy-on-write

fix dead store bug?

applying andy's comments

successful rebase

rebase and apply comments
@vpapavas vpapavas force-pushed the confluent-ha-routing branch from a0de6e1 to 626b7ef Compare February 5, 2020 04:42
Copy link
Contributor

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work @vpapavas ! Almost ready.. One more round and we should be home..

Could you also please add unit tests for the two RoutingFilter impls and RoutingFilters class.

Copy link
Contributor

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor logging related comments..

LGTM! Please go ahead merge, once you are ready!

@vpapavas vpapavas merged commit ace23b1 into confluentinc:master Feb 6, 2020
@vinothchandar
Copy link
Contributor

@vpapavas seems like the CI build has an issue?

[2020-02-06T06:30:06.538Z] [INFO] -------------------------------------------------------------
[2020-02-06T06:30:06.538Z] [ERROR] COMPILATION ERROR : 
[2020-02-06T06:30:06.538Z] [INFO] -------------------------------------------------------------
[2020-02-06T06:30:06.538Z] [ERROR] /home/jenkins/workspace/entinc_Contributors_ksql_PR-4398/ksql-rest-app/src/test/java/io/confluent/ksql/rest/integration/PullQueryRoutingFunctionalTest.java:[100,52] cannot find symbol
[2020-02-06T06:30:06.538Z]   symbol:   variable JSON
[2020-02-06T06:30:06.538Z]   location: interface io.confluent.ksql.serde.Format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants