-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lifecycle hooks for cloudmap (de)registration operations on vtgate pods #3934
Add lifecycle hooks for cloudmap (de)registration operations on vtgate pods #3934
Conversation
665c15f
to
93adf70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any notes about y'all's discussion with CIC about potentially integrating vitess with our mesh? using hooks for service discovery registration/deregistration makes me a little nervous (e.g., what happens in cloudmap if a k8s node goes offline in a way where the prestop hooks for vitess pods don't run?)
def get_aws_region(self) -> str: | ||
region = self.get_region() | ||
return f"{region[:2]}-{region[2:6]}-{region[6:7]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a tad brittle as this slicing won't work for all AWS regions - it's probably fine for where vitess will be deployed initially, but perhaps we can have the region identification happen inside the script itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can have the region identification happen inside the script itself?
Can you explain this? Or do you mean pass it from yelpsoa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VinaySagarGonabavi my bad, i meant that the register/deregister script could read /nail/etc/habitat and figure out the region from there - in general, it's probably better to have the script be somewhat independent of how paasta is configuring pods since it's easier/faster to roll out an image change for your service than to do anything else (e.g., you can avoid having to go through a paasta release if the script is called with no or minimal arguments)
Reopening this as we moved back to a single VitessCluster setup in #3947 |
Notes from a few rounds of discussion with CIC/Service Mesh on this are noted in https://jira.yelpcorp.com/browse/DREIMP-10901 and the last discussion with CIPX is in https://yelp.slack.com/archives/C060C8L80LE/p1723597193236439?thread_ts=1723511315.755149&cid=C060C8L80LE For the cases of when a pod goes away before it could deregister itself from cloudmap, we're planning to handle with an external monitoring check to query the exposed endpoint /debug/health from vtgate service on each pod ip registered in a cloudmap service. If either the pod is not responsive or the health check fails, it can be handled with an auto remediation step of deregistering the ip. It's not ideal but as Krall pointed out if the external monitoring job runs frequently enough, this case can be handled |
93adf70
to
43758bc
Compare
Added the scripts to the docker image in https://github.yelpcorp.com/docker-images/vitess_base/pull/21
AWS role and policy added for namespace pods in https://github.yelpcorp.com/misc/terraform-code/pull/24481