Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lifecycle hooks for cloudmap (de)registration operations on vtgate pods #3934

Conversation

VinaySagarGonabavi
Copy link
Contributor

Added the scripts to the docker image in https://github.yelpcorp.com/docker-images/vitess_base/pull/21
AWS role and policy added for namespace pods in https://github.yelpcorp.com/misc/terraform-code/pull/24481

@VinaySagarGonabavi VinaySagarGonabavi force-pushed the u/gonabavi/u/gonabavi/DREIMP-10901_add_cloudmap_scripts_add_lifecycle_hooks_for_vtgate_pods branch from 665c15f to 93adf70 Compare August 22, 2024 17:18
@VinaySagarGonabavi VinaySagarGonabavi requested review from alexanderjulianmartinez and alex-eid and removed request for alex-eid August 22, 2024 17:27
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any notes about y'all's discussion with CIC about potentially integrating vitess with our mesh? using hooks for service discovery registration/deregistration makes me a little nervous (e.g., what happens in cloudmap if a k8s node goes offline in a way where the prestop hooks for vitess pods don't run?)

Comment on lines +737 to +742
def get_aws_region(self) -> str:
region = self.get_region()
return f"{region[:2]}-{region[2:6]}-{region[6:7]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a tad brittle as this slicing won't work for all AWS regions - it's probably fine for where vitess will be deployed initially, but perhaps we can have the region identification happen inside the script itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have the region identification happen inside the script itself?
Can you explain this? Or do you mean pass it from yelpsoa?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VinaySagarGonabavi my bad, i meant that the register/deregister script could read /nail/etc/habitat and figure out the region from there - in general, it's probably better to have the script be somewhat independent of how paasta is configuring pods since it's easier/faster to roll out an image change for your service than to do anything else (e.g., you can avoid having to go through a paasta release if the script is called with no or minimal arguments)

paasta_tools/vitesscluster_tools.py Show resolved Hide resolved
@VinaySagarGonabavi
Copy link
Contributor Author

Reopening this as we moved back to a single VitessCluster setup in #3947

@VinaySagarGonabavi
Copy link
Contributor Author

VinaySagarGonabavi commented Sep 6, 2024

are there any notes about y'all's discussion with CIC about potentially integrating vitess with our mesh? using hooks for service discovery registration/deregistration makes me a little nervous (e.g., what happens in cloudmap if a k8s node goes offline in a way where the prestop hooks for vitess pods don't run?)

Notes from a few rounds of discussion with CIC/Service Mesh on this are noted in https://jira.yelpcorp.com/browse/DREIMP-10901 and the last discussion with CIPX is in https://yelp.slack.com/archives/C060C8L80LE/p1723597193236439?thread_ts=1723511315.755149&cid=C060C8L80LE

For the cases of when a pod goes away before it could deregister itself from cloudmap, we're planning to handle with an external monitoring check to query the exposed endpoint /debug/health from vtgate service on each pod ip registered in a cloudmap service. If either the pod is not responsive or the health check fails, it can be handled with an auto remediation step of deregistering the ip. It's not ideal but as Krall pointed out if the external monitoring job runs frequently enough, this case can be handled

@VinaySagarGonabavi VinaySagarGonabavi force-pushed the u/gonabavi/u/gonabavi/DREIMP-10901_add_cloudmap_scripts_add_lifecycle_hooks_for_vtgate_pods branch from 93adf70 to 43758bc Compare September 6, 2024 03:42
@VinaySagarGonabavi VinaySagarGonabavi merged commit bf2f8cf into master Sep 6, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants