-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] add method to kill aws instance to simulate chaos #45546
Conversation
19b10da
to
cc4b6df
Compare
cc4b6df
to
550747a
Compare
Signed-off-by: hongchaodeng <[email protected]>
550747a
to
197df39
Compare
python/ray/_private/test_utils.py
Outdated
aws ec2 terminate-instances --region $region --instance-ids $instanceId | ||
""" # noqa: E501 | ||
|
||
ssh = paramiko.SSHClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need paramiko? Maybe I'm wrong but can we just use subprocess with command = "ssh ..."?
If we are to add it, I think we need to add it also somewhere for example python/requirements/test-requirements.txt.
Apart from ssh, you can also curl the IDMSv2 HTTP ports. Here is some code (not tested)
import boto3
import requests
def get_instance_metadata(token, path):
url = f"http://169.254.169.254/latest/meta-data/{path}"
headers = {'X-aws-ec2-metadata-token': token}
response = requests.get(url, headers=headers)
return response.text
def main():
# Get the metadata token
token = requests.put(
'http://169.254.169.254/latest/api/token',
headers={'X-aws-ec2-metadata-token-ttl-seconds': '21600'}
).text
# Get instance ID and region
instance_id = get_instance_metadata(token, 'instance-id')
region = get_instance_metadata(token, 'placement/region')
# Create EC2 client
ec2 = boto3.client('ec2', region_name=region)
# Terminate the instance
response = ec2.terminate_instances(InstanceIds=[instance_id])
# Print the response
print(response)
if __name__ == "__main__":
main()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will write it in pure ssh.
We can't write python code because it's not wrapped in ssh command.
ssh = paramiko.SSHClient() | ||
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) | ||
|
||
# This is a feature on Anyscale platform that enables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can detect if we are on anyscale and else, skip the test so the test does not fail on local desktop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's something to add to the test. This PR doesn't change that, but only add the utility method.
That's also why I separate this from #45364. I can't make the call and need to talk to other decision makers to follow up on config change.
Signed-off-by: hongchaodeng <[email protected]>
python/ray/_private/test_utils.py
Outdated
@@ -1533,6 +1533,23 @@ def _kill_resource(self, node_id, node_to_kill_ip, node_to_kill_port): | |||
) | |||
self.killed.add(node_id) | |||
|
|||
def _kill_node(self, ip): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _kill_node(self, ip): | |
def _terminate_ec2_instance(self, ip): |
Signed-off-by: hongchaodeng <[email protected]>
93d68ae
to
db89199
Compare
|
||
result = subprocess.run(ssh_command, shell=True, capture_output=True, text=True) | ||
print(f"STDOUT:\n{result.stdout}\n") | ||
print(f"STDERR:\n{result.stderr}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for the sake of completeness, can we detect if the aws cli return 0 or not, and raise exceptions if it's not returning 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
It will raise CalledProcessError if failed.
Signed-off-by: hongchaodeng <[email protected]>
Signed-off-by: hongchaodeng <[email protected]>
Why are these changes needed?
This is separated from #45364.
Have tested it for
chaos_test/test_chaos_basic.py
.How to enable this feature is to be discussed? We can enable it for all chaos tests, maybe? Or we can add a flag to migrate tests one by one?
I will test it on more existing tests and discuss with other teams to decide later.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.