Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] add method to kill aws instance to simulate chaos #45546

Merged
merged 5 commits into from
May 28, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions python/ray/_private/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from dataclasses import dataclass

import requests
import paramiko
from ray._raylet import Config

import psutil # We must import psutil after ray because we bundle it with ray.
Expand Down Expand Up @@ -1533,6 +1534,34 @@ def _kill_resource(self, node_id, node_to_kill_ip, node_to_kill_port):
)
self.killed.add(node_id)

def _kill_node(self, ip):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _kill_node(self, ip):
def _terminate_ec2_instance(self, ip):

# This command uses IMDSv2 to get the host instance id and region.
# After that it terminates itself using aws cli.
command = """
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

instanceId=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id/)
region=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/placement/region)

aws ec2 terminate-instances --region $region --instance-ids $instanceId
""" # noqa: E501

ssh = paramiko.SSHClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need paramiko? Maybe I'm wrong but can we just use subprocess with command = "ssh ..."?

If we are to add it, I think we need to add it also somewhere for example python/requirements/test-requirements.txt.

Apart from ssh, you can also curl the IDMSv2 HTTP ports. Here is some code (not tested)

import boto3
import requests

def get_instance_metadata(token, path):
    url = f"http://169.254.169.254/latest/meta-data/{path}"
    headers = {'X-aws-ec2-metadata-token': token}
    response = requests.get(url, headers=headers)
    return response.text

def main():
    # Get the metadata token
    token = requests.put(
        'http://169.254.169.254/latest/api/token',
        headers={'X-aws-ec2-metadata-token-ttl-seconds': '21600'}
    ).text

    # Get instance ID and region
    instance_id = get_instance_metadata(token, 'instance-id')
    region = get_instance_metadata(token, 'placement/region')

    # Create EC2 client
    ec2 = boto3.client('ec2', region_name=region)

    # Terminate the instance
    response = ec2.terminate_instances(InstanceIds=[instance_id])

    # Print the response
    print(response)

if __name__ == "__main__":
    main()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will write it in pure ssh.

We can't write python code because it's not wrapped in ssh command.

ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())

# This is a feature on Anyscale platform that enables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can detect if we are on anyscale and else, skip the test so the test does not fail on local desktop.

Copy link
Member Author

@hongchaodeng hongchaodeng May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's something to add to the test. This PR doesn't change that, but only add the utility method.

That's also why I separate this from #45364. I can't make the call and need to talk to other decision makers to follow up on config change.

# easy ssh access to worker nodes.
ssh.connect(ip, username="ray", port=2222)

stdin, stdout, stderr = ssh.exec_command(command)
output = stdout.read().decode()
error = stderr.read().decode()

stdin.close()

print(f"STDOUT:\n{output}")
print(f"STDERR:\n{error}")

def _kill_raylet(self, ip, port, graceful=False):
import grpc
from grpc._channel import _InactiveRpcError
Expand Down
Loading