Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] add method to kill aws instance to simulate chaos #45546

Merged
merged 5 commits into from
May 28, 2024

Conversation

hongchaodeng
Copy link
Member

@hongchaodeng hongchaodeng commented May 24, 2024

Why are these changes needed?

This is separated from #45364.
Have tested it for chaos_test/test_chaos_basic.py.

How to enable this feature is to be discussed? We can enable it for all chaos tests, maybe? Or we can add a flag to migrate tests one by one?

I will test it on more existing tests and discuss with other teams to decide later.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@hongchaodeng hongchaodeng requested review from jjyao and rynewang May 24, 2024 18:16
@hongchaodeng hongchaodeng added the go add ONLY when ready to merge, run all tests label May 24, 2024
aws ec2 terminate-instances --region $region --instance-ids $instanceId
""" # noqa: E501

ssh = paramiko.SSHClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need paramiko? Maybe I'm wrong but can we just use subprocess with command = "ssh ..."?

If we are to add it, I think we need to add it also somewhere for example python/requirements/test-requirements.txt.

Apart from ssh, you can also curl the IDMSv2 HTTP ports. Here is some code (not tested)

import boto3
import requests

def get_instance_metadata(token, path):
    url = f"http://169.254.169.254/latest/meta-data/{path}"
    headers = {'X-aws-ec2-metadata-token': token}
    response = requests.get(url, headers=headers)
    return response.text

def main():
    # Get the metadata token
    token = requests.put(
        'http://169.254.169.254/latest/api/token',
        headers={'X-aws-ec2-metadata-token-ttl-seconds': '21600'}
    ).text

    # Get instance ID and region
    instance_id = get_instance_metadata(token, 'instance-id')
    region = get_instance_metadata(token, 'placement/region')

    # Create EC2 client
    ec2 = boto3.client('ec2', region_name=region)

    # Terminate the instance
    response = ec2.terminate_instances(InstanceIds=[instance_id])

    # Print the response
    print(response)

if __name__ == "__main__":
    main()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will write it in pure ssh.

We can't write python code because it's not wrapped in ssh command.

ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())

# This is a feature on Anyscale platform that enables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can detect if we are on anyscale and else, skip the test so the test does not fail on local desktop.

Copy link
Member Author

@hongchaodeng hongchaodeng May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's something to add to the test. This PR doesn't change that, but only add the utility method.

That's also why I separate this from #45364. I can't make the call and need to talk to other decision makers to follow up on config change.

Signed-off-by: hongchaodeng <[email protected]>
@@ -1533,6 +1533,23 @@ def _kill_resource(self, node_id, node_to_kill_ip, node_to_kill_port):
)
self.killed.add(node_id)

def _kill_node(self, ip):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _kill_node(self, ip):
def _terminate_ec2_instance(self, ip):

Signed-off-by: hongchaodeng <[email protected]>

result = subprocess.run(ssh_command, shell=True, capture_output=True, text=True)
print(f"STDOUT:\n{result.stdout}\n")
print(f"STDERR:\n{result.stderr}\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for the sake of completeness, can we detect if the aws cli return 0 or not, and raise exceptions if it's not returning 0?

Copy link
Member Author

@hongchaodeng hongchaodeng May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
It will raise CalledProcessError if failed.

Signed-off-by: hongchaodeng <[email protected]>
Signed-off-by: hongchaodeng <[email protected]>
@jjyao jjyao merged commit ce0932b into ray-project:master May 28, 2024
6 checks passed
@hongchaodeng hongchaodeng deleted the test-term-ins branch May 28, 2024 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants