Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-9724 IO500 example improvements #15

Merged
merged 5 commits into from
Feb 10, 2022
Merged

DAOS-9724 IO500 example improvements #15

merged 5 commits into from
Feb 10, 2022

Conversation

markaolson
Copy link
Contributor

Remove Intel OneAPI MPI from daos-client images

Build IO500 specific daos-client images

Removed installation steps for Intel OneAPI MPI, mpifileutils, and IO500-SC21 from run_io500-sc21.sh.

Ability to create daos-server and daos-client images with different versions of DAOS.

Add multiple config files in IO500 example.

Signed-off-by: Mark A. Olson [email protected]

Mark A. Olson added 2 commits February 7, 2022 21:58
Remove Intel OneAPI MPI from daos-client images

Build IO500 specific daos-client images

Removed the installation steps for Intel OneAPI MPI, mpifileutils, and IO500-SC21 from run_io500-sc21.sh.

Ability to create daos-server and daos-client images with different versions of DAOS.

Add multiple config files in IO500 example.

Signed-off-by: Mark A. Olson <[email protected]>
},
"variables": {
"daos_version": "{{env `DAOS_VERSION`}}"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson How is the value for DAOS_VERSION set? We don't wan Users messing around with this dev feature. We want our users on Intel/GCP Validated machines images not a choose your own adventure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version is no longer hard coded in each of the packer templates.
The default version is now in a single file images/daos_version.sh.
That file contains the default version used when no version is specified in the -v --version option of images/build_images.sh


# Default DAOS version to be installed in images
export DEFAULT_DAOS_VERSION="2.0.1"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson This makes is very user accessible. Can you put this in a less easy spot to find?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a problem for the user to see which version of DAOS will be used as the default when they build their images? What other spot do you suggest?

OR

export DAOS_INSTALL_TYPE="server"
export DAOS_VERSION="2.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is is very user accessible. We don't want users doing this. "export DAOS_VERSION="2.0.0"" Can you better hid this dev feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding the reasoning behind trying to hide how the version number is set.

Are you asking me to just remove that from the help output or remove the ability to pass the version in an environment variable?

Probably deserves an offline discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson After some thought lets just leave this as an undocumented feature.

The core issue is we do not want DAOS GCP USERs playing around with DAOS Versions. This should be an Experts only type activity. As you can just see there are DAOS 2.0 Lib fabric changes getting landed. If a USE decide DAOS 1.2 or 1.0 would be fun to try bad things would happen.

@@ -0,0 +1,46 @@
#!/bin/bash
# Copyright 2021 Google LLC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson we work for Intel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write this file. I ran it through shellcheck and fixed the issues that shellcheck identified. I also renamed it from setup-server.sh to setup_server.sh so that the word separator was consistent with the other bash scripts. That's why it's showing up as a new file. I will remove the copyright notice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just add a 2nd intel copyright line as I understand.

@@ -0,0 +1,371 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson I am not a big fan of this but ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know.

@@ -67,7 +65,7 @@ export TF_VAR_server_template_name="${DAOS_SERVER_BASE_NAME}"
export TF_VAR_server_mig_name="${DAOS_SERVER_BASE_NAME}"
export TF_VAR_server_machine_type="${DAOS_SERVER_MACHINE_TYPE}"
export TF_VAR_server_os_project="${TF_VAR_project_id}"
export TF_VAR_server_os_family="daos-server-centos-7"
export TF_VAR_server_os_family="daos-server-io500-centos-7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson Why do we need a daos-server-io500-centos-7 image?

IO500 is a client application a special server should not be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are multiple people working in the same project one person might be testing IO500 on DAOS 2.0.0 and another person may be testing the terraform/examples/full_cluster_setup or testing images with new versions in some other way. This actually happened with Johann and I a couple of times where my images messed up what he was doing. This is the first step in fixing the issue. When you run the IO500 the server and client images need to match. If you run the --version and pass a new version with the --force option you do not want to blow away your main daos-server-centos-7 image. So this just ensures that the images you are using are specifically built for the IO500 example and don't interfere with anything else. We still have a larger issue we need to address in a future PR. We need a way to tag/label the images with the version number so that the scripts can check for existing images with a specific version. Topic for another day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson had a good chat today. I better understand what is going on.

@@ -0,0 +1,79 @@
#!/bin/bash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see having 1 config per DAOS Server config.

If a user wants more clients or server they can do that at the Terraform layer.

This is is way too many scripts here.

@@ -0,0 +1,78 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many copies of the same basic thing.

sed -i "s/^\(\s*\)targets:.*/\1targets: ${DAOS_SERVER_DISK_COUNT}/g" "${SCRIPT_DIR}/daos_server.yml"
sed -i "s/^\(\s*\)scm_size:.*/\1scm_size: ${DAOS_SERVER_SCM_SIZE}/g" "${SCRIPT_DIR}/daos_server.yml"

# Copy daos_server.yml to all servers
Copy link
Contributor

@Kmannth Kmannth Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson This type of things needs to be apart of the Server code not of any IO500 Example. Please refactor this logic into Server code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent you an e-mail about this. You may have not seen it yet. The TODO comment at the top of the file was added because, yes, everything in this file this needs to move to terraform/startup scripts. However, it's too big of a change to include in this PR. Although this is a new file this code already existed in different files. I didn't add this feature. I just moved it to this file and called clush to deploy the .yml files instead of looping in a for loop and calling SSH for each server. I looked for code that manipulated daos_*.yml files in different scripts in the terraform/examples/io500 directory and then consolidated it into this single file. That way when we actually do move the configuration to terraform/startup scripts it should be easy to yank this out by deleting this one file and then removing the line in start.sh that calls it. I think the refactoring you are requesting should be done in another PR to focus specifically on those changes rather than make this PR larger. @cboneti and @johannlombardi do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson I agree this patch is huge. Breaking things into smaller pieces. helps.

Can we open an issue to track IO500 work somewhere like we have been talking about for few weeks? If so then I can agree to a fix in future version type way.


log_section "Destroying DAOS Servers & Clients"

pushd "${SCRIPT_DIR}/../full_cluster_setup"
terraform destroy -auto-approve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markaolson can you help me understand all the ACTIVE_CONFIG logic?

What is the use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config/active_config.sh symlink is created by the start.sh script. It points to the config that was sourced when the start script ran terraform to provision the instances. The symlink is then used by the stop.sh script to know which config to source when it runs terraform to tear down the environment. Without the symlink you you have to manually pass the last config you used to the stop.sh script every time you run it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop and start should just be + ansible start + ansible destroy.

What happens if I spawn 2 different DAOS system for 2 difference clusters from the same project?

After I create the 2nd how do a I get back to the 1st?

Copy link
Contributor

@Kmannth Kmannth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is a huge patch with many things happening in it.

On first pass it look mostly good.
A few things that stood out to me:

  1. Exposing DAOS_VERSION to users does not seem like a good step. I understand is very helpful for DEV but having a specific config file and full docs for it too much for me. We don't want GCP DAOS users mixing and match non validated DAOS release on GCP. I 100% support having this functionally as a Exports only type thing.

  2. There is a big IO500 refactor.
    We need one example per Server configs.
    Users should update the tfvars for number of clients/servers. There is a lot of redundant scripts.
    No no no to more updating the server.yml from IO500 scripts. Maybe drop out the I0500 running stuff for now while it can be fixed.

Mark A. Olson added 3 commits February 8, 2022 11:58
Changed values of os_family vars to match the names in the packer json files.

Signed-off-by: Mark A. Olson <[email protected]>
This file was supposed to show up in the git history as
being renamed to setup_server.sh.

Changed the word separator to an underscore to be consistent
with other multi-word files in the same directory.

Signed-off-by: Mark A. Olson <[email protected]>
Removed version information in help
Removed invalid config_4c_2s_16d.sh config

Signed-off-by: Mark A. Olson <[email protected]>
Copy link
Contributor

@Kmannth Kmannth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to land this in it's current state.

Copy link
Contributor

@lsitkiew lsitkiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me

@johannlombardi johannlombardi merged commit 7616dad into daos-stack:master Feb 10, 2022
@markaolson markaolson deleted the DAOS-9724_2 branch February 10, 2022 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants