Creating a virtual cluster on demand in an OpenStack environment including a UNICORE instance.
The following software and tools are used to setup the virtual UNICORE cluster:
- Terraform (Infrastructure as a code)
- BeeGFS/BeeOND (Shared File System)
- TORQUE (Batch System)
- UNICORE Server (Middleware)
- UNICORE workflow system (Workflow System)
- Zabbix monitoring system
In order to setup VALET you need to fulfill the following prerequisites
- You need access to an OpenStack driven cloud (for example the de.NBI cloud)
- Further you need access to the API and permissions to upload images
- An openrc file with the correct credentials needs to be available (can be donwloaded from the OpenStack Dashboard, Horizon)
- Installed version of Terraform (tested with v0.12.10)
- Access to remote resources (internet)
This section will list the most up to date and tested images for the master and compute nodes. If you want to use older images for some reasons you will need to change the names in the Terraformvars.tf
file.
- master image : unicore_master_centos_20190712.qcow2
- compute image : unicore_compute_centos_20190719.qcow2
- master image : unicore_master_centos_20190702.qcow2
- compute image : unicore_compute_centos_20190701.qcow2
The following information will help you to setup and use the virtual UNICORE cluster. This guide is tested for Linux on CentOS7 with Terraform version 0.12.10.
In order to use the sources you need to download or clone this git repository to your local machine.
git clone https://github.com/MaximilianHanussek/virtual_cluster_local_ips.git
You can also download it as a ZIP archive from the website of the repository or via wget
wget https://github.com/MaximilianHanussek/virtual_cluster_local_ips/archive/master.zip
you will find it as master.zip
.
Before we modify the required variables of Terraform for your OpenStack environment you will need to source your openstack credentials as environment variables and initialize Terraform. You can simply source your openstack credentials by downloading a so-called openrc file from the OpenStack dashboard also known as Horizon, to your local machine. After you have done that, source it with the following command
source /path/to/rc/file
Normally you should be asked for your password. Enter it and comfirm with enter. You will get no response, but you can check if everything worked well if you have the openstack client installed by running the following command
openstack image list
After that you should see a list of images that are available for your project.
Further we need to initialize Terraform. Therefore change into the terraform
directory of the downloaded git repo and run
terraform init
If everything worked out you should see some similar output like below:
Initializing provider plugins... The following providers do not have any version constraints in configuration, so the latest version was installed. To prevent automatic upgrades to new major versions that may contain breaking changes, it is recommended to add version = "..." constraints to the corresponding provider blocks in configuration, with the constraint strings suggested below. * provider.openstack: version = "~> 1.19" * provider.tls: version = "~> 2.0" Terraform has been successfully initialized! You may now begin working with Terraform. Try running "terraform plan" to see any changes that are required for your infrastructure. All Terraform commands should now work. If you ever set or change modules or backend configuration for Terraform, re-run this command to reinitialize your working directory. If you forget, other commands will detect it and remind you to do so if necessary.
In order to start the virtual cluster you will need a few variables you have to set on your own.
Change into the terraform directory, if not already done and open the vars.tf
file. You will find a bunch of defined variables, a comprehensive list can be found in the table below. The ones you will need to touch for sure are marked with yes (required)
. The ones you can change but do not have to change are marked with yes (not required)
. The ones marked with yes (poss. required)
need to be changed if you are running VALET on a non de.NBI cloud site or even not on the de.NBI cloud site Tübingen. As these values and namnes only exists in these cloud environments. Variables you are not allowed to change are marked with no
. If you change one of the no
tagged variables it could or will break the configuration process.
- beeond_disc_size: Sets the cinder volume size of the volumes attached to the master node and the two compute nodes. The shared file system will have the chosen size in gigabytes times three, for every participating node. So for 10GB it will 30GB. Set the size according to your needs and available resources.
- beeond_storage_backend: Sets the name of the storage backend for the cinder volumes, choose the appropriate one of your cloud site.
- flavors: Sets the used compute resources (CPUs, RAM, ...). Recommended for the master node are 8 CPUs and at least 16GB RAM.
- compute_node_count: Sets the number of compute nodes (current configuration works only with two).
- image_master: Sets the image to be used for the master node. Will be downloaded automatically.
- image_compute: Sets the image to be used for the compute node. Will be downloaded automatically.
- openstack_key_name: Sets the SSH key name of your OpenStack environment (Keypair is required to be set up already).
- private_key_path: Sets the path to your private key in order to access the VMs and run configuration scripts.
- name_prefix: Sets a prefix for the names of the starting VMs
- security_groups: Sets the names and the security groups itself (do not need to exist)
- network: Sets the network to be used
Variable | Default value | Unit | Change |
---|---|---|---|
beeond_disc_size | 10 | Gigabytes | yes (not required) |
beeond_storage_backend | quobyte_hdd | - | yes (poss. required) |
flavors | de.NBI small disc | 8 CPUs, 16GB RAM | yes (poss. required) |
compute_node_count | 2 | Instances | no |
image_master | unicore_master_centos | - | no |
image_compute | unicore_compute_centos | - | no |
openstack_key_name | test | - | yes (required) |
private_key_path | /path/to/private/key | - | yes (required) |
name_prefix | unicore- | - | no |
security_groups | virtual-unicore-cluster-public | - | no |
network | denbi_uni_tuebingen_external | - | yes (poss. required) |
After the Terraform variables are setup correctly we can go on to start the configuration process.
In order to do this, change into the terraform
directory of the Git repository and first run a dry run with
terraform plan
Terraform will now inform you what it will do and checks if the syntax of the terraform files (.tf) is correct.
If an error occur please follow the notes from Terraform and asure that you have sourced your openrc credentials file and initialized the Terraform plugins with terraform init
.
If everything looks reasonable we can start with the real action executing
terraform apply
This command will first set up the required volumes, then the security group. Afterwards the required images will be downloaded and imported into the OpenStack environment, which can take some time dependent on the network connection (compute image: 1.93GB, master image: 4.40GB). The next step will fire up the VMs and also attaches the cinder volumes. A subsequent script will mount the volumes, create one time SSH keys and distribute them on the different VMs so they can talk with each other without using your general private key for obvious security reasons. In the end the shared file system based on BeeOND will be started, the TORQUE cluster is started and in the end the UNICORE components. On top the Zabbix Monitoring system is set up. All this will take around 5-10 minutes. In the end you will have a fully setup UNICORE cluster that you can access like explained in Chapter 5. But of course you can use just the usual TORQUE batch system without UNICORE and submitting jobs to a queue.
The setup Zabbix webinterface can be found under the following URL replacing the example IP (42.42.42.42) with the public IP of your created master node: http://42.42.42.42/zabbix
The set login credentials are: Username: admin Password: zabbix
If you are just using the inital cluster without adding and removing nodes you can also change the password. If you want to use the add and remove procedures please do not change the credentials, as they are required for the Zabbix API access in order to remove nodes from Zabbix.
There are different ways to access the UNICORE cluster. One possibility is to use UNICORE Commandline Client (UCC) which can be downloaded here. The second possibility is to use the UNICORE Rich Client (URC), you can donwload here. In this instructions we will focus on the second possibility as this is the more convenient one.
In order to use the URC follow the steps below:
- Download the URC to your local computer (the same you have started)
- Unpack it and start the Application
- It will ask your for the credentials, we will use the demo credentials as this is also the user who is already in the UNICORE user database. Please also check to save the password (which is 321 if yopu should forget it).
- Afterwards go to the Workbench and add the new Registry by right-clicking into the window titled with
Grid Browser
and chooseAdd Registry
. You can freely choose a name and afterwards replacelocalhost
with the IP of your master node. You can find this information in the OpenStack dashboard (Horizon) or in Terraform. The rest of the URL needs to stay the same. Here an Example:
https://42.42.42.42:8080/REGISTRY/services/Registry?res=default_registry
Now you can start a small test run by submitting a script to the UNICORE cluster for example via the also configured Workflow System. For this purpose create a new workflow project and add a script (v2.2) to the worklfow, connect it with the green play button and enter for example in the script
whoami uname -r date
Click on the play button chose the available worjkflow engine and click on finish. You will see the worklfow running in the Grid Browser window if you unfold the name of Registry you have chosen, the Workflow engine
and the Workflows
icon. The output is accessible in the folder working directory of ...
.
For further complex workflows and further explanations on UNICORE we refer to the official documentation which you can find here.
It might happen that the initial cluster resources are not sufficient for the applied workload and more nodes could solve the problem faster. Or you need some smaller nodes or larger nodes for different kind of workloads. For this case we provide a mechanism that will automatically start a new node (via terraform). Add the new
node to the already existing BeeOND file system and also make it available as a resource for the batch system (TORQUE).
and for UNICORE and also makes Zabbix aware of the new available resources.
In order to add a new node you only have to go in the root repository directory where you find the script start_up_new_node
. This wrapper script takes care of all the tasks explained shortly above. The only thing you need to do is to enter the path to your openstack rc file
and enter the corresponidng password if you are asked for it.
sh start_up_new_node /path/to/rc/file
After some minutes you will have a new node added to your existng cluster. The new node is also added to the resources of the initial cluster. This means terraform is still tracking the whole cluster and not the intital cluster and added nodes. This implementation allows you to destroy the whole cluster without any thoughts about the added and removed nodes.
In some cases it can be necessary to resize the shared filesystem used by the virtual cluster. That can be the case if the first idea of the necessary total size has been guessed to small or the workload has changed. It is possible to do this by following the subsequent steps. This guide is focused on an OpenStack cloud environment if you are using an other environment have a look how to resize volumes in that environment.
-
Starting situation: You have a cluster with one master node and two compute nodes, each with a Cinder Volume with a capacity of 100GB (in total 300GB). Now you want to expand the storage capacity to 1TB per volume (in total 3TB).
-
First stop all your jobs using the shared file systems or other tools using it.
-
Login to the master node and stop the whole shared file system without deleting the according data:
beeond stop -i /home/centos/beeond.statusfile -n /home/centos/beeond_nodefile -L -a /home/centos/.ssh/connection_key.pem -z centos
- After the filesystem has been stopped correctly, unmount every single volume used for the shared filesystem (in this case 3 volumes). You can start to unmount it on the master node running the following command:
sudo umount /mnt
Afterwards do this on all other compute nodes (in this case 2 times). 4. After the volumes are unmounted you can detach them over the OpenStack dashboard. 5. Choose the resize option on the OpenStack Dashboard in the volume section and choose the desired size. 6. Attach all volumes on the same VMs they have been attached before 7. Now the resized volumes have been attached to the volumes we need to make the VM aware of the new size. In order to do this mount the volumes on all nodes belonging to the cluster (master and compute, so 3 times here) with the following command.
sudo mount /dev/vdb /mnt
You can check in beforehand if the volume has been attached to the same device path by running the command lsblk
. If it is not attached to /dev/vdb
please change the mount command.
8. If the volumes are mounted again you can now resize all the volumes by running the command:
sudo xfs_growfs -d /mnt
So you need to this 3 times in this example. Depending on the size of the volumes it can take some time, so please be patient. 9. In the final step you can start the shared filesystem by running the following command:
beeond start -i /home/centos/beeond.statusfile -n /home/centos/beeond_nodefile -d /mnt/ -c /beeond/ -a /home/centos/.ssh/connection_key.pem -z centos
For the case you want to free some resources and want to downgrade your current cluster we also provide a removing procedure. Please change into the root directory of the repository and run the following script:
sh stop_node /path/to/rc/file
The lastly added node will be chosen to be removed from the cluster. First, no new jobs are allowed to be scheduled on the node marked for removal. After all currently running jobs on this node are finished, the node is removed from TORQUE. In the next step the node is removed from the BeeOND shared file system. First no new data has to be written to the volume of this node. Then all the data distributed on this node is migrated to the other nodes (if possible, means enough capacity is left). In the next step the removed node is deleted from the Zabbix environment. At the end the node is deleted from the host file on the master node and therefore completely decoupled. As a final step the resources available to UNICORE are updated. At the end the VM and its attached Cinder volume are destroyed. Please enter the corresponding rc file password if you are asked for it.