Virtual UNICORE cluster (VALET) on demand using local IPs for the worker nodes

Creating a virtual cluster on demand in an OpenStack environment including a UNICORE instance.

Software Stack

The following software and tools are used to setup the virtual UNICORE cluster:

Terraform (Infrastructure as a code)
BeeGFS/BeeOND (Shared File System)
TORQUE (Batch System)
UNICORE Server (Middleware)
UNICORE workflow system (Workflow System)
Zabbix monitoring system

Prerequisites

In order to setup VALET you need to fulfill the following prerequisites

You need access to an OpenStack driven cloud (for example the de.NBI cloud)
Further you need access to the API and permissions to upload images
An openrc file with the correct credentials needs to be available (can be donwloaded from the OpenStack Dashboard, Horizon)
Installed version of Terraform (tested with v0.12.10)
Access to remote resources (internet)

Latest Images

This section will list the most up to date and tested images for the master and compute nodes. If you want to use older images for some reasons you will need to change the names in the Terraformvars.tf file.

Current

master image : unicore_master_centos_20190712.qcow2
compute image : unicore_compute_centos_20190719.qcow2

Old

master image : unicore_master_centos_20190702.qcow2
compute image : unicore_compute_centos_20190701.qcow2

Installation and Usage

The following information will help you to setup and use the virtual UNICORE cluster. This guide is tested for Linux on CentOS7 with Terraform version 0.12.10.

1. Download/clone the git repository

In order to use the sources you need to download or clone this git repository to your local machine.

git clone https://github.com/MaximilianHanussek/virtual_cluster_local_ips.git

You can also download it as a ZIP archive from the website of the repository or via wget

wget https://github.com/MaximilianHanussek/virtual_cluster_local_ips/archive/master.zip

you will find it as master.zip.

2. Source openstack credentials and initialize

Before we modify the required variables of Terraform for your OpenStack environment you will need to source your openstack credentials as environment variables and initialize Terraform. You can simply source your openstack credentials by downloading a so-called openrc file from the OpenStack dashboard also known as Horizon, to your local machine. After you have done that, source it with the following command

source /path/to/rc/file

Normally you should be asked for your password. Enter it and comfirm with enter. You will get no response, but you can check if everything worked well if you have the openstack client installed by running the following command

openstack image list

After that you should see a list of images that are available for your project.

Further we need to initialize Terraform. Therefore change into the terraform directory of the downloaded git repo and run

terraform init

If everything worked out you should see some similar output like below:

Initializing provider plugins...

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.openstack: version = "~> 1.19"
* provider.tls: version = "~> 2.0"

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
re-run this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

3. Configure terraform variables

In order to start the virtual cluster you will need a few variables you have to set on your own. Change into the terraform directory, if not already done and open the vars.tf file. You will find a bunch of defined variables, a comprehensive list can be found in the table below. The ones you will need to touch for sure are marked with yes (required). The ones you can change but do not have to change are marked with yes (not required). The ones marked with yes (poss. required) need to be changed if you are running VALET on a non de.NBI cloud site or even not on the de.NBI cloud site Tübingen. As these values and namnes only exists in these cloud environments. Variables you are not allowed to change are marked with no. If you change one of the no tagged variables it could or will break the configuration process.

Variable explanantion

beeond_disc_size: Sets the cinder volume size of the volumes attached to the master node and the two compute nodes. The shared file system will have the chosen size in gigabytes times three, for every participating node. So for 10GB it will 30GB. Set the size according to your needs and available resources.
beeond_storage_backend: Sets the name of the storage backend for the cinder volumes, choose the appropriate one of your cloud site.
flavors: Sets the used compute resources (CPUs, RAM, ...). Recommended for the master node are 8 CPUs and at least 16GB RAM.
compute_node_count: Sets the number of compute nodes (current configuration works only with two).
image_master: Sets the image to be used for the master node. Will be downloaded automatically.
image_compute: Sets the image to be used for the compute node. Will be downloaded automatically.
openstack_key_name: Sets the SSH key name of your OpenStack environment (Keypair is required to be set up already).
private_key_path: Sets the path to your private key in order to access the VMs and run configuration scripts.
name_prefix: Sets a prefix for the names of the starting VMs
security_groups: Sets the names and the security groups itself (do not need to exist)
network: Sets the network to be used

Variable	Default value	Unit	Change
beeond_disc_size	10	Gigabytes	yes (not required)
beeond_storage_backend	quobyte_hdd	-	yes (poss. required)
flavors	de.NBI small disc	8 CPUs, 16GB RAM	yes (poss. required)
compute_node_count	2	Instances	no
image_master	unicore_master_centos	-	no
image_compute	unicore_compute_centos	-	no
openstack_key_name	test	-	yes (required)
private_key_path	/path/to/private/key	-	yes (required)
name_prefix	unicore-	-	no
security_groups	virtual-unicore-cluster-public	-	no
network	denbi_uni_tuebingen_external	-	yes (poss. required)

4. Start Terraform setup

After the Terraform variables are setup correctly we can go on to start the configuration process. In order to do this, change into the terraform directory of the Git repository and first run a dry run with

terraform plan

Terraform will now inform you what it will do and checks if the syntax of the terraform files (.tf) is correct. If an error occur please follow the notes from Terraform and asure that you have sourced your openrc credentials file and initialized the Terraform plugins with terraform init.

If everything looks reasonable we can start with the real action executing

terraform apply

This command will first set up the required volumes, then the security group. Afterwards the required images will be downloaded and imported into the OpenStack environment, which can take some time dependent on the network connection (compute image: 1.93GB, master image: 4.40GB). The next step will fire up the VMs and also attaches the cinder volumes. A subsequent script will mount the volumes, create one time SSH keys and distribute them on the different VMs so they can talk with each other without using your general private key for obvious security reasons. In the end the shared file system based on BeeOND will be started, the TORQUE cluster is started and in the end the UNICORE components. On top the Zabbix Monitoring system is set up. All this will take around 5-10 minutes. In the end you will have a fully setup UNICORE cluster that you can access like explained in Chapter 5. But of course you can use just the usual TORQUE batch system without UNICORE and submitting jobs to a queue.

5. Access Zabbix Webinterface

The setup Zabbix webinterface can be found under the following URL replacing the example IP (42.42.42.42) with the public IP of your created master node: http://42.42.42.42/zabbix

The set login credentials are: Username: admin Password: zabbix

If you are just using the inital cluster without adding and removing nodes you can also change the password. If you want to use the add and remove procedures please do not change the credentials, as they are required for the Zabbix API access in order to remove nodes from Zabbix.

6. Access your UNICORE cluster

There are different ways to access the UNICORE cluster. One possibility is to use UNICORE Commandline Client (UCC) which can be downloaded here. The second possibility is to use the UNICORE Rich Client (URC), you can donwload here. In this instructions we will focus on the second possibility as this is the more convenient one.

In order to use the URC follow the steps below:

Download the URC to your local computer (the same you have started)
Unpack it and start the Application
It will ask your for the credentials, we will use the demo credentials as this is also the user who is already in the UNICORE user database. Please also check to save the password (which is 321 if yopu should forget it).
Afterwards go to the Workbench and add the new Registry by right-clicking into the window titled with Grid Browser and choose Add Registry. You can freely choose a name and afterwards replace localhost with the IP of your master node. You can find this information in the OpenStack dashboard (Horizon) or in Terraform. The rest of the URL needs to stay the same. Here an Example:

https://42.42.42.42:8080/REGISTRY/services/Registry?res=default_registry

Now you can start a small test run by submitting a script to the UNICORE cluster for example via the also configured Workflow System. For this purpose create a new workflow project and add a script (v2.2) to the worklfow, connect it with the green play button and enter for example in the script

whoami
uname -r
date

Click on the play button chose the available worjkflow engine and click on finish. You will see the worklfow running in the Grid Browser window if you unfold the name of Registry you have chosen, the Workflow engine and the Workflows icon. The output is accessible in the folder working directory of ....

For further complex workflows and further explanations on UNICORE we refer to the official documentation which you can find here.

7. Start and add new node to existing cluster

It might happen that the initial cluster resources are not sufficient for the applied workload and more nodes could solve the problem faster. Or you need some smaller nodes or larger nodes for different kind of workloads. For this case we provide a mechanism that will automatically start a new node (via terraform). Add the new node to the already existing BeeOND file system and also make it available as a resource for the batch system (TORQUE). and for UNICORE and also makes Zabbix aware of the new available resources. In order to add a new node you only have to go in the root repository directory where you find the script start_up_new_node. This wrapper script takes care of all the tasks explained shortly above. The only thing you need to do is to enter the path to your openstack rc file and enter the corresponidng password if you are asked for it.

sh start_up_new_node /path/to/rc/file

After some minutes you will have a new node added to your existng cluster. The new node is also added to the resources of the initial cluster. This means terraform is still tracking the whole cluster and not the intital cluster and added nodes. This implementation allows you to destroy the whole cluster without any thoughts about the added and removed nodes.

8. Resize whole BeeOND filesystem

In some cases it can be necessary to resize the shared filesystem used by the virtual cluster. That can be the case if the first idea of the necessary total size has been guessed to small or the workload has changed. It is possible to do this by following the subsequent steps. This guide is focused on an OpenStack cloud environment if you are using an other environment have a look how to resize volumes in that environment.

Starting situation: You have a cluster with one master node and two compute nodes, each with a Cinder Volume with a capacity of 100GB (in total 300GB). Now you want to expand the storage capacity to 1TB per volume (in total 3TB).
First stop all your jobs using the shared file systems or other tools using it.
Login to the master node and stop the whole shared file system without deleting the according data:

beeond stop -i /home/centos/beeond.statusfile -n /home/centos/beeond_nodefile -L -a /home/centos/.ssh/connection_key.pem -z centos

After the filesystem has been stopped correctly, unmount every single volume used for the shared filesystem (in this case 3 volumes). You can start to unmount it on the master node running the following command:

sudo umount /mnt

Afterwards do this on all other compute nodes (in this case 2 times). 4. After the volumes are unmounted you can detach them over the OpenStack dashboard. 5. Choose the resize option on the OpenStack Dashboard in the volume section and choose the desired size. 6. Attach all volumes on the same VMs they have been attached before 7. Now the resized volumes have been attached to the volumes we need to make the VM aware of the new size. In order to do this mount the volumes on all nodes belonging to the cluster (master and compute, so 3 times here) with the following command.

sudo mount /dev/vdb /mnt

You can check in beforehand if the volume has been attached to the same device path by running the command lsblk. If it is not attached to /dev/vdb please change the mount command. 8. If the volumes are mounted again you can now resize all the volumes by running the command:

sudo xfs_growfs -d /mnt

So you need to this 3 times in this example. Depending on the size of the volumes it can take some time, so please be patient. 9. In the final step you can start the shared filesystem by running the following command:

beeond start -i /home/centos/beeond.statusfile -n /home/centos/beeond_nodefile -d /mnt/ -c /beeond/ -a /home/centos/.ssh/connection_key.pem -z centos

9. Remove a node from the cluster

For the case you want to free some resources and want to downgrade your current cluster we also provide a removing procedure. Please change into the root directory of the repository and run the following script:

sh stop_node /path/to/rc/file

The lastly added node will be chosen to be removed from the cluster. First, no new jobs are allowed to be scheduled on the node marked for removal. After all currently running jobs on this node are finished, the node is removed from TORQUE. In the next step the node is removed from the BeeOND shared file system. First no new data has to be written to the volume of this node. Then all the data distributed on this node is migrated to the other nodes (if possible, means enough capacity is left). In the next step the removed node is deleted from the Zabbix environment. At the end the node is deleted from the host file on the master node and therefore completely decoupled. As a final step the resources available to UNICORE are updated. At the end the VM and its attached Cinder volume are destroyed. Please enter the corresponding rc file password if you are asked for it.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
ansible		ansible
terraform		terraform
terraform_add_node		terraform_add_node
.gitignore		.gitignore
README.md		README.md
VALET_scheduler.service		VALET_scheduler.service
VALET_scheduler.timer		VALET_scheduler.timer
add_node_mount_cinder_volume		add_node_mount_cinder_volume
add_node_to_cluster		add_node_to_cluster
add_node_to_torque		add_node_to_torque
add_to_host_file		add_to_host_file
beegfs-ondemand-stoplocal		beegfs-ondemand-stoplocal
beeond		beeond
beeond-add-storage-node		beeond-add-storage-node
beeond-remove-storage-node		beeond-remove-storage-node
configure_unicore		configure_unicore
delete_from_host_file		delete_from_host_file
delete_node_from_torque		delete_node_from_torque
delete_node_from_zabbix.py		delete_node_from_zabbix.py
get_next_compute_node_number.sh		get_next_compute_node_number.sh
next_node_number		next_node_number
ping_bool		ping_bool
remove_node_from_cluster		remove_node_from_cluster
setup_zabbix		setup_zabbix
start_initial_unicore_cluster		start_initial_unicore_cluster
start_up_new_node		start_up_new_node
stop_node		stop_node
update_unicore_resources		update_unicore_resources
virtual_cluster_api_start_node		virtual_cluster_api_start_node
virtual_cluster_config_file		virtual_cluster_config_file
virtual_cluster_scheduler		virtual_cluster_scheduler
while_test		while_test
zabbix.conf.php		zabbix.conf.php
zabbix_api.py		zabbix_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Virtual UNICORE cluster (VALET) on demand using local IPs for the worker nodes

Software Stack

Prerequisites

Latest Images

Current

Old

Installation and Usage

1. Download/clone the git repository

2. Source openstack credentials and initialize

3. Configure terraform variables

Variable explanantion

4. Start Terraform setup

5. Access Zabbix Webinterface

6. Access your UNICORE cluster

7. Start and add new node to existing cluster

8. Resize whole BeeOND filesystem

9. Remove a node from the cluster

About

Releases

Packages

Languages

MaximilianHanussek/virtual_cluster_local_ips

Folders and files

Latest commit

History

Repository files navigation

Virtual UNICORE cluster (VALET) on demand using local IPs for the worker nodes

Software Stack

Prerequisites

Latest Images

Current

Old

Installation and Usage

1. Download/clone the git repository

2. Source openstack credentials and initialize

3. Configure terraform variables

Variable explanantion

4. Start Terraform setup

5. Access Zabbix Webinterface

6. Access your UNICORE cluster

7. Start and add new node to existing cluster

8. Resize whole BeeOND filesystem

9. Remove a node from the cluster

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages