dependency on provisioner node #436

mkudlej · 2017-09-27T09:14:10Z

Now gluster volumes monitoring depends on provisioner node. What should user do for eliminate possibility of failure of provisioner node? Should user set up another provisioner node?

@Tendrl/tendrl-core @mbukatov

mbukatov · 2017-09-27T09:26:06Z

Based on Tendrl/tendrl-ansible#27, tendrl-ansible no longer set provisioner flag for one of GlusterFS nodes since Tendrl/tendrl-ansible@dfbf5a5 , so this is completely in the hands of Tendrl itself.

I would say that this imply that moving provisioner role into another machine when the original one is down should be handled by Tendrl itself as well.

@mkudlej Btw does Tendrl tells you in the UI which machine is the provisioner ?

mkudlej · 2017-09-27T09:38:04Z

@mbukatov There is no info that machine is provisioner or not, so I expect that Tendrl should deal with provisioner failures.

r0h4n · 2017-10-10T09:37:00Z

This is in progress.

Also, please file issue on UI to see "provisioner" tag in node list in the UI

r0h4n · 2017-10-12T09:55:05Z

This is done, please verify the scenario

mbukatov · 2017-10-31T18:52:18Z

@mkudlej @r0h4n I don't see "provisioner" tag shown in node list in the ui.

Screenshot for tendrl-ui-1.5.3-20171031T125959.2e0f6d8.noarch:

mbukatov · 2017-10-31T19:08:30Z

Checking with:

# rpm -qa | grep tendrl | sort
tendrl-api-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-api-httpd-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-commons-1.5.3-20171031T113910.d08af41.noarch
tendrl-grafana-plugins-1.5.3-20171030T165855.897ef42.noarch
tendrl-grafana-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-monitoring-integration-1.5.3-20171030T165855.897ef42.noarch
tendrl-node-agent-1.5.3-20171031T062609.0ad4284.noarch
tendrl-notifier-1.5.3-20171030T164233.702f1a5.noarch
tendrl-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-ui-1.5.3-20171031T125959.2e0f6d8.noarch

Based on this gist, I have identified provisioner node:

[root@mbukatov-usm1-server ~]# etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b/NodeContext/tags
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_5d56772c-9c6e-4940-b9ac-369ac9cf403b", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]
[root@mbukatov-usm1-server ~]# etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b/NodeContext/fqdn
mbukatov-usm1-gl2.example.com

And then shut it down.

While the ui reports it as down:

the machine still has the provisioner tag:

# ping mbukatov-usm1-gl2.example.com
PING mbukatov-usm1-gl2.example.com (10.37.169.60) 56(84) bytes of data.
From mbukatov-usm1-server.example.com (10.37.169.90) icmp_seq=1 Destination Host Unreachable
From mbukatov-usm1-server.example.com (10.37.169.90) icmp_seq=2 Destination Host Unreachable
From mbukatov-usm1-server.example.com (10.37.169.90) icmp_seq=3 Destination Host Unreachable
From mbukatov-usm1-server.example.com (10.37.169.90) icmp_seq=4 Destination Host Unreachable
^C
--- mbukatov-usm1-gl2.example.com ping statistics ---
7 packets transmitted, 0 received, +4 errors, 100% packet loss, time 6002ms
pipe 4
# etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b/NodeContext/tags
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_5d56772c-9c6e-4940-b9ac-369ac9cf403b", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]

That said, when I loop over all nodes, I see that one new machine has been assigned the provisioner role, so that there are 2 with this role now (one shutdown, one running):

# for i in /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b /nodes/ecb55f9c-debe-4a23-b822-1276bd790f58 /nodes/da3f1ea7-08d7-459d-9b76-62fd73fa0776 /nodes/76bd7e60-deca-4448-a74f-ecf0831aaa00 /nodes/c466a391-d607-43ea-aff2-43f3e2629326; do etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get $i/NodeContext/tags; done | grep provisioner
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_5d56772c-9c6e-4940-b9ac-369ac9cf403b", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "tendrl/node_da3f1ea7-08d7-459d-9b76-62fd73fa0776", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]

Now when I start the gl2 machine (the original provisioner) again, I see:

# ping mbukatov-usm1-gl2.example.com
PING mbukatov-usm1-gl2.example.com (10.37.169.60) 56(84) bytes of data.
64 bytes from mbukatov-usm1-gl2.example.com (10.37.169.60): icmp_seq=1 ttl=64 time=0.936 ms
64 bytes from mbukatov-usm1-gl2.example.com (10.37.169.60): icmp_seq=2 ttl=64 time=0.372 ms
^C
--- mbukatov-usm1-gl2.example.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.372/0.654/0.936/0.282 ms
# for i in /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b /nodes/ecb55f9c-debe-4a23-b822-1276bd790f58 /nodes/da3f1ea7-08d7-459d-9b76-62fd73fa0776 /nodes/76bd7e60-deca-4448-a74f-ecf0831aaa00 /nodes/c466a391-d607-43ea-aff2-43f3e2629326; do etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get $i/NodeContext/tags; done | grep provisioner
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_5d56772c-9c6e-4940-b9ac-369ac9cf403b", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_ecb55f9c-debe-4a23-b822-1276bd790f58", "tendrl/node"]

So that now we have 2 running provisioner nodes. Wouldn't that be a problem?

tendrl-bug-id: Tendrl/gluster-integration#436

r0h4n · 2017-11-01T11:17:25Z

Please verify now, I have made "tendrl/monitor" to take responsibility for re-claiming any old provisioner tags

mbukatov · 2017-11-01T19:46:57Z

On a new cluster instance, I see that when I import the cluster, there are 2 provisioners already (and I haven't powered down any node yet):

# for i in $(etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 ls /nodes); do etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get $i/NodeContext/tags $i; done | grep provisioner
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/752af1aa-5257-4d5f-8758-f916234ebc77", "provisioner/752af1aa-5257-4d5f-8758-f916234ebc77", "tendrl/node_bb268f4c-6cc1-4e81-9c8a-e0955a8a91ab", "tendrl/node"]
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/node_7973f1b9-6471-4947-ba86-93e74198de53", "tendrl/integration/752af1aa-5257-4d5f-8758-f916234ebc77", "provisioner/752af1aa-5257-4d5f-8758-f916234ebc77", "tendrl/node"]

I did a mistake and haven't checked for this scenario during #436 (comment) so I'm not sure if this is a new behavior or not.

Is this expected behavior?

I'm using:

# rpm -qa | grep tendrl | sort
tendrl-api-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-api-httpd-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-commons-1.5.3-20171101T103313.c987736.noarch
tendrl-grafana-plugins-1.5.3-20171101T130858.f752f23.noarch
tendrl-grafana-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-monitoring-integration-1.5.3-20171101T130858.f752f23.noarch
tendrl-node-agent-1.5.3-20171101T112542.0d676e6.noarch
tendrl-notifier-1.5.3-20171030T164233.702f1a5.noarch
tendrl-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-ui-1.5.3-20171031T125959.2e0f6d8.noarch

mbukatov · 2017-11-02T10:11:38Z

After checking status as described in #436 (comment), I powered down 2 machines with provisioner tag. Next morning, I rechecked which nodes are tagged as provisioner again (so that Tendrl had few hours to adapt):

# for i in $(etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 ls /nodes); do etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get $i/NodeContext/tags $i; echo $i; done | grep provisioner
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/node_afa47dda-7cef-4d39-a7e3-be11c69bd39f", "tendrl/integration/752af1aa-5257-4d5f-8758-f916234ebc77", "provisioner/752af1aa-5257-4d5f-8758-f916234ebc77", "tendrl/node"]

Translating fqdn for a new provisioner:

# etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /nodes/afa47dda-7cef-4d39-a7e3-be11c69bd39f/NodeContext/fqdn
mbukatov-usm1-gl4.example.com

To sum it up: 2 nodes I turned off no longer have the provisioner tag, and another node, which is still running, was labeled as provisioner instead. Now I have only single provisioner node, which is expected behavior.

The remaining question is, why did I start with 2 provisioner nodes as shown in #436 (comment) ?

r0h4n self-assigned this Sep 28, 2017

r0h4n added the pending verification label Oct 12, 2017

r0h4n added a commit to Tendrl/node-agent that referenced this issue Nov 1, 2017

Update check_all_managed_nodes_status.py

636241c

tendrl-bug-id: Tendrl/gluster-integration#436

r0h4n added a commit to Tendrl/node-agent that referenced this issue Nov 1, 2017

Update check_all_managed_nodes_status.py

34c20fb

tendrl-bug-id: Tendrl/gluster-integration#436

r0h4n added a commit to Tendrl/node-agent that referenced this issue Nov 1, 2017

Update services_and_index_sync.py

0d676e6

tendrl-bug-id: Tendrl/gluster-integration#436

r0h4n closed this as completed Mar 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dependency on provisioner node #436

dependency on provisioner node #436

mkudlej commented Sep 27, 2017

mbukatov commented Sep 27, 2017

mkudlej commented Sep 27, 2017

r0h4n commented Oct 10, 2017

r0h4n commented Oct 12, 2017

mbukatov commented Oct 31, 2017

mbukatov commented Oct 31, 2017 •

edited

Loading

r0h4n commented Nov 1, 2017

mbukatov commented Nov 1, 2017 •

edited

Loading

mbukatov commented Nov 2, 2017 •

edited

Loading

dependency on provisioner node #436

dependency on provisioner node #436

Comments

mkudlej commented Sep 27, 2017

mbukatov commented Sep 27, 2017

mkudlej commented Sep 27, 2017

r0h4n commented Oct 10, 2017

r0h4n commented Oct 12, 2017

mbukatov commented Oct 31, 2017

mbukatov commented Oct 31, 2017 • edited Loading

r0h4n commented Nov 1, 2017

mbukatov commented Nov 1, 2017 • edited Loading

mbukatov commented Nov 2, 2017 • edited Loading

mbukatov commented Oct 31, 2017 •

edited

Loading

mbukatov commented Nov 1, 2017 •

edited

Loading

mbukatov commented Nov 2, 2017 •

edited

Loading