Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dependency on provisioner node #436

Closed
mkudlej opened this issue Sep 27, 2017 · 9 comments
Closed

dependency on provisioner node #436

mkudlej opened this issue Sep 27, 2017 · 9 comments
Assignees

Comments

@mkudlej
Copy link

mkudlej commented Sep 27, 2017

Now gluster volumes monitoring depends on provisioner node. What should user do for eliminate possibility of failure of provisioner node? Should user set up another provisioner node?

@Tendrl/tendrl-core @mbukatov

@mbukatov
Copy link
Contributor

Based on Tendrl/tendrl-ansible#27, tendrl-ansible no longer set provisioner flag for one of GlusterFS nodes since Tendrl/tendrl-ansible@dfbf5a5 , so this is completely in the hands of Tendrl itself.

I would say that this imply that moving provisioner role into another machine when the original one is down should be handled by Tendrl itself as well.

@mkudlej Btw does Tendrl tells you in the UI which machine is the provisioner ?

@mkudlej
Copy link
Author

mkudlej commented Sep 27, 2017

@mbukatov There is no info that machine is provisioner or not, so I expect that Tendrl should deal with provisioner failures.

@r0h4n r0h4n self-assigned this Sep 28, 2017
@r0h4n
Copy link
Contributor

r0h4n commented Oct 10, 2017

This is in progress.

Also, please file issue on UI to see "provisioner" tag in node list in the UI

@r0h4n
Copy link
Contributor

r0h4n commented Oct 12, 2017

This is done, please verify the scenario

@mbukatov
Copy link
Contributor

@mkudlej @r0h4n I don't see "provisioner" tag shown in node list in the ui.

Screenshot for tendrl-ui-1.5.3-20171031T125959.2e0f6d8.noarch:

screenshot_20171031_195031

@mbukatov
Copy link
Contributor

mbukatov commented Oct 31, 2017

Checking with:

# rpm -qa | grep tendrl | sort
tendrl-api-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-api-httpd-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-commons-1.5.3-20171031T113910.d08af41.noarch
tendrl-grafana-plugins-1.5.3-20171030T165855.897ef42.noarch
tendrl-grafana-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-monitoring-integration-1.5.3-20171030T165855.897ef42.noarch
tendrl-node-agent-1.5.3-20171031T062609.0ad4284.noarch
tendrl-notifier-1.5.3-20171030T164233.702f1a5.noarch
tendrl-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-ui-1.5.3-20171031T125959.2e0f6d8.noarch

Based on this gist, I have identified provisioner node:

[root@mbukatov-usm1-server ~]# etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b/NodeContext/tags
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_5d56772c-9c6e-4940-b9ac-369ac9cf403b", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]
[root@mbukatov-usm1-server ~]# etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b/NodeContext/fqdn
mbukatov-usm1-gl2.example.com

And then shut it down.

While the ui reports it as down:

screenshot_20171031_200635

the machine still has the provisioner tag:

# ping mbukatov-usm1-gl2.example.com
PING mbukatov-usm1-gl2.example.com (10.37.169.60) 56(84) bytes of data.
From mbukatov-usm1-server.example.com (10.37.169.90) icmp_seq=1 Destination Host Unreachable
From mbukatov-usm1-server.example.com (10.37.169.90) icmp_seq=2 Destination Host Unreachable
From mbukatov-usm1-server.example.com (10.37.169.90) icmp_seq=3 Destination Host Unreachable
From mbukatov-usm1-server.example.com (10.37.169.90) icmp_seq=4 Destination Host Unreachable
^C
--- mbukatov-usm1-gl2.example.com ping statistics ---
7 packets transmitted, 0 received, +4 errors, 100% packet loss, time 6002ms
pipe 4
# etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b/NodeContext/tags
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_5d56772c-9c6e-4940-b9ac-369ac9cf403b", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]

That said, when I loop over all nodes, I see that one new machine has been assigned the provisioner role, so that there are 2 with this role now (one shutdown, one running):

# for i in /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b /nodes/ecb55f9c-debe-4a23-b822-1276bd790f58 /nodes/da3f1ea7-08d7-459d-9b76-62fd73fa0776 /nodes/76bd7e60-deca-4448-a74f-ecf0831aaa00 /nodes/c466a391-d607-43ea-aff2-43f3e2629326; do etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get $i/NodeContext/tags; done | grep provisioner
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_5d56772c-9c6e-4940-b9ac-369ac9cf403b", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "tendrl/node_da3f1ea7-08d7-459d-9b76-62fd73fa0776", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]

Now when I start the gl2 machine (the original provisioner) again, I see:

# ping mbukatov-usm1-gl2.example.com
PING mbukatov-usm1-gl2.example.com (10.37.169.60) 56(84) bytes of data.
64 bytes from mbukatov-usm1-gl2.example.com (10.37.169.60): icmp_seq=1 ttl=64 time=0.936 ms
64 bytes from mbukatov-usm1-gl2.example.com (10.37.169.60): icmp_seq=2 ttl=64 time=0.372 ms
^C
--- mbukatov-usm1-gl2.example.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.372/0.654/0.936/0.282 ms
# for i in /nodes/5d56772c-9c6e-4940-b9ac-369ac9cf403b /nodes/ecb55f9c-debe-4a23-b822-1276bd790f58 /nodes/da3f1ea7-08d7-459d-9b76-62fd73fa0776 /nodes/76bd7e60-deca-4448-a74f-ecf0831aaa00 /nodes/c466a391-d607-43ea-aff2-43f3e2629326; do etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get $i/NodeContext/tags; done | grep provisioner
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_5d56772c-9c6e-4940-b9ac-369ac9cf403b", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node"]
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/34f86e55-935e-46ac-9780-b60139cd399d", "provisioner/34f86e55-935e-46ac-9780-b60139cd399d", "tendrl/node_ecb55f9c-debe-4a23-b822-1276bd790f58", "tendrl/node"]

So that now we have 2 running provisioner nodes. Wouldn't that be a problem?

r0h4n added a commit to Tendrl/node-agent that referenced this issue Nov 1, 2017
r0h4n added a commit to Tendrl/node-agent that referenced this issue Nov 1, 2017
r0h4n added a commit to Tendrl/node-agent that referenced this issue Nov 1, 2017
@r0h4n
Copy link
Contributor

r0h4n commented Nov 1, 2017

Please verify now, I have made "tendrl/monitor" to take responsibility for re-claiming any old provisioner tags

@mbukatov
Copy link
Contributor

mbukatov commented Nov 1, 2017

On a new cluster instance, I see that when I import the cluster, there are 2 provisioners already (and I haven't powered down any node yet):

# for i in $(etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 ls /nodes); do etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get $i/NodeContext/tags $i; done | grep provisioner
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/integration/752af1aa-5257-4d5f-8758-f916234ebc77", "provisioner/752af1aa-5257-4d5f-8758-f916234ebc77", "tendrl/node_bb268f4c-6cc1-4e81-9c8a-e0955a8a91ab", "tendrl/node"]
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/node_7973f1b9-6471-4947-ba86-93e74198de53", "tendrl/integration/752af1aa-5257-4d5f-8758-f916234ebc77", "provisioner/752af1aa-5257-4d5f-8758-f916234ebc77", "tendrl/node"]

I did a mistake and haven't checked for this scenario during #436 (comment) so I'm not sure if this is a new behavior or not.

Is this expected behavior?

I'm using:

# rpm -qa | grep tendrl | sort
tendrl-api-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-api-httpd-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-commons-1.5.3-20171101T103313.c987736.noarch
tendrl-grafana-plugins-1.5.3-20171101T130858.f752f23.noarch
tendrl-grafana-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-monitoring-integration-1.5.3-20171101T130858.f752f23.noarch
tendrl-node-agent-1.5.3-20171101T112542.0d676e6.noarch
tendrl-notifier-1.5.3-20171030T164233.702f1a5.noarch
tendrl-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-ui-1.5.3-20171031T125959.2e0f6d8.noarch

@mbukatov
Copy link
Contributor

mbukatov commented Nov 2, 2017

After checking status as described in #436 (comment), I powered down 2 machines with provisioner tag. Next morning, I rechecked which nodes are tagged as provisioner again (so that Tendrl had few hours to adapt):

# for i in $(etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 ls /nodes); do etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get $i/NodeContext/tags $i; echo $i; done | grep provisioner
["detected_cluster/f04f1852e450992d62eece2fbd851389d593dfb9b9114279204703a67b37a2ee", "tendrl/integration/gluster", "gluster/server", "tendrl/node_afa47dda-7cef-4d39-a7e3-be11c69bd39f", "tendrl/integration/752af1aa-5257-4d5f-8758-f916234ebc77", "provisioner/752af1aa-5257-4d5f-8758-f916234ebc77", "tendrl/node"]

Translating fqdn for a new provisioner:

# etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /nodes/afa47dda-7cef-4d39-a7e3-be11c69bd39f/NodeContext/fqdn
mbukatov-usm1-gl4.example.com

To sum it up: 2 nodes I turned off no longer have the provisioner tag, and another node, which is still running, was labeled as provisioner instead. Now I have only single provisioner node, which is expected behavior.

The remaining question is, why did I start with 2 provisioner nodes as shown in #436 (comment) ?

@r0h4n r0h4n closed this as completed Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants