Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[PLAT-4772][Platform] Allow stop + remove to wait for data to move of…
…f tserver Summary: Given a universe where data can be moved off a node, we now wait for tablets to move off of the tserver that was first stopped on platform and then removed . Previously, if a node is stopped on Platform, and then the node is removed, we do not wait for the tablets to move off and just continue for the removal of the node from the universe. We try to move tablets off of a node when possible We remove the `isTServer` condition in `UpdatePlacementInfo.java` when blacklisting nodes because if a node is stopped, it's `isTServer` value is set as `false` but we would still like to blacklist this node Test Plan: Some things to understand beforehand: 1. When a node’s tserver/master is not running, isTserver/isMaster is false 2. If a node is stopped, the node is still alive, just that the tserver/master process is not running, thus `isTserverAliveOnNode` will be false on a stopped node Create a GCP universe with 6 nodes and rf3, with AZs comprising of us-west-1, us-west-2, and us-east1. In the below tests, we should have a clean universe and all the nodes should be live. Perform the following tests: a) Happy path, stopping and then immediately removing a node from the universe 1. Stop a node in us-west-2 2. Immediately after the node is stopped, remove the same node from the universe 3. Go to master UI at <master-ip>:7000/tablet-servers, on a node that is currently not being removed 4. Keep refreshing the page, we should see the values for the node's `User Tablet-Peers / Leaders` slowly decrease until it hits 0 / 0 5. The node should successfully be removed b) Edge Case #1, Only removing a node from the universe 1. Remove a node in us-west-2 from the universe 2. Go to master UI at <master-ip>:7000/tablet-servers, on a node that is currently not being removed 3. Keep refreshing the page, we should see the values for the node's `User Tablet-Peers / Leaders` slowly decrease until it hits 0 / 0 4. The node should successfully be removed c) Edge Case #2: Stopping a node, wait until tablets are moved off, then remove node from universe 1. Stop a node in us-west-2 2. Go to master UI at <master-ip>:7000/tablet-servers 3. Wait for around 10 - 15 mins, tablets from the stopped node should be moved off automatically after this timeframe, i.e. under the `User Tablet-Peers / Leaders` column, that node should display 0 / 0. 4. Remove the same node from the universe 5. Since the tablets are already moved off the node, this node should not have much of a wait time for node removal 6. The node should successfully be removed d) Edge case #3: Remove 2 nodes from the same AZ 1. Remove a node in us-west-2 from the universe 2. Remove another node from us-west-2 3. For the second node, since there is nowhere for the tablets to go to, we will not wait for the tablets to move, so on the master UI, we should see x / 0 under the `User Tablet-Peers / Leaders` columns, where 'x' is the number of tablet peers. The RemoveNodeFromUniverse task should finish. However the value of 'x' should slowly decrease until it hits 0. e) Edge case #4: Remove as many nodes as possible on Platform 1. We should only be able to remove at most 1 node with a master server on it to maintain a majority of tablet peers (in our case, we have rf3, so 3 masters servers, thus we can only remove one master server). 2. We should be able to remove all nodes with only tservers All areas that use `UpdatePlacementInfo.java` either have the number of nodes to be blacklisted as 0 except for in `EditKubernetesUniverse.java` but we are already using tservers, so it is safe to remove the `isTserver` check in `UpdatePlacementInfo.java` Reviewers: sanketh, nsingh Reviewed By: nsingh Subscribers: yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D18596
- Loading branch information