Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: support raft learner in etcd - part 3 #10730

Merged
merged 13 commits into from
May 29, 2019
70 changes: 70 additions & 0 deletions Documentation/op-guide/runtime-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,46 @@ The new member will run as a part of the cluster and immediately begin catching

If adding multiple members the best practice is to configure a single member at a time and verify it starts correctly before adding more new members. If adding a new member to a 1-node cluster, the cluster cannot make progress before the new member starts because it needs two members as majority to agree on the consensus. This behavior only happens between the time `etcdctl member add` informs the cluster about the new member and the new member successfully establishing a connection to the existing one.

#### Add a new member as learner

Starting from v3.4, etcd supports adding a new member as learner / non-voting member.
The motivation and design can be found in [design doc](https://etcd.readthedocs.io/en/latest/server-learner.html).
In order to make the process of adding a new member safer,
and to reduce cluster downtime when the new member is added, it is recommended that the new member is added to cluster
as a learner until it catches up. This can be described as a three step process:

* Add the new member as learner via [gRPC members API][member-api-grpc] or the `etcdctl member add --learner` command.
jingyih marked this conversation as resolved.
Show resolved Hide resolved

* Start the new member with the new cluster configuration, including a list of the updated members (existing members + the new member).
This step is exactly the same as before.

* Promote the newly added learner to voting member via [gRPC members API][member-api-grpc] or the `etcdctl member promote` command.
etcd server validates promote request to ensure its operational safety.
Only after its raft log has caught up to leader’s can learner be promoted to a voting member.
If a learner member has not caught up to leader's raft log, member promote request will fail
(see [error cases when promoting a member] section for more details).
In this case, user should wait and retry later.

In v3.4, etcd server limits the number of learners that cluster can have to one. The main consideration is to limit the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should make this configurable. One seems to be a small number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people want to increase read perf. So they want more learners for example.

Copy link
Contributor Author

@jingyih jingyih May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @gyuho

I kind of agree that 1 is not enough for some use cases, such as upsizing cluster from 1 to 3, or 3 to 5, and live-migrating a 3-node cluster to a new 3-node cluster. On the other hand, we do not want users to add too many learners at the same time, which might result in too much overhead for the leader. One goal of using learner is to make add members safer - not if it results in too much overhead on leader and then cause the cluster to fail. I think we can hard-code the limit to a small number, and make it configurable as an experimental feature in 3.4.

Copy link
Contributor Author

@jingyih jingyih May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuring this limit is cluster-wide reconfiguration (very similar to member change), which means we need to add an API, and maybe a new config change type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 Do we keep the limit of 1? I do not have strong opinions on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us keep this as it is for now. we can remove the limit later.

extra workload on leader due to propagating data from leader to learner.

Use `etcdctl member add` with flag `--learner` to add new member to cluster as learner.

```sh
$ etcdctl member add infra3 --peer-urls=http://10.0.1.13:2380 --learner
Member 9bf1b35fc7761a23 added to cluster a7ef944b95711739

ETCD_NAME="infra3"
ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380,infra3=http://10.0.1.13:2380"
ETCD_INITIAL_CLUSTER_STATE=existing
```

After new etcd process is started for the newly added learner member, use `etcdctl member promote` to promote learner to voting member.
```
$ etcdctl member promote 9bf1b35fc7761a23
Member 9e29bbaa45d74461 promoted in cluster a7ef944b95711739
```

#### Error cases when adding members

In the following case a new host is not included in the list of enumerated nodes. If this is a new cluster, the node must be added to the list of initial cluster members.
Expand Down Expand Up @@ -153,6 +193,35 @@ etcd: this member has been permanently removed from the cluster. Exiting.
exit 1
```

#### Error cases when adding a learner member

Cannot add learner to cluster if the cluster already has 1 learner (v3.4).
```
$ etcdctl member add infra4 --peer-urls=http://10.0.1.14:2380 --learner
Error: etcdserver: too many learner members in cluster
```

#### Error cases when promoting a learner member

Learner can only be promoted to voting member if it is in sync with leader.
```
$ etcdctl member promote 9bf1b35fc7761a23
Error: etcdserver: can only promote a learner member which is in sync with leader
```

Promoting a member that is not a learner will fail.
```
$ etcdctl member promote 9bf1b35fc7761a23
Error: etcdserver: can only promote a learner member
```

Promoting a member that does not exist in cluster will fail.
```
$ etcdctl member promote 12345abcde
Error: etcdserver: member not found
```


### Strict reconfiguration check mode (`-strict-reconfig-check`)

As described in the above, the best practice of adding new members is to configure a single member at a time and verify it starts correctly before adding more new members. This step by step approach is very important because if newly added members is not configured correctly (for example the peer URLs are incorrect), the cluster can lose quorum. The quorum loss happens since the newly added member are counted in the quorum even if that member is not reachable from other existing members. Also quorum loss might happen if there is a connectivity issue or there are operational issues.
Expand All @@ -173,3 +242,4 @@ It is enabled by default.
[member migration]: ../v2/admin_guide.md#member-migration
[remove member]: #remove-a-member
[runtime-reconf]: runtime-reconf-design.md
[error cases when promoting a member]: #error-cases-when-promoting-a-learner-member
94 changes: 83 additions & 11 deletions clientv3/integration/cluster_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"reflect"
"strings"
"testing"
"time"

"go.etcd.io/etcd/integration"
"go.etcd.io/etcd/pkg/testutil"
Expand Down Expand Up @@ -214,14 +215,19 @@ func TestMemberAddForLearner(t *testing.T) {
}
}

func TestMemberPromoteForLearner(t *testing.T) {
// TODO test not ready learner promotion.
func TestMemberPromote(t *testing.T) {
defer testutil.AfterTest(t)

clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 3})
defer clus.Terminate(t)
// TODO change the random client to client that talk to leader directly.
capi := clus.RandClient()

// member promote request can be sent to any server in cluster,
// the request will be auto-forwarded to leader on server-side.
// This test explicitly includes the server-side forwarding by
// sending the request to follower.
leaderIdx := clus.WaitLeader(t)
followerIdx := (leaderIdx + 1) % 3
capi := clus.Client(followerIdx)

urls := []string{"http://127.0.0.1:1234"}
memberAddResp, err := capi.MemberAddAsLearner(context.Background(), urls)
Expand All @@ -244,18 +250,84 @@ func TestMemberPromoteForLearner(t *testing.T) {
t.Fatalf("Added 1 learner node to cluster, got %d", numberOfLearners)
}

memberPromoteResp, err := capi.MemberPromote(context.Background(), learnerID)
if err != nil {
t.Fatalf("failed to promote member: %v", err)
// learner is not started yet. Expect learner progress check to fail.
// As the result, member promote request will fail.
_, err = capi.MemberPromote(context.Background(), learnerID)
expectedErrKeywords := "can only promote a learner member which is in sync with leader"
if err == nil {
t.Fatalf("expecting promote not ready learner to fail, got no error")
}
if !strings.Contains(err.Error(), expectedErrKeywords) {
t.Fatalf("expecting error to contain %s, got %s", expectedErrKeywords, err.Error())
}

// create and launch learner member based on the response of V3 Member Add API.
// (the response has information on peer urls of the existing members in cluster)
learnerMember := clus.MustNewMember(t, memberAddResp)
clus.Members = append(clus.Members, learnerMember)
if err := learnerMember.Launch(); err != nil {
t.Fatal(err)
}

// retry until promote succeed or timeout
timeout := time.After(5 * time.Second)
for {
select {
case <-time.After(500 * time.Millisecond):
case <-timeout:
t.Errorf("failed all attempts to promote learner member, last error: %v", err)
break
}

_, err = capi.MemberPromote(context.Background(), learnerID)
// successfully promoted learner
if err == nil {
break
}
// if member promote fails due to learner not ready, retry.
// otherwise fails the test.
if !strings.Contains(err.Error(), expectedErrKeywords) {
t.Fatalf("unexpected error when promoting learner member: %v", err)
}
}
}

numberOfLearners = 0
for _, m := range memberPromoteResp.Members {
// TestMaxLearnerInCluster verifies that the maximum number of learners allowed in a cluster is 1
func TestMaxLearnerInCluster(t *testing.T) {
defer testutil.AfterTest(t)

// 1. start with a cluster with 3 voting member and 0 learner member
clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 3})
defer clus.Terminate(t)

// 2. adding a learner member should succeed
resp1, err := clus.Client(0).MemberAddAsLearner(context.Background(), []string{"http://127.0.0.1:1234"})
if err != nil {
t.Fatalf("failed to add learner member %v", err)
}
numberOfLearners := 0
for _, m := range resp1.Members {
if m.IsLearner {
numberOfLearners++
}
}
if numberOfLearners != 0 {
t.Errorf("learner promoted, expect 0 learner, got %d", numberOfLearners)
if numberOfLearners != 1 {
t.Fatalf("Added 1 learner node to cluster, got %d", numberOfLearners)
}

// 3. cluster has 3 voting member and 1 learner, adding another learner should fail
_, err = clus.Client(0).MemberAddAsLearner(context.Background(), []string{"http://127.0.0.1:2345"})
if err == nil {
t.Fatalf("expect member add to fail, got no error")
}
expectedErrKeywords := "too many learner members in cluster"
if !strings.Contains(err.Error(), expectedErrKeywords) {
t.Fatalf("expecting error to contain %s, got %s", expectedErrKeywords, err.Error())
}

// 4. cluster has 3 voting member and 1 learner, adding a voting member should succeed
_, err = clus.Client(0).MemberAdd(context.Background(), []string{"http://127.0.0.1:3456"})
if err != nil {
t.Errorf("failed to add member %v", err)
}
}
51 changes: 48 additions & 3 deletions clientv3/integration/kv_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1011,9 +1011,8 @@ func TestKVForLearner(t *testing.T) {
}
defer cli.Close()

// TODO: expose servers's ReadyNotify() in test and use it instead.
// waiting for learner member to catch up applying the config change entries in raft log.
time.Sleep(3 * time.Second)
// wait until learner member is ready
<-clus.Members[3].ReadyNotify()

tests := []struct {
op clientv3.Op
Expand Down Expand Up @@ -1051,3 +1050,49 @@ func TestKVForLearner(t *testing.T) {
}
}
}

// TestBalancerSupportLearner verifies that balancer's retry and failover mechanism supports cluster with learner member
func TestBalancerSupportLearner(t *testing.T) {
defer testutil.AfterTest(t)

clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 3})
defer clus.Terminate(t)

// we have to add and launch learner member after initial cluster was created, because
// bootstrapping a cluster with learner member is not supported.
clus.AddAndLaunchLearnerMember(t)

learners, err := clus.GetLearnerMembers()
if err != nil {
t.Fatalf("failed to get the learner members in cluster: %v", err)
}
if len(learners) != 1 {
t.Fatalf("added 1 learner to cluster, got %d", len(learners))
}

// clus.Members[3] is the newly added learner member, which was appended to clus.Members
learnerEp := clus.Members[3].GRPCAddr()
cfg := clientv3.Config{
Endpoints: []string{learnerEp},
DialTimeout: 5 * time.Second,
DialOptions: []grpc.DialOption{grpc.WithBlock()},
}
cli, err := clientv3.New(cfg)
if err != nil {
t.Fatalf("failed to create clientv3: %v", err)
}
defer cli.Close()

// wait until learner member is ready
<-clus.Members[3].ReadyNotify()

if _, err := cli.Get(context.Background(), "foo"); err == nil {
t.Fatalf("expect Get request to learner to fail, got no error")
}

eps := []string{learnerEp, clus.Members[0].GRPCAddr()}
cli.SetEndpoints(eps...)
if _, err := cli.Get(context.Background(), "foo"); err != nil {
t.Errorf("expect no error (balancer should retry when request to learner fails), got error: %v", err)
}
}
Loading