-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip - etcd 3.5.0-alpha-0 #2452
wip - etcd 3.5.0-alpha-0 #2452
Conversation
Signed-off-by: Matthew Sykes <[email protected]>
The commit etcd-io/etcd@c9491d7 strongly distinguishes the StartNode API call of the etcd/raft Node from the RestartNode. This commit: (1) Aligns our code with the bespoken API changes. (2) Makes the logic in node.start() be more easy to reason about. Change-Id: I48070de9241ebcaae720fb1d7dc7c7d1e507625a Signed-off-by: Yacov Manevich <[email protected]>
Panic is no more but there may be other issues:
Not sure if this failure is a flake or a genuine problem. Probably worth to run the tests locally and see. |
@yacovm, Thanks for looking. The test that failed on your end is failing for me as well. There are also 5 or 6 unit tests (mostly related to reconfiguration) that are failing as well. For grins I tried to pull in your change into a tree with the current versions of etcd and there were failures there as well. I was sort of hoping the cleanup could be applied independently since it was much easier to read. So, none of this is urgent but I think we need to track it. Would be great for @jyellick to weigh in as well. |
But UTs passed in the latest build after my commit. |
No. They're "pended" out in the code.
|
I see, I didn't look at the code and I wondered how come the unit tests passed but the integration tests didn't. |
Definitely seems like something we'd like to pick up -- and at least thus far the changes don't seem too invasive. I've pulled the code and poked around a little, but not solved the UT failures yet. @guoger if you have some spare cycles, you might have some thoughts as well. |
It looks like etcdraft has gotten a little bit smarter about how it evicts nodes from configuration which is causing problems interacting with Fabric splitting the config change across two indexes (one for the config block, the next for the actual raft conf change) or at least that's what things look like to me. Notably, before the conf state simply listed nodes and learners. Now we have voters, learners, voters outgoing, and learners next. There's also a chance that some of those rote compile changes from Nodes to Voters isn't actually what we wanted, so they might deserve a bit more scrutiny. One thing I think that's important to note is that we've historically said that we should really combine the two steps of proposing a config block and proposing the actual raft config change, as splitting them across entries creates yet another point in reconfiguration where nodes can crash and create headaches for us. On the other hand, changing this behavior would create a bit of an upgrade headache (or maybe we could figure out a clever way to handle it). It's probably worth investigating the interoperability between the v1/v3 network ABIs, since if they are not compatible (and we will require downtime to upgrade), now might be an opportune time to change the two-step behavior. I know that @guoger has started to look at this, so perhaps he will have some additional insight. |
Reconfig is broken with new version of etcd/raft because it now performs more strict check on incoming snapshot. If the snapshot sent was taken before adding new node, then it certainly does not contain new node, and would fail the check aforementioned. etcd server it self handles this by injecting the actual It shouldn't be hard for us to do the same, and i think we should upgrade lib after this change. |
Thank you for taking a look. We'll have to revisit this in the future. |
@jyellick @yacovm - An alpha of etcd v3.5 has been released as a module. I've pulled it in and done just enough work to make it compile. I thought you might want to take a look and see if it's something we want to pull in at some point in the near future.
Right now a number of tests are failing with a panic from etcd:
One of the fields on the protobuf messages exchanged by raft also changed names (but appears to have kept the same field number and semantics).