Added PeerManagement.Type = 'member-list' #492

thrawn01 · 2022-08-29T23:07:40Z

Which problem is this PR solving?

This PR removes the need to run Redis in the deployment environment. All member-list requires is a DNS entry
pointing to a single active instance of Refinery to discover and monitor the health of the rest of the cluster. See https://github.com/hashicorp/memberlist

This allows customers who are using nomad, k8s or consul to deploy Refinery without the need for Redis.

Short description of the changes

This PR builds on the changes in Added PeerManagement.Timeout config option #491
Added three new config options MemberListListenAddr, MemberListKnownMembers and MemberListAdvertiseAddr
Added Close(ctx context.Context) error to the Peers interface to allow for clean shutdown and de-registration of an instance in the member-list cluster.

Race condition in `Peers` interface design

Since peer.NewPeers() launches async go routines for both redis and member-list. it is possible that a callback could occur before we call RegisterUpdatedPeersCallback()to register the call back, therefore missing the initial callback.

For example, in memberlist_test.go we should be able to assert that p0 will receive 3 callbacks, one for itself, and then 2 for each additional member. However, because of the race condition, we might only get 2 callbacks.

We could change the Peers interface to instead register a callback during peer.NewPeers(conf, cb) call OR we could add a Start() method to the Peers interface, such that users would RegisterUpdatedPeersCallback() and then call Start() when ready to participate in receiving updates.

kentquirk · 2022-08-30T16:47:05Z

I haven't reviewed this yet, but the preferred fix to the race condition problem would be the Start() method approach.

In fact, to keep this more manageable, any chance you could make that fix and the change to the PeerManagement API as a separate PR from adding the memberlist support?

Remove `computePublicAddr()`. Use configuration's bind address verbatim.

Baliedge · 2022-09-01T14:17:01Z

I've found an unexpected behavior in the peering functionality that pertains to both Redis and this new Memberlist modes.

When in Redis mode, it will bind a peer listener on the public IP, overriding whatever was set for a bind IP in config GetPeerListenAddr(). Derrick brought this logic into memberlist, but it ultimately caused unit test issues because it would bind to the public IP and the tests would connect to 127.0.0.1 and fail.

For memberlist mode, I removed the public IP logic to fix unit tests because they wanted to bind to localhost, not public IP. But this creates a disparity with Redis mode and I don't think it's my place to change the behavior of pre-existing logic.

My recommendation is to remove the configuration override that binds to public IP only. But, you may also provide a GetPublicIP() function. This way, devs have a choice when populating the configuration structure.

cc: @kentquirk

…er-memberlist

Baliedge · 2022-09-02T20:36:10Z

@thrawn01, I'm having some trouble getting this to work. There's some bugs in the peer list update once memberlist discovers a peer. But the real problem is that in Nomad, we can't effectively point it to other peers to prime the memberlist mesh. The consul service alias doesn't get created until after the service is up and running and health check passes. But, it's failing on startup because it can't find the alias and the health check fails.

I'm going to give etcd a try in a separate PR.

MikeGoldsmith

Thanks for creating the PR @thrawn01. It's very thorough 👍🏻

This is interesting for us and we are open to adding support for MemberLists. As @Baliedge indicated, somethings are not working as expected with nomad.

Please can you take a look?

Requires `PeerListenAddr` to be unique in the cluster. So, "0.0.0.0:8081" default value will not work.

…er-memberlist

127.0.0.1 can be bound, but not any other 4th octet value. Would be better if custom ports on common IP worked. Peering currently requires unique IP addresses for each peer, regardless of port.

Baliedge · 2022-09-07T13:40:50Z

After discussion with @MikeGoldsmith, we decided that Memberlist probably isn't a good solution for our use case. It seems to work well when you have a dedicated cluster master with a static host that you can point peers to. But that isn't our environment. Under our Nomad environment all nodes are dynamic and may come and go with every new deployment.

Instead, I feel it would be best to focus on integrating using our existing etcd service as we do with our other services.

Additionally, there are two behaviors that don't impact production, but prevent end-to-end integration testing.

First, the peer listener binds to the detected public IP, overriding the IP set in PeerListenAddr configuration. This is by design with the assumption you would never need multiple peers on the same host. However, this impacts testability because you only have 1 IP to work with.

Second, peering logic identifies peers by IP address only, without the port. As a result, peering would not support multiple peers on same host but different ports. This design requires each peer to have different IPs and bind on the same port. As before, this also impacts testability.

We agreed to leave these two behaviors unchanged in our peering changes. They may be addressed in later efforts.

MikeGoldsmith · 2022-09-07T15:06:15Z

I'm going to close this for now. If someone wants to continue working on this, with mind of the points @Baliedge highlighted above, please re-open :)

thrawn01 requested review from a team and robbkidd August 29, 2022 23:07

thrawn01 force-pushed the thrawn.peer-memberlist branch from fa241b9 to f6b266c Compare August 30, 2022 02:12

Added PeerManagement.Timeout config option

3b3137e

vreynolds added the status: oncall Flagged for awareness from Honeycomb Telemetry Oncall label Aug 30, 2022

thrawn01 force-pushed the thrawn.peer-memberlist branch from add92a1 to 9bd09b2 Compare August 30, 2022 15:37

Added PeerManagement.Type = 'member-list'

de757d5

thrawn01 force-pushed the thrawn.peer-memberlist branch from 9bd09b2 to de757d5 Compare August 30, 2022 15:41

Fix unit test.

8985bfb

Remove `computePublicAddr()`. Use configuration's bind address verbatim.

Baliedge added 6 commits September 1, 2022 10:19

Merge branch 'main' of github.com:honeycombio/refinery into thrawn.pe…

2811d4c

…er-memberlist

go mod tidy

d70bc9f

Fix peer type validation.

73fe396

Debug log peer list.

584fd88

More logging.

1c27cb4

Fix logging.

6e2850b

MikeGoldsmith added type: enhancement New feature or request status: revision needed Waiting for response to changes requested. version: bump minor A PR that adds behavior, but is backwards-compatible. labels Sep 6, 2022

MikeGoldsmith requested changes Sep 6, 2022

View reviewed changes

MikeGoldsmith self-assigned this Sep 6, 2022

Fix Memberlist implementation.

75f080b

Requires `PeerListenAddr` to be unique in the cluster. So, "0.0.0.0:8081" default value will not work.

Baliedge force-pushed the thrawn.peer-memberlist branch from d575ea9 to 75f080b Compare September 6, 2022 20:41

Baliedge added 2 commits September 6, 2022 16:43

Merge branch 'main' of github.com:honeycombio/refinery into thrawn.pe…

f417882

…er-memberlist

WIP: Unit test not working because it requires unique bindable IPs.

23c450c

127.0.0.1 can be bound, but not any other 4th octet value. Would be better if custom ports on common IP worked. Peering currently requires unique IP addresses for each peer, regardless of port.

MikeGoldsmith closed this Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added PeerManagement.Type = 'member-list' #492

Added PeerManagement.Type = 'member-list' #492

thrawn01 commented Aug 29, 2022

kentquirk commented Aug 30, 2022 •

edited

Loading

Baliedge commented Sep 1, 2022 •

edited

Loading

Baliedge commented Sep 2, 2022

MikeGoldsmith left a comment

Baliedge commented Sep 7, 2022

MikeGoldsmith commented Sep 7, 2022

Added PeerManagement.Type = 'member-list' #492

Added PeerManagement.Type = 'member-list' #492

Conversation

thrawn01 commented Aug 29, 2022

Which problem is this PR solving?

Short description of the changes

Race condition in Peers interface design

kentquirk commented Aug 30, 2022 • edited Loading

Baliedge commented Sep 1, 2022 • edited Loading

Baliedge commented Sep 2, 2022

MikeGoldsmith left a comment

Choose a reason for hiding this comment

Baliedge commented Sep 7, 2022

MikeGoldsmith commented Sep 7, 2022

Race condition in `Peers` interface design

kentquirk commented Aug 30, 2022 •

edited

Loading

Baliedge commented Sep 1, 2022 •

edited

Loading