-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MM-43872] Implement rtcd auto-scaling logic #69
Conversation
if cfg.AuthKey == "" { | ||
m.ctx.LogDebug("auth key missing from rtcd config, generating a new one") | ||
cfg.AuthKey, err = random.NewSecureString(32) | ||
if err != nil { | ||
return cfg, fmt.Errorf("failed to generate auth key: %w", err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to mention that this PR also includes the auth related changes proposed in mattermost/rtcd#28 . This simplifies a bit and removes the corner case of getting locked out due to crashing after registration but before saving to the store. We now save the key prior to registration in registerRTCDClient
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, that's a pretty deep PR. 😅
// Send routes the message to the appropriate host that's handling the given | ||
// call. If this is missing a new client is created and added to the mapping. | ||
func (m *rtcdClientManager) Send(msg rtcd.ClientMessage, callID string) error { | ||
state, err := m.ctx.kvGetChannelState(callID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this will cause a performance issue? Maybe not if it's just ws events... Just thinking out loud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first implementation was actually doing some in-memory caching and only fetching on join events. But with that it came more complexity as I had to keep a map of callID -> host
and a few extra corner cases to handle so I reverted to the dumbest solution which I think is going to perform well enough for now. But certainly doing some caching would be a good first optimization.
server/rtcd.go
Outdated
for _, host := range hosts { | ||
go func(client *rtcd.Client) { | ||
for { | ||
select { | ||
case err, ok := <-client.ErrorCh(): | ||
if !ok { | ||
return | ||
} | ||
m.ctx.LogError(err.Error()) | ||
case msg, ok := <-client.ReceiveCh(): | ||
if !ok { | ||
return | ||
} | ||
if err := m.handleClientMsg(msg); err != nil { | ||
m.ctx.LogError(err.Error()) | ||
} | ||
} | ||
} | ||
}(host.client) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cool.
// the available hosts (ip addresses) pointed by the rtcd URL that are advertised through DNS. | ||
// When new hosts are found a client for them is created. Hosts that are missing | ||
// from the returned set are flagged and won't be used for new calls. | ||
func (m *rtcdClientManager) hostsChecker() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cool!
Summary
This is where the magic happens 🪄 It's still a bit rough with a couple of corner cases but it's a start. Looking to improve this in further iterations.
Simply put the logic goes as follows:
rtcdClientManager
which resolves thertcdURL
and connects to any returned host.hostsChecker
goroutine to routinely resolve the hosts in case anything has changed. If new hosts are found we create clients for those as well. If some hosts are gone we flag them so that they won't be used for future calls.callState
object in the k/v store.As I mentioned above there are still some rough edges. As an example, we don't have yet a proper way of removing hosts from the mapping other than when clients have failed all their re-connection attempts. Ideally we'd like to remove those as soon as the last call routed to them has ended. To do that we need a fair bit more coordination though so opted to defer it for now and think it through a bit more.
Ticket Link
https://mattermost.atlassian.net/browse/MM-43872