Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MM-43872] Implement rtcd auto-scaling logic #69

Merged
merged 10 commits into from
May 24, 2022
Merged

[MM-43872] Implement rtcd auto-scaling logic #69

merged 10 commits into from
May 24, 2022

Conversation

streamer45
Copy link
Collaborator

@streamer45 streamer45 commented May 11, 2022

Summary

This is where the magic happens 🪄 It's still a bit rough with a couple of corner cases but it's a start. Looking to improve this in further iterations.

Simply put the logic goes as follows:

  • Upon starting the plugin we create an rtcdClientManager which resolves the rtcdURL and connects to any returned host.
  • We also start a hostsChecker goroutine to routinely resolve the hosts in case anything has changed. If new hosts are found we create clients for those as well. If some hosts are gone we flag them so that they won't be used for future calls.
  • When a call is about to start we pick the host with the least number of calls started on it (from the plugin instance perspective that is, to be improved).
  • Once a call has started, any future message related to that call is routed to the appropriate host. This is saved as part of the callState object in the k/v store.

As I mentioned above there are still some rough edges. As an example, we don't have yet a proper way of removing hosts from the mapping other than when clients have failed all their re-connection attempts. Ideally we'd like to remove those as soon as the last call routed to them has ended. To do that we need a fair bit more coordination though so opted to defer it for now and think it through a bit more.

Ticket Link

https://mattermost.atlassian.net/browse/MM-43872

@streamer45 streamer45 added 2: Dev Review Requires review by a core committer Do Not Merge/Awaiting PR Awaiting another pull request before merging (e.g. server changes) labels May 11, 2022
@streamer45 streamer45 requested a review from cpoile May 11, 2022 17:07
@streamer45 streamer45 self-assigned this May 11, 2022
Comment on lines +459 to +465
if cfg.AuthKey == "" {
m.ctx.LogDebug("auth key missing from rtcd config, generating a new one")
cfg.AuthKey, err = random.NewSecureString(32)
if err != nil {
return cfg, fmt.Errorf("failed to generate auth key: %w", err)
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention that this PR also includes the auth related changes proposed in mattermost/rtcd#28 . This simplifies a bit and removes the corner case of getting locked out due to crashing after registration but before saving to the store. We now save the key prior to registration in registerRTCDClient.

Copy link
Member

@cpoile cpoile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, that's a pretty deep PR. 😅

// Send routes the message to the appropriate host that's handling the given
// call. If this is missing a new client is created and added to the mapping.
func (m *rtcdClientManager) Send(msg rtcd.ClientMessage, callID string) error {
state, err := m.ctx.kvGetChannelState(callID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this will cause a performance issue? Maybe not if it's just ws events... Just thinking out loud.

Copy link
Collaborator Author

@streamer45 streamer45 May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first implementation was actually doing some in-memory caching and only fetching on join events. But with that it came more complexity as I had to keep a map of callID -> host and a few extra corner cases to handle so I reverted to the dumbest solution which I think is going to perform well enough for now. But certainly doing some caching would be a good first optimization.

server/rtcd.go Outdated
Comment on lines 90 to 109
for _, host := range hosts {
go func(client *rtcd.Client) {
for {
select {
case err, ok := <-client.ErrorCh():
if !ok {
return
}
m.ctx.LogError(err.Error())
case msg, ok := <-client.ReceiveCh():
if !ok {
return
}
if err := m.handleClientMsg(msg); err != nil {
m.ctx.LogError(err.Error())
}
}
}
}(host.client)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool.

// the available hosts (ip addresses) pointed by the rtcd URL that are advertised through DNS.
// When new hosts are found a client for them is created. Hosts that are missing
// from the returned set are flagged and won't be used for new calls.
func (m *rtcdClientManager) hostsChecker() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cool!

@streamer45 streamer45 added 3: Reviews Complete All reviewers have approved the pull request Do Not Merge Should not be merged until this label is removed and removed 2: Dev Review Requires review by a core committer Do Not Merge/Awaiting PR Awaiting another pull request before merging (e.g. server changes) labels May 17, 2022
@streamer45 streamer45 removed the Do Not Merge Should not be merged until this label is removed label May 24, 2022
@streamer45 streamer45 merged commit 8b250a6 into main May 24, 2022
@streamer45 streamer45 deleted the MM-43872 branch May 24, 2022 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants