Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MM-53432] Calls transcriptions #549

Merged
merged 40 commits into from
Nov 29, 2023
Merged

[MM-53432] Calls transcriptions #549

merged 40 commits into from
Nov 29, 2023

Conversation

streamer45
Copy link
Collaborator

@streamer45 streamer45 commented Oct 10, 2023

Summary

PR implements the plugin side for calls transcriptions (post call transcript).

One important detail to keep in mind is that recording and transcription jobs are artificially coupled, meaning that when transcriptions are enabled (globally from the plugin settings), starting and stopping recordings will automatically start and stop transcriptions. Moreover, any failure would cause both jobs to be terminated.

One limitation I found when trying to follow the proposed design is that it's not really possible to update the attachments for a previously created post so we cannot easily add the transcription file to recording post unless we artificially delay its creation. I don't like that idea very much given the transcription could take a while to process. For now we simply have the bot make a new post with the transcription file attached. I created https://mattermost.atlassian.net/browse/MM-54874 to track this improvement as it'll likely require server changes.

Still to do (ideally prior to merging)

  • Fix E2E pipeline. Need to create and bump the builder to use Go 1.21. (https://git.internal.mattermost.com/qa/calls-e2e-testing/-/merge_requests/37)
  • Need to prepare a test server for UX/PM to test this on.
  • At least a couple of e2e tests. Need to update all the pipelines to make it happen though.
  • Attaching a text based format in addition to WebVTT for easy (human readable) preview of the transcription content.
  • Probably need to slightly rework the transcript format since the speaker names are not rendered automatically.
  • Add some basic configuration for transcription jobs, such as quality (e.g. model size to use).
  • Consider having a setting to control the timeout for the transcribing job given we currently rely on MaxRecordingDuration (see comment in code).
  • Update documentation (e.g. the new plugin setting).
  • Merge, release and update all the dependencies (5 sister repositories at least I believe).
  • Performance and scaling tests

Design

https://www.figma.com/file/ZAvwHhdUTaWkby4uekmnDX/Call-transcription

Ticket Link

https://mattermost.atlassian.net/browse/MM-53432

@streamer45 streamer45 added 1: UX Review Requires review by ux 2: Dev Review Requires review by a core committer 3: Security Review 1: Editor Review labels Oct 10, 2023
@streamer45 streamer45 added this to the v0.21.0 / MM 9.3 milestone Oct 10, 2023
@streamer45 streamer45 requested a review from cpoile October 10, 2023 22:41
@streamer45 streamer45 self-assigned this Oct 10, 2023
@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2023

Codecov Report

Attention: 670 lines in your changes are missing coverage. Please review.

Comparison is base (ff80fe2) 5.70% compared to head (d787eae) 6.04%.

Files Patch % Lines
server/bot_api.go 0.00% 219 Missing ⚠️
server/transcription_api.go 0.00% 143 Missing ⚠️
server/job_service.go 0.00% 72 Missing ⚠️
server/session.go 0.00% 53 Missing ⚠️
server/job_metadata.go 37.80% 47 Missing and 4 partials ⚠️
server/recording_api.go 0.00% 49 Missing ⚠️
server/utils.go 15.90% 37 Missing ⚠️
server/state.go 18.51% 21 Missing and 1 partial ⚠️
server/configuration.go 36.84% 11 Missing and 1 partial ⚠️
server/store.go 0.00% 12 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #549      +/-   ##
========================================
+ Coverage   5.70%   6.04%   +0.34%     
========================================
  Files         24      26       +2     
  Lines       4561    5110     +549     
========================================
+ Hits         260     309      +49     
- Misses      4282    4777     +495     
- Partials      19      24       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@cpoile cpoile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor comments, nothing big. Amazing work!

return
}

postID := info["thread_id"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have a ensureFields(info, "thread_id", "file_id", "transcription_id") and then we can use it elsewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ideal solution would be for the request body to be properly typed and likely provide an IsValid() method. That means more coupling as both sides need to be aware of the type but I think we are going into that direction anyway with the new public stuff so may as well do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(again, just an idea, not necessary): That's why a more generic ensureFields(stringmap map[string]string, fields ...string) type of thing would be better? (less coupling)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, understood the proposal. My main worry is that that would become problematic as soon as we need to handle something other than strings. And then we are re-inventing structs using maps. Maybe some coupling is the way to go given this is essentially a public API from the implementer's point of view.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I added a JobInfo type and updated everything, no more map parsing on the API side.

Comment on lines +349 to +351
// Checking if the transcription has ended due to the bot leaving.
if prevState.Call != nil && prevState.Call.Transcription != nil && currState.Call != nil && currState.Call.Transcription != nil &&
currState.Call.Transcription.EndAt > prevState.Call.Transcription.EndAt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to think the way that the bot behaviour + state is triggering things seems a little brittle and confusing. Maybe it's too late to change it, but it would be nice if jobs were started and stopped by idempotent events. Granted, we need the bots to do any of the actual recording/transcribing, but it seems like maybe the communication triggers and the stopping triggers (at least) could be in response to events, rather than these conditional blocks (which seems like it would be error prone).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand more about that as I am not entirely sure what you mean by "idempotent events". Could you make an example perhaps?

In essence the code above is in response to an event, namely the bot leaving the call which we need to handle somehow. Not sure if it's a syntactic improvement you'd be looking for (e.g. onBotLeftCall callback) or something more structural.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my thinking (and while writing this I think I see why it's not practical): Something like moving all the "bot leaving triggers the recording/transcription ending" logic out of here and putting it in its own function (say: recordingEnded), then moving the conditional logic (lines 350-351) somewhere else (that's the infeasible part I think), then using that conditional logic to trigger an api call which trigger the recordingEnded. Then if that recordingEnded endpoint gets triggered multiple times (for whatever reason) we can ignore it after the first one.
I think it's tough to find a better place for lines 350-351 though...
I guess I'm worried that we've tightly coupled the bot's presence with the recording/transcribing activity. I know we need the bot there for that to actually work, I just feel it's mixing two concepts that logically should be separate...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I feel this deserves a sync discussion, let's chat about it next week.

Comment on lines +127 to +129
if trState.JobID != "" {
return fmt.Errorf("transcription job already in progress")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's funny. Not sure how that would happen :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually can happen because we are not 100% atomic there due to the fact that we need to unlock to make the API call to the job service. So concurrent requests to start transcriptions could theoretically result in multiple jobs starting. Of course the client shouldn't ever do that but it's a possibility we may want to account for in case of issues/bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know, thanks

return nil
}

func (p *Plugin) stopTranscribingJob(state *channelState, callID string) (rerr error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an opportunity to consolidate some code with the checker? Not a big deal though.

@antran22
Copy link

This is awesome, big thanks to Mattermost Team for this feature. Is there any point on the roadmap for this feature that we can wait for?

Base automatically changed from MM-51852 to main October 30, 2023 19:36
@streamer45
Copy link
Collaborator Author

This is awesome, big thanks to Mattermost Team for this feature. Is there any point on the roadmap for this feature that we can wait for?

Hopefully sometime next month. But the functionality (at least initially) will only be available for Enterprise licensed installations.

@streamer45 streamer45 requested a review from cwarnermm November 20, 2023 19:45
@matthewbirtch
Copy link
Contributor

Is there anything else blocking you were looking to see added/changed? I am still working on some small improvements on the text output as discussed above but other than that I think we are ready to move forward. Once approved I can begin the process of getting this in Community, ideally sometime this week.

Nope, nothing else for me other than the text output. I think we're good to proceed as well @streamer45. Great work! I'll approve this knowing we will be doing some tweaking to the text output and we can test things out more robustly on Community.

@matthewbirtch matthewbirtch removed the 1: UX Review Requires review by ux label Nov 20, 2023
@streamer45
Copy link
Collaborator Author

@matthewbirtch I implemented some basic heuristics to compact the produced text file output in order to avoid very short sentences. I tried a couple of approaches and eventually converged on a duration based one (as opposed to length) since I felt that splitting/joining based on time gives better results overall as it accounts for pace and potential pauses in speech.

It's still experimental and highly configurable, right now enforcing the following rules. We join segments if:

  1. The speaker doesn't change. This is required to guarantee order of the segments (e.g. question/answer sequences).
  2. There are less than X seconds of pause between the end of a previous text segment and the start of the next one. (X = 2s)
  3. The overall duration of the current sentence is less than Y seconds. (Y = 10s)

Give it a try when you get a chance and let me know if you see any improvements or hit any issue.

Copy link
Member

@cwarnermm cwarnermm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided string feedback inline.

Copy link
Member

@cwarnermm cwarnermm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided string feedback inline.

@matthewbirtch
Copy link
Contributor

@matthewbirtch I implemented some basic heuristics to compact the produced text file output in order to avoid very short sentences. I tried a couple of approaches and eventually converged on a duration based one (as opposed to length) since I felt that splitting/joining based on time gives better results overall as it accounts for pace and potential pauses in speech.

It's still experimental and highly configurable, right now enforcing the following rules. We join segments if:

  1. The speaker doesn't change. This is required to guarantee order of the segments (e.g. question/answer sequences).
  2. There are less than X seconds of pause between the end of a previous text segment and the start of the next one. (X = 2s)
  3. The overall duration of the current sentence is less than Y seconds. (Y = 10s)

Give it a try when you get a chance and let me know if you see any improvements or hit any issue.

Definitely seeing an improvement here @streamer45 and I like the rules you've established here - this is really smart! I vote we proceed with what you have and we can always tweak these numbers if we find we're not getting expected results in broader testing.

@streamer45 streamer45 requested a review from cwarnermm November 21, 2023 21:55
Copy link
Member

@cwarnermm cwarnermm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making it so easy for me to review the latest strings with direct links!

@cwarnermm cwarnermm added 3: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer 1: Editor Review labels Nov 21, 2023
@streamer45 streamer45 added the Do Not Merge/Awaiting PR Awaiting another pull request before merging (e.g. server changes) label Nov 21, 2023
@streamer45 streamer45 removed the Do Not Merge/Awaiting PR Awaiting another pull request before merging (e.g. server changes) label Nov 24, 2023
@streamer45
Copy link
Collaborator Author

Merging this to unblock other PRs. Still working on docs updates as a separate item (https://mattermost.atlassian.net/browse/MM-55587)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants