Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate faster version of whisper (batched faster whisper) to Aana SDK #41

Closed
ashwinnair14 opened this issue Jan 24, 2024 · 3 comments · Fixed by #53
Closed

Integrate faster version of whisper (batched faster whisper) to Aana SDK #41

ashwinnair14 opened this issue Jan 24, 2024 · 3 comments · Fixed by #53
Assignees
Labels
enhancement New feature or request wip Work In Progress

Comments

@ashwinnair14
Copy link
Contributor

ashwinnair14 commented Jan 24, 2024

Feature Summary

  • Concise description of the feature
    You can integrate the faster-batched version of a whisper into the Aana SDK.

Justification/Rationale

  • Why is the feature beneficial?
    This feature enables a faster version of whisper that uses VAD(voice activity detection) and improves batching to improve the throughput by approximately 4x.

Proposed Implementation (if any)

  • How do you envision the implementation of this feature?
    There are 2 options for the implementation.

1st A separate endpoint for the batched whisper.
2nd A flag/param for the existing endpoint to enable batched inference with a trade-off on WER. A bunch of parameters usually familiar to the user are not, e.g. without_timestamps (no word word-level timestamps).

VAD would be introduced as a separate deployment.

@ashwinnair14 ashwinnair14 added the enhancement New feature or request label Jan 24, 2024
@Jiltseb Jiltseb changed the title Integrate faster version wispher (batched faster wispher) to Aana SDK Integrate faster version of whisper (batched faster whisper) to Aana SDK Jan 25, 2024
@Jiltseb
Copy link
Contributor

Jiltseb commented Jan 25, 2024

As per discussion, we will create a separate Endpoint for batched faster-whisper. We could even consider it as a separate target in the future.

@ashwinnair14
Copy link
Contributor Author

Comments from Jilt

Below is the benchmarking result:
https://docs.google.com/spreadsheets/d/1XMVbwDnVissogqf5MHptal29tUV0VAlvbVDPxrzAX5U/edit?pli=1#gid=2029644071
Observations:
Video to audio extraction as an initial steps improves speed, especially for threading.
The difference between single and multiple deployment reduces if we perform audio extraction.
Best results obtained if we have a single deployment for both asr and vad models (3% and 2% difference only).
With separate deployments, we can decouple stages, and use vad stage for multiple models.

@Jiltseb
Copy link
Contributor

Jiltseb commented Feb 7, 2024

Steps:

  1. Add VAD deployment, Vad parameters, deployments.py updates: Done

  2. Add batched_inference method in whisper deployment, wrapping the whisper model in batched inference pipeline: Done

  3. Define nodes, end points, initial API calls for testing: Done

  4. Implementation and comparison of different pipelines for batched inference: Done
    I. video input and Vad+whisper as a single deployment.
    ii. video input and vad and whisper as separate deployment.
    iii. video to audio conversion and vad+whisper as a single deployment.
    iv. video to audio conversion and vad and whisper as separate deployment.
    v. Apply above to direct audio input as well.
    v. Add multiple model replica and tests for speed.

  5. Handling audio conversion, loading and deletion: Done
    I. Unlike Image and Video objects, there was no specific dataclass for Audio. Created a data class with basic i/o functionalities consistent with other dataclasses.
    ii. Handling cleanups for Audio objects.

  6. Handling video files without audio ( Passes through nodes with empty content and writes empty transcription): Done

  7. Changing/adding default values of the initial whisper implementation (eg: vad_filter to True): new issue

  8. Testing: Done
    I. Modify old tests based on input type change.: Done
    ii. Add test for vad_deployment: Done
    iii. Add test for whisper_deployment (extra added methods), with input taken from the expected vad_deployment output.: Done
    iv. Integration test with new endpoint for transcribe_batch: Done

  9. Change all whisper functions to accept Audio input format and not Videos: Done

  10. Changes to the nodes in the pipeline, and endpoints based on the audio type.: Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wip Work In Progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants