-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add api docs for audio-to-text pipeline #594
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
70edf20
Add api docs for speech-to-text
eliteprox 8f9774d
Capitalize title
eliteprox 51c8fe4
Update supported file types
eliteprox 42a2c56
Update recommended price per unit
eliteprox a381615
Update docs for audio-to-text
eliteprox e9019b5
update file types and request limit, sort menu items
eliteprox e4667f2
docs(ai): apply small audio-to-text improvements
rickstaa File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
--- | ||
openapi: post /audio-to-text | ||
--- | ||
|
||
<Info> | ||
The public [Livepeer.cloud](https://www.livepeer.cloud/) Gateway used in this | ||
guide is intended for experimentation and is not guaranteed for production | ||
use. It is a free, non-token-gated, but rate-limited service designed for | ||
testing purposes. For production-ready applications, consider setting up your | ||
own Gateway node or partnering with one via the `ai-video` channel on | ||
[Discord](https://discord.gg/livepeer). | ||
</Info> | ||
|
||
<Note> | ||
Please note that the **optimal** parameters for a given model may vary | ||
depending on the specific model and use case. The parameters provided in this | ||
guide are not model-specific and should be used as a starting point. | ||
Additionally, some models may have parameters such as `guiding_scale` and | ||
`num_inference_steps` disabled by default. For more information on | ||
model-specific parameters, please refer to the respective model documentation. | ||
</Note> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
--- | ||
title: Audio-to-Text | ||
--- | ||
|
||
## Overview | ||
|
||
The `audio-to-text` pipeline converts audio from media files into text, | ||
utilizing cutting-edge diffusion models from HuggingFace's | ||
[automatic-speech-recognition (ASR) pipeline](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition). | ||
|
||
<div align="center"> | ||
|
||
</div> | ||
|
||
## Models | ||
|
||
### Warm Models | ||
|
||
The current warm model requested for the `audio-to-text` pipeline is: | ||
|
||
- [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3): | ||
Whisper is a pre-trained model for automatic speech recognition (ASR) and | ||
speech translation. | ||
|
||
<Tip> | ||
For faster responses with different | ||
[audio-to-text](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition) | ||
diffusion models, ask Orchestrators to load it on their GPU via the `ai-video` | ||
channel in [Discord Server](https://discord.gg/livepeer). | ||
</Tip> | ||
|
||
### On-Demand Models | ||
|
||
The following models have been tested and verified for the `audio-to-text` | ||
pipeline: | ||
|
||
<Note> | ||
If a specific model you wish to use is not listed, please submit a [feature | ||
request](https://github.com/livepeer/ai-worker/issues/new?assignees=&labels=enhancement%2Cmodel&projects=&template=model_request.yml) | ||
on GitHub to get the model verified and added to the list. | ||
</Note> | ||
|
||
{/* prettier-ignore */} | ||
<Accordion title="Tested and Verified Diffusion Models"> | ||
- [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3): A high-performance | ||
ASR model by Open AI. | ||
|
||
</Accordion> | ||
|
||
## Basic Usage Instructions | ||
|
||
<Tip> | ||
For a detailed understanding of the `audio-to-text` endpoint and to experiment | ||
with the API, see the [AI Subnet API | ||
Reference](/ai/api-reference/audio-to-text). | ||
</Tip> | ||
|
||
To create an audio transcript using the `audio-to-text` pipeline, submit a | ||
`POST` request to the Gateway's `audio-to-text` API endpoint: | ||
|
||
```bash | ||
curl -X POST "https://<gateway-ip>/audio-to-text" \ | ||
-F model_id=openai/whisper-large-v3 \ | ||
-F audio=@<PATH_TO_FILE> | ||
``` | ||
|
||
In this command: | ||
|
||
- `<gateway-ip>` should be replaced with your AI Gateway's IP address. | ||
- `model_id` is the diffusion model for image generation. | ||
- `audio` is the path to the audio file to be transcribed. | ||
|
||
<Note> | ||
- Supported file types: `mp4`, `webm`, `mp3`, `flac`, `wav` and `m4a` - | ||
Maximum request size: 50 MB | ||
</Note> | ||
|
||
For additional optional parameters, refer to the | ||
[AI Subnet API Reference](/ai/api-reference/audio-to-text). | ||
|
||
After execution, the Orchestrator processes the request and returns the response | ||
to the Gateway: | ||
|
||
```json | ||
{ | ||
"chunks": [ | ||
{ | ||
"text": " Explore the power of automatic speech recognition", | ||
"timestamp": [ | ||
0, | ||
1.35 | ||
] | ||
}, | ||
{ | ||
"text": " By extracting the text from audio", | ||
"timestamp": [ | ||
1.35 | ||
2.07 | ||
] | ||
} | ||
], | ||
"text": " Explore the power of automatic speech recognition By extracting the text from audio" | ||
} | ||
``` | ||
|
||
## API Reference | ||
|
||
<Card | ||
title="API Reference" | ||
icon="rectangle-terminal" | ||
href="/ai/api-reference/audio-to-text" | ||
> | ||
Explore the `audio-to-text` endpoint and experiment with the API in the AI | ||
Subnet API Reference. | ||
</Card> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickstaa Curious about your thoughts on pricing. The audio-to-text pipeline uses milliseconds as the unit (second * 1000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eliteprox I like the pricing per millisecond 👍🏻. The pricing seems a bit low, but that's fine since we can let the market take effect. With the current pricing, it would cost
12882811*1000*60*10**-18*3205.48 = $0.0025
per minute of audio, whereas OpenAI charges$0.006
, making us very competitive.