Skip to content

Commit

Permalink
fix broken link
Browse files Browse the repository at this point in the history
  • Loading branch information
jmgirard committed Aug 16, 2024
1 parent bb09d32 commit 15c581c
Show file tree
Hide file tree
Showing 5 changed files with 6 additions and 6 deletions.
4 changes: 2 additions & 2 deletions _freeze/posts/whisper2024/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "7a94a1a961167fec26308c1ccee8ddb3",
"hash": "860c64d76897fe444909093181943557",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"AI Transcription from R using Whisper: Part 1\"\ndescription: \"Tutorial on Using AI Transcription\"\nauthor: \"Jeffrey Girard\"\ndate: \"2024-08-14\"\nimage: whisper.webp\ndraft: false\ncategories:\n - teaching\n - audio\n - AI\n---\n\n\n\n## Introduction\n\nIn much of my work, I study how people communicate through verbal and nonverbal behavior. To study verbal behavior, it is often necessary to generate *transcripts*, which are written records of the words that were spoken. Transcriptions can be done manually (i.e., by a person) and assisted through the use of behavioral annotation software like [ELAN](https://archive.mpi.nl/tla/elan) or [ANVIL](https://anvil-software.de) or subtitle generation and editing software like [Aegisub](https://aegisub.org/) or [Subtitld](https://www.subtitld.org/en). However, new tools based on artificial intelligence (AI) can be much more efficient and scalable, albeit at some cost to accuracy.\n\nIn this blog post, I will provide a tutorial on how to set up and use OpenAI's free [Whisper](https://openai.com/index/whisper/) model to generate automatic transcriptions of audio files (either recorded originally as audio or extracted from video files). I will first show you how to quickly install the audio.whisper R package and transcribe an example file. However, the processing will be very slow and we can do much, much better if we offload some of the work to a dedicated graphics card, such as an Nvidia card with [CUDA](https://developer.nvidia.com/about-cuda). Enabling this takes some technical work, especially on Windows, but is worth the investment if you plan to process a lot of files. This technical work will be described in Part 2.\n\n:::{.callout-note}\nAlthough the Whisper model comes from OpenAI, the approach described here will actually run it locally, which means your audio files will not need to be sent to any third parties. This makes it usable for private and sensitive (e.g., patient) data!\n:::\n\n## Quickstart (easy setup, slow processing)\n\n### Install dependencies\n\nI assume you already have R (and probably an IDE like RStudio) installed. Open this up and install the development version of the [audio.whisper](https://github.com/bnosac/audio.whisper) package from github.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Install remotes if you don't have it already\n# install.packages(\"remotes\") \n\n# Install audio.whisper from github\nremotes::install_github(\"bnosac/audio.whisper\")\n```\n:::\n\n\n\n### Download whisper model\n\nLoad this new package and download one of the whisper models: `\"tiny\"`, `\"base\"`, `\"small\"`, `\"medium\"`, or `\"large-v3\"`. Earlier entries on that list are smaller (to download and hold in RAM), faster, and less accurate whereas later entries are larger, slower, and more accurate. There are also English-only versions of all but the large model, which end in `\".en\"` as in `\"base.en\"`, and these may be more efficient if you know that all speech will be in English. You can learn more about these models via `?whisper_download_model`. For this tutorial, we will go with the `\"base\"` model.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load package from library\nlibrary(audio.whisper)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Download or load from file the desired whisper model\nmodel <- whisper(\"base\")\n## whisper_init_from_file_with_params_no_state: loading model from 'C:/GitHub/affcomlab/posts/whisper2024/ggml-base.bin'\n## whisper_model_load: loading model\n## whisper_model_load: n_vocab = 51865\n## whisper_model_load: n_audio_ctx = 1500\n## whisper_model_load: n_audio_state = 512\n## whisper_model_load: n_audio_head = 8\n## whisper_model_load: n_audio_layer = 6\n## whisper_model_load: n_text_ctx = 448\n## whisper_model_load: n_text_state = 512\n## whisper_model_load: n_text_head = 8\n## whisper_model_load: n_text_layer = 6\n## whisper_model_load: n_mels = 80\n## whisper_model_load: ftype = 1\n## whisper_model_load: qntvr = 0\n## whisper_model_load: type = 2 (base)\n## whisper_model_load: adding 1608 extra tokens\n## whisper_model_load: n_langs = 99\n## whisper_model_load: CPU buffer size = 147.46 MB\n## whisper_model_load: model size = 147.37 MB\n## whisper_init_state: kv self size = 16.52 MB\n## whisper_init_state: kv cross size = 18.43 MB\n## whisper_init_state: compute buffer (conv) = 14.86 MB\n## whisper_init_state: compute buffer (encode) = 85.99 MB\n## whisper_init_state: compute buffer (cross) = 4.78 MB\n## whisper_init_state: compute buffer (decode) = 96.48 MB\n```\n:::\n\n\n\n\n:::{.callout-note}\nNote that the larger models may take a while to download, so if you get an error that the download took longer than permitted, you can temporarily allow more time via: `options(timeout = 300)`.\n:::\n\n### Transcribe example file\n\nThe package comes with an example audio file in the proper format, which contains 11 seconds of a speech by John F. Kennedy Jr. Let's load it from file using `system.file()` and then transcribe it using `predict()`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Construct file path to example audio file in package data\njfk <- system.file(package = \"audio.whisper\", \"samples\", \"jfk.wav\")\n\n# Run English transcription using the downloaded whisper model\nout <- predict(model, newdata = jfk, language = \"en\")\n\n# Print transcript\nout$data\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n<table class=\"table\" style=\"margin-left: auto; margin-right: auto;\">\n <thead>\n <tr>\n <th style=\"text-align:right;\"> segment </th>\n <th style=\"text-align:right;\"> segment_offset </th>\n <th style=\"text-align:left;\"> from </th>\n <th style=\"text-align:left;\"> to </th>\n <th style=\"text-align:left;\"> text </th>\n </tr>\n </thead>\n<tbody>\n <tr>\n <td style=\"text-align:right;\"> 1 </td>\n <td style=\"text-align:right;\"> 0 </td>\n <td style=\"text-align:left;\"> 00:00:00.000 </td>\n <td style=\"text-align:left;\"> 00:00:11.000 </td>\n <td style=\"text-align:left;\"> And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country. </td>\n </tr>\n</tbody>\n</table>\n\n`````\n:::\n:::\n\n\n\nThe results look good! But we can see how long this took by digging into the output object.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Examine the time elapsed to process this audio\nout$timing\n## $transcription_start\n## [1] \"2024-08-15 12:30:48 CDT\"\n## \n## $transcription_end\n## [1] \"2024-08-15 12:51:48 CDT\"\n## \n## $transcription_duration\n## Time difference of 20.98786 mins\n```\n:::\n\n\n\nYikes, 21 minutes to process just 11 seconds of audio. That's motivation to work on the CUDA version to speed things up. But before we move on to that, I'll first show you how to extract audio from a video file and convert it to the format that Whisper wants.\n\n### Extract and format audio\n\nDownload the example <a href=\"mlk.mp4\" download>mlk.mp4</a> video file, which contains 12 seconds of a speech by Martin Luther King, Jr. This video contains an audio stream in AAC format with a sampling rate of 44.1 kHz. However, whisper requires audio files in WAV format with a sampling rate of 16 kHz. We can extract and convert it in one step using the `av_audio_convert()` function from the `av` package.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Install av package if you don't have it already\n# install.packages(\"av\")\n\n# Load package from library\nlibrary(av)\n\n# Extract and convert audio\nav_audio_convert(\n \"mlk.mp4\", \n output = \"mlk.wav\", \n format = \"wav\", \n sample_rate = 16000\n)\n## [1] \"C:\\\\GitHub\\\\affcomlab\\\\posts\\\\whisper2024\\\\mlk.wav\"\n```\n:::\n\n\n\nNote that the process would have been identical if this had been an audio file in a different format rather than a video file - you would just replace the .mp4 file with the audio file (e.g., .mp3). Now let's transcribe this and verify that our conversion worked.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Run English transcription using the downloaded whisper model\nout2 <- predict(model, newdata = \"mlk.wav\", language = \"en\")\n\n# Print transcript\nout2$data\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n<table class=\"table\" style=\"margin-left: auto; margin-right: auto;\">\n <thead>\n <tr>\n <th style=\"text-align:right;\"> segment </th>\n <th style=\"text-align:right;\"> segment_offset </th>\n <th style=\"text-align:left;\"> from </th>\n <th style=\"text-align:left;\"> to </th>\n <th style=\"text-align:left;\"> text </th>\n </tr>\n </thead>\n<tbody>\n <tr>\n <td style=\"text-align:right;\"> 1 </td>\n <td style=\"text-align:right;\"> 0 </td>\n <td style=\"text-align:left;\"> 00:00:00.000 </td>\n <td style=\"text-align:left;\"> 00:00:02.000 </td>\n <td style=\"text-align:left;\"> I have a dream. </td>\n </tr>\n <tr>\n <td style=\"text-align:right;\"> 2 </td>\n <td style=\"text-align:right;\"> 0 </td>\n <td style=\"text-align:left;\"> 00:00:02.000 </td>\n <td style=\"text-align:left;\"> 00:00:12.000 </td>\n <td style=\"text-align:left;\"> But one day, this nation will rise up, live up the true meaning of its creed. </td>\n </tr>\n</tbody>\n</table>\n\n`````\n:::\n:::\n\n\n\nNot perfect (swapped \"that\" for \"but\" and omitted an \"and\") but pretty good. And this only the base model - it might do better with a larger model, but for time's sake I'll leave that until after we get CUDA working in Part 2.\n\n::: {.text-center}\n<a href=\"../whisper2024/index.qmd\" class=\"btn btn-primary mt-5\" role=\"button\" >Continue to Part 2&nbsp;&nbsp;&raquo;</a>\n:::\n",
"markdown": "---\ntitle: \"AI Transcription from R using Whisper: Part 1\"\ndescription: \"Tutorial on Using AI Transcription\"\nauthor: \"Jeffrey Girard\"\ndate: \"2024-08-14\"\nimage: whisper.webp\ndraft: false\ncategories:\n - teaching\n - audio\n - AI\n---\n\n\n\n## Introduction\n\nIn much of my work, I study how people communicate through verbal and nonverbal behavior. To study verbal behavior, it is often necessary to generate *transcripts*, which are written records of the words that were spoken. Transcriptions can be done manually (i.e., by a person) and assisted through the use of behavioral annotation software like [ELAN](https://archive.mpi.nl/tla/elan) or [ANVIL](https://anvil-software.de) or subtitle generation and editing software like [Aegisub](https://aegisub.org/) or [Subtitld](https://www.subtitld.org/en). However, new tools based on artificial intelligence (AI) can be much more efficient and scalable, albeit at some cost to accuracy.\n\nIn this blog post, I will provide a tutorial on how to set up and use OpenAI's free [Whisper](https://openai.com/index/whisper/) model to generate automatic transcriptions of audio files (either recorded originally as audio or extracted from video files). I will first show you how to quickly install the audio.whisper R package and transcribe an example file. However, the processing will be very slow and we can do much, much better if we offload some of the work to a dedicated graphics card, such as an Nvidia card with [CUDA](https://developer.nvidia.com/about-cuda). Enabling this takes some technical work, especially on Windows, but is worth the investment if you plan to process a lot of files. This technical work will be described in Part 2.\n\n:::{.callout-note}\nAlthough the Whisper model comes from OpenAI, the approach described here will actually run it locally, which means your audio files will not need to be sent to any third parties. This makes it usable for private and sensitive (e.g., patient) data!\n:::\n\n## Quickstart (easy setup, slow processing)\n\n### Install dependencies\n\nI assume you already have R (and probably an IDE like RStudio) installed. Open this up and install the development version of the [audio.whisper](https://github.com/bnosac/audio.whisper) package from github.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Install remotes if you don't have it already\n# install.packages(\"remotes\") \n\n# Install audio.whisper from github\nremotes::install_github(\"bnosac/audio.whisper\")\n```\n:::\n\n\n\n### Download whisper model\n\nLoad this new package and download one of the whisper models: `\"tiny\"`, `\"base\"`, `\"small\"`, `\"medium\"`, or `\"large-v3\"`. Earlier entries on that list are smaller (to download and hold in RAM), faster, and less accurate whereas later entries are larger, slower, and more accurate. There are also English-only versions of all but the large model, which end in `\".en\"` as in `\"base.en\"`, and these may be more efficient if you know that all speech will be in English. You can learn more about these models via `?whisper_download_model`. For this tutorial, we will go with the `\"base\"` model.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load package from library\nlibrary(audio.whisper)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Download or load from file the desired whisper model\nmodel <- whisper(\"base\")\n## whisper_init_from_file_with_params_no_state: loading model from 'C:/GitHub/affcomlab/posts/whisper2024/ggml-base.bin'\n## whisper_model_load: loading model\n## whisper_model_load: n_vocab = 51865\n## whisper_model_load: n_audio_ctx = 1500\n## whisper_model_load: n_audio_state = 512\n## whisper_model_load: n_audio_head = 8\n## whisper_model_load: n_audio_layer = 6\n## whisper_model_load: n_text_ctx = 448\n## whisper_model_load: n_text_state = 512\n## whisper_model_load: n_text_head = 8\n## whisper_model_load: n_text_layer = 6\n## whisper_model_load: n_mels = 80\n## whisper_model_load: ftype = 1\n## whisper_model_load: qntvr = 0\n## whisper_model_load: type = 2 (base)\n## whisper_model_load: adding 1608 extra tokens\n## whisper_model_load: n_langs = 99\n## whisper_model_load: CPU buffer size = 147.46 MB\n## whisper_model_load: model size = 147.37 MB\n## whisper_init_state: kv self size = 16.52 MB\n## whisper_init_state: kv cross size = 18.43 MB\n## whisper_init_state: compute buffer (conv) = 14.86 MB\n## whisper_init_state: compute buffer (encode) = 85.99 MB\n## whisper_init_state: compute buffer (cross) = 4.78 MB\n## whisper_init_state: compute buffer (decode) = 96.48 MB\n```\n:::\n\n\n\n\n:::{.callout-note}\nNote that the larger models may take a while to download, so if you get an error that the download took longer than permitted, you can temporarily allow more time via: `options(timeout = 300)`.\n:::\n\n### Transcribe example file\n\nThe package comes with an example audio file in the proper format, which contains 11 seconds of a speech by John F. Kennedy Jr. Let's load it from file using `system.file()` and then transcribe it using `predict()`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Construct file path to example audio file in package data\njfk <- system.file(package = \"audio.whisper\", \"samples\", \"jfk.wav\")\n\n# Run English transcription using the downloaded whisper model\nout <- predict(model, newdata = jfk, language = \"en\")\n\n# Print transcript\nout$data\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n<table class=\"table\" style=\"margin-left: auto; margin-right: auto;\">\n <thead>\n <tr>\n <th style=\"text-align:right;\"> segment </th>\n <th style=\"text-align:right;\"> segment_offset </th>\n <th style=\"text-align:left;\"> from </th>\n <th style=\"text-align:left;\"> to </th>\n <th style=\"text-align:left;\"> text </th>\n </tr>\n </thead>\n<tbody>\n <tr>\n <td style=\"text-align:right;\"> 1 </td>\n <td style=\"text-align:right;\"> 0 </td>\n <td style=\"text-align:left;\"> 00:00:00.000 </td>\n <td style=\"text-align:left;\"> 00:00:11.000 </td>\n <td style=\"text-align:left;\"> And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country. </td>\n </tr>\n</tbody>\n</table>\n\n`````\n:::\n:::\n\n\n\nThe results look good! But we can see how long this took by digging into the output object.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Examine the time elapsed to process this audio\nout$timing\n## $transcription_start\n## [1] \"2024-08-15 12:30:48 CDT\"\n## \n## $transcription_end\n## [1] \"2024-08-15 12:51:48 CDT\"\n## \n## $transcription_duration\n## Time difference of 20.98786 mins\n```\n:::\n\n\n\nYikes, 21 minutes to process just 11 seconds of audio. That's motivation to work on the CUDA version to speed things up. But before we move on to that, I'll first show you how to extract audio from a video file and convert it to the format that Whisper wants.\n\n### Extract and format audio\n\nDownload the example <a href=\"mlk.mp4\" download>mlk.mp4</a> video file, which contains 12 seconds of a speech by Martin Luther King, Jr. This video contains an audio stream in AAC format with a sampling rate of 44.1 kHz. However, whisper requires audio files in WAV format with a sampling rate of 16 kHz. We can extract and convert it in one step using the `av_audio_convert()` function from the `av` package.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Install av package if you don't have it already\n# install.packages(\"av\")\n\n# Load package from library\nlibrary(av)\n\n# Extract and convert audio\nav_audio_convert(\n \"mlk.mp4\", \n output = \"mlk.wav\", \n format = \"wav\", \n sample_rate = 16000\n)\n## [1] \"C:\\\\GitHub\\\\affcomlab\\\\posts\\\\whisper2024\\\\mlk.wav\"\n```\n:::\n\n\n\nNote that the process would have been identical if this had been an audio file in a different format rather than a video file - you would just replace the .mp4 file with the audio file (e.g., .mp3). Now let's transcribe this and verify that our conversion worked.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Run English transcription using the downloaded whisper model\nout2 <- predict(model, newdata = \"mlk.wav\", language = \"en\")\n\n# Print transcript\nout2$data\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n<table class=\"table\" style=\"margin-left: auto; margin-right: auto;\">\n <thead>\n <tr>\n <th style=\"text-align:right;\"> segment </th>\n <th style=\"text-align:right;\"> segment_offset </th>\n <th style=\"text-align:left;\"> from </th>\n <th style=\"text-align:left;\"> to </th>\n <th style=\"text-align:left;\"> text </th>\n </tr>\n </thead>\n<tbody>\n <tr>\n <td style=\"text-align:right;\"> 1 </td>\n <td style=\"text-align:right;\"> 0 </td>\n <td style=\"text-align:left;\"> 00:00:00.000 </td>\n <td style=\"text-align:left;\"> 00:00:02.000 </td>\n <td style=\"text-align:left;\"> I have a dream. </td>\n </tr>\n <tr>\n <td style=\"text-align:right;\"> 2 </td>\n <td style=\"text-align:right;\"> 0 </td>\n <td style=\"text-align:left;\"> 00:00:02.000 </td>\n <td style=\"text-align:left;\"> 00:00:12.000 </td>\n <td style=\"text-align:left;\"> But one day, this nation will rise up, live up the true meaning of its creed. </td>\n </tr>\n</tbody>\n</table>\n\n`````\n:::\n:::\n\n\n\nNot perfect (swapped \"that\" for \"but\" and omitted an \"and\") but pretty good. And this only the base model - it might do better with a larger model, but for time's sake I'll leave that until after we get CUDA working in Part 2.\n\n::: {.text-center}\n<a href=\"../whisper2024b/index.qmd\" class=\"btn btn-primary mt-5\" role=\"button\" >Continue to Part 2&nbsp;&nbsp;&raquo;</a>\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down
Loading

0 comments on commit 15c581c

Please sign in to comment.