Skip to content

A full-text search for YouTube subtitles and video metadata with a command line interface.

License

Notifications You must be signed in to change notification settings

h0lg/SubTubular

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SubTubular

A full-text search for YouTube with a command line interface. Searches subtitles and video metadata, returning time-stamped video links.

Overview

Searches

  • video title, description, keywords and captions (a.k.a. subtitles, closed captions/CC or transcript)
  • across multiple captions and description lines
  • in the scope of one or multiple videos, a playlist or channel
  • while ignoring the case of the search terms

supporting

returning

  • a list of search results with highlighted matches
  • including time-stamped video links to the corresponding part of the video for caption matches
  • as a text or HTML file if you need it

caching

  • searchable video metadata and subtitles in all available languages
  • videos in playlists or channels for a configurable time
  • channel aliases like handles, slugs or user names
  • full-text indexes for all searched texts
  • so that subsequent searches on the same scope can be done offline and are way faster than the first one
  • in your local user profile, i.e.
    • %AppData%\Roaming on Windows
    • ~/.config on Linux and macOS
  • until you explicitly clear them

requiring

  • no installation except for the .NET 7 runtime (which you may have installed already)
  • no YouTube login

thanks to

  • YoutubeExplode licensed under LGPL 3 for doing a better job at getting the relevant data off YouTube's public web API than YouTube's own Data API v3 is able to do at the time of writing. And for not requiring a clunky app registration and user authorization for every bit of data on top of that. A real game-changer!
  • LIFTI licensed under MIT for the heavy-lifting on the full-text search with indexing, fuzzy and wild card matching among other powerful query features. And for making them accessible through a well-designed API with awesome documentation.
  • CommandLineParser licensed under MIT for elegantly parsing and validating command line arguments as well as generating nicely formatted help text for them. And for making their TextWrapper accessible for easy reuse in host command line apps - it helps SubTubular with block-formatting full-text matches containing a lot of padding.
  • AngleSharp licensed under MIT for making HTML output generation easy and intuitive
  • Octokit licensed under MIT for wrapping the Github API and offering easy access to releases and their assets enabling the download of and showing release notes for different releases

not providing

  • subtitle download in any common, reusable format (although that would be an easy addition if required).

Commands

common search parameters

All search commands share the following parameters:

shorthand, name     
-f, --for (Group: query) What to search for. Quote "multi-word phrases". Single words are matched exactly by default, ?fuzzy or with wild cards for s%ngle and multi* letters. Combine multiple & terms | "phrases or queries" using AND '&' and OR '|' and ( use | brackets | for ) & ( complex | expressions ). You can restrict your search to the video Title, Description, Keywords and/or Captions; e.g. title="click bait". Learn more about the query syntax at https://mikegoatly.github.io/lifti/docs/searching/lifti-query-syntax/ .
-k, --keywords (Group: query) Lists the keywords the videos in scope are tagged with including their number of occurrences.
-p, --pad (Default: 23) How much context to pad a match in; i.e. the minimum number of characters of the original description or subtitle track to display before and after it.
-m, --html If set, outputs the highlighted search result in an HTML file including hyperlinks for easy navigation. The output path can be configured in the out parameter. Omitting it will save the file into the default output folder and name it according to your search parameters. Existing files with the same name will be overwritten.
-o, --out Writes the search results to a file, the format of which is either text or HTML depending on the html flag. Supply either a file or folder path. If the path doesn't contain a file name, the file will be named according to your search parameters. Existing files with the same name will be overwritten.
-s, --show The output to open if a file was written. Valid values: file, folder

common playlist search parameters

Search commands searching a playlist containing multiple videos (including search-playlist and search-channel) support the following parameters in addition to the common search parameters:

shorthand, name      
-t, --top (Default: 50) The number of videos to search, counted from the top of the playlist; effectively limiting the search scope to the top partition of it. You may want to gradually increase this to include all videos in the list while you're refining your query. Note that the special Uploads playlist of a channel is sorted latest uploaded first, but custom playlists may be sorted differently. Keep that in mind if you don't find what you're looking for and when using order-by (which is only applied to the results) with uploaded on custom playlists.
-r, --order-by Order the video search results by uploaded or score with asc for ascending. The default is descending (i.e. latest respectively highest first) and by score. Note that the order is only applied to the results with the search scope itself being limited by the --top parameter. Note also that for un-cached videos, this option is ignored in favor of outputting matches as soon as they're found - but simply repeating the search will hit the cache and return them in the requested order.
-h, --cache-hours (Default: 24) The maximum age of a playlist cache in hours before it is considered stale and the list of videos in it is refreshed. Note this doesn't apply to the videos themselves because their contents rarely change after upload. Use --clear-cache to clear videos associated with a playlist or channel if that's what you're after.

search-videos, videos, v

Searches the specified videos. Supports the common search parameters.

videos (pos. 0) Required. The space-separated YouTube video IDs and/or URLs. Note that if the video ID starts with a dash, you have to quote it like "-1a2b3c4d5e" or use the entire URL to prevent it from being misinterpreted as a command option.

search-playlist, playlist, p

Searches the videos in a playlist. Supports the common playlist search parameters.

playlist (pos. 0) Required. The playlist ID or URL.

search-channel, channel, c

Searches the videos in a channel's Uploads playlist. This is a glorified search-playlist. Supports the common playlist search parameters.

channel (pos. 0) Required. The channel ID, handle, slug, user name or a URL for either of those.

open, o

Opens app-related folders in a file browser.

folder (pos. 0) Required. The folder to open. Valid values: app, cache, errors, output, storage

with

folder being the directory
app the app is running from
cache used for caching channel, playlist and video info
errors error logs are written to
output output files are written to by default unless explicitly specified using the out parameter
storage that hosts the cache, errors and output folders

clear-cache, clear

Deletes cached info as well as the corresponding full-text indexes for channels, playlists and videos.

position / shorthand, name     
scope (pos. 0) Required. The type of caches to delete. For playlists and channels this will include the associated videos. Valid values: all, videos, playlists, channels
ids (pos. 1) The space-separated IDs or URLs of elements in the scope to delete caches for. Can be used with every scope but all while supporting user names, channel handles and slugs besides IDs for channels. If not set, all elements in the specified scope are considered for deletion. Note that if the video ID starts with a dash, you have to quote it like "-1a2b3c4d5e" or use the entire URL to prevent it from being misinterpreted as a command option.
-l, --last-access The maximum number of days since the last access of a cache file for it to be excluded from deletion. Effectively only deletes old caches that haven't been accessed for this number of days. Ignored for explicitly set ids.
-m, --mode (Default: summary) The deletion mode; summary only outputs how many of what file type were deleted. verbose outputs the deleted file names as well as the summary. simulate lists all file names that would be deleted by running the command instead of deleting them. You can use this to preview the files that would be deleted. Valid values: summary, verbose, simulate

release, r

List, browse and install other SubTubular releases. At least one option is required.

position / shorthand, name
-l, --list Lists available releases from https://github.com/h0lg/SubTubular/releases .
-n, --notes Opens the github release notes for a single release. Supply either the version of the release you're interested in or latest.
-i, --install Downloads a release from github and unzips it to the current installation folder while backing up the running version. Supply either the version of the release to install or latest.

Examples & use cases

Find specific parts of podcasts or other long-running videos

Scott Adams mentioned a psychological phenomenon named after a physicist on his podcast one of these days. Or did he say physician? What was its name again?

SubTubular.exe search-videos https://www.youtube.com/watch?v=egeCYaIe21Y
https://www.youtube.com/watch?v=gDrFdxWNk8c --for "physician | physicist" --pad 150

or short

SubTubular.exe videos egeCYaIe21Y gDrFdxWNk8c -f "physician | physicist" -p 150

gives you below result.

Note how the --for|-f argument is quoted because it contains a | pipe.

14/08/2020 22:00 https://youtu.be/egeCYaIe21Y
  English (auto-generated)
    17:22 this aclu story because it seems they've turned bad now this is an example of a gel man
          amnesia i talk about this all the time gail mann was the name of a physicist who
          whenever he saw a story about physics he knew the story was wrong but then if he saw a
          story about some other topic he would say that's probably right
          https://youtu.be/egeCYaIe21Y?t=1042

(turns out, it was the Gell-Mann Amnesia effect)

Search a playlist for mentions of a certain topic

The other day Styx mentioned some old book that describes the calcification of the pineal gland while predating the fluoridation of drinking water - apparently disproving the myth that it's caused by the fluoride.

Can we find it in his occult literature playlist? And would there be other mentions of fluoride in his reviews of old books?

SubTubular.exe search-playlist https://www.youtube.com/playlist?list=PLe6Bc4vsmzwLiFQv1eh8oZe4uCkw-yYl7
--for "( pineal ~ gland* & calcifi* ) | fluorid*" --top 500 --pad 90

or shorter

SubTubular.exe playlist PLe6Bc4vsmzwLiFQv1eh8oZe4uCkw-yYl7
-f "( pineal ~ gland* & calcifi* ) | fluorid*" -t 500 -p 90

both let you find below result.

But let's have a closer look at the query trailing the --for|-f - it searches

  • either for occurrences of pineal near words starting with gland because we want to match gland or glands but only when occurring together with pineal; both words on their own may mean different things
    • and only if also something starting with calcifi (like calcified or calcification) is found in the same context
  • or simply for anything starting with fluorid (like fluoridation or fluoridated)
Occult Literature 14: Occultism For Beginners (Dower)
10/06/2016 22:00 https://youtu.be/Kf3LXznEka8
  English (auto-generated)
    00:56 it's it's categorizations according to more traditional occultism the use of the
          pituitary and pineal glands it also has one of the earliest mentions of the
          calcification of the pineal gland of any work that I've ever been able to find also
          proves because this predates fluoridation by almost 30 years proves the the
          calcification of the pineal gland was known long before fluoride was interjected into
          the average person's diet in the form of fluoridated water so New Agers beware you may
          not appreciate this work when you look at the date on it and then of course the
          treatise on    https://youtu.be/Kf3LXznEka8?t=56

So apparently he spoke about Dower's Occultism For Beginners and no, there are no other fluoride-related mentions in his reviews.

Using wild cards and exact matching in multi-word phrases

Since searching the occult playlist above, Little Jimmy listens to Heavy Metal (backwards of course), has been asking strange questions and generally has become very uppity. Talk around town is that he probably also does drugs, speaks in tongues and is into some sort of demon worship. They say he, his unfortunate twin Little Timmy and their friend Little Sally have been getting into all kinds of shenanigans lately.

Windows CMD

> SubTubular.exe search-channel Styxhexenhammer666 --for """little ?jimmy"" | ""little sally""" --top 500 --pad 66

PowerShell

PS > .\SubTubular.exe search-channel Styxhexenhammer666 --for '""little ?jimmy"" | ""little sally""' --top 500 --pad 66

Bash

$ ./SubTubular.exe search-channel Styxhexenhammer666 --for '"little ?jimmy" | "little sally"' --top 500 --pad 66

Note how

  • multi-word phrases are quoted when nested inside a quoted --for|-f argument on different shells
  • fuzzy-matching jimmy using a ? prefix will match "Little Jimmy" as well as "Little Timmy".

To prevent them from burning churches, we may have to restrict their access to harmful online content. Let's give them the old Clockwork Orange treatment and have them watch Bob Ross paint happy little things and beat the devil out on a loop for a few days.

Windows CMD

> SubTubular.exe search-channel https://www.youtube.com/@bobross_thejoyofpainting
--for "captions= ( ""beat the devil out"" | ""happy little *"" )" --top 500 --pad 30

or shorter

> SubTubular.exe channel bobross_thejoyofpainting
-f "captions= ( ""beat the devil out"" | ""happy little *"" )" -t 500 -p 30

PowerShell

PS > .\SubTubular.exe search-channel https://www.youtube.com/@bobross_thejoyofpainting
--for 'captions= ( ""beat the devil out"" | ""happy little *"" )' --top 500 --pad 30

or shorter

PS > .\SubTubular.exe channel bobross_thejoyofpainting
-f 'captions= ( ""beat the devil out"" | ""happy little *"" )' -t 500 -p 30

Bash

$ ./SubTubular.exe search-channel https://www.youtube.com/@bobross_thejoyofpainting
--for 'captions= ( "beat the devil out" | "happy little *" )' --top 500 --pad 30

or shorter

$ ./SubTubular.exe channel bobross_thejoyofpainting
-f 'captions= ( "beat the devil out" | "happy little *" )' -t 500 -p 30

will fill their prescription with results like below.

Note how the captions=(...) expression excludes matches in title, description or keywords - since those wouldn't help our troubled kids.

"Beat the devil out of it, and we're ready."
10/10/2022 22:00 https://youtu.be/D_xamByJsYs
  English (auto-generated)
    00:13 put the dark on clean the brush and beat the devil out of it
          and we're ready    https://youtu.be/D_xamByJsYs?t=13

Best of Clouds (Part 1) | The Joy of Painting with Bob Ross
12/05/2022 22:00 https://youtu.be/y5OXoEtcen8
  English
    01:38 Right there we have just another happy little cloud. They just float
          around here and have a good time all day.    https://youtu.be/y5OXoEtcen8?t=98
    04:16 Then. (brush rattles) (chuckles) Just beat the devil out of it. There. And sometimes I'll take
          the brush and go across    https://youtu.be/y5OXoEtcen8?t=256
    13:40 Now maybe, maybe in our world, there's just a happy little cloud that lives up here.
          This is pure midnight black, pure black.    https://youtu.be/y5OXoEtcen8?t=820
    17:28 Okay, maybe in our world there's a happy little cloud. Just sort of floats
          around in the sky up here    https://youtu.be/y5OXoEtcen8?t=1048
    18:19 So we'll give him one, lives right there. Just a happy little guy.
          In my world, everything is happy. So we have happy little clouds and happy trees.
          All right, there we go.    https://youtu.be/y5OXoEtcen8?t=1099

Search a channel for specific content

I might have gazed into the abyss for a little too long and now I need a deep breath, some unclenching and a refresher on the importance of free speech. Russell Brand may be able to help me with that - he seems to enjoy making use of it. Let's see if we can pick his thoughts on the topic out of the whirlwind of praise for our benevolent elites and trusted institutions.

Windows CMD

> SubTubular.exe search-channel https://www.youtube.com/@RussellBrand
--for """freedom of speech"" | ""free speech"" | censorship | ""cancel culture"""
--top 500 --pad 40

or short

> SubTubular.exe channel RussellBrand
-f """freedom of speech"" | ""free speech"" | censorship | ""cancel culture"""
-t 500 -p 40

PowerShell

PS > .\SubTubular.exe search-channel https://www.youtube.com/@RussellBrand
--for '""freedom of speech"" | ""free speech"" | censorship | ""cancel culture""'
--top 500 --pad 40

or short

PS > .\SubTubular.exe channel RussellBrand
-f '""freedom of speech"" | ""free speech"" | censorship | ""cancel culture""'
-t 500 -p 40

Bash

$ ./SubTubular.exe search-channel https://www.youtube.com/@RussellBrand
--for '"freedom of speech" | "free speech" | censorship | "cancel culture"'
--top 500 --pad 40

or short

$ ./SubTubular.exe channel RussellBrand
-f '"freedom of speech" | "free speech" | censorship | "cancel culture"'
-t 500 -p 40

will let you find something like the following. Note that title, description and keywords are matched as well as subtitles.

Who Benefits From Online Censorship?
02/04/2022 22:00 https://youtu.be/CoUW0iR8ewU
  in description: a new bill to regulate online speech.
                  #Censorship #Canada #FreeSpeech

                  References
                  https://reclaimthenet.org/canadas-internet-censorship-bill-is-a-major-threat-to-free-speech-online/

                  https://chrishedges.substack.c
  in keywords: censorship
  English (auto-generated)
    00:00 censorship it's everywhere whether it's russia today all canadians or me
          censorship is back in fashion why and who does it benefit is it the vulnerable
          https://youtu.be/CoUW0iR8ewU?t=0
    00:48 controversial bc11 otherwise known as the internet censorship bill i can see
          why they want to call it fc11 sounds a    https://youtu.be/CoUW0iR8ewU?t=48
    02:53 speech shut up the main criticism the bill has faced from a flurry of
          free speech advocates of various ideological and political
          persuasions is that the    https://youtu.be/CoUW0iR8ewU?t=173

Exploring a channel or playlist via its keywords

What else has Russell Brand been talking about recently on his channel?

SubTubular.exe search-channel https://www.youtube.com/@RussellBrand
--keywords --top 100

or short

SubTubular.exe channel RussellBrand -k -t 100

will look at the keywords the top 100 videos of the searched playlist are tagged with and list them with their number of occurrences, most used first.

100x News | 100x politics | 8x pandemic | 6x covid | 5x Putin | 5x Ukraine | 4x cold war
4x fauci | 4x invasions | 4x latest news | 4x military | 4x military industrial complex
4x NATO | 4x news | 4x Russia | 4x russia ukraine war | 4x the cold war | 4x ukraine 2014
4x ukraine crisis | 4x Vladimir Putin | 4x War | 4x world war | 4x World War 3 | 4x WW3
4x WWIII | 3x biden | 3x bill gates | 3x Cold War | 3x nord stream | 3x Nord Stream pipeline
3x russian army | 3x ukraine russia war | 3x Ukraine war | 3x vaccines | 3x WEF
2x big tech | 2x censorship | 2x china | 2x chinese | 2x coronavirus | 2x cover-up
2x covid-19 | 2x elon | 2x Elon Musk | 2x follow the science | 2x Institute of Virology
2x investigation | 2x jabs | 2x joe biden | 2x lab | 2x lab leak | 2x leak | 2x leaked
2x market | 2x new prime minister uk | 2x outbreak | 2x Peter Daszak | 2x putin
2x rachael maddow | 2x rishi | 2x rishi sunak | 2x science | 2x scientists
2x stop the spread | 2x theory | 2x trump | 2x ukraine | 2x ukraine war | 2x unvaccinated
2x vaccinated | 2x vaccine | 2x Virology | 2x virus | 2x war | 2x wet market

Find material for a supercut of a phrase

I have here a pile of rocks that needs grinding. Let's make a supercut of Jörg Sprave's laughter. And while we're at it, let me show you its features:

Windows CMD

> SubTubular.exe search-channel https://www.youtube.com/user/JoergSprave
--for "haha | laugh* | ""let me show you its features""" --top 100 --cache-hours 0
--order-by uploaded asc --html --out "path/to/my output file.html" --show file

or short

> SubTubular.exe channel JoergSprave -f "haha | laugh* | ""let me show you its features"""
-t 100 -h 0 -r uploaded asc -m -o "path/to/my output file.html" -s file

PowerShell

PS > .\SubTubular.exe search-channel https://www.youtube.com/user/JoergSprave
--for 'haha | laugh* | ""let me show you its features""' --top 100 --cache-hours 0
--order-by uploaded asc --html --out "path/to/my output file.html" --show file

or short

PS > .\SubTubular.exe channel JoergSprave -f 'haha | laugh* | ""let me show you its features""'
-t 100 -h 0 -r uploaded asc -m -o "path/to/my output file.html" -s file

Bash

$ ./SubTubular.exe search-channel https://www.youtube.com/user/JoergSprave
--for 'haha | laugh* | "let me show you its features"' --top 100 --cache-hours 0
--order-by uploaded asc --html --out "path/to/my output file.html" --show file

or short

$ ./SubTubular.exe channel JoergSprave -f 'haha | laugh* | "let me show you its features"'
-t 100 -h 0 -r uploaded asc -m -o "path/to/my output file.html" -s file

thankfully at any given time will yield results like you find below.

Note how

  • --top|-t 100 only searches the top 100 videos in the Uploads playlist of the channel
  • --cache-hours|-h 0 disables playlist caching to make sure we get the freshest laughs
  • --order-by|-r uploaded asc will sort the results by uploaded date instead of score and ascending (latest last) instead of descending (latest first)
  • --html|-m will generate a HTML output file including time-stamped hyperlinks to the found results
  • --out|-o "path/to/my output file.html" will save the output file to a custom path instead of the default output folder; the path being quoted because it contains spaces
  • --show|-s file will open the output file after it has been written so you don't have to navigate to it
The 200 Joule Repeating Rubber X-Bow Project!
18/05/2022 22:00 https://youtu.be/iiUOVlnj65w
  English (auto-generated)
    00:16 today because it's shooting let me show you its features repeating crossbows
          like the adder the stinger and    https://youtu.be/iiUOVlnj65w?t=16

The Inventor who wouldn't give up...
01/06/2022 22:00 https://youtu.be/JO-A3Z6S3b4
  English (auto-generated)
    01:47 accidents like the last one [Laughter] so after i had repaired it
          https://youtu.be/JO-A3Z6S3b4?t=107

Tips & best practices

Writing queries

To start with, you'll want to get familiar with the syntax of the shell you're using - at least to the degree that you know how to quote arguments. There are examples above to give you an idea. You'll end up quoting the --for|-f parameter a lot because some control characters used by the LIFTI query syntax will conflict with control characters of your shell. The best example for this is the | pipe, which LIFTI uses as an OR operator - but on the most common shells forwards the output of a command preceding it to a command trailing it. Since we don't want that, we'll have to quote any query that contains an OR pipe - and maybe escape nested quotes depending on the shell.

Next, learn the features of the LIFTI query syntax and try them out one by one until you understand them. It helps to do that with a channel, playlist or videos you know a bit of the content of - so you know what you should find.

You'll probably want to use an iterative process for designing your full-text queries. Start with a simple one and see what it matches, then progressively tweak it until you're happy with the results. Keep in mind that not immediately finding what your looking for in a playlist could also just mean you have to increase the --top number of videos to search.

Searching auto-generated subtitles

If you can't seem to find what you're looking for, here are some things to keep in mind:

  • Make sure the videos you search have subtitles. Not all do. Or at least not immediately. Allow for some time before the auto-generated subtitles of newly-uploaded videos are available.
  • Try fuzzy matching for names and words with different or uncommon spellings.
  • Keep your multi-word phrases short or use nearness expressions. Make use of wild cards and fuzzy matching. Otherwise, only exact matches are returned - so the longer your phrase, the less likely it is to match anything.
  • Omit punctuation (dots and commas). As of writing this, the auto-generated subtitles are not structured into sentences.
  • Don't overestimate the capabilities of YouTube's speech recognition algorithm (yet). Auto-generated subtitles don't always make sense, semantically speaking. Similar sounding words may be misunderstood, especially for speakers with poor pronunciation, high throughput, an accent or simply due to background noise. A statement about defense could for example easily be misinterpreted as being about the fence.
  • You'll find that the speech recognition algorithm will replace
    • inaudible words with ? and
    • swear words with [ __ ] .

Feel free to contribute your own best practices in the issues.

Fair use

Do not use this software with the intent of infringing on any creator's freedom of speech or any viewer's freedom of choice.

Specifically, you may not use this software or its output to target content for flagging, banning or demonetizing.

Those to whom this limitation applies, should feel encouraged to explore the origins of their right to censor third party conversation and come back another day with better intentions <3