-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathyoutube2llm.py
executable file
·814 lines (690 loc) · 43.4 KB
/
youtube2llm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
#!/usr/bin/env python3
'''
Overview:
Script is developed based on: https://github.com/mirabdullahyaser/Summarizing-Youtube-Videos-with-OpenAI-Whisper-and-GPT-3
I added the functionality to get the transcript from YouTube when available to save Whisper some work.
Usage
A. Summarising (after retrieving YouTube's caption/subtitles or transcribing with Whisper)
# process a YouTube video, passing the video ID. will produce a list of Q&As for the video by default when --mode is not specified
youtube2llm.py analyse --vid 2JmfDKOyQcI
# process a transcript file (e.g. when you already have the caption / transcript file saved locally, saving the web traffic calls to youtube)
youtube2llm.py analyse --tf output/FbquCdNZ4LM-transcript.txt
# have the LLM use some tones/personalities specified in the system prompts
youtube2llm.py analyse --nc --lmodel mistral --lmtone doubtful_stylistic_british --vid=6KeiPitz5QM
# don't send chapters, when it's making it too verbose or fragmented
youtube2llm.py analyse --vid=FbquCdNZ4LM --nc
# produce a note for this video
youtube2llm.py analyse --vid Lsf166_Rd6M --nc --mode note
# produce a list of definitions made in this video
youtube2llm.py analyse --vid Lsf166_Rd6M --nc --mode definition
# download and transcribe the audio file (don't use YouTube's auto caption)
youtube2llm.py analyse --vid=MNwdq2ofxoA --nc --lmodel mistral --dla
# the video has no subtitle, so it falls back to Whisper that transcribes it
youtube2llm.py analyse --vid=TVbeikZTGKY --nc
# this video in Bahasa Indonesia has no auto caption nor official subtitle, so we specify Indonesian language so Whisper's output is better
youtube2llm.py analyse --vid PDpyUMOOcyw --lang id
# this video has age filter on and can't be accessed without logging in
youtube2llm.py analyse --vid=FbquCdNZ4LM --nc
pytube.exceptions.AgeRestrictedError: FbquCdNZ4LM is age restricted, and can't be accessed without logging in.
# in this case, I will use `yt-dlp -f` to get the audio file (m4a or mp4) and then run the audio file through audio2llm.py to transcribe it with Whisper
# another example of such video: https://www.youtube.com/watch?v=77ivEdhHKB0
# # uses mistral (via ollama) for summarisation. the script will use OpenAI's API for the ask and embed mode (still TODO)
youtube2llm.py analyse --vid=sYmCnngKq00 --nc --lmodel mistral
# run a custom prompt against an audio file (first it will retrieve the transcript from YouTube or generate the transcript using Whisper)
youtube2llm.py analyse --vid="-3vmxQet5LA" --prompt "all the public speaking techniques and tactics shared"
# run a custom prompt against a transcript file
youtube2llm.py analyse --tf output/-3vmxQet5LA-transcript.txt --prompt "1. what is the Joke Structure and 2. what is the memory palace technique"
B. Embedding. Generates CSV file with vectors from any transcript (txt) file
# this generates output/embeddings/452f186b-54f2-4f66-a635-6e1f56afbdd4_media.mp3-transcript_embedding.csv
youtube2llm.py embed --tf output/452f186b-54f2-4f66-a635-6e1f56afbdd4_media.mp3.txt
C. Asking the embedding file a question / query
# without q(uery) specified, defaults to:
"what questions can I ask about what's discussed in the video so I understand the main argument and points that the speaker is making? and for each question please answer each and elaborate them in detail in the same response"
youtube2llm.py ask --ef output/embeddings/452f186b-54f2-4f66-a635-6e1f56afbdd4_media.mp3-transcript_embedding.csv
# or pass on a query
youtube2llm.py ask --ef output/embeddings/452f186b-54f2-4f66-a635-6e1f56afbdd4_media.mp3-transcript_embedding.csv --q "what is rule and what is norm"
youtube2llm.py ask --ef output/embeddings/452f186b-54f2-4f66-a635-6e1f56afbdd4_media-transcript_embedding.csv --q "if we could distill the transcript into 4 arguments or points, what would they be?"
# batching some video IDs in a text file (line-separated)
while read -r f || [ -n "$f" ]; do; youtube2llm.py analyse --vid="$f" --nc --lmodel mistral; done < list_of_youtube_video_ids.txt
'''
import os
import hashlib
import datetime
import tiktoken
import argparse
import whisper
import openai
import ollama
import ast # for converting embeddings saved as strings back to arrays
import pandas as pd # for storing text and embeddings data
from pathlib import Path
from scipy import spatial # for calculating vector similarities for search
from yt_dlp import YoutubeDL, DownloadError # I only use YoutubeDL to retrieve the chapters and the title of the video
# from openai.embeddings_utils import cosine_similarity, get_embedding # not used yet,
# see https://platform.openai.com/docs/guides/embeddings/use-cases > text search using embeddings
# and https://cookbook.openai.com/examples/recommendation_using_embeddings
# for some new methods introduced in https://github.com/ALucek/chunking-strategies/blob/main/chunking.ipynb
from chunking_evaluation.chunking import (
ClusterSemanticChunker,
LLMSemanticChunker,
FixedTokenChunker,
RecursiveTokenChunker,
KamradtModifiedChunker
)
from chunking_evaluation.utils import openai_token_count
from prompts import system_prompts, user_prompts
YOUTUBE_VIDEO_URL = "https://www.youtube.com/watch?v={}"
## summarization-related
GPT_MODEL = 'gpt-3.5-turbo'
GPT_MODEL = 'gpt-4o'
## embedding-related.
## reference: https://platform.openai.com/docs/guides/embeddings/embedding-models
EMBEDDING_MODEL = "text-embedding-ada-002" # 1536 output dimension, "Most capable 2nd generation embedding model, replacing 16 first generation models"
EMBEDDING_MODEL = "text-embedding-3-large" # 3072 output dimension, "Most capable embedding model for both english and non-english tasks"
EMBEDDING_MODEL = "text-embedding-3-small" # 1536 output dimension, "Increased performance over 2nd generation ada embedding model"
EMBEDDING_MODEL = "nomic-embed-text"
EMBEDDING_MODEL = "mxbai-embed-large"
BATCH_SIZE = 1000
# as per https://github.com/openai/openai-cookbook/blob/3f8d3f34054526173c0c9cd110d21d90fe993c3f/examples/Get_embeddings_from_dataset.ipynb or https://cookbook.openai.com/examples/get_embeddings_from_dataset
embedding_encoding = "cl100k_base" # this the encoding for text-embedding-ada-002
max_tokens = 8000 # the maximum for text-embedding-ada-002 is 8191 # and BGE is 8192 https://pyvespa.readthedocs.io/en/latest/examples/mother-of-all-embedding-models-cloud.html
encoding = tiktoken.get_encoding(embedding_encoding)
## prep the LLM for summarization.
## TODO: should probably put into summarize_text instead,
## if the `client` object is only used there. but right now I'm using client globally
## across the different modes supported by this script (summarize, embed, ask).
## or can just initiate this object separately in each method. but that's sloppy.
## so this works:
## - if summarize mode, then GPT_MODEL can be overridden with ollama's 'mistral' for example
## - if embed or ask mode, then GPT_MODEL stays hardcoded as 'gpt-*'
client = None
use_api = False # set to False to use ollama library or set to True to use ollama API
# I don't merge with the branching under main (to use OpenAI or local ollama (API/library)) because the logic became
# a bit complicated. and I'm just benchmarking the difference between the two methods for ollama anyway
def get_captions(video_id, language="en"):
caption = ''
raw_caption = ''
transcripts = []
'''
Tried 3 methods before deciding on youtube_transcript_api
# method #1: # yt-dlp (the new youtube-dl), blm berhasil, gak tau gmn caranya retrieve the subtitles without downloading them, not sure if it's even supported by yt-dlp
ydl_opts = {
'format': 'm4a/bestaudio/best',
# See help(yt_dlp.postprocessor) for a list of available Postprocessors and their arguments
'postprocessors': [{ # Extract audio using ffmpeg
'key': 'FFmpegExtractAudio',
'preferredcodec': 'm4a',
}]
}
ydl = YoutubeDL(ydl_opts)
info = ydl.extract_info(url, download=False)
chapters = [c.get('title') for c in info.get('chapters')]
ttml_urls = [c.get('url') for c in info.get('automatic_captions').get('en') if c.get('ext') == 'ttml']
# then, just run requests.get on ttml_urls[0]?
# e.g. https://www.youtube.com/api/timedtext?v=uA5GV-XmwtM&ei=eTSRZdnHHI2xjuMPyZeK-AE&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1704040169&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=ED91E4A2584EBD0048F5FD22738F76C5FF1ABF11.364F3A4E50CA060E54F0784709DE38A2229E6757&key=yt8&kind=asr&lang=en&fmt=ttml
# ydl.list_subtitles(video_id, info.get('automatic_captions'), 'subtitles') # this works, returns a table of available transcripts, but I'm not sure how to then get the transcript text
# method #2 # with youtube_dl (old youtube-dl)
ydl = youtube_dl.YoutubeDL({'writesubtitles': True, 'allsubtitles': True, 'writeautomaticsub': True})
res = ydl.extract_info(url, download=False)
if res['requested_subtitles'] and (res['requested_subtitles']['en'] or res['requested_subtitles']['a.en']):
print('Grabbing vtt file from ' + res['requested_subtitles']['en']['url'])
raw_caption = requests.get(res['requested_subtitles']['en']['url'], stream=True)
caption = re.sub(r'\d{2}\W\d{2}\W\d{2}\W\d{3}\s\W{3}\s\d{2}\W\d{2}\W\d{2}\W\d{3}','',response.text)
with f1 = open("caption.vtt", "w"):
f1.write(caption)
if len(res['subtitles']) > 0:
print('manual captions')
else:
print('automatic_captions')
else:
print('Youtube Video does not have any english captions')
# method #3: pytube. gak tau jg, sempet bisa tp skrg gak bs, entahlah
yt = YouTube(YOUTUBE_VIDEO_URL)
youtube_video.captions.get_by_language_code('a.en')
'''
from youtube_transcript_api import YouTubeTranscriptApi # and this for the transcript
# method #4: https://pypi.org/project/youtube-transcript-api/
# YouTubeTranscriptApi.get_transcript('uA5GV-XmwtM')
try:
transcripts = YouTubeTranscriptApi.list_transcripts(video_id)
except Exception as e:
print("An unexpected error occurred:", e)
for t in transcripts:
print ("transcript found for language: "+t.language_code)
# wait, I'm returning the LAST transcript this video has? haha. wrong. but okay works most of the time, for EN language
if(t.language_code == language.lower()): # ok fixed the above. mestinya ada method yg langsung akses based on language code sih, but I just want this to work for now
# print(f"{t.language_code}, {language}")
raw_caption = t.fetch()
if raw_caption:
caption = ' '.join([t.get('text') for t in raw_caption])
return caption
def get_chapters(info):
chapters = []
if info.get('chapters'):
chapters = [c.get('title', '') for c in info.get('chapters', {'title': ''})]
return chapters
def get_youtube_metadata(url, save=True):
ydl_opts = {}
ydl = YoutubeDL(ydl_opts)
info = ydl.extract_info(url, download=False)
if(save):
import json
with(open('output/'+video_id+'-metadata.json', 'w') as f):
f.write(json.dumps(info))
return info
def process_youtube_video(url, video_id, language="en", force_download_audio=False, transcription_model='base'):
caption = None
transcript = ''
video_title = ''
chapters = []
if(not force_download_audio):
caption = get_captions(video_id, language)
import torch
devices = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = whisper.load_model(transcription_model, device=devices)
youtube_metadata = get_youtube_metadata(video_id, save=True) # returns a dict, optionally saves it into a {video_id}-metadata.json file
chapters = get_chapters(youtube_metadata)
video_title = youtube_metadata.get('title')
if not caption or force_download_audio:
# download the mp4 so Whisper can transcribe them
from pytube import YouTube
print(url)
# trying to fix age restriction-related error
# from pytube.innertube import _default_clients
# _default_clients["ANDROID_MUSIC"] = _default_clients["WEB"]
# youtube_video = YouTube(url, use_oauth=True, allow_oauth_cache=False)
youtube_video = YouTube(url)
video_id = youtube_video.vid_info.get('videoDetails').get('videoId')
if(True): # use pytube
# this breaks since 2024-05-07. open issue (with a quick fix workaround done on cipher.py) at: https://github.com/pytube/pytube/issues/1918
streams = youtube_video.streams.filter(only_audio=True)
stream = streams.first() # taking first object of lowest quality
OUTPUT_AUDIO = Path(__file__).resolve().parent.joinpath('output', video_id+'.mp4')
stream.download(filename=OUTPUT_AUDIO)
transcription = model.transcribe(OUTPUT_AUDIO.as_posix(), verbose=True, fp16=False, language=language) # language="id", or "en" by default
else: # use yt-dlp, based on one version of workaround in pytube/issues/#1918
import tempfile
with tempfile.TemporaryDirectory() as temporary_directory:
audio_details = download_youtube_mp3(video_id, temporary_directory)
'''
with open(audio_details["file_path"], "rb") as f:
bytes_data = f.read()
'''
transcription = model.transcribe(audio_details["file_path"], verbose=True, fp16=False, language=language) # language="id", or "en" by default
transcript = transcription['text']
else:
# youtube caption is available, either auto caption or the one included by the video poster
transcript = caption
return transcript, chapters, video_title
def download_youtube_mp3(video_id, temporary_directory):
url = f"https://www.youtube.com/watch?v={video_id}"
ydl_opts = {
"format": 'bestaudio/best',
"max_filesize": 20 * 1024 * 1024,
"outtmpl": f"{temporary_directory}/%(id)s.%(ext)s",
"noplaylist": True,
"verbose": True,
"postprocessors": [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'mp3',
'preferredquality': '192',
}],
}
ydl = YoutubeDL(ydl_opts)
try:
meta = ydl.extract_info(url, download=True)
except DownloadError as e:
raise e
else:
video_id = meta["id"]
return {
"title": meta["title"],
"file_path": f"{temporary_directory}/{video_id}.mp3"
}
def llm_process(transcript, llm_mode, chapters=[], use_chapters=True, prompt='', video_title='', llm_personality='doubtful_stylistic_british'):
# need to refactor. I declare this as global because it's used to initiate a client object, which is used by all the llm_process, embed, and ask functionalities supported by this script
global GPT_MODEL
if llm_mode in['summary', 'kp']:
llm_mode = 'summary'
if(llm_mode == 'tag'):
# GPT_MODEL = 'gpt-3.5-turbo-16k'
GPT_MODEL = 'gpt-4o'
system_prompt = system_prompts[llm_personality]
user_prompt = user_prompts[llm_mode]
if(prompt):
user_prompt = prompt
if(use_chapters and chapters):
chapters_string = "\n".join(chapters)
user_prompt += f" These are the different topics discussed in the conversation:\n{chapters_string}.\nPlease orient the summary and organise them into different sections based on each topic."
print("user_prompt: "+user_prompt)
# result=""
# previous = ""
result = "System Prompt: "+system_prompt+"\n\n-------\n"
result = result + "User Prompt: "+user_prompt+"\n\n-------\n"
if(video_title):
result += "video_title: " + video_title+"\n\n-------\n"
## Chunking. For managing token limit, e.g.
# openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 36153 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
# while the rate limit error is like this
# openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo-16k in organization org-xxxx on tokens_usage_based per min: Limit 60000, Used 22947, Requested 45463. Please try again in 8.41s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens_usage_based', 'param': None, 'code': 'rate_limit_exceeded'}}
n = 1300
if(GPT_MODEL == 'gpt-4o' or GPT_MODEL.endswith('-16k') or GPT_MODEL.endswith('-1106')):
# n = 10000 # maximum context length is 16385 for gpt-3.5-turbo-16k, and 4097 for gpt-3.5-turbo
n = 5300 # the response was stifled when it was 10k before
print(f"n used: {n}")
# start chunking
'''
# v0: my original version
st = transcript.split()
snippet= [' '.join(st[i:i+n]) for i in range(0,len(st),n)]
print(f"snippets generated (old): {len(snippet)}")
# trying two new methods, based on https://github.com/ALucek/chunking-strategies/blob/main/chunking.ipynb
# v1:
from chromadb.utils import embedding_functions
embedding_function = embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ["OPENAI_KEY"], model_name="text-embedding-3-large")
cluster_chunker = ClusterSemanticChunker(
embedding_function=embedding_function,
max_chunk_size=10000, # if use `n` 1300, will produce 7 chunks (more than 2 the original and recursive one). if set to higher (5k), generates 5 chunks. idk if I set 20k...? still generates 5. interesting. the original video has 8-9 segments btw
length_function=openai_token_count
)
snippet = cluster_chunker.split_text(transcript)
print(f"snippets generated (cluster_chunker): {len(snippet)}")
'''
# v2:
recursive_token_chunker = RecursiveTokenChunker(
chunk_size=n,
chunk_overlap=0,
length_function=openai_token_count,
separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research at https://research.trychroma.com/evaluating-chunking
)
snippet = recursive_token_chunker.split_text(transcript)
print(f"snippets generated (recursive_token_chunker): {len(snippet)}")
# end chunking
## start sending the values to the LLM
for i in range(0, len(snippet), 1):
print("Processing transcribed snippet {} of {}".format(i+1,len(snippet)))
if('gpt' not in GPT_MODEL and use_api is False):
# result += "use ollama library\n----\n"
try:
# to try this, https://github.com/ollama/ollama/issues/2929#issuecomment-2327457681
options = ollama.Options(temperature=0.0, top_k=30, top_p=0.8, num_thread=8)
gpt_response = ollama.chat( # err, but I'm still using this endpoint? not the OpenAI's "client.chat.completions.create..." way
model=GPT_MODEL,
messages=[
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": user_prompt + "\n\n\"" + "Video Title: " + video_title + "\n\n" + snippet[i] + "\"\n Do not include anything that is not in the transcript."
# "content": "\"" + snippet[i] + "\"\n Do not include anything that is not in the transcript. For additional context here is the previous written message: \n " + previous
}
], options=options
)
# previous = gpt_response['message']['content']
# result += gpt_response.choices[0].message.content + "\n\n-------\n\n(based on the snippet: "+ snippet[i]+")\n\n-------\n\n"
result += gpt_response['message']['content'] + "\n\n-------\n\n"
except Exception as e:
print("An unexpected error occurred:", e)
print(result)
else:
try:
response = client.chat.completions.create( # new API
model=GPT_MODEL,
messages=[
{'role': 'system', 'content': system_prompt},
{"role": "user", "content": user_prompt + "\n\n\"" + "Video Title: " + video_title + "\n\n" + snippet[i]}
# {"role": "user", "content": "\"" + snippet[i] + "\"\n For additional context here is the previous written message: \n " + previous}
],
# max_tokens=4096,
temperature=0
)
# previous = response.choices[0].message.content
result += response.choices[0].message.content+"\n\n-----\n\n"
# print(f"result appended: {result}")
except Exception as e:
print("An unexpected error occurred:", e)
print(result)
return result
def create_embedding(transcript, embedding_filename):
# following and migrated based on
# https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb >
# 3. Embed document chunks and 4. Store document chunks and embeddings
# Usage stat: Tokens used: 16,274 from a 17426 words document (76016 chars)
embeddings = []
transcripts = []
SAVE_PATH = "output/embeddings/"+embedding_filename.replace('.txt', '')+"-transcript_embedding.csv"
# ok.... kyknya gini... si transcript ini perlu gw split jadi list of strings dulu... nah yg diproses di bawah ini seharusnya adalah list, NOT the string....
# https://github.com/Azure/azure-openai-samples/blob/main/use_cases/passage_based_qna/day_in_life_of_data_scientist.ipynb # some good method, with normalize_text
# atau kyk gini awalnya dah bener ya, cukup looping sekali, tapi bukan extend melainkan append, perhaps that's the only problem earlier?
# this seems worth checking too, https://cookbook.openai.com/examples/embedding_long_inputs
for batch_start in range(0, len(transcript), BATCH_SIZE):
batch_end = batch_start + BATCH_SIZE
batch = transcript[batch_start:batch_end]
# print(f"Chunking batch {batch_start} to {batch_end-1}")
# print(f"Chunked content: {batch}")
transcripts.append(batch)
print(f"{len(transcript)} split into {len(transcripts)} strings.")
for batch_start in range(0, len(transcripts)): # ini gak perlu pake BATCH_SIZE lagi dong, karena each string in the transcripts list is already < affordable token length
batch_end = batch_start + 1
batch = transcripts[batch_start:batch_end]
# print(f"Running batch {batch_start} to {batch_end-1}")
# print(f"Batch content: {batch}")
response = client.embeddings.create(model=EMBEDDING_MODEL, input=batch)
# print(f"response {response}")
for i, be in enumerate(response.data):
# print(f"(i, be.index): {(i, be.index)} and len(be.embedding): {len(be.embedding)}")
assert i == be.index # double check embeddings are in same order as input
batch_embeddings = [e.embedding for e in response.data]
# print("len(embeddings) before: {}".format(len(embeddings)))
embeddings.extend(batch_embeddings)
# print("len(embeddings) after: {}".format(len(embeddings)))
df = pd.DataFrame({"text": transcripts, "embedding": embeddings})
df.to_csv(SAVE_PATH, index=False, mode="w")
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
"""Return the number of tokens in a string."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
# Fallback for unsupported models
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
def strings_ranked_by_relatedness(
query: str,
df: pd.DataFrame,
relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
top_n: int = 100
) -> tuple[list[str], list[float]]:
"""Returns a list of strings and relatednesses, sorted from most related to least."""
query_embedding_response = client.embeddings.create(model=EMBEDDING_MODEL, input=query, )
query_embedding = query_embedding_response.data[0].embedding
strings_and_relatednesses = [
(row["text"], relatedness_fn(query_embedding, row["embedding"]))
for i, row in df.iterrows()
]
# print(f"strings_and_relatednesses: {strings_and_relatednesses}")
# e.g. [(" Welcome back. I took a couple weeks off, but it's good to be back here. .... We'll see you next time.", 0.7000641373282386)]
strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
strings, relatednesses = zip(*strings_and_relatednesses)
return strings[:top_n], relatednesses[:top_n]
def strings_ranked_by_relatedness_ollama(
query: str,
df: pd.DataFrame,
relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
top_n: int = 100
) -> tuple[list[str], list[float]]:
"""
Returns a list of strings and relatednesses, sorted from most related to least.
Still need a separate method because there's still yet OpenAI-compatible API for ollama embedding (right?)
"""
response = ollama.embeddings(model=EMBEDDING_MODEL, prompt=query)
query_embedding = response.get('embedding', [])
if not query_embedding:
print("Failed to generate embedding for the query.")
return
similarities = [
(row['text'], relatedness_fn(query_embedding, row['embedding']))
for _, row in df.iterrows()
]
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
strings, relatednesses = zip(*similarities)
return strings[:top_n], relatednesses[:top_n]
'''
top_matches = similarities[:top_n]
# Display results
print("Top matches:")
for text, similarity in top_matches:
print(f"Similarity: {similarity:.4f}\nText: {text}\n")
'''
def query_message(
query: str,
df: pd.DataFrame,
model: str,
token_budget: int
) -> str:
"""Return a message for GPT, with relevant source texts pulled from a dataframe."""
if('nomic' in EMBEDDING_MODEL or 'mxbai' in EMBEDDING_MODEL):
strings, relatednesses = strings_ranked_by_relatedness_ollama(query, df, top_n=5)
else:
strings, relatednesses = strings_ranked_by_relatedness(query, df, top_n=5)
# print(f"checking strings_ranked_by_relatedness for query: {query}")
# introduction = 'Use the below transcript of a podcast episode to answer the subsequent question. If the answer cannot be found in the text, write "I could not find an answer."'
introduction = 'Use the below transcript of a podcast episode to answer the subsequent question."'
question = f"\n\nQuestion: {query}"
message = introduction
for string in strings:
# print(f"string: {string} with length of: "+str(len(string)))
next_article = f'\nTranscript section:\n"""\n{string}\n"""'
if (num_tokens(message + next_article + question, model=model) > token_budget):
# message += string[:token_budget] # blm yakin gw ini dia lg ngapain ini wkwwk
break
else:
message += next_article
return message + question
def ask(
query: str,
df: pd.DataFrame,
model: str,
token_budget: int = 4096 - 500,
print_message: bool = False,
) -> str:
"""Answers a query using GPT (or ollama? I am not sure if this handles the ollama switch. but perhaps yes because there's already OpenAI-compatible endpoint for chat in ollama, and I have initiated the `client` accordingly as a global thing') and a dataframe of relevant texts and embeddings."""
response_message = ''
message = query_message(query, df, model=model, token_budget=token_budget)
messages = [
{"role": "system", "content": "You answer questions about the podcast episode."},
{"role": "user", "content": message},
]
if('gpt' not in model and use_api is False):
response = ollama.chat(
model=model, # whichever ollama model was specified
messages=messages,
# options=options
)
response_message = response['message']['content']
else:
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.5
)
response_message = response.choices[0].message.content
if print_message:
print("ask -> user prompt -> content: " + message)
print(f"ask's response: {response_message}")
return response_message
def ask_the_embedding(question, embeddings_filename, print_message=False):
# following this tut https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb?ref=mlq.ai
df = pd.read_csv(embeddings_filename) # read as dataframe
# convert embeddings from CSV str type back to list type
df['embedding'] = df['embedding'].apply(ast.literal_eval) # df has two columns: "text" and "embedding"
return ask(question, df, model=GPT_MODEL, print_message=print_message)
def create_embedding_ollama(transcript, embedding_filename):
"""Creates embeddings using Ollama's mxbai-embed-large or nomic model and saves to a CSV."""
SAVE_PATH = "output/embeddings/"+embedding_filename.replace('.txt', '')+"-transcript_embedding.csv"
embeddings_list = []
transcripts = []
for batch_start in range(0, len(transcript), BATCH_SIZE):
batch_end = batch_start + BATCH_SIZE
batch = transcript[batch_start:batch_end]
transcripts.append(batch)
print(f"Transcript split into {len(transcripts)} chunks.")
for batch in transcripts:
response = ollama.embeddings(model=EMBEDDING_MODEL, prompt=batch)
embedding = response.get('embedding', [])
if embedding:
embeddings_list.append(embedding)
# Save embeddings to CSV
df = pd.DataFrame({"text": transcripts, "embedding": embeddings_list})
df.to_csv(SAVE_PATH, index=False, mode="w")
print(f"Embeddings saved to {SAVE_PATH}")
def ask_the_embedding_ollama(question, embeddings_filename, print_message=True):
"""Query embeddings using cosine similarity."""
# Load the embeddings from CSV
df = pd.read_csv(embeddings_filename)
df['embedding'] = df['embedding'].apply(ast.literal_eval)
return ask(question, df, model=GPT_MODEL, print_message=print_message)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='analyse, embed, and ask anything about the content of the youtube video')
parser.add_argument('action', type=str, default='ask', help='analyse, embed, ask')
parser.add_argument('--vid', type=str, help='youtube video id') # making it optional so we can create embedding for any transcript
parser.add_argument('--q', type=str, help='your question')
parser.add_argument('--tf', type=str, help='transcript filename, if exists, to create embedding for. in this case, value of vid is ignored')
parser.add_argument('--ef', type=str, help='embedding filename to use')
# only QnAs mode is implemented at the moment. want to merge with the llm_process method in audio2llm.py
parser.add_argument('--tmodel', type=str, default='base', help='the Whisper model to use for transcription (tiny/base/small/medium/large. default: base)')
parser.add_argument('--mode', type=str, default='QnAs', help='QnAs, note, summary/kp, tag, topix, thread, tp, cbb, definition, distinctions, misconceptions, ada, translation')
parser.add_argument('--lmtone', type=str, default='default', help="customise LLM's tone. doubtful_stylistic_british is one you can use")
parser.add_argument('--lmodel', type=str, default='gpt-4o', help='the GPT model to use for summarization (default: gpt-4o)')
parser.add_argument('--prompt', type=str, help='prompt to use, but chapters will be concatenated as well')
parser.add_argument('--nc', action='store_true', help="don't pass chapters to analyse")
parser.add_argument('--dla', action='store_true', help="download and transcribe the audio, don't use youtube auto caption")
parser.add_argument('--lang', type=str, default='en', help='language code of the video')
time_begin = datetime.datetime.now() # in audio2llm.py I put this inside the llm_process method because it saves the text file in the method, unlike here
args = parser.parse_args()
force_download_audio = False
if(args.dla):
force_download_audio = True
if(args.lmodel): # hackish because I haven't made GPT_MODEL local (still global)
GPT_MODEL = args.lmodel
if('gpt' in GPT_MODEL):
print(f"base :: using {GPT_MODEL}")
openai.api_key = os.getenv("OPENAI_KEY")
client = openai.OpenAI(api_key=openai.api_key)
else:
print(f"base :: using ollama {GPT_MODEL}")
# use ollama. perhaps add support for (simonw's) llm some time. or huggingface's (via sentence_transformer?)
client = openai.OpenAI(
base_url = 'http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
if args.action == 'analyse':
chapters = []
transcript = ''
llm_result = ''
if(args.tf and os.path.exists(args.tf)): # no need to retrieve the transcript from youtube when it has been provided
video_id = os.path.basename(args.tf)
with open(args.tf) as f:
transcript = f.read()
else:
video_id = args.vid
transcript, chapters, video_title = process_youtube_video(YOUTUBE_VIDEO_URL.format(video_id), video_id, language=args.lang, force_download_audio=force_download_audio, transcription_model=args.tmodel)
with open('output/'+video_id+'-transcript.txt', "w") as f:
f.write(transcript)
print(f'Transcript acquired of length: \n{str(len(transcript))}')
with_chapters = True
if(args.nc):
with_chapters = False
mode = args.mode
if(args.prompt):
llm_result += f"prompt: {args.prompt}\n\n------\n\n"
print(f"prompt: {args.prompt}")
mode = f"prompt-{hashlib.md5(args.prompt.encode('utf-8')).hexdigest()}"
llm_result = llm_process(transcript, llm_mode=args.mode, chapters=chapters, use_chapters=with_chapters, prompt=args.prompt, video_title=video_title, llm_personality=args.lmtone)
time_end = datetime.datetime.now()
print(f"time taken: {str(time_end - time_begin)}")
llm_result += f"\n\n-----\n\ntime taken: {str(time_end - time_begin)}"
print(f'LLM result for the Video:\n{llm_result}')
llm_result_filename = f"output/{video_id}-{mode}-{args.lmodel}.md"
with open(llm_result_filename, "w") as f:
f.write(llm_result)
elif args.action == 'embed': # not strong enough, TODO to refactor
if args.vid:
transcript_id = args.vid # need this to construct the below. TODO: refactor so the file naming is more structured and simple
transcript_filename = 'output/'+transcript_id+'-transcript.txt'
if(args.tf and os.path.exists(args.tf)): # override the value of vid if tf was provided
transcript_filename = args.tf
transcript_id = os.path.basename(transcript_filename)
print(f'Creating embedding from transcript: \n{transcript_filename}, transcript_id: {transcript_id}')
with open(transcript_filename) as f:
transcript = f.read()
# print(f'Transcript acquired: \n{transcript}')
if('text-embedding' in EMBEDDING_MODEL): # uses OpenAI's embedding model
create_embedding(transcript, transcript_id)
elif('nomic' in EMBEDDING_MODEL or 'mxbai' in EMBEDDING_MODEL): # uses ollama's local embedding model
create_embedding_ollama(transcript, transcript_id)
elif args.action == 'ask':
question = "what questions can I ask about what's discussed in the video so I understand the main argument and points that the speaker is making? and for each question please answer each and elaborate them in detail in the same response"
'''
question = 'what are the three questions that the video provide answer for? and for each question please answer each and elaborate them in detail in the same response'
another good one: "what are all arguments that the essay is making?"
'''
if args.q:
question = args.q
if args.ef and os.path.isfile(args.ef):
answer = question + "\n======\n"
# embeddings_filename = "output/embeddings/"+video_id+"-transcript_embedding.csv"
if('nomic' in EMBEDDING_MODEL or 'mxbai' in EMBEDDING_MODEL):
answer += ask_the_embedding_ollama(question, args.ef)
else:
answer += ask_the_embedding(question, args.ef)
embeddings_filename = args.ef
print(f'Answer:\n{answer}')
import re
pattern = r"([^/]+)-transcript-transcript_embedding.csv" # not so robust way of inferring the video_id from the embedding file, which is just gonna be used to save the answer file sih
match_video_id = re.search(pattern, embeddings_filename.split('/')[-1])
if match_video_id:
video_id = match_video_id.group(1)
with open('output/'+video_id+'-answer-'+hashlib.md5(question.encode('utf-8')).hexdigest()+'.txt', "w") as f:
f.write(answer)
else:
print(f"Cannot extract video_id from {embeddings_filename}.")
'''
TODO
√ 0. implement some [embedding](https://www.mlq.ai/openai-whisper-gpt-3-fine-tuning-youtube-video/) so I can dig deeper into bits like
"tell me more about The conversation emphasizes the importance of asking good questions and how they can lead to breakthroughs in writing."
which, would be a good next step to getting that "living blog" concept implemented
√ 1. add argparse
√ 2. chunk the text properly, consider proper sentences (NLTK) # and clean it up too?
# I have implemented this in summarize.py and md2notes.py,
# but youtube transcripts (auto-caption) is another beast (no punctuations),
# so it's sort of useless to do more advanced chunking unless the transcript is generated using Whisper
# perhaps some overlap would be beneficial tho? i.e. use the langchain text_splitters
https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
https://platform.openai.com/tokenizer
√ 3. save the transcript and summary/note to file to save some token and processing costs when doing this trial
√ 4. perhaps for other project: to retrieve web articles and summarise (similar structure with this but use readability and stuff) # done in url2markdown.py
kyk tool ini https://totheweb.com/learning_center/tools-convert-html-text-to-plain-text-for-content-review/
read a file containing a list of line-separated URLs, retrieve the text, embed, draw the underlying theme, generate a list of talking points, questions that can be answered from these articles
help me see wider, deeper, clearer
√ 5. tune the prompt, still seems to be patchy, not as good as audio2llm.py. or is it the temperature?
https://platform.openai.com/docs/guides/prompt-engineering/strategy-provide-reference-text
or https://github.com/e-johnstonn/GPT-Doc-Summarizer/blob/master/my_prompts.py
√ 6. get timestamped response to dig in. e.g. "on minute xyz they discussed this point" (perhaps incorporate as answer to my convo as per #1 ?) get ReAct / agent involved?
√ 7. try this https://github.com/openai/openai-cookbook/blob/main/examples/RAG_with_graph_db.ipynb # implemented elsewhere (see *embedding*.py)
√ 8. and this https://github.com/openai/openai-cookbook/blob/main/examples/Recommendation_using_embeddings.ipynb # implemented elsewhere (see *embedding*.py)
~ 9. implement / port this on other local model (mistral, llama2, etc) so the cost is managed
the embed and ask results are pretty shit with ollama mistral. see create_embedding_ollama
oh, ollama mistral nggak utk embedding kyknya sih. need to probably do the bge version I had working... somewhere... haha
here, find chromadb https://howaibuildthis.substack.com/p/choosing-the-right-embedding-model
2024-04-02:
- done for the `summarize` mode
- still TODO for the `ask` and `embed` modes
√ 10. image recognition? I want to process my screenshots and stuff # got a version of LLaVA working locally and two scripts using OpenAI's Vision for image and video
start here:
* https://huggingface.co/docs/transformers/main/tasks/image_captioning
* https://llava-vl.github.io/ # the most ready to use. but not perfect
* https://huggingface.co/blog/blip-2
* https://simonwillison.net/2023/Sep/12/llm-clip-and-chat/
CLIP is a pretty old model at this point, and there are plenty of interesting alternatives that are just waiting for someone to wrap them in a plugin.
I’m particularly excited about Facebook’s ImageBind, which can embed images, text, audio, depth, thermal, and IMU data all in the same vector space!
* https://github.com/facebookresearch/ImageBind # to try the tutorial https://itnext.io/imagebind-one-embedding-space-to-bind-them-all-b48c8623d39b
* https://platform.openai.com/docs/guides/vision # reading and processing images
* https://platform.openai.com/docs/guides/images/introduction?context=node # generating images
√ 11. the LLM output of youtube2llm.py is quite high-level, vague, and not as good (refined, granular, precise) as the audio2llm.py.
need to check if it's the prompt, the chunking strategy, the context window, or what. seems like it's a combination all of them sih.
OK refactored the prompts and modes out of audio2llm.py to a separate file (prompts.py) and allow youtube2llm.py to use them.
chunking method and size is already the same between these scripts
12. refactor the output directory # can't remember what this is about, is it just adding output_dir to the args? or to not hardcode some of the "/output"... paths here?
13. allow running an `ask` mode on a video without having to explicitly call `analyse` and then `embed` beforehand
i.e.:
./youtube2llm.py analyse --vid=zt6i6vVgiO4 --nc --lmodel mistral
./youtube2llm.py embed --tf=output/zt6i6vVgiO4-transcript.txt
./youtube2llm.py ask --ef=output/embeddings/zt6i6vVgiO4-transcript-transcript_embedding.csv --q "what are the three depression subtypes and the way depression manifests in Earth, Wind, or Fire types?"
became
./youtube2llm.py ask --vid=zt6i6vVgiO4 --nc --lmodel mistral --q "what are the three depression subtypes and the way depression manifests in Earth, Wind, or Fire types?"
14. when trying to use o1-preview, got:
An unexpected error occurred: Error code: 400 - {'error': {'message': "Your organization must qualify for at least usage tier 5 to access 'o1-preview'. See https://platform.openai.com/docs/guides/rate-limits/usage-tiers for more details on usage tiers.", 'type': 'invalid_request_error', 'param': 'model', 'code': 'below_usage_tier'}}
// lol, https://platform.openai.com/docs/guides/rate-limits/usage-tiers > "Tier 5 $1,000 paid and 30+ days since first successful payment"
// and https://platform.openai.com/settings/organization/limits shows I'm currently tier 2. haha
'''