Last active
May 14, 2023 18:03
-
-
Save teemow/05aed2fd8f80c8abf30d471f4b6e805a to your computer and use it in GitHub Desktop.
fetch youtube playlist with title, description and subtitles of each video and train gpt with the information
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
set -eu | |
FOLDER=$1 | |
PLAYLIST=$2 | |
rm -f playlist.txt | |
mkdir -p $FOLDER | |
yt-dlp --flat-playlist -i --print-to-file url playlist.txt $PLAYLIST | |
for i in $(cat playlist.txt) | |
do | |
FILENAME=$(yt-dlp --get-title --skip-download "$i" | tr -s '[[:space:]]' '_').content | |
if [ -f $FOLDER/$FILENAME ]; then | |
continue | |
fi | |
rm -rf tmp | |
mkdir -p tmp | |
cd tmp | |
# fetch subtitle | |
yt-dlp --skip-download \ | |
--sub-lang en-orig \ | |
--write-auto-sub \ | |
"$i" | |
if [ -f *.vtt ]; then | |
# convert subtitle | |
for j in *.vtt | |
do | |
vtt2text "$j" | |
done | |
# get title and description | |
yt-dlp --get-title --get-description --skip-download "$i" > $FILENAME | |
cat *.txt >> $FILENAME | |
mv $FILENAME ../$FOLDER/$FILENAME | |
fi | |
cd .. | |
done |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import logging | |
import sys | |
import textwrap | |
from llama_index import ( | |
GPTKeywordTableIndex, | |
Document, | |
SimpleDirectoryReader, | |
LLMPredictor, | |
) | |
from langchain import OpenAI | |
if __name__ == "__main__": | |
logging.basicConfig(stream=sys.stdout, level=logging.CRITICAL) | |
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) | |
if not os.path.exists("index.json"): | |
subtitles_folder = sys.argv[1] | |
documents = SimpleDirectoryReader(subtitles_folder).load_data() | |
llm_predictor = LLMPredictor( | |
llm=OpenAI(temperature=0, | |
model_name="text-davinci-003", | |
max_tokens=2048) | |
) | |
index = GPTKeywordTableIndex(documents, llm_predictor=llm_predictor) | |
index.save_to_disk("index.json") | |
else: | |
index = GPTKeywordTableIndex.load_from_disk("index.json") | |
while True: | |
try: | |
prompt = input("What should I figure out? ") | |
response = index.query(prompt) | |
response = str(response).strip() | |
if not response: | |
continue | |
for line in textwrap.wrap(response, width=75): | |
print(line) | |
print("-----") | |
except KeyboardInterrupt: | |
break |
And then you run python train-with-subtitles.py kubecon-europe-22
Don't forget to put your OPENAI_API_KEY in the env.
Prerequisites:
- install yt-dlp
- install some python dependencies:
pip install llama-index openai nltk
I was wondering where vtt2text (line 34 of fetch script) was coming from? I don't see it in any linux repos. I found a python package but i needs to be in its own script. Am I missing something here? thanks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
eg. fetch all the subtitles of the videos from Kubecon Europe 2022