tg - youtube transcript grabber

flags we def want:

--write-comments --write-info-json --skip-download --write-thumbnail --verbose --batch-file --paths

for getting ids use --flat-playlist

yt-dlp --write-comments --write-info-json --skip-download --verbose --batch-file ../urls.txt --paths tmp/

if -P not provided (in conjuction w --batch file) the output is using the default OUTPUT_TEMPLATE and written to current directory. => use -P

any video type, (streams, videos, shorts) will work formatted as here: https://www.youtube.com/watch?v=

this command retrieves all of the public vid ids as a \n separated list.

yt-dlp --flat-playlist --print "%(id)s" "https://www.youtube.com/channel/UCnKJ-ERcOd3wTpG7gA5OI_g/"

what we want to do is create a "main csv" that stores a row for each of these ids as well as a column for the path to the .info.json file, (empty if nothing returned), additionally a column for "previously_tried_to_grab_metadata" so we know which ones to skip in the future

tograb = ids \ (is_empty(metadata_path)) & !(tried_to_grab)

links

video: https://www.youtube.com/watch?v=3K6Z51brJB8 short (has video comments): https://www.youtube.com/watch?v=1RoK_i-GJGo stream: https://www.youtube.com/watch?v=8_whLzXVLGI
stream with chat comments: https://www.youtube.com/watch?v=VCHIwNMGo_Y stream with chat and auto-subs https://www.youtube.com/watch?v=HyCBlLMrCZc

this is what --list-subs gives:

anand@jain ~/s/t/data (main)> yt-dlp --list-subs  "https://www.youtube.com/watch?v=3K6Z51brJB8"
[youtube] Extracting URL: https://www.youtube.com/watch?v=3K6Z51brJB8
[youtube] 3K6Z51brJB8: Downloading webpage
[youtube] 3K6Z51brJB8: Downloading ios player API JSON
[youtube] 3K6Z51brJB8: Downloading android player API JSON
[youtube] 3K6Z51brJB8: Downloading player ef5f17ca
WARNING: [youtube] 3K6Z51brJB8: nsig extraction failed: You may experience throttling for some formats
         n = X3qPQdQWVydE62f9 ; player = https://www.youtube.com/s/player/ef5f17ca/player_ias.vflset/en_US/base.js
[info] Available automatic captions for 3K6Z51brJB8:
Language Name                  Formats
ab       Abkhazian             vtt, ttml, srv3, srv2, srv1, json3
aa       Afar                  vtt, ttml, srv3, srv2, srv1, json3
af       Afrikaans             vtt, ttml, srv3, srv2, srv1, json3
...

if the --list-subs id had live chat

note that the live_chat subtitle is an actual "subtitle" not an autosub, they are listed differently

  anand@jain ~/s/t/data (main)> yt-dlp --list-subs  "https://www.youtube.com/watch?v=VCHIwNMGo_Y"
[youtube] Extracting URL: https://www.youtube.com/watch?v=VCHIwNMGo_Y
[youtube] VCHIwNMGo_Y: Downloading webpage
WARNING: [youtube] No supported JavaScript runtime could be found. YouTube extraction without a JS runtime has been deprecated, and some formats may be missing. See  https://github.com/yt-dlp/yt-dlp/wiki/EJS  for details on installing one. To silence this warning, you can use  --extractor-args "youtube:player_client=default"
[youtube] VCHIwNMGo_Y: Downloading android sdkless player API JSON
[youtube] VCHIwNMGo_Y: Downloading web safari player API JSON
WARNING: [youtube] VCHIwNMGo_Y: Some web_safari client https formats have been skipped as they are missing a url. YouTube is forcing SABR streaming for this client. See  https://github.com/yt-dlp/yt-dlp/issues/12482  for more details
[youtube] VCHIwNMGo_Y: Downloading m3u8 information
WARNING: [youtube] VCHIwNMGo_Y: Some web client https formats have been skipped as they are missing a url. YouTube is forcing SABR streaming for this client. See  https://github.com/yt-dlp/yt-dlp/issues/12482  for more details
VCHIwNMGo_Y has no automatic captions
[info] Available subtitles for VCHIwNMGo_Y:
Language  Formats
live_chat json

write-subs is not what we want because its grabbing the "non-auto" ones

from chat on en-orig vs en as auto-sub => i want en-orig

From digging through yt-dlp issues and YouTube’s caption API calls:
GitHub
+2
GitHub
+2

en-orig (“English (Original)”)

This is the ASR track in the video’s original audio language when that language is English.

It’s the one that comes from kind=asr in the timedtext API (pure speech-recognition track).
GitHub
+1

If you want “what YouTube auto-generated from the spoken English, no translation layer”, you want en-orig.

en (“English”)

This label is overloaded and can be:

A human-uploaded English subtitle (shows up under “Available subtitles”), or

An auto-generated or auto-translated English track (under “Available automatic captions”), depending on the video.
GitHub
+1

On some videos you’ll see both en-orig and en in the automatic captions list; they’re often very similar but not guaranteed identical.

prompt

rust script runnable with clap:

i want to basically have a csv that has id, metadata_path, last_metadata_attempt

so i can basically then run like tg ( for transcript-grabber) tg update ids, that will run something like yt-dlp --flat-playlist --print "%(id)s" "https://www.youtube.com/channel/UCnKJ-ERcOd3wTpG7gA5OI_g/" to get all of MY CHANNELs video ids, (we can hardcode in channel id for now). we then want to compare the ids we have with the list in the "main csv".

for all of the ids that 1) have empty metadata path and 2) a non empty last-metatdata attempt date, we want to then make a temp yt-dlp batch file with these ids/urls and run yt-dlp --write-comments --write-info-json --skip-download --write-thumbnail --verbose --batch-file (the one just created) --paths (somehow get abs path to the root of this rust project + ./data/meta/)

then subsequently update the main.csv to have the updated metadata_paths and attempt (as well as new rows if new ids were found in the grabbing of the ids)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md
duration.jl		duration.jl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tg - youtube transcript grabber

flags we def want:

for getting ids use --flat-playlist

if -P not provided (in conjuction w --batch file) the output is using the default OUTPUT_TEMPLATE and written to current directory. => use -P

links

this is what --list-subs gives:

if the --list-subs id had live chat

write-subs is not what we want because its grabbing the "non-auto" ones

from chat on en-orig vs en as auto-sub => i want en-orig

prompt

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tg - youtube transcript grabber

flags we def want:

for getting ids use --flat-playlist

if -P not provided (in conjuction w --batch file) the output is using the default OUTPUT_TEMPLATE and written to current directory. => use -P

links

this is what --list-subs gives:

if the --list-subs id had live chat

write-subs is not what we want because its grabbing the "non-auto" ones

from chat on en-orig vs en as auto-sub => i want en-orig

prompt

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages