--write-comments --write-info-json --skip-download --write-thumbnail --verbose --batch-file --paths
yt-dlp --write-comments --write-info-json --skip-download --verbose --batch-file ../urls.txt --paths tmp/
if -P not provided (in conjuction w --batch file) the output is using the default OUTPUT_TEMPLATE and written to current directory. => use -P
any video type, (streams, videos, shorts) will work formatted as here: https://www.youtube.com/watch?v=
this command retrieves all of the public vid ids as a \n separated list.
yt-dlp --flat-playlist --print "%(id)s" "https://www.youtube.com/channel/UCnKJ-ERcOd3wTpG7gA5OI_g/"
what we want to do is create a "main csv" that stores a row for each of these ids as well as a column for the path to the .info.json file, (empty if nothing returned), additionally a column for "previously_tried_to_grab_metadata" so we know which ones to skip in the future
tograb = ids \ (is_empty(metadata_path)) & !(tried_to_grab)
video:
https://www.youtube.com/watch?v=3K6Z51brJB8
short (has video comments):
https://www.youtube.com/watch?v=1RoK_i-GJGo
stream:
https://www.youtube.com/watch?v=8_whLzXVLGI
stream with chat comments:
https://www.youtube.com/watch?v=VCHIwNMGo_Y
stream with chat and auto-subs
https://www.youtube.com/watch?v=HyCBlLMrCZc
anand@jain ~/s/t/data (main)> yt-dlp --list-subs "https://www.youtube.com/watch?v=3K6Z51brJB8"
[youtube] Extracting URL: https://www.youtube.com/watch?v=3K6Z51brJB8
[youtube] 3K6Z51brJB8: Downloading webpage
[youtube] 3K6Z51brJB8: Downloading ios player API JSON
[youtube] 3K6Z51brJB8: Downloading android player API JSON
[youtube] 3K6Z51brJB8: Downloading player ef5f17ca
WARNING: [youtube] 3K6Z51brJB8: nsig extraction failed: You may experience throttling for some formats
n = X3qPQdQWVydE62f9 ; player = https://www.youtube.com/s/player/ef5f17ca/player_ias.vflset/en_US/base.js
[info] Available automatic captions for 3K6Z51brJB8:
Language Name Formats
ab Abkhazian vtt, ttml, srv3, srv2, srv1, json3
aa Afar vtt, ttml, srv3, srv2, srv1, json3
af Afrikaans vtt, ttml, srv3, srv2, srv1, json3
...
note that the live_chat subtitle is an actual "subtitle" not an autosub, they are listed differently
anand@jain ~/s/t/data (main)> yt-dlp --list-subs "https://www.youtube.com/watch?v=VCHIwNMGo_Y"
[youtube] Extracting URL: https://www.youtube.com/watch?v=VCHIwNMGo_Y
[youtube] VCHIwNMGo_Y: Downloading webpage
WARNING: [youtube] No supported JavaScript runtime could be found. YouTube extraction without a JS runtime has been deprecated, and some formats may be missing. See https://github.com/yt-dlp/yt-dlp/wiki/EJS for details on installing one. To silence this warning, you can use --extractor-args "youtube:player_client=default"
[youtube] VCHIwNMGo_Y: Downloading android sdkless player API JSON
[youtube] VCHIwNMGo_Y: Downloading web safari player API JSON
WARNING: [youtube] VCHIwNMGo_Y: Some web_safari client https formats have been skipped as they are missing a url. YouTube is forcing SABR streaming for this client. See https://github.com/yt-dlp/yt-dlp/issues/12482 for more details
[youtube] VCHIwNMGo_Y: Downloading m3u8 information
WARNING: [youtube] VCHIwNMGo_Y: Some web client https formats have been skipped as they are missing a url. YouTube is forcing SABR streaming for this client. See https://github.com/yt-dlp/yt-dlp/issues/12482 for more details
VCHIwNMGo_Y has no automatic captions
[info] Available subtitles for VCHIwNMGo_Y:
Language Formats
live_chat json
From digging through yt-dlp issues and YouTube’s caption API calls:
GitHub
+2
GitHub
+2
en-orig (“English (Original)”)
This is the ASR track in the video’s original audio language when that language is English.
It’s the one that comes from kind=asr in the timedtext API (pure speech-recognition track).
GitHub
+1
If you want “what YouTube auto-generated from the spoken English, no translation layer”, you want en-orig.
en (“English”)
This label is overloaded and can be:
A human-uploaded English subtitle (shows up under “Available subtitles”), or
An auto-generated or auto-translated English track (under “Available automatic captions”), depending on the video.
GitHub
+1
On some videos you’ll see both en-orig and en in the automatic captions list; they’re often very similar but not guaranteed identical.
rust script runnable with clap:
i want to basically have a csv that has id, metadata_path, last_metadata_attempt
so i can basically then run like tg ( for transcript-grabber)
tg update ids, that will run something like yt-dlp --flat-playlist --print "%(id)s" "https://www.youtube.com/channel/UCnKJ-ERcOd3wTpG7gA5OI_g/"
to get all of MY CHANNELs video ids, (we can hardcode in channel id for now).
we then want to compare the ids we have with the list in the "main csv".
for all of the ids that 1) have empty metadata path and 2) a non empty last-metatdata attempt date, we want to then make a temp yt-dlp batch file with these ids/urls and run yt-dlp --write-comments --write-info-json --skip-download --write-thumbnail --verbose --batch-file (the one just created) --paths (somehow get abs path to the root of this rust project + ./data/meta/)
then subsequently update the main.csv to have the updated metadata_paths and attempt (as well as new rows if new ids were found in the grabbing of the ids)