How to extract closed caption transcript from YouTube video?

后端 未结 10 541
囚心锁ツ
囚心锁ツ 2020-12-22 19:32

Is it possible to extract the closed caption transcript from YouTube videos?

We have over 200 webcasts on YouTube and each is at least one hour long. YouTube has clo

相关标签:
10条回答
  • 2020-12-22 19:33

    You can view/copy/download a timecoded xml file of a youtube's closed captions file by accessing

    http://video.google.com/timedtext?lang=[LANGUAGE]&v=[YOUTUBE VIDEO IDENTIFIER]
    

    For example http://video.google.com/timedtext?lang=pt&v=WSVKbw7LC2w

    NOTE: this method does not download autogenerated closed captions, even if you get the language right (maybe there's a special code for autogenerated languages).

    0 讨论(0)
  • 2020-12-22 19:35

    There is a free python tool called YouTube transcript API

    You can use it in scripts or as a command line tool:

    pip install youtube_transcript_api
    
    0 讨论(0)
  • 2020-12-22 19:40

    Another option is to use youtube-dl:

    youtube-dl --skip-download --write-auto-sub $youtube_url
    

    The default format is vtt and the other available format is ttml (--sub-format ttml).

    --write-sub
           Write subtitle file
    
    --write-auto-sub
           Write automatically generated subtitle file (YouTube only)
    
    --all-subs
           Download all the available subtitles of the video
    
    --list-subs
           List all available subtitles for the video
    
    --sub-format FORMAT
           Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"
    
    --sub-lang LANGS
           Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags
    

    You can use ffmpeg to convert the subtitle file to another format:

    ffmpeg -i input.vtt output.srt
    

    This is what the VTT subtitles look like:

    WEBVTT
    Kind: captions
    Language: en
    
    00:00:01.429 --> 00:00:04.249 align:start position:0%
    
    ladies<00:00:02.429><c> and</c><00:00:02.580><c> gentlemen</c><c.colorE5E5E5><00:00:02.879><c> I'd</c></c><c.colorCCCCCC><00:00:03.870><c> like</c></c><c.colorE5E5E5><00:00:04.020><c> to</c><00:00:04.110><c> thank</c></c>
    
    00:00:04.249 --> 00:00:04.259 align:start position:0%
    ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
     </c>
    
    00:00:04.259 --> 00:00:05.930 align:start position:0%
    ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
    you<00:00:04.440><c> for</c><00:00:04.620><c> coming</c><00:00:05.069><c> tonight</c><00:00:05.190><c> especially</c></c><c.colorCCCCCC><00:00:05.609><c> at</c></c>
    
    00:00:05.930 --> 00:00:05.940 align:start position:0%
    you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
     </c>
    
    00:00:05.940 --> 00:00:07.730 align:start position:0%
    you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
    such<00:00:06.180><c> short</c><00:00:06.690><c> notice</c></c>
    
    00:00:07.730 --> 00:00:07.740 align:start position:0%
    such short notice
    
    
    00:00:07.740 --> 00:00:09.620 align:start position:0%
    such short notice
    I'm<00:00:08.370><c> sure</c><c.colorE5E5E5><00:00:08.580><c> mr.</c><00:00:08.820><c> Irving</c><00:00:09.000><c> will</c><00:00:09.120><c> fill</c><00:00:09.300><c> you</c><00:00:09.389><c> in</c><00:00:09.420><c> on</c></c>
    
    00:00:09.620 --> 00:00:09.630 align:start position:0%
    I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
     </c>
    
    00:00:09.630 --> 00:00:11.030 align:start position:0%
    I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
    the<00:00:09.750><c> circumstances</c><00:00:10.440><c> that's</c><00:00:10.620><c> brought</c><00:00:10.920><c> us</c></c>
    
    00:00:11.030 --> 00:00:11.040 align:start position:0%
    <c.colorE5E5E5>the circumstances that's brought us
     </c>
    

    Here are the same subtitles without the part at the top of the file and without tags:

    00:00:01.429 --> 00:00:04.249 align:start position:0%
    
    ladies and gentlemen I'd like to thank
    
    00:00:04.249 --> 00:00:04.259 align:start position:0%
    ladies and gentlemen I'd like to thank
    
    
    00:00:04.259 --> 00:00:05.930 align:start position:0%
    ladies and gentlemen I'd like to thank
    you for coming tonight especially at
    
    00:00:05.930 --> 00:00:05.940 align:start position:0%
    you for coming tonight especially at
    
    
    00:00:05.940 --> 00:00:07.730 align:start position:0%
    you for coming tonight especially at
    such short notice
    
    00:00:07.730 --> 00:00:07.740 align:start position:0%
    such short notice
    
    
    00:00:07.740 --> 00:00:09.620 align:start position:0%
    such short notice
    I'm sure mr. Irving will fill you in on
    
    00:00:09.620 --> 00:00:09.630 align:start position:0%
    I'm sure mr. Irving will fill you in on
    
    
    00:00:09.630 --> 00:00:11.030 align:start position:0%
    I'm sure mr. Irving will fill you in on
    the circumstances that's brought us
    

    You can see that each subtitle text is repeated three times. There is a new subtitle text every eighth line (3rd, 11th, 19th, and 27th).

    This converts the VTT subtitles to a simpler format:

    sed '1,/^$/d' *.vtt| # remove the part at the top
    sed 's/<[^>]*>//g'| # remove tags
    awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3' # print each new subtitle text and its start time without milliseconds
    

    This is what the output of the command above looks like:

    00:00:01 ladies and gentlemen I'd like to thank
    00:00:04 you for coming tonight especially at
    00:00:05 such short notice
    00:00:07 I'm sure mr. Irving will fill you in on
    00:00:09 the circumstances that's brought us
    

    This prints the closed captions of a video in the simplified format:

    cap()(cd /tmp;rm -f -- *.vtt;youtube-dl --skip-download --write-auto-sub -- "$1";sed '1,/^$/d' -- *.vtt|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3')

    The command below downloads the captions of all videos on a channel. When there is an error like Unable to extract video data, -i (--ignore-errors) causes youtube-dl to skip the video instead of exiting with an error.

    youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(id)s.%(ext)s' https://www.youtube.com/channel/$channelid;for f in *.vtt;do sed '1,/^$/d' "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3'>"${f%.vtt}";done

    0 讨论(0)
  • 2020-12-22 19:45

    I just got this easily done manually by opening the transcript at the beginning of the video and left-clicking and dragging at the time 00:00 marker with the shift key pressed over a few lines at the beginning.

    I then advanced the video to near the end. When the video stopped, I clicked the end of the last sentence whilst holding down the shift key once more. With CTRL-C I copied the text to the clipboard and pasted it into an editor.

    Done!

    Caveat: Be sure to have no RDP-Windows sharing the clipboard or Software such as Teamviewer is running at the same time as this procedure will overflow their buffers where a large amount of text is copied.

    0 讨论(0)
  • 2020-12-22 19:49

    Here's how to get the transcript of a YouTube video (when available):

    • Go to YouTube and open the video of your choice.
    • Click on the "More actions" button (3 horizontal dots) located next to the Share button.
    • Click "Open transcript"

    Although the syntax may be a little goofy this is a pretty good solution.

    Source: http://ccm.net/faq/40644-youtube-how-to-get-the-transcript-of-a-video

    0 讨论(0)
  • 2020-12-22 19:49

    You can download the streaming subtitles from YouTube with KeepSubs DownSub and SaveSubs.

    You can choose from the Automatic Transcript or author supplied close captions. It also offers the possibility to automatically translate the English subtitles into other languages using Google Translate.

    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题