Modern video sites support a specification known as “VTT”, which is a fairly rich format for listing video subtitles. This gives you timings of each segment of text, as well as the ability to highlight words in each segment.
Unfortunately this spec is horribly complex, and there are very few utilities you can run that aren’t browser oriented. If you want to write a command line application to parse out the text, it is quite painful.
00:00:00.320 --> 00:00:01.740 align:start position:0%
thanks<00:00:00.510> for <00:00:00.599> joining <00:00:00.930> me <00:00:01.510> are <00:00:01.740> my
00:00:01.740 --> 00:00:01.910 align:start position:0%
thanks for joining me are my
00:00:01.910 --> 00:00:03.189 align:start position:0%
thanks for joining me are my
name<00:00:02.090> is <00:00:02.620> Richard <00:00:02.929> King <00:00:03.189> in
00:00:03.189 --> 00:00:03.580 align:start position:0%
name is Richard King in
As you can see, there are several issues with subtitles: once you allow the UI to highlight words as they are spoken, the text must be replicated. Tools that consume transcripts must de-duplicate overlapping words. Second, there are no breaks for when sentences end, and since this is automatically generated by youtube, every “um” is transcribed.
We can make this much easier with an older format, SRT.
The SRT format looks like this:
1 00:00:00,320 --> 00:00:03,579 thanks for joining me are my name is Richard King in 2 00:00:03,580 --> 00:00:06,939 I'm a visual journalist at at 5:38
This is at least something that can be parsed in a tolerable fashion. I expect in the future that VTT libraries will improve, as it is the up-and-coming browser standard for subtitles, but if you just want to get transcripts now, the easiest way to do this is to get an SRT file, and parse it like so:
function parseSrt(text) {
let lines = text.split("\n");
let matchBreak = /\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d/i;
let transcript = "";
for (let i: number = 0; i < lines.length; i++) {
let line = lines[i];
console.log(line);
if (!line.match(matchBreak)) {
transcript += ' ' + line;
}
}
return transcript.replace(/\s+/ig, ' ');
}
A great tool I found to help with this is Google2SRT1.
- http://google2srt.sourceforge.net/en/ [↩]