solr - Gary Sieling

September 19, 2017

Full-Text Search within Closed Captions

Youtube automatically generates closed captions for videos. FindLectures.com crawls these, and allows you to search for a phrase within a video and start playback where the phrase occurs.

Machine-generated transcriptions include timestamps, but also many transcription errors. If we can obtain captions and a corrected transcript for a speech, these can be aligned using the words that do match. In the spots that differ, we can update the language with the corrected wording from the transcript.

In the below example, George W. Bush introduces the phrase “axis of evil” in a State of the Union address and the search engine recognizes that this is about 13 minutes in:

Captions are stored in the search index in a simplified version of the SRT closed caption format:

00:13:00 word states like these and their
00:13:23 terrorist allies constitute an axis of evil
00:13:27 arming to threaten the peace of the world

For many famous speeches, there is crowd noise at the time of the famous parts of the speeches which causes errors in machine transcriptions. For very well-known speeches transcripts are typically available, but the timings are rarely included.

This is the case for a famous speech by George H.W. Bush, which uses the phrase “Read my lips – no new taxes” – at this point in the video, the crowd cheers, rendering the last word or two unintelligible to a machine.

Machine transcriptions also commonly misspell homonyms (words that sound alike), such as “code” and “coat”, or “word” and “world”.

Closed captions often bold words as they are spoken, so the captioning may also include the same phrase repeatedly.

00:13:00 word states like these and their
00:13:02 word states like these and their
00:13:03 word states like these and their
00:13:04 these and their terrorist allies
00:13:06 these and their terrorist allies
00:13:07 these and their terrorist allies

There are robust algorithms to do text alignment. These were invented to aid in DNA sequencing. Sequencing a genome is like re-assembling a puzzle: it takes many small strands of DNA, and then recombines them by matching where they overlap.

For this essay I’m using is the Smith-Waterman algorithm, as there is a good implementation available on NPM:

npm install igenius-smith-waterman --save

A DNA sequence is represented using the letters A, C, T, G. Since alignment algorithms use letters, rather than words, we need to build a mapping to the words in the text.

A -> the
B -> axis
C -> of
D -> evil

To increase the number of matches in the alignment, punctuation and accent marks are removed from the transcript, and all words are lower case.

It is also important to include the caption time-stamps in this dictionary. In DNA terms, these are like “mutations” we want to apply to the transcript.

E -> 00:13:00
F -> 00:13:23
G -> 00:13:27

Our list of base pairs will be much larger than DNA. A typical speech might include 500-2,000 unique terms, so it’s important to use an implementation that supports Unicode characters.

When we run the alignment algorithm, we give it two series of letters, and tells us how to turn the first string into the second, and vice versa. Where there is no match, it inserts dashes.

align('ABCDEFG', 'ABCDEFFG'):
left: ABCDE-FG
right: ABCDEFFG

To re-construct the data, we iterate letter by letter, choosing timestamp tokens from the captions, and everything else from the transcript side, which gives us a result like this:

00:00:04 Thank you. Thank
00:06:15 you very much. I
00:07:27 have many friends to
00:07:33 thank tonight. I thank the voters who
00:07:37 supported me. I thank the gallant men who

There is one final improvement we can make – if a famous phrase spans two lines, it is difficult for a full-text search engine to find, e.g.:

00:00:04 Read my lips -
00:00:40 No new taxes.
00:01:05 Let me tell
00:01:30 you more about the mission.

When re-constructing the text, each line can look ahead, pulling a few words from the next. This ensures that entire phrases will generally be included on each line, and yields much better quality search results.

00:00:04 Read my lips - no new taxes
00:00:40 no new taxes. Let me tell
00:01:05 Let me tell you more about the mission.

The full code for this demonstration is available on github:

https://github.com/garysieling/transcript-alignment.git