As part of a project, I thought it might be interesting to score lectures for how often the speaker says “um” (or similar).
An interesting realization here is that an automated transcription of a lecture is superior for this purpose than manual closed captions or a written transcript, as those edit the material down.
You need to tokenize whatever text you have:
from nltk import word_tokenize
tokens = word_tokenize(transcript)
Realistically, you only care if this is a frequent occurrence, so the best way to use this is combined with a threshold, or to feed this into a polynomial function that reduces the quality score for a transcript as it gets more severe.
check = ["um", "uh", "ah", "ehm", "eh", "uhm", "ah", "umm", "er"]
def umsScore(tokens):
bad = 0
for t in tokens:
if (t.lower() in check):
cnt = cnt + 1
return cnt