Scoring documents for quality in Python – how often does a speaker say “um”?

As part of a project, I thought it might be interesting to score lectures for how often the speaker says “um” (or similar).

An interesting realization here is that an automated transcription of a lecture is superior for this purpose than manual closed captions or a written transcript, as those edit the material down.

You need to tokenize whatever text you have:

from nltk import word_tokenize
tokens = word_tokenize(transcript)

Realistically, you only care if this is a frequent occurrence, so the best way to use this is combined with a threshold, or to feed this into a polynomial function that reduces the quality score for a transcript as it gets more severe.

check = ["um", "uh", "ah", "ehm", "eh", "uhm", "ah", "umm", "er"]
  
def umsScore(tokens):
  bad = 0
  for t in tokens:
    if (t.lower() in check):
      cnt = cnt + 1

  return cnt