If you import Google N-Grams data into Postgres, you can use this to compute TF-IDF measures on documents.
In my environment, I have talk transcripts stored in JSON files. In this example, I’ll show how to measure the distance between these and a word list (e.g. “I, me, my, myself, mine” etc).
import json
def get_transcript(theFile):
try:
with open(path + theFile, encoding="utf8") as json_data:
d = json.load(json_data)
json_data.close()
return d["transcript_s"]
except:
print("Found error")
return null
Once we have a transcript we need to tokenize the text into words. The best way to do this is to use NLTK, since it has a lot of choices for how to go about doing this.
from nltk.tokenize import RegexpTokenizer
from collections import defaultdict
def get_tokens(text):
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
return [t for t in tokenizer.tokenize(text)]
def get_counts(tokens):
counts = defaultdict(int)
for curr in tokens:
counts[curr] += 1
return counts
Before we comput TF-IDF, we need to know how often each word occurs in the N-Grams dataset. The important thing with this is to memoize the results.
import psycopg2
seen_tokens = {}
def get_docs_with_token(token):
if token in seen_tokens:
return seen_tokens[token]
conn = psycopg2.connect( \
"dbname='postgres' " + \
"user='postgres' " + \
"host='localhost' " \
"password='postgres'")
cur = conn.cursor()
table = token[0].lower()
cur.execute(\
"select volume_count from ngrams_" + \
table + " where year = 2008 and ngram = '" + \
token + "'")
rows = cur.fetchall()
result = 0
for row in rows:
result = row[0]
seen_tokens[token] = result;
return result
Once we have this, we can define the tf-idf function for one term in our search. Strangely, the “log” function in python is a natural log (there is no “ln” like you might expect). THere are some options here – you may wish to dampen the values (“Relevant Search” says that Lucene takes the square root of values)
Note also that we’re using “volumes” reported by Google n-grams as the number of documents in the “full” set. I’ve hard-coded the max # of documents in that set, since there is no point querying for this, but if you wanted to re-execute this computation for every year in the dataset, it would need to be an array or a SQL query.
def tfidf_token(search_token, all_tokens, all_token_counts):
total_terms = len(all_tokens)
term_count = all_token_counts[search_token]
total_docs = 206272
tf = 1.0 * term_count / total_terms
docs_with_term = get_docs_with_term(search_token)
idf = math.log(1.0 * total_docs / docs_with_term)
tfidf = tf * idf
return tf * idf
Once we have this it’s a trivial exercise to get the score for each search term, and sum them up:
def tfidf_search(search, file):
transcript = get_transcript(file)
all_tokens = get_tokens(transcript)
all_token_counts = get_counts(all_tokens)
vals = [tfidf_token(token, all_tokens, all_token_counts) for token in search]
print(vals)
score = sum(vals)
print(score)
return score
Once we’ve done this, all sorts of interesting possibilities are now available.
personal = ["I", "i", "Me", "me", "My", "my", "myself", "Myself"]
for file in files:
tfidf_search(personal, file)