Google Ngrams is a dataset of word frequencies over time, based on Google books1. There are files for different numbers of words (1, 2, 3, 4, and 5 word combinations). Unzipped, the 1 word files are just under 30 GB, which is large, but manageable on a developer workstation.
The files are partitioned by the first character of the word, and are grouped by the year of publication. From this, you get the number of books that use a word in a given year, and the number of times the word is used overall. Note these can be quite large numbers, so you need bigint.
I think it makes sense to import each file into it’s own table, because then you can choose to query them separately to test the performance of what you want to do.
create table ngrams_0 (
ngram varchar,
year int,
match_count bigint,
volume_count bigint
);
create index on ngrams_0 (
ngram,
volume_count
);
create table ngrams_1 (
like ngrams_0
including indexes
);
Postgres has a built-in CSV importer, which seems to be quite fast.
psql -c "\COPY ngrams_o FROM 'z:\googlebooks-eng-all-1gram-20120701-o' DELIMITER '\t' ENCODING 'utf-8'"
It doesn’t do any error handling – it just stops if it doesn’t know what to do. I explored using pgloader or a multicorn based foreign data wrapper (file_fdw is a COPY wrapper), as an alternative, but found these were more difficult to configure on Windows than I wanted.
Rather, I found that there was only one error condition preventing COPY from work, which was slashes in the dataset. I didn’t really want these anyway, so I removed these rows ahead of time. This technique also works well to pre-filter the files to a single year, which makes prototyping a lot easier.
for f in googlebooks*
do
echo filtered.$f
grep -v '\\' $f > filtered.$f
done
Once you do this, you may want a view across all the tables:
create table all_ngrams as
select * from (
select * from ngrams_0 union all
select * from ngrams_1 union all
...
) all
Once you have this, you can do interesting queries, like TF-IDF2:
with words as (
select doc_id, word, cnt, sum(cnt) over(partition by doc_id) wc
from (
select 'error' as word, 10 as cnt, 1 as doc_id
union all
select 'an' as word, 5 as cnt, 1 as doc_id
union all
select 'occurred' as word, 7 as cnt, 1 as doc_id
) wc
)
select *, sum(tf_idf) over () score
from (
select doc_id, word, cnt, tf, volume_count, idf, tf*idf tf_idf
from (
select
doc_id,
words.word,
words.cnt,
words.cnt::float / words.wc tf,
aw.volume_count,
ln(206272::float / aw.volume_count) idf -- max(volume_count)
from words
left join all_ngrams aw on aw.ngram = words.word
) scores
) total_score;