Recent years have produced many promising data sets and algorithms for machine learning. New techniques like deep learning require significant computational power, often beyond what you may have on your desk. These bleeding edge tools frequently have specific dependencies, and can require significant effort to maintain a development environment.
One such technique is word2vec — this family of machine learning algorithms learns semantic information about words from how they are used in context. The algorithm tracks common terms that occur around a given word, producing a vector representing co-occurrence. These are commonly 50, 100, or 300 dimensions — the size is a configuration parameter to the algorithm, chosen as a trade-off between accuracy and computing power.
Given these co-occurrence vectors, word2vec then “learns” mathematical relationships between words, producing a new set of vectors where terms are located near each other if they are used in the same way.
Since terms are represented as vectors, you can add and subtract them. This algorithm is famous for identifying implicit relationships such as gender — e.g. in one of the papers on the subject, the researchers found that for their dataset “king-man+woman =~ queen.”
The algorithm can take days or weeks to train. Fortunately there is a word2vec model pre-trained on all of wikipedia. There is a simple Python API (gensim) to work with this model but it requires a lot of RAM. I found that this requires a 32 GB cloud instance — more than I have available at home.
Typically you might explore a model like this with a Jupyter notebook, which allows you to write and test Python code in your browser or a console.
To make this work, you’ll need to create a 32 GB VM, and provision Ubuntu 17.04. Then, run a few commands:
apt-get update && apt-get upgrade
apt-get install -y git python3 python3-pip jupyter-notebook unzip
pip3 install gensim
The file for the machine learning model is distributed through bittorrent so you’ll also need to install rtorrent:
apt-get -y install rtorrent
rtorrent https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent
To exit rtorrent, press “ctrl-q.”
tar xvf en_1000_no_stem.tar.gz
You should also set up ssh keys. On your local machine, you’ll need to generate a public key:
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub
Create the ssh folder on the remote machine folder and paste your public key into an authorized_keys file:
mkdir ~/.ssh
vi ~/.ssh/authorized_keys
Once these steps are complete you start Jupyter on the remote machine:
jupyter-notebook — no-browser — port=8889
To connect your local machine to the server, run the following command to forward these ports to the virtual machine:
ssh -N -f -L localhost:8888:localhost:8889 root@172.104.12.63
When this finishes, you’ll have access to Jupyter running on the remote machine as if you were running it locally. To access it, go to localhost:8888 in your browser.
This Jupyter instance will only be able to access the file system of the remote machine. If you need to load additional data files, they must be uploaded with scp.
Now that Jupyter is a available you can run queries against this model. Note that loading this model can take a couple minutes. Jupyter will show the code as running by placing an asterisk to the left of the code block, but you may also find it interesting to separately monitor your VM using “top”.
from gensim.models import Word2Vec
model = Word2Vec.load(“en_1000_no_stem/en.model”)
model.similarity(‘woman’, ‘man’)
From here, you can explore the model and do your development work as needed.
I use virtual machines to obtain more RAM on an hourly basis, which allows me to postpone buying or building a new machine. This has been a very reliable setup — you can tear down and rebuild the machine until you get it right, which forces you to document what you did.