Entries Tagged 'Data Mining' ↓

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Technologies used:
Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs

Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.

This task splits into three pieces. I run a separate server for each – I’m not sure whether the Node.js community has a preferred architecture, but this feels like a natural fit.

  • Queue server
  • Static file share for PDFs
  • Saving results

There is a fourth server; a Python server which serves static Javascript pages that coordinate the work, but outside of development this would be run as a console application with PhantomJS.

The Node.js servers all run as virtual machines on my developer workstation, configured using Vagrant and Virtualbox – these could easily be moved onto separate hosts. The virtual machines communicate with fixed IPs, and each is assigned a different port on the same physical host, so that any laptop on the network can join in the fun.

Once each of the simple servers is running, the coordination code can be written in a browser, which lets you work using the Chrome developer tools. This comes at a cost – you have to configure all the servers to send Access-Control-Allow-Origin, to allow cross-domain host. This is a nuisance that would not be present in other architectures. One alternative is to put everything behind a proxy like Squid, but I haven’t tried this yet.

The PDFs are an extract from PACER (court cases), stored on a NAS, about 1.2 MM PDFs, ½ TB of storage. They are randomly distributed among numbered folders, each two levels deep, and two digits of hex: e.g. \0f\2a. Without this innovation there would be one folder with many files in it, which is slow to traverse on NTFS and impossible on ext4, the default on my NAS (ext4 lets you have 64,000 files / directory). As an added benefit, this structure partitions the data evenly, as the partitions are generated at random.

pacer

I generated a list of just the PDFs in this data set, which a Javascript replacement for ls that I wrote, which took several hours. This list informs the queueing server which files need to be process. All it does is print out the next document to parse:

var http = require('http'),
    fs = require('fs'),
    lazy = require("lazy"),
 
var files = [];
var lines =
     new lazy(fs.createReadStream('files.txt'))
       .lines
       .map(function (line) {
	files[files.length] = line.toString();
});
 
var lineNum = 0;
 
http.createServer(function (req, res) {
  res.writeHead(200, {
    'Content-Type': 'text/plain',
    'Access-Control-Allow-Origin' : '*'
  });
 
  res.write(files[lineNum]);
  lineNum++;
  res.end('\n');
}).listen(80);

This document can be then be retrieved from a second server, which uses a library called node-static to serve files:

var static = require('node-static');
 
var headers = {
    'Access-Control-Allow-Origin' : '*'
};
 
var file =
  new(static.Server)
     ('/vagrant/',
     {'headers': headers});
 
require('http').createServer(
  function (request, response) {
    request.addListener('end',
      function () {
        file.serve(request, response,
          function(req, res) {
            file.serve(request, response);
        });
    }).resume();
}).listen(80);

Finally, a third server saves the results in any format needs (errors, json, text output):

var http = require('http'),
      fs = require('fs'),
      qs = require('querystring');
 
http.createServer(function (req, res) {
  var body = '';
  req.on('data', function (data) {
    body += data;
  });
 
  req.on('end', function() {
    res.writeHead(200, {
      'Content-Type': 'text/plain',
      'Access-Control-Allow-Origin' : '*'
    });
 
    var loc = data.loc || '';
    var key = data.key || '';
    var ext = data.ext || '';
    var data = qs.parse(body);
 
    var filename = '/vagrant_data/' + loc + '/' + key + '.' + ext;
 
    fs.writeFile(filename, data.data || '', function (err) {
      if (err) throw err;
    });
 
    res.write(filename);
    res.end('\n');
  });
}).listen(80);

Note that what port a server listens on is unrelated to the port you connect on – when Vagrant builds the virtual machine, it lets you set port forwarding.

Once the above servers are configured, any machine that joins your network can participate – conceivably enabling old laptops to become an inefficient server farm. The queueing server could distribute packets of work in JSON instead of urls, which would support heterogeneous work payloads (text extraction + map + reduce work).

The processing code does all the interesting work – opening a PDF, extracting text, building full text data with lunr.js, and saving processed results to the server. I like to use a browser to test this so that it’s easy to debug, but the intent is to run within phantomjs. The code below communicates with pdf.js to extract text:

var queueUrl = 'http://192.168.11.37:8002/';
 
xmlhttp = new XMLHttpRequest();
xmlhttp.open("GET", queueUrl, false);
xmlhttp.send();
 
var pdfName = xmlhttp.responseText;
var pdfUrl = 'http://192.168.11.37:8001/' + pdfName;
var pdf = PDFJS.getDocument(pdfUrl);
 
var data = '';
 
pdf.errbacks[0] =
function() {
  document.location.reload(true);
};
 
pdf.then(function(pdf) {
 var maxPages = pdf.pdfInfo.numPages;
 for (var j = 1; j <= maxPages; j++) {
    var page = pdf.getPage(j);
 
    // the callback function - we create one per page
    var processPageText = function processPageText(pageIndex) {
      return function(pageData, content) {
        return function(text) {
          // bidiTexts has a property identifying whether this
          // text is left-to-right or right-to-left
 
          // Defect here - sometimes the output has words
          // concatenated where they shouldn't be. But, if
          // you just add spaces you'll get spaces within 
          // words.
          for (var i = 0; i < text.bidiTexts.length; i++) {
            data += text.bidiTexts[i].str;
          }
 
	  data += ' ';
 
          if (pageData.pageInfo.pageIndex ===
              maxPages - 1) {
             ... output processing goes here ...
          }
        }
      }
    }(j);
 
    var processPage = function processPage(pageData) {
      var content = pageData.getTextContent();
 
      content.then(processPageText(pageData, content));
    }
 
    page.then(processPage);
 }
});

And, finally, the code which saves results back to the server:

var loc = pdfName.substr(0, pdfName.lastIndexOf('/'));
loc = 'data/'; // can configure where final data is stored
 
var key = pdfName.substr(0, pdfName.lastIndexOf('.'))
                 .substr(pdfName.lastIndexOf('/') + 1) +
                 '.text-rendition';
 
$.post(
  'http://192.168.11.37:8005',
  { loc : loc, key: key, ext: 'txt', data: data }
).done(
  function() {
   document.location.reload(true);
 }
);

This can then be put into a full text index, and stored, to be combine at at a later date:

var index = lunr(function () {
    this.field('text');
    this.ref('id');
});
 
index.add({
  text: data,
  id: pdfName
});
 
var serializedIndex =
  JSON.stringify(index.toJSON());
 
$.post(
  'http://192.168.11.37:8005',
  { loc : loc, key: pdfName + '-idx', ext: 'json', data: serializedIndex }
).done(
  function() {
   document.location.reload(true);
 }
);

In my initial test this was really slow – only 120 PDFs per minute. For a 1.2mm PDF data set – this will take about ten days. For 20 years of U.S. litigation this would take four months, without any additional processing. Surprisingly this is a CPU-bound operation; unsurprisingly that generates a ton of heat. Cursory inspection suggests the biggest performance gains would come from improving the NAS connection and caching of pdf.js libraries.

From a developer perspective, it is challenging to identify which code is synchronous and which isn’t, creating many subtle bugs. Each Javascript library models this a bit differently, and Javascript’s lack of typing information makes the problem worse.

Error handling is also difficult: not only do these need to be logged, but you need to be able to re-run the ingestion process on a subset of the entire document set. Currently this isn’t handled, but should be cautionary to anyone reproducing this.

Operationally it’s easy to add machines: almost all the work is done on the front-end. You can typically open several dozen processes on a machine by simply opening console windows or browser tabs. A system like this ought to be more introspective and watch the load on the host and respond accordingly.

If you enjoyed this, you may also be interested in these articles:

My First A/B Test… with Results

A/B testing gets a lot of attention on Hacker News, inbound.org, and other forums, and appeals to me as a data analysis exercise. As a software engineer with a practical bent, I like the concept of data analysis techniques which produce useful results while treating a system as a black box. This stands in contrast to algorithms that aim to analyze data and tell a story, for instance applying agent-based modeling to political science and the study of war and peace.

Testing two variations of a site to see how people react turns out to be extremely difficult to tinker with on a blog. You need to have the right sort of problems to justify using statistics, and it’s challenging to create those problems to happen to justify the experiment. From another angle, I’ve long been leery of using split testing at all as keeping WordPress stable has been a real pain so I prefer to avoid additional operational complexity.

Enter Adzerk, which provides a hosted ad server, removing the need for additional infrastructure. The function of an ad server is to let you upload media (pictures, etc) and set up business rules for when to display each “ad”, effectively letting you run something resembling an Adsense clone, minus all the AI. Adzerk has a nice free plan, which covers you up to a ridiculously high number of impressions, so it’s not really necessary. I’ve been really happy with the site, although from their perspective I’m likely a terrible customer, as they’re not making any money off me.

The logical products to promote on a programming site seem like jobs, books, and developer tools – right now I’m just running campaigns on a couple sites consisting of links to Amazon pages. Here’s a screenshot a campaign set up very recently:

Screen Shot 2013-05-22 at 9.36.38 PM

Once you get enough impressions to compare, you can just turn off each entry and make new ones for the next test. There’s no particular reason you have to use traditional display advertising with this – as cheap as it is, one could easily build a very dynamic site using their API, for instance to pump in hot news, thumbnails for suggested stories, etc.

I’m running campaigns on a couple different sites – After a slow build-up in readership, and a burst of Hacker News traffic, I was able to finally achieve statistically significant results for clickthroughs on a test of four Javascript books:

TitleImpressionsClickthrough %Clicks
Javascript the Good Parts98010.220
Javascript Patterns100920.2222
Ext CookBook99440.1111
ExtJS in Action100600.4343

 

To test this, I made a table of each combination of values, comparing them using the chi-squared test on Evan’s Awesome A/B testing tools. Fortunately this showed ExtJS in Action to be a clear winner over all the rest as I hoped- one risk of this technique is that there is a clear loser or a couple equivalent winners, where I was hoping for one to be the best.

xno differenceno differenceloses
no differencexno differenceloses
no differenceno differencexloses
winswinswinsx

 

In all, this took around nine months to achieve, but with little effort on my part. The lesson I take is this: while this is fun, it may be better to look for big wins elsewhere- especially borrowing ideas from people who already run these tests. Additionally, the tests would complete much faster running only two variations, and many can run in parallel.

Full-Text Indexing PDFs in Javascript

I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the 200+ million PDFs that represent 20+ years of U.S. litigation. These processes can take many months of machine time, which puts a lot of pressure on the software teams that build them.

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem.

To test building this processing pipeline, we first look at how to extract text from PDFs, which will later be inserted into a full text index. The code for pdf.js is instructive, in that the Mozilla developers use browser features that aren’t in common use. Web Workers, for instance, let you set up background processing threads.

The pdf.js APIS make heavy use of Promises, which hold references in code to operations that haven’t completed yet. You operate on them using callbacks:

var pdf = PDFJS.getDocument('http://www.pacer.gov/documents/pacermanual.pdf');
 
var pdf = PDFJS.getDocument('pacermanual.pdf');
pdf.then(function(pdf) {
 // this code is called once the PDF is ready
});

This API seems immature yet- ideally you should be able to do promise.then(f(x)).then(g(x)).then(h(x)) etc, but that isn’t yet available.

For rendering PDFs the Promise pattern makes a lot of sense, as it leaves room for parallelizing the rendering process. For merely extracting the text from a PDF it feels like a lot of work – you need to be confident that your callbacks run in order and track which one is last.

The following demonstrates how to extract all the PDF text, which is then printed to the browser console log:

‘use strict’;
var pdf = PDFJS.getDocument('http://www.pacer.gov/documents/pacermanual.pdf');
 
var pdf = PDFJS.getDocument('pacermanual.pdf');
pdf.then(function(pdf) {
 var maxPages = pdf.pdfInfo.numPages;
 for (var j = 1; j <= maxPages; j++) {
    var page = pdf.getPage(j);
 
    // the callback function - we create one per page
    var processPageText = function processPageText(pageIndex) {
      return function(pageData, content) {
        return function(text) {
          // bidiTexts has a property identifying whether this
          // text is left-to-right or right-to-left
          for (var i = 0; i < text.bidiTexts.length; i++) {
            str += text.bidiTexts[i].str;
          }
 
          if (pageData.pageInfo.pageIndex ===
              maxPages - 1) {
            // later this will insert into an index
            console.log(str);
          }
        }
      }
    }(j);
 
    var processPage = function processPage(pageData) {
      var content = pageData.getTextContent();
 
      content.then(processPageText(pageData, content));
    }
 
    page.then(processPage);
 }
});

It’s not trivial to identify where headings and images are. This would require hooking into the rendering code, and possibly a deep understanding of PDF commands (PDFs appear to be represented as stream of rendering commands, similar to RTF).

Lunr
Creating a Lunr index and adding text is straightforward- all the APIs operate on JSON bean objects, which is a pleasantly simple API:

doc1 = {
    id: 1,
    title: 'Foo',
    body: 'Foo foo foo!'
  };
 
doc2 = {
    id: 2,
    title: 'Bar',
    body: 'Bar bar bar!'
  }
 
doc3 = {
    id: 3,
    title: 'gary',
    body: 'Foo Bar bar bar!'
  }
 
index = lunr(function () {
    this.field('title', {boost: 10})
    this.field('body')
    this.ref('id')
  })
 
// Add documents to the index
index.add(doc1)
index.add(doc2)
index.add(doc3)

Searching is simple – one neat tidbit I found is that you can inspect the index easily, since it’s just a JS object:

// Run a search
index.search(“foo”)
 
// Inspect the actual index to see which docs match a term
index2.tokenStore.root.f.o.o.docs

When I was first introduced to full-text indexing, I was confused by what is meant by a “document” – this generalizes beyond a PDF or Office document to any database row, possibly including large blobs of text.

Full-text search would be pretty dumb if you had to build the index every time, and Lunr makes it really easy to serialize and deserialize the index itself:

var serializedIndex = JSON.stringify(index1.toJSON())
var deserializedIndex = JSON.parse(serializedIndex)
var index2 = lunr.Index.load(deserializedIndex)

Index.toJSON also returns a “bean” style object (not a string). I’ve never seen an API like this, and I really like the idea – it gives you a clean Javascript object with only the data that requires serialization.

The following are attributes of the index:

  • corpusTokens – Sorted list of tokens
  • documentStore – list of each document – catenate
  • fields – The fields used to describe each document (similar to database columns)
  • pipeline – The pipeline object used to process tokens
  • tokenStore – Where and how often words are referenced in each document

One great thing about this type of index is that the work can be done in parallel and then combined as a map-reduce job. Only three entries from the above object need to be combined, as “fields” and “pipeline” are static. The following demonstrates the implementation of the reduction step (note jQuery is referenced):

(function reduce(a, b) {
  var j1 = a.toJSON();
  var j2 = b.toJSON();
 
  // The "unique" function does uniqueness by sorting,
  // which we need here.
  var corpusTokens =
      $.unique(
          $.merge(
              $.merge([], j1.corpusTokens),
                           j2.corpusTokens));
 
  // It's important to create new arrays and
  // objects throughout, or else you modify 
  // the source indexes, which is disastrous.
  var documentStore =
     {store: $.extend({},
                      j1.documentStore.store,
                      j2.documentStore.store),
      length: j1.documentStore.length + j2.documentStore.length};
 
  var jt1 = j1.tokenStore;
  var jt2 = j2.tokenStore;
 
  // The 'true' here triggers a deep copy
  var tokenStore = {
    root: $.extend(true, {}, jt1.root, jt2.root),
    length: jt1.length + jt2.length
  };
 
  return {version: j1.version,
          fields: $.merge([], j1.fields),
          ref: j1.ref,
          documentStore: documentStore,
          tokenStore: tokenStore,
          corpusTokens: corpusTokens,
          pipeline: $.merge([], j1.pipeline)};
})(index1, index2)

I tested this by creating three indexes: index1, index2, and index3. index1 is {doc1}, index2 is {doc2, doc3}, and index3 is {doc1, doc2, doc3}. To test the code, you need simply diff:

JSON.stringify(index3.toJSON())
 
JSON.stringify(combine(index1, index2))

Possibilities

Overall this technique has a lot of wasted network I/O, making this seem silly. On the other hand, there are listings on ebay and fiverr selling for “traffic”, which typically comes from pop-unders, botnets, hidden iframes, etc. You can easily find listings like “20,000 hits for $3”, and less in bulk. This is typically cheap because it has little commercial value other than perpetrating various forms of fraud.

You’d need a cheap VM with loads of bandwidth to use as a proxy, as well as publically available data – you couldn’t use this as a scraping technique due to browser protections against cross-domain requests. You’d also need to generate unique document IDs in a unique fashion, perhaps using the original URL.

If a traffic source runs on modern browsers, one could use this as a source of potentially cheap and unlimited processing power, even for point of combining the indexes, although provisions must be made for the natural instability of the system.

If you enjoyed this, you might also enjoy the following:

Entity recognition with Scala and Stanford NLP Named Entity Recognizer

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.

import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import java.io._
import org.xml.sax._
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.CoreAnnotations
 
object pdfHandler extends ContentHandler {
  val contents: StringBuffer = new StringBuffer()
 
  def characters(ch: Array[Char], start: Int, length: Int) {
    contents.append(new String(ch))
  }
 
  def endDocument() {
  }
 
  def endElement(uri: String, localName: String, qName: String) {
  }
 
  def endPrefixMapping(prefix: String) {
  }
 
  def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
  }
 
  def processingInstruction(target: String, data: String) {
  }
 
  def setDocumentLocator(locator: Locator) {
  }
 
  def skippedEntity(name: String) {
  }
 
  def startDocument() {
  }
 
  def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
  }
 
  def startPrefixMapping(prefix: String, uri: String) {
  }
}
 
object pdf extends App {
  val file = """e:\data\11-1285_i4dk.pdf"""
 
  val pdf: PDFParser = new PDFParser();
 
  val stream: InputStream = new FileInputStream(file)
  val handler: ContentHandler = pdfHandler
  val metadata: Metadata = new Metadata()
  val context: ParseContext = new ParseContext()
 
  pdf.parse(stream,
    handler,
    metadata,
    context)
 
  stream.close()
 
  val contents: String = pdfHandler.contents.toString()
  println(contents)
 
  val src = "stanford-ner-2013-04-04/classifiers/"
  val classifier1 = "english.all.3class.distsim.crf.ser.gz"
  val classifier2 = "english.conll.4class.distsim.crf.ser.gz"
  val classifier3 = "english.muc.7class.distsim.crf.ser.gz"
 
  val serializedClassifier = src + classifier1
 
  val classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier)
  val out = classifier.classify(contents)
 
  var words = 0
  for (i <- 0 to out.size() - 1) {
    val sentence = out.get(i)
 
    var foundWord = ""
    var oldWordClass = ""
 
    for (j <- 0 to sentence.size() - 1) {
      val word = sentence.get(j)
      val wordClass = word.get(classOf[CoreAnnotations.AnswerAnnotation]) + ""
 
      if (!oldWordClass.equals(wordClass)) {
        if (!oldWordClass.equals("O") && !oldWordClass.equals("")) {
          print("[/" + oldWordClass + "]")
        }
      }
 
      if (!wordClass.equals("O") && !wordClass.equals("")) {
        if (!oldWordClass.equals(wordClass)) {
          print("[" + wordClass + "]")
        }
      }
 
      oldWordClass = wordClass
 
      words = words + 1
      print(word);
      print(" ");
 
      if (words > 10) {
        words = 0
        println(" ")
      }
    }
  }
}
11-1285 [ORGANIZATION]US Airways , Inc. [/ORGANIZATION]v.
[PERSON]McCutchen [/PERSON]-LRB- 4\/16\/13 -RRB- 1 -LRB-
Slip Opinion -RRB- OCTOBER TERM ,
2012 Syllabus NOTE : Where it
is feasible , a syllabus -LRB-
headnote -RRB- will be released ,
as isbeing done in connection with
this case , at the time
the opinion is issued . The
syllabus constitutes no part of the
opinion of the Court but has
beenprepared by the Reporter of Decisions
for the convenience of the reader
. See [LOCATION]United States [/LOCATION]v. [ORGANIZATION]Detroit
Timber & Lumber Co. [/ORGANIZATION], 200
U. S. 321 , 337 .
SUPREME COURT OF THE [ORGANIZATION]UNITED STATES
Syllabus US AIRWAYS [/ORGANIZATION], INC. ,
IN ITS CAPACITY AS FIDUCIARY AND
PLAN ADMINISTRATOR OF THE [LOCATION]US [/LOCATION]AIRWAYS
, INC. . EMPLOYEE BENEFITS PLAN
v. [PERSON]MCCUTCHEN [/PERSON]ET AL. . CERTIORARI
TO THE [ORGANIZATION]UNITED STATES [/ORGANIZATION]COURT OF
APPEALS FOR THE THIRD CIRCUIT No.
11 -- 1285 . Argued November
27 , 2012 -- Decided April
16 , 2013 The health benefits
plan established by petitioner [ORGANIZATION]US Airways
[/ORGANIZATION]paid $ 66,866 in medical expenses
for injuries suffered by respondentMcCutchen ,
a [ORGANIZATION]US Airways [/ORGANIZATION]employee , in
a car accident caused by athird
party . The plan entitled [ORGANIZATION]US
Airways [/ORGANIZATION]to reimbursement if
[PERSON]McCutchen [/PERSON]

Implementing k-means in Scala

To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two – the exact points used are in the code below.

This implements Lloyd’s algorithm, which tries to cluster points in iterations in a simple manner:

1. Assume a certain number of clusters
2. Group the points at random
3. Compute the center of each cluster
4. For each point, compute which cluster is closest
5. Move all the points into new groupings
6. Repeat 3-5 a few times, until you’re happy with the results

I like how the functional programming style forces you to recreate all the data structures, in this case. It might be tempting to implement this in an imperative style, modifying data structures in place, but since steps 4-5 require separate data, you are protected against making it more difficult. You can see the full source below, or on github.

Since this example is fairly contrived, this converges pretty quickly:

Initial State:
  Cluster 0
  Mean: (17.83517750970944, 12.242720407317105)
    (10.8348626966492, 18.7800980127523))
    (7.7875624720831, 20.1569764307574))
    (11.9096128931784, 21.1855674228972))
    (22.4668345067162, 8.9705504626857))
    (7.91362116378194, 21.325928219919))
    (22.636600400773, 2.46561420928429))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (25.1439536911272, 3.58469981317611))
    (23.5359486724204, 4.07290025106778))
    (11.7493214262468, 17.8517235677469))
    (12.4277617893575, 19.4887691804508))
    (11.931275122466, 18.0462702532436))
    (25.4645673159779, 7.54703465191098))
    (21.8031183153743, 5.69297814349064))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (16.95249500233747, 12.848199048608048)
    (11.7265904596619, 16.9636039793709))
    (10.7751248849735, 22.1517666115673))
    (23.6587920739353, 3.35476798095758))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (7.03171204763376, 19.1985058633283))
    (23.7722765903534, 3.74873642284525))
    (10.259545802461, 23.4515683763173))
    (28.1587146197594, 3.70625885635717))
    (10.1057940183815, 18.7332929859685))
    (8.90149362263775, 19.6314465074203))
    (12.4353462881232, 19.6310467981989))
    (24.3793349065557, 4.59761596097384))
    (22.5447925324242, 2.99485404382734))
    (26.8942422516129, 5.02646862012427))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (27.0339042858296, 4.4151109960116))
    (11.0118378554584, 20.9773232834654))
Iteration: 0
  Cluster 0
  Mean: (23.781370272978315, 5.754127202865132)
    (11.7265904596619, 16.9636039793709))
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.296576237184727, 20.09138475584863)
    (10.8348626966492, 18.7800980127523))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 1
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 2
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 3
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 4
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 5
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
class Point(dx: Double, dy: Double) {
  val x: Double = dx
  val y: Double = dy
 
  override def toString(): String = {
    "(" + x + ", " + y + ")"
  }
 
  def dist(p: Point): Double = {
    return x * p.x + y * p.y
  }
}
 
object kmeans extends App {
  val k: Int = 2
 
  // Correct answers to centers are (10, 20) and (25, 5)
  val points: List[Point] = List(
    new Point(10.8348626966492, 18.7800980127523),
    new Point(10.259545802461, 23.4515683763173),
    new Point(11.7396623802245, 17.7026240456956),
    new Point(12.4277617893575, 19.4887691804508),
    new Point(10.1057940183815, 18.7332929859685),
    new Point(11.0118378554584, 20.9773232834654),
    new Point(7.03171204763376, 19.1985058633283),
    new Point(6.56491029696013, 21.5098251711267),
    new Point(10.7751248849735, 22.1517666115673),
    new Point(8.90149362263775, 19.6314465074203),
    new Point(11.931275122466, 18.0462702532436),
    new Point(11.7265904596619, 16.9636039793709),
    new Point(11.7493214262468, 17.8517235677469),
    new Point(12.4353462881232, 19.6310467981989),
    new Point(13.0838514816799, 20.3398794353494),
    new Point(7.7875624720831, 20.1569764307574),
    new Point(11.9096128931784, 21.1855674228972),
    new Point(8.87507602702847, 21.4823134390704),
    new Point(7.91362116378194, 21.325928219919),
    new Point(26.4748241341303, 9.25128245838802),
    new Point(26.2100410238973, 5.06220487544192),
    new Point(28.1587146197594, 3.70625885635717),
    new Point(26.8942422516129, 5.02646862012427),
    new Point(23.7770902983858, 7.19445492687232),
    new Point(23.6587920739353, 3.35476798095758),
    new Point(23.7722765903534, 3.74873642284525),
    new Point(23.9177161897547, 8.1377950229489),
    new Point(22.4668345067162, 8.9705504626857),
    new Point(24.5349708443852, 5.00561881333415),
    new Point(24.3793349065557, 4.59761596097384),
    new Point(27.0339042858296, 4.4151109960116),
    new Point(21.8031183153743, 5.69297814349064),
    new Point(22.636600400773, 2.46561420928429),
    new Point(25.1439536911272, 3.58469981317611),
    new Point(21.4930923464916, 3.28999356823389),
    new Point(23.5359486724204, 4.07290025106778),
    new Point(22.5447925324242, 2.99485404382734),
    new Point(25.4645673159779, 7.54703465191098)).sortBy(
      p => (p.x + " " + p.y).hashCode())
 
  def clusterMean(points: List[Point]): Point = {
    val cumulative = points.reduceLeft((a: Point, b: Point) => new Point(a.x + b.x, a.y + b.y))
 
    return new Point(cumulative.x / points.length, cumulative.y / points.length)
  }
 
  def render(points: Map[Int, List[Point]]) {
    for (clusterNumber <- points.keys.toSeq.sorted) {
      System.out.println("  Cluster " + clusterNumber)
 
      val meanPoint = clusterMean(points(clusterNumber))
      System.out.println("  Mean: " + meanPoint)
 
      for (j <- 0 to points(clusterNumber).length - 1) {
        System.out.println("    " + points(clusterNumber)(j) + ")")
      }
 
      System.out.println("")
    }
  }
 
  val clusters =
    points.zipWithIndex.groupBy(
      x => x._2 % k) transform (
        (i: Int, p: List[(Point, Int)]) => for (x <- p) yield x._1)
 
  System.out.println("Initial State: ")
  render(clusters)
 
  def iterate(clusters: Map[Int, List[Point]]): Map[Int, List[Point]] = {
    val unzippedClusters =
      (clusters: Iterator[(Point, Int)]) => clusters.map(cluster => cluster._1)
 
    // find cluster means
    val means =
      (clusters: Map[Int, List[Point]]) =>
        for (clusterIndex <- clusters.keys)
          yield clusterMean(clusters(clusterIndex))
 
    // find the closest index
    def closest(p: Point, means: Iterable[Point]): Int = {
      val distances = for (center <- means) yield p.dist(center)
      return distances.zipWithIndex.min._2
    }
 
    // assignment step
    val newClusters =
      points.groupBy(
        (p: Point) => closest(p, means(clusters)))
 
    render(newClusters)
 
    return newClusters
  }
 
  var clusterToTest = clusters
  for (i <- 0 to 5) {
    System.out.println("Iteration: " + i)
    clusterToTest = iterate(clusterToTest)
  }
}

Extracting Social Media Vote Counts for Reddit, Twitter, Google+ and Hacker News

Ever wonder if your blog posts have been submitted to sites like Reddit or Hacker News, who submitted them, and how well they did? All this data is available through JSON APIs, official or not.

This code example collects the submission statistics for a WordPress blog entry – Twitter and Google+ posts, Hacker News posts/votes, and Reddit posts/upvotes/downvotes.

The output is a CSV file. Since stories can be submitted to HN and reddit multiple times, this duplicates the URL rows in the output.

You can get the source on Github here.

social-media

 

&lt;?php function google_plus_count($url) {     $curl = curl_init();     curl_setopt($curl, CURLOPT_URL, "https://clients6.google.com/rpc");     curl_setopt($curl, CURLOPT_POST, 1);     curl_setopt($curl, CURLOPT_POSTFIELDS, '[{"method":"pos.plusones.get","id":"p","params":{"nolog":true,"id":"' . $url . '","source":"widget","userId":"@viewer","groupId":"@self"},"jsonrpc":"2.0","key":"p","apiVersion":"v1"}]');     curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);     curl_setopt($curl, CURLOPT_HTTPHEADER, array(         'Content-type: application/json'     ));     $curl_results = curl_exec($curl);     curl_close($curl);     $json = json_decode($curl_results, true);     $val  = $json[0]['result']['metadata']['globalCounts']['count'];     if ($val) {         return $val;     } else {         return 0;     }     ;      } function twitter_count($url) {     $twitter_url = "http://urls.api.twitter.com/1/urls/count.json?url=" . $url;     $curl        = curl_init();     curl_setopt($curl, CURLOPT_URL, $twitter_url);     curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);     curl_setopt($curl, CURLOPT_HTTPHEADER, array(         'Content-type: application/json'     ));     $curl_results = curl_exec($curl);     curl_close($curl);     $json = json_decode($curl_results, true);     $val  = $json['count'];     if ($val) {         return $val;     } else {         return 0;     }     ; } function hn_count($url) {     $hn_url = 'http://api.thriftdb.com/api.hnsearch.com/items/_search?filter[fields][url][]=' . $url;     $curl   = curl_init();     curl_setopt($curl, CURLOPT_URL, $hn_url);     curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);     curl_setopt($curl, CURLOPT_HTTPHEADER, array(         'Content-type: application/json'     ));     $curl_results = curl_exec($curl);     curl_close($curl);     $json = json_decode($curl_results, true);          $i = 0;     if ($json['results']) {         $json = $json['results'];                  foreach ($json as $item) {             $item_val = $item['item'];             if ($i &gt; 0) {
                echo "\n$url,,,,";
            }
 
            echo $item_val['points'] . "," . $item_val['num_comments'] . "," . $item_val['username'] . ",";
            $i++;
        }
    }
    if ($i == 0) { // never submitted
        echo ",,,";
    }
    // points num_comments username
}
 
$sql = "select comment_count, post_name from wp_posts where post_type = 'post' and post_status='publish';";
 
$hostname = "127.0.0.1";
$username = "";
$password = "";
$db       = "";
 
$reddit_name = "";
$reddit_pass = "";
 
$dbhandle = mysql_connect($hostname, $username, $password);
$selected = mysql_select_db($db, $dbhandle);
 
$result = mysql_query($sql);
 
require_once("reddit.php");
$reddit = new reddit($reddit_name, $reddit_pass);
 
echo "SLUG,COMMENTS,GOOGLE_PLUS,TWITTER,HN_POINTS,HN_COMMENTS,HN_USER,SUBREDDIT,REDDIT_LIKES,REDDIT_SCORE,REDDIT_DOWNS,REDDIT_UPS,REDDIT_NUM_COMMENTS,REDDIT_SUBMISSION,REDDIT_AUTHOR,HN\n";
while ($row = mysql_fetch_array($result)) {
    $url = "http://garysieling.com/blog/" . $row{'post_name'};
 
    echo $url . "," . $row{'comment_count'} . ",";
    echo google_plus_count($url) . "," . twitter_count($url) . ",";
    hn_count($url);
 
    $pageInfo = $reddit-&gt;getPageInfo($url);
    $i        = 0;
    if ($pageInfo and $pageInfo-&gt;data and $pageInfo-&gt;data-&gt;children) {
        $children = $pageInfo-&gt;data-&gt;children;
 
        foreach ($pageInfo-&gt;data-&gt;children as $submission) {
 
            $submission   = $submission-&gt;data;
            $subreddit    = $submission-&gt;subreddit;
            $likes        = $submission-&gt;likes;
            $score        = $submission-&gt;score;
            $downs        = $submission-&gt;downs;
            $ups          = $submission-&gt;ups;
            $num_comments = $submission-&gt;num_comments;
            $num_reports  = $submission-&gt;num_reports;
            $author       = $submission-&gt;author;
 
            if ($i &gt; 0) {
                echo "\n" . $url . "," . $row{'comment_count'};
                echo ",,,,,,";
            }
 
            echo "$subreddit,$likes,$score,$downs,$ups,$num_comments,$num_reports,$author";
 
            $i++;
        }
    }
 
    echo "
";
 
}

Philly ETE – Building a Terabyte-scale Math Platform – Summary

Cliff Click, 0xdata

Click represents 0xdata, which is building a system that can handle R-style analysis at a large speed/scale, aimed at companies that do advertising or credit card fraud detection, where transaction volume is large, and where money is lost waiting for models to rebuild. Typically these data comes from a variety of sources, file formats, etc, and requires much pre-cleaning to be functional. Post-cleaning, it is typically loaded into a dimensional model, and in this case, models are generated using linear models, k-means, or random forests, which Click said represent 80% of the models they see.

These types of models are used because they generate fast linear equations- for instance a credit card swipe gets 2-10ms dedicated to determine if it is suspected fraud. The challenge is building the models in a time-efficient manner – a common response is to downsample data, which loses accuracy. Clicknoted that 10x data increases lead to 75-85% increases in model accuracy, which sometimes can be repeated more than once.

Their database (H2O) can be started easily from just a jar:

java -jar h2o.jar

Servers will find each other using multicast, and are designed to use the maximize CPU and memory use to reduce time to complete queries using a distributed Fork/Join. The system is designed to reduce GC pause overhead, presumably based on Click’s past experience working at Azul - the only GC parameter they set is -xmx. Notably, they apparently make heavy use of UDP for communication.

This talk introduced me to fastr - this provides R semantics within Java, which is awesome.

Interesting finds from this talk-

http://www.csg.is.titech.ac.jp/~chiba/javassist/

http://radare.org/doc/html/Chapter14.html

Job Title Trends in Computing Fields

The Bureau of Labor Statistics creates a listing of job titles, average salary, number of jobs, and projections. Their taxonomy groups people into 750 job title categories, in some odd groupings.

Few categories are set to show declines, particularly in any job type even vaguely related to the IT field. There are a few exceptions, but these tend to be positions operating a computer rather than knowledge-work. What I found interesting is that the only place I’ve seen reference to these positions is in obsolete thrift-store computer books from the ’80s, where these are commonly listed as career paths in IT once one learns Turbo Pascal or Fortran.

OccupationEmployment
(in thousands)
Employment change,
2010-2020
Percent
self-
employed,
2010
Job openings due to growth and replacement needs,
2010-2020
(in thousands)
2010 median annual wage (in dollars)
Title20102020Number (in
thousands)
Percent
Total, All Occupations143,068.2163,537.120,468.914.37.854,787.433,840
Data Entry Keyers234.7218.8-15.9-6.81.841.227,450
Computer Operators86.478.9-7.4-8.61.58.336,930
Word Processors and Typists115.3102.1-13.2-11.57.76.733,400

 

The BLS projects jobs related to computer hardware to grow at a much slower pace than software jobs, an unsurprising, but striking trend. Also fascinating is number of self-employed engineers is significantly lower than average, suggesting that freelancing is not as common as one might think, or that people change titles when becoming self-employed, due to increased responsibilities.

OccupationEmployment
(in thousands)
Employment change,
2010-2020
Percent
self-
employed,
2010
Job openings due to growth and replacement needs,
2010-2020
(in thousands)
2010 median annual wage (in dollars)
Title20102020Number (in
thousands)
Percent
Total, All Occupations143,068.2163,537.120,468.914.37.854,787.433,840
Computer Programmers363.1406.843.7125.6128.071,380
Software Developers, Applications520.8664.5143.827.62.3197.987,790
Software Developers, Systems Software392.3519.4147.232.42.3168.094,180
Computer Hardware Engineers70.076.36.39.01.322.998,810
Electrical Engineers154.0164.710.77.02.247.884,540

 

Individuals with overlapping tasks often have very different titles, for instance, a Graphic Designer and Software Engineer might both be proficient in Javascript. Contrary to the attention it is receiving in popular culture, the BLS does not see much potential for being a 3-D printer operator (row 4) but “Big Data” fits well with the last row, and it pays very well. Personally, I wish more people became DBAs (we’re hiring!) and it looks like growth in that field is strong.

OccupationEmployment
(in thousands)
Employment change,
2010-2020
Percent
self-
employed,
2010
Job openings due to growth and replacement needs,
2010-2020
(in thousands)
2010 median annual wage (in dollars)
Title20102020Number (in
thousands)
Percent
Total, All Occupations143,068.2163,537.120,468.914.37.854,787.433,840
Database Administrators110.8144.833.930.61.652.773,490
Network and Computer Systems Administrators347.2443.896.627.80.8155.369,160
Computer Numerically Controlled Machine Tool Programmers, Metal and Plastic16.618.31.810.8n/a4.945,900
Computer Numerically Controlled Machine Tool Programmers, Metal and Plastic16.618.31.810.8n/a4.945,900
Graphic Designers279.2316.537.313.429.4123.843,500
Computer and Information Research Scientists28.233.55.318.76.010.6100,660
Information Security Analysts, Web Developers, and Computer Network Architects302.3367.965.721.717.1110.375,660

Simulating a Line-Following Robot in R

I’ve been reading up on controlling mobile robots, and built a simple robotic movement simulator in R, using R graphing libraries. The motivation for doing this is to practice setting up the math for controlling a robot, without having to build a physical device. Starting with an over-simple model allows learning a bit at a time, building up to a full solution. The model sets up a series of waypoints (red), where the robot’s path is drawn over-top.

waypoints

The robot’s position is modeled as a combination of point and direction, and controlled using requested velocity and angle. The code below simply accepts the command and applies it exactly, although a more realistic simulation would run this through a probability function, to simulate slipping or sliding.

move <- function(pos, command) {
  loc <- c(pos$point[1], pos$point[2]) +
         c(command$speed * cos(command$angle),
           command$speed * sin(command$angle));
  list(point = c(loc[1], loc[2]),
       angle = command$angle);
}

Next, we defined a few helper functions work on angles. Find the angle of a point over the horizon:

angle <- function(goal, p) {
  atan( (goal[2] - p[2]) / (goal[1] - p[1] ) )
}

A second function to ensure angles are between -π and π:

angle2 <- function(theta) {
  atan2(sin(theta), cos(theta));
}

And a third, to find the distance between two points:

distance <- function(a, b) {
  sqrt(sum((a - b) * (a - b)))
}

The real work is done by a simulation function, which accepts a starting position and angle, waypoint locations, a controller function, and the number of iterations to test. This graphs the waypoints and each robot position. As implemented, it doesn't render anything until the end of the simulation.

This function tests how far the robot is from the waypoint. If it gets close, it sets the next waypoint as a goal. At each iteration, the simulation provides the robot position and previous command information to a controller, which then uses that information as it wishes to decide how to move. Retaining the previous command allows the controller to track error over time, and re-adjust as necessary.

This contrived example doesn't simulate sensor inaccuracies, and thus a controller can assume it knows the exact locations of the waypoints relative to the current position. In a realistic scenario, a robot would need survey sensor data to continually refine location data.

simulate <- function(waypoints, start, angle, f, iter) {
    plot(waypoints, type="o", col="#FD7871");
    command <- list(speed = 0,
                    angle = 0);
    waypoint_idx <- 1;
    position <- list(point = start,
                     angle = angle);
    for (i in 1:iter)
    {
        goal <- c(waypoint_idx, waypoints[waypoint_idx]);
        command <- f(position, goal, command);
        points(x = position$point[1], y = position$point[2],
               type = "o", col = "#FFD8AB");
        position <- move(position, command);
        if (distance(position$point, goal) < 0.1) {
            waypoint_idx <- waypoint_idx + 1;
            command <- list(speed = 0, angle = 0, err = 0);
        }
    }
    for(i in 1:length(waypoints)) {
      points(x = i, y = waypoints[i], type = "o", col = "#FD7871")
    }
}

This is an example controller, which works entirely by modifying the current angle. This controller is an attempt to implement a PID-controller, which uses the current error, net error, and change in error over time to adjust the output angles. This could modify velocity over time, although in this case it uses a constant velocity. A realistic simulation scenario might accept the output of this controller, and limit angular and velocity changes. Depending on the actual robotics hardware, the angular changes might be discretized. Velocity increases might be limited to acceleration, rather than deceleration, like cruise control in a car.

controller <- function(position, goal, last) {
  v <- -.05;
  err <- angle(position$point, goal) - position$angle;
  neterr <- last$err + err;
  list(speed = v,
       angle = 5 * angle2(position$angle + err),
       err = neterr);
}

Tuning constant multipliers in the controller allows a variety of over and under-correction strategies, shown below. The examples I produced are not textbook examples, which suggests a defect in the simulation technique, but still represents and entertaining look at the way these things can fail. In the first example, the robot clearly under-corrects:

under-correct

In the second it makes odd angle choices. Other possible defects are weaving - an overall accurate path, but inability to recognize how close it is to correctness. In this situation, error over time can be tracked to prevent this behavior.

off-correct

The source is available on Github.

Finding the beat in R

In a previous article, I described a method for detecting chords in an audio file (also available for Scala). Continuing on this theme, the following will find the onset of a drumbeat in a file, using R. I’m using a single drumstick click, which you can hear on freesound.org.

This method detects sudden volume increases- it is not made to respond to changes in pitch or timbre (i.e. a song that marks the beat by changing pitch or switching instruments, respectively). However, the methods for doing this seem to be based on the method described below.

We’re looking for the onset of the drumbeat- where the anticipation starts. From reading literature, it appears that this is believed to be what we perceive as the beat in music, rather than, say, the loudest point.

Load the file into memory:

library(sound)
file<-'432__tictacshutup__prac-perc-4.wav'
sample<-loadSample(file)
fourbeats<-appendSample(sample, sample, sample, sample)
saveSample(fourbeats, "out\\fourbeats.wav")
eightbeats<-appendSample(sample, sample,
                         sample, sample,
                         sample, sample,
                         sample, sample)
saveSample(eightbeats, "out\\eightbeats.wav")

Next, define the first order differential function. (There is a method in R called diff, which is essentially the same)

firstOrderDiff<-function(x, lag){ x[(1+lag):length(x) -
                  x[1:(length(x)-lag)]] }
wav<-sample$sound
plot(1:length(wav), abs(wav), type="l")

Clearly there is a lot going on- for the sake of example, let's zoom in:

begin<-abs(wav[1:2000])
plot(1:length(begin), abs(begin), type="l")

We're really interested in magnitude (loudness) of the sound - it's much easier to work with if you take the absolute value.

Still, there are a lot of peaks and valleys. The first order differential is approximately the derivative, and can be computed over a range (e.g. sample 99 - sample 0, sample 100 - sample 1, etc), but experimentally this seems unstable. Instead, I compute the rolling mean over a small sample, then compute the derivative. This is part of the value in working with only positive numbers, as rolling mean is useless on alternating negative and positive numbers.

library(zoo)
smoothed<-rollmean(begin, 100)
plot(abs(smoothed), type="l")

And for the key, find the max value of the derivative to determine where the sound rises fastest:

start<-which.max(firstOrderDiff(begin, 100))
abline(v=start)

In the future I will describe how to generalize this for finding each beat, and handling more types of music.

If you're interested in R (the statistical programming language), check out my review of the R Cookbook. You may also be interested in The Scientist & Engineer's Guide to Digital Signal Processing