Entries Tagged 'Data Science' ↓

Full-Text Indexing PDFs in Javascript

I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the 200+ million PDFs that represent 20+ years of U.S. litigation. These processes can take many months of machine time, which puts a lot of pressure on the software teams that build them.

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem.

To test building this processing pipeline, we first look at how to extract text from PDFs, which will later be inserted into a full text index. The code for pdf.js is instructive, in that the Mozilla developers use browser features that aren’t in common use. Web Workers, for instance, let you set up background processing threads.

The pdf.js APIS make heavy use of Promises, which hold references in code to operations that haven’t completed yet. You operate on them using callbacks:

var pdf = PDFJS.getDocument('http://www.pacer.gov/documents/pacermanual.pdf');
 
var pdf = PDFJS.getDocument('pacermanual.pdf');
pdf.then(function(pdf) {
 // this code is called once the PDF is ready
});

This API seems immature yet- ideally you should be able to do promise.then(f(x)).then(g(x)).then(h(x)) etc, but that isn’t yet available.

For rendering PDFs the Promise pattern makes a lot of sense, as it leaves room for parallelizing the rendering process. For merely extracting the text from a PDF it feels like a lot of work – you need to be confident that your callbacks run in order and track which one is last.

The following demonstrates how to extract all the PDF text, which is then printed to the browser console log:

‘use strict’;
var pdf = PDFJS.getDocument('http://www.pacer.gov/documents/pacermanual.pdf');
 
var pdf = PDFJS.getDocument('pacermanual.pdf');
pdf.then(function(pdf) {
 var maxPages = pdf.pdfInfo.numPages;
 for (var j = 1; j <= maxPages; j++) {
    var page = pdf.getPage(j);
 
    // the callback function - we create one per page
    var processPageText = function processPageText(pageIndex) {
      return function(pageData, content) {
        return function(text) {
          // bidiTexts has a property identifying whether this
          // text is left-to-right or right-to-left
          for (var i = 0; i < text.bidiTexts.length; i++) {
            str += text.bidiTexts[i].str;
          }
 
          if (pageData.pageInfo.pageIndex ===
              maxPages - 1) {
            // later this will insert into an index
            console.log(str);
          }
        }
      }
    }(j);
 
    var processPage = function processPage(pageData) {
      var content = pageData.getTextContent();
 
      content.then(processPageText(pageData, content));
    }
 
    page.then(processPage);
 }
});

It’s not trivial to identify where headings and images are. This would require hooking into the rendering code, and possibly a deep understanding of PDF commands (PDFs appear to be represented as stream of rendering commands, similar to RTF).

Lunr
Creating a Lunr index and adding text is straightforward- all the APIs operate on JSON bean objects, which is a pleasantly simple API:

doc1 = {
    id: 1,
    title: 'Foo',
    body: 'Foo foo foo!'
  };
 
doc2 = {
    id: 2,
    title: 'Bar',
    body: 'Bar bar bar!'
  }
 
doc3 = {
    id: 3,
    title: 'gary',
    body: 'Foo Bar bar bar!'
  }
 
index = lunr(function () {
    this.field('title', {boost: 10})
    this.field('body')
    this.ref('id')
  })
 
// Add documents to the index
index.add(doc1)
index.add(doc2)
index.add(doc3)

Searching is simple – one neat tidbit I found is that you can inspect the index easily, since it’s just a JS object:

// Run a search
index.search(“foo”)
 
// Inspect the actual index to see which docs match a term
index2.tokenStore.root.f.o.o.docs

When I was first introduced to full-text indexing, I was confused by what is meant by a “document” – this generalizes beyond a PDF or Office document to any database row, possibly including large blobs of text.

Full-text search would be pretty dumb if you had to build the index every time, and Lunr makes it really easy to serialize and deserialize the index itself:

var serializedIndex = JSON.stringify(index1.toJSON())
var deserializedIndex = JSON.parse(serializedIndex)
var index2 = lunr.Index.load(deserializedIndex)

Index.toJSON also returns a “bean” style object (not a string). I’ve never seen an API like this, and I really like the idea – it gives you a clean Javascript object with only the data that requires serialization.

The following are attributes of the index:

  • corpusTokens – Sorted list of tokens
  • documentStore – list of each document – catenate
  • fields – The fields used to describe each document (similar to database columns)
  • pipeline – The pipeline object used to process tokens
  • tokenStore – Where and how often words are referenced in each document

One great thing about this type of index is that the work can be done in parallel and then combined as a map-reduce job. Only three entries from the above object need to be combined, as “fields” and “pipeline” are static. The following demonstrates the implementation of the reduction step (note jQuery is referenced):

(function reduce(a, b) {
  var j1 = a.toJSON();
  var j2 = b.toJSON();
 
  // The "unique" function does uniqueness by sorting,
  // which we need here.
  var corpusTokens =
      $.unique(
          $.merge(
              $.merge([], j1.corpusTokens),
                           j2.corpusTokens));
 
  // It's important to create new arrays and
  // objects throughout, or else you modify 
  // the source indexes, which is disastrous.
  var documentStore =
     {store: $.extend({},
                      j1.documentStore.store,
                      j2.documentStore.store),
      length: j1.documentStore.length + j2.documentStore.length};
 
  var jt1 = j1.tokenStore;
  var jt2 = j2.tokenStore;
 
  // The 'true' here triggers a deep copy
  var tokenStore = {
    root: $.extend(true, {}, jt1.root, jt2.root),
    length: jt1.length + jt2.length
  };
 
  return {version: j1.version,
          fields: $.merge([], j1.fields),
          ref: j1.ref,
          documentStore: documentStore,
          tokenStore: tokenStore,
          corpusTokens: corpusTokens,
          pipeline: $.merge([], j1.pipeline)};
})(index1, index2)

I tested this by creating three indexes: index1, index2, and index3. index1 is {doc1}, index2 is {doc2, doc3}, and index3 is {doc1, doc2, doc3}. To test the code, you need simply diff:

JSON.stringify(index3.toJSON())
 
JSON.stringify(combine(index1, index2))

Possibilities

Overall this technique has a lot of wasted network I/O, making this seem silly. On the other hand, there are listings on ebay and fiverr selling for “traffic”, which typically comes from pop-unders, botnets, hidden iframes, etc. You can easily find listings like “20,000 hits for $3”, and less in bulk. This is typically cheap because it has little commercial value other than perpetrating various forms of fraud.

You’d need a cheap VM with loads of bandwidth to use as a proxy, as well as publically available data – you couldn’t use this as a scraping technique due to browser protections against cross-domain requests. You’d also need to generate unique document IDs in a unique fashion, perhaps using the original URL.

If a traffic source runs on modern browsers, one could use this as a source of potentially cheap and unlimited processing power, even for point of combining the indexes, although provisions must be made for the natural instability of the system.

If you enjoyed this, you might also enjoy the following:

Entity recognition with Scala and Stanford NLP Named Entity Recognizer

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.

import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import java.io._
import org.xml.sax._
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.CoreAnnotations
 
object pdfHandler extends ContentHandler {
  val contents: StringBuffer = new StringBuffer()
 
  def characters(ch: Array[Char], start: Int, length: Int) {
    contents.append(new String(ch))
  }
 
  def endDocument() {
  }
 
  def endElement(uri: String, localName: String, qName: String) {
  }
 
  def endPrefixMapping(prefix: String) {
  }
 
  def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
  }
 
  def processingInstruction(target: String, data: String) {
  }
 
  def setDocumentLocator(locator: Locator) {
  }
 
  def skippedEntity(name: String) {
  }
 
  def startDocument() {
  }
 
  def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
  }
 
  def startPrefixMapping(prefix: String, uri: String) {
  }
}
 
object pdf extends App {
  val file = """e:\data\11-1285_i4dk.pdf"""
 
  val pdf: PDFParser = new PDFParser();
 
  val stream: InputStream = new FileInputStream(file)
  val handler: ContentHandler = pdfHandler
  val metadata: Metadata = new Metadata()
  val context: ParseContext = new ParseContext()
 
  pdf.parse(stream,
    handler,
    metadata,
    context)
 
  stream.close()
 
  val contents: String = pdfHandler.contents.toString()
  println(contents)
 
  val src = "stanford-ner-2013-04-04/classifiers/"
  val classifier1 = "english.all.3class.distsim.crf.ser.gz"
  val classifier2 = "english.conll.4class.distsim.crf.ser.gz"
  val classifier3 = "english.muc.7class.distsim.crf.ser.gz"
 
  val serializedClassifier = src + classifier1
 
  val classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier)
  val out = classifier.classify(contents)
 
  var words = 0
  for (i <- 0 to out.size() - 1) {
    val sentence = out.get(i)
 
    var foundWord = ""
    var oldWordClass = ""
 
    for (j <- 0 to sentence.size() - 1) {
      val word = sentence.get(j)
      val wordClass = word.get(classOf[CoreAnnotations.AnswerAnnotation]) + ""
 
      if (!oldWordClass.equals(wordClass)) {
        if (!oldWordClass.equals("O") && !oldWordClass.equals("")) {
          print("[/" + oldWordClass + "]")
        }
      }
 
      if (!wordClass.equals("O") && !wordClass.equals("")) {
        if (!oldWordClass.equals(wordClass)) {
          print("[" + wordClass + "]")
        }
      }
 
      oldWordClass = wordClass
 
      words = words + 1
      print(word);
      print(" ");
 
      if (words > 10) {
        words = 0
        println(" ")
      }
    }
  }
}
11-1285 [ORGANIZATION]US Airways , Inc. [/ORGANIZATION]v.
[PERSON]McCutchen [/PERSON]-LRB- 4\/16\/13 -RRB- 1 -LRB-
Slip Opinion -RRB- OCTOBER TERM ,
2012 Syllabus NOTE : Where it
is feasible , a syllabus -LRB-
headnote -RRB- will be released ,
as isbeing done in connection with
this case , at the time
the opinion is issued . The
syllabus constitutes no part of the
opinion of the Court but has
beenprepared by the Reporter of Decisions
for the convenience of the reader
. See [LOCATION]United States [/LOCATION]v. [ORGANIZATION]Detroit
Timber & Lumber Co. [/ORGANIZATION], 200
U. S. 321 , 337 .
SUPREME COURT OF THE [ORGANIZATION]UNITED STATES
Syllabus US AIRWAYS [/ORGANIZATION], INC. ,
IN ITS CAPACITY AS FIDUCIARY AND
PLAN ADMINISTRATOR OF THE [LOCATION]US [/LOCATION]AIRWAYS
, INC. . EMPLOYEE BENEFITS PLAN
v. [PERSON]MCCUTCHEN [/PERSON]ET AL. . CERTIORARI
TO THE [ORGANIZATION]UNITED STATES [/ORGANIZATION]COURT OF
APPEALS FOR THE THIRD CIRCUIT No.
11 -- 1285 . Argued November
27 , 2012 -- Decided April
16 , 2013 The health benefits
plan established by petitioner [ORGANIZATION]US Airways
[/ORGANIZATION]paid $ 66,866 in medical expenses
for injuries suffered by respondentMcCutchen ,
a [ORGANIZATION]US Airways [/ORGANIZATION]employee , in
a car accident caused by athird
party . The plan entitled [ORGANIZATION]US
Airways [/ORGANIZATION]to reimbursement if
[PERSON]McCutchen [/PERSON]

Implementing k-means in Scala

To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two – the exact points used are in the code below.

This implements Lloyd’s algorithm, which tries to cluster points in iterations in a simple manner:

1. Assume a certain number of clusters
2. Group the points at random
3. Compute the center of each cluster
4. For each point, compute which cluster is closest
5. Move all the points into new groupings
6. Repeat 3-5 a few times, until you’re happy with the results

I like how the functional programming style forces you to recreate all the data structures, in this case. It might be tempting to implement this in an imperative style, modifying data structures in place, but since steps 4-5 require separate data, you are protected against making it more difficult. You can see the full source below, or on github.

Since this example is fairly contrived, this converges pretty quickly:

Initial State:
  Cluster 0
  Mean: (17.83517750970944, 12.242720407317105)
    (10.8348626966492, 18.7800980127523))
    (7.7875624720831, 20.1569764307574))
    (11.9096128931784, 21.1855674228972))
    (22.4668345067162, 8.9705504626857))
    (7.91362116378194, 21.325928219919))
    (22.636600400773, 2.46561420928429))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (25.1439536911272, 3.58469981317611))
    (23.5359486724204, 4.07290025106778))
    (11.7493214262468, 17.8517235677469))
    (12.4277617893575, 19.4887691804508))
    (11.931275122466, 18.0462702532436))
    (25.4645673159779, 7.54703465191098))
    (21.8031183153743, 5.69297814349064))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (16.95249500233747, 12.848199048608048)
    (11.7265904596619, 16.9636039793709))
    (10.7751248849735, 22.1517666115673))
    (23.6587920739353, 3.35476798095758))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (7.03171204763376, 19.1985058633283))
    (23.7722765903534, 3.74873642284525))
    (10.259545802461, 23.4515683763173))
    (28.1587146197594, 3.70625885635717))
    (10.1057940183815, 18.7332929859685))
    (8.90149362263775, 19.6314465074203))
    (12.4353462881232, 19.6310467981989))
    (24.3793349065557, 4.59761596097384))
    (22.5447925324242, 2.99485404382734))
    (26.8942422516129, 5.02646862012427))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (27.0339042858296, 4.4151109960116))
    (11.0118378554584, 20.9773232834654))
Iteration: 0
  Cluster 0
  Mean: (23.781370272978315, 5.754127202865132)
    (11.7265904596619, 16.9636039793709))
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.296576237184727, 20.09138475584863)
    (10.8348626966492, 18.7800980127523))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 1
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 2
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 3
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 4
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 5
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
class Point(dx: Double, dy: Double) {
  val x: Double = dx
  val y: Double = dy
 
  override def toString(): String = {
    "(" + x + ", " + y + ")"
  }
 
  def dist(p: Point): Double = {
    return x * p.x + y * p.y
  }
}
 
object kmeans extends App {
  val k: Int = 2
 
  // Correct answers to centers are (10, 20) and (25, 5)
  val points: List[Point] = List(
    new Point(10.8348626966492, 18.7800980127523),
    new Point(10.259545802461, 23.4515683763173),
    new Point(11.7396623802245, 17.7026240456956),
    new Point(12.4277617893575, 19.4887691804508),
    new Point(10.1057940183815, 18.7332929859685),
    new Point(11.0118378554584, 20.9773232834654),
    new Point(7.03171204763376, 19.1985058633283),
    new Point(6.56491029696013, 21.5098251711267),
    new Point(10.7751248849735, 22.1517666115673),
    new Point(8.90149362263775, 19.6314465074203),
    new Point(11.931275122466, 18.0462702532436),
    new Point(11.7265904596619, 16.9636039793709),
    new Point(11.7493214262468, 17.8517235677469),
    new Point(12.4353462881232, 19.6310467981989),
    new Point(13.0838514816799, 20.3398794353494),
    new Point(7.7875624720831, 20.1569764307574),
    new Point(11.9096128931784, 21.1855674228972),
    new Point(8.87507602702847, 21.4823134390704),
    new Point(7.91362116378194, 21.325928219919),
    new Point(26.4748241341303, 9.25128245838802),
    new Point(26.2100410238973, 5.06220487544192),
    new Point(28.1587146197594, 3.70625885635717),
    new Point(26.8942422516129, 5.02646862012427),
    new Point(23.7770902983858, 7.19445492687232),
    new Point(23.6587920739353, 3.35476798095758),
    new Point(23.7722765903534, 3.74873642284525),
    new Point(23.9177161897547, 8.1377950229489),
    new Point(22.4668345067162, 8.9705504626857),
    new Point(24.5349708443852, 5.00561881333415),
    new Point(24.3793349065557, 4.59761596097384),
    new Point(27.0339042858296, 4.4151109960116),
    new Point(21.8031183153743, 5.69297814349064),
    new Point(22.636600400773, 2.46561420928429),
    new Point(25.1439536911272, 3.58469981317611),
    new Point(21.4930923464916, 3.28999356823389),
    new Point(23.5359486724204, 4.07290025106778),
    new Point(22.5447925324242, 2.99485404382734),
    new Point(25.4645673159779, 7.54703465191098)).sortBy(
      p => (p.x + " " + p.y).hashCode())
 
  def clusterMean(points: List[Point]): Point = {
    val cumulative = points.reduceLeft((a: Point, b: Point) => new Point(a.x + b.x, a.y + b.y))
 
    return new Point(cumulative.x / points.length, cumulative.y / points.length)
  }
 
  def render(points: Map[Int, List[Point]]) {
    for (clusterNumber <- points.keys.toSeq.sorted) {
      System.out.println("  Cluster " + clusterNumber)
 
      val meanPoint = clusterMean(points(clusterNumber))
      System.out.println("  Mean: " + meanPoint)
 
      for (j <- 0 to points(clusterNumber).length - 1) {
        System.out.println("    " + points(clusterNumber)(j) + ")")
      }
 
      System.out.println("")
    }
  }
 
  val clusters =
    points.zipWithIndex.groupBy(
      x => x._2 % k) transform (
        (i: Int, p: List[(Point, Int)]) => for (x <- p) yield x._1)
 
  System.out.println("Initial State: ")
  render(clusters)
 
  def iterate(clusters: Map[Int, List[Point]]): Map[Int, List[Point]] = {
    val unzippedClusters =
      (clusters: Iterator[(Point, Int)]) => clusters.map(cluster => cluster._1)
 
    // find cluster means
    val means =
      (clusters: Map[Int, List[Point]]) =>
        for (clusterIndex <- clusters.keys)
          yield clusterMean(clusters(clusterIndex))
 
    // find the closest index
    def closest(p: Point, means: Iterable[Point]): Int = {
      val distances = for (center <- means) yield p.dist(center)
      return distances.zipWithIndex.min._2
    }
 
    // assignment step
    val newClusters =
      points.groupBy(
        (p: Point) => closest(p, means(clusters)))
 
    render(newClusters)
 
    return newClusters
  }
 
  var clusterToTest = clusters
  for (i <- 0 to 5) {
    System.out.println("Iteration: " + i)
    clusterToTest = iterate(clusterToTest)
  }
}

Case Study: 10x File Copy Performance with Robocopy

Source data:

  • ~500,000 folders (court cases)
  • ~2.5-3 million documents
  • Source drives is replicated x2 with RAID
  • Copying to NAS over GB ethernet
  • Initial un-tuned copy was set to take ~2 weeks (after switching to Robocopy – before, it was painful just to do an ls)
  • Final copy took ~24 hours

Monitoring:

  • Initially I saw 20-40 Kbps in traffic in DD-WRT, clearly too low. After some changes this is still generally low, but with spikes up to 650 Kbps.
  • CPU use – 4/8 cores in use, even with >8 threads assigned to Robocopy
  • In Computer Management -> Performance monitoring, the disk being copied is Reading as fast as it can (set to 100 all the time)
  • The number called “Split IO / second” is very high much of the time. Research indicates this could be improved with defrag (though this might take me months to complete).

Filesystem Lessions:

  • NTFS can hold a folder with large numbers of files but takes forever to enumerate
  • When you enumerate a directory in NTFS (e.g. by opening it in Windows Explorer), Windows appears to lock the folder(!) which pauses any copy/ls operations
  • The copy does not appear to be I/O bound – even setting Robocopy to use many threads, only 4/8 cores are in use at 5-15% per each.
  • ext4 (destination system) supports up to 64,000 items per folder, any more and you get an error.
  • I split all 500k items into groups of 256*256 at random (for instance one might open \36\0f to see a half dozen items). These are split up using md5 on the folder names – basically this uses the filesystem as a tree map.
  • One nice consequence of this is that you can estimate how far along the process is by looking at how many folders have been copied (85/256 -> 33%, etc)

Robocopy Options:

  • Robocopy lets you remove the console logging, with /LOG:output.txt
  • Robocopy lets you set the number of threads it uses. By default this is 8, it seemed to run faster with > 8, but only the first few threads made any difference.

To investigate:

  • Ways of using virtual filesystems – it’d be nice to continue using wget to download, but split up large folders into batches for scraping. 
  • One possibility is to use wget through VirtualBox, since there are more linux based virtual filesystems – not sure on the performance ovehead

Philly ETE – Building a Terabyte-scale Math Platform – Summary

Cliff Click, 0xdata

Click represents 0xdata, which is building a system that can handle R-style analysis at a large speed/scale, aimed at companies that do advertising or credit card fraud detection, where transaction volume is large, and where money is lost waiting for models to rebuild. Typically these data comes from a variety of sources, file formats, etc, and requires much pre-cleaning to be functional. Post-cleaning, it is typically loaded into a dimensional model, and in this case, models are generated using linear models, k-means, or random forests, which Click said represent 80% of the models they see.

These types of models are used because they generate fast linear equations- for instance a credit card swipe gets 2-10ms dedicated to determine if it is suspected fraud. The challenge is building the models in a time-efficient manner – a common response is to downsample data, which loses accuracy. Clicknoted that 10x data increases lead to 75-85% increases in model accuracy, which sometimes can be repeated more than once.

Their database (H2O) can be started easily from just a jar:

java -jar h2o.jar

Servers will find each other using multicast, and are designed to use the maximize CPU and memory use to reduce time to complete queries using a distributed Fork/Join. The system is designed to reduce GC pause overhead, presumably based on Click’s past experience working at Azul - the only GC parameter they set is -xmx. Notably, they apparently make heavy use of UDP for communication.

This talk introduced me to fastr - this provides R semantics within Java, which is awesome.

Interesting finds from this talk-

http://www.csg.is.titech.ac.jp/~chiba/javassist/

http://radare.org/doc/html/Chapter14.html

Philly ETE – Database as a Value – Summary

This was the first time I’ve seen Rich Hickey’s talk on Datomic, which lent great clarity to the product. As implemented, Datomic functions as an immutable database for philosophical reasons, although in practice it doesn’t manage it’s own storage, and may eventually support deletions to satisfy legal and compliance issues around privacy.

This database technology aims to solve common problems encountered in relational database systems: inability to reason about the current state of data, concurrency issues, challenges of viewing data as of a given time, mismatched metaphors between code and data, and various frustrating design patterns. Making data immutable and adding a time dimension to data allows a number of great simplifications, and produces an index resembling git history (although unlike git, you can’t rewrite history).

There are a number of possible implementations – naive approaches would copy data or build append-logs (as is typical in database rewrite logs), but the Datomic structures bear study, as they are all based on wide, short trees. Because data nodes are immutable, the contents of a tree won’t change, but only have new data added. This lets the db track novelty, and arranged as they have it currently, it appears that you can easily traverse the history as a range scan. Each tree-node has a series of pointers to other tree locations, based on various pre-indexed ways to traverse the tree.

Architecturally, this can have many readers, which don’t require coordination (a query would specify the database/db version it queries against), but a single writer.

Data elements are designed to resemble facts (e.g. :noun :verb :o bject), so one might think of this as resembling a cell or row in a traditional database. Because these are immutable, they can also be cached anywhere, or cached split across machines, so a database could have arbitrary subsets of data in memory – query results are a merge-join of memory and disk storage. As I understood it, data is read into memory as segments (pages) like traditional databases.

This is designed to take a functional approach to database programming, so instead of running a query inside a database, you run a function on a database, e.g. f(db) -> results. This has great potential – ease of mocking data, running hypothetical inserts to see the result, passing custom code in database queries. Notably, this is not necessarily intended to work well for very large data (click streams) or non-functional uses of a database (i.e. anything that uses a db as a global variable, such as a counter).

Philly ETE – Druid Column Oriented Database

I attended a talk at Philly ETE by Metamarkets, a company doing real-time analytics for advertising. Having worked on a couple Oracle-based reporting projects, I entered with interest. Their system is built around dimensional modeling, although with atypically high volume inserts and low latency for updating reports. They attempted to build this system with a traditional relational database (star schema + pre-computed fact subsets), and then when that didn’t work out, they tried using a key-value store (!) with many possible queries pre-computed.

Rather than purchasing a database from EMC (Greenplum) or Oracle, they decided to roll their own, which apparently is cheaper(?!). Likely, the reason this works is that there are an increasing number of OSS components, such as Zookeeper, Kafka, Hive, etc, so one can construct a bundled database product (a similar model to HortonWorks). Their system is called Druid, and was recently released as an open-source project, although as of this writing they admit it’s not the easiest thing to configure and stand up.

Unsurprisingly, the solution they built makes extensive use of bitmap indexes. Fact table data is partitioned into batches of up to a 1-2 million rows – each dimension is produced per batch, then query results are merged. This limits the cardinality of dimensions – each bitmap index will have no more unique values than rows in the partition. Bitmap indexes are nice, in that they can be ANDed and ORed directly in large batches, with one bit per unique value, so are much smaller than the source data. In fact, this data is also compressed in a way which can allow for boolean operations to work – Metamarkets claims the resulting index is 30-40% the size of an inverted index. Incidentally, there is a handy Java implementation of this compression algorithm.

This system appears to allow much wider scaling than a typical RDBMS (e.g. 100 nodes), and mitigates network I/O by tracking which nodes control a segment of data (they’re not the only DB vendor who has this – perhaps it’s a feature of zookeeper?). Data segments are also considered immutable – when changes are made, a new segment is created, the old one soft-deleted, and background jobs run nightly to clean up duplicates/leftover data.

More information here-

http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-rows-per-second/

http://metamarkets.com/druid/

Philly ETE 2013 – Day 1 Keynote Summary

Summary of talk by Claudia Perlich – Chief Scientist, m6d

The Philly ETE Keynote address was a presentation on modelling advertising data by media 6 degrees, a company which uses data modelling techniques to improve advertisement conversion rates for large brands.

The techniques presented showed that predictive models of purchasing behavior can be built in an unsupervised fashion, without knowledge of causality. Specifically, they purchase data that ties cookies to hashed URLs, about 100 million people per day, averaging tens of clicks per cookie. These models use techniques such as linear regression, and work best with niche products (since most people have credit cards, etc, they are naturally un-targeted)

Using this data, they predict whether it is worth purchasing impressions on ad exchanges, comparing users known to visit a brand to traits of a user listed in an ad exchange – the decision to purchase an ad in front of a user has to happen in <15 ms.

One interesting challenge is establishing causal relationships – doing A/B tests with blank/public service ads as a control is undesirable.  Data can establish causal relationships, but can be confounding (cause-effect may appear reversed). For instance, consider beauty; using “beautiful” on a message on a dating site might be a failure because of the reaction it induces, or because the attractive people get most of the messages.

Retargeting was also discussed at some length – this is the advertising practice of showing ads for a product when the user has previously visited a site. The effects of botnets clicking ads is visible to those who buy ads on exchanges (many impressions that don’t convert) but appealing to the exchanges and to some ad agencies (gives the appearance of having higher volume). These can make money by surfing people to random sites, normally resulting in low value ad traffic for those sites, which many unique IPs and traffic which can appeal somewhat normal.

Recently these started visiting sites that use retargeting (e.g. Chase) leading to much higher value ads being shown on many of the botnet sites, which was visible to a number of ad buyers.

Job Title Trends in Computing Fields

The Bureau of Labor Statistics creates a listing of job titles, average salary, number of jobs, and projections. Their taxonomy groups people into 750 job title categories, in some odd groupings.

Few categories are set to show declines, particularly in any job type even vaguely related to the IT field. There are a few exceptions, but these tend to be positions operating a computer rather than knowledge-work. What I found interesting is that the only place I’ve seen reference to these positions is in obsolete thrift-store computer books from the ’80s, where these are commonly listed as career paths in IT once one learns Turbo Pascal or Fortran.

OccupationEmployment
(in thousands)
Employment change,
2010-2020
Percent
self-
employed,
2010
Job openings due to growth and replacement needs,
2010-2020
(in thousands)
2010 median annual wage (in dollars)
Title20102020Number (in
thousands)
Percent
Total, All Occupations143,068.2163,537.120,468.914.37.854,787.433,840
Data Entry Keyers234.7218.8-15.9-6.81.841.227,450
Computer Operators86.478.9-7.4-8.61.58.336,930
Word Processors and Typists115.3102.1-13.2-11.57.76.733,400

 

The BLS projects jobs related to computer hardware to grow at a much slower pace than software jobs, an unsurprising, but striking trend. Also fascinating is number of self-employed engineers is significantly lower than average, suggesting that freelancing is not as common as one might think, or that people change titles when becoming self-employed, due to increased responsibilities.

OccupationEmployment
(in thousands)
Employment change,
2010-2020
Percent
self-
employed,
2010
Job openings due to growth and replacement needs,
2010-2020
(in thousands)
2010 median annual wage (in dollars)
Title20102020Number (in
thousands)
Percent
Total, All Occupations143,068.2163,537.120,468.914.37.854,787.433,840
Computer Programmers363.1406.843.7125.6128.071,380
Software Developers, Applications520.8664.5143.827.62.3197.987,790
Software Developers, Systems Software392.3519.4147.232.42.3168.094,180
Computer Hardware Engineers70.076.36.39.01.322.998,810
Electrical Engineers154.0164.710.77.02.247.884,540

 

Individuals with overlapping tasks often have very different titles, for instance, a Graphic Designer and Software Engineer might both be proficient in Javascript. Contrary to the attention it is receiving in popular culture, the BLS does not see much potential for being a 3-D printer operator (row 4) but “Big Data” fits well with the last row, and it pays very well. Personally, I wish more people became DBAs (we’re hiring!) and it looks like growth in that field is strong.

OccupationEmployment
(in thousands)
Employment change,
2010-2020
Percent
self-
employed,
2010
Job openings due to growth and replacement needs,
2010-2020
(in thousands)
2010 median annual wage (in dollars)
Title20102020Number (in
thousands)
Percent
Total, All Occupations143,068.2163,537.120,468.914.37.854,787.433,840
Database Administrators110.8144.833.930.61.652.773,490
Network and Computer Systems Administrators347.2443.896.627.80.8155.369,160
Computer Numerically Controlled Machine Tool Programmers, Metal and Plastic16.618.31.810.8n/a4.945,900
Computer Numerically Controlled Machine Tool Programmers, Metal and Plastic16.618.31.810.8n/a4.945,900
Graphic Designers279.2316.537.313.429.4123.843,500
Computer and Information Research Scientists28.233.55.318.76.010.6100,660
Information Security Analysts, Web Developers, and Computer Network Architects302.3367.965.721.717.1110.375,660

Robot Localization in R

The excellent “Artificial Intelligence for Robotics” class on Udacity starts with a Python example on teaching a robot to determine where it is, given sensor data. Assuming the robot starts with a map of the world, and can make some observations of it’s surroundings, a series of movements and successive observations can quickly narrow the location using probability distributions. Solving the problem of localization in this generic way opens a lot of possibilities for solving unrelated problems; e.g. a similar techniques might be used to build a database index of music, and match a song given a short segment.

This type of problem is a natural fit for R, given that R is built around statistical operations and lists. In this case I didn’t use much built-in functionality outside list operations, but clearly this process can be expressed concisely.

First comes set-up:

world <- c("green", "red", "red", "green", "green")
pHit <- .8
pMiss <- .2

The "world" variable is a set-up used in the course example; this is the map of the world, assumed to be circular. Probability estimates are established- error in sensor and movement is assumed. pHit and pHit are scaling factors used to describe the accuracy of sensor input.

Next, we set up the measurement data, and helper data to generate a proper probability distribution, and to normalize a distribution:

measurements <- c("red", "red", "green")
norm <- array(1/length(world), length(world))
normalize <- function(p) { p / sum(p) }

Then the sense function, which takes a measurement, and adjust the probabilities of where we are:

sense <- function(dist, value) {
  normalize(
    sapply(1:length(dist),
      function(idx){
       hit=(world[idx]==value);
       dist[idx] * (hit * pHit + (1-hit) * (pMiss))
     }
    )
  )
}
p<-norm
> p
[1] 0.2 0.2 0.2 0.2 0.2
> sense(p, "red")
[1] 0.1111111 0.3333333 0.3333333 0.1111111 0.1111111

You can see here that the "robot" has moved from knowing nothing (a constant distribution) to being somewhat confident that it is standing on one of the red squares. It can't be certain where yet, do to there being multiple red squares, as well as possible sensor error.

Next, we define a function to rotate the array, since the array is cyclic:

rotate <- function(a, i){
 c(a[(i+1):length(a)],
   a[1:i])
}
> c(1:10)
 [1]  1  2  3  4  5  6  7  8  9 10
> rotate(c(1:10), 2)
 [1]  3  4  5  6  7  8  9 10  1  2
> rotate(c(1:10), 3)
 [1]  4  5  6  7  8  9 10  1  2  3

Define some constants for movement accuracy. pMove is the probability that if you move to the left one space, you'll be successful - pLeft and pRight are the chances of landing on either side.

#movement accuracy
pMove<-.8
pLeft<-.1
pRight<-.1

The movement function is then very simple. This isn't normalized like "sense", because the movement probabilities add up to one, but if they didn't, this would also require normalization.

move <- function(p) {
  pMove * rotate(p, 1) +
  pLeft * p +
  pRight * rotate(p, 2)
}

In the example modeled below, the robot starts on the third square (red) and moves to the left several times. Each time it moves, it loses some certainty for it's location, but then gains far more certainty upon taking measurements.

> world
[1] "green" "red"   "red"   "green" "green"
> dist
[1] 0.2 0.2 0.2 0.2 0.2
> sense(dist, "red")
[1] 0.1111111 0.3333333 0.3333333 0.1111111 0.1111111
> move(sense(dist, "red"))
[1] 0.3111111 0.3111111 0.1333333 0.1111111 0.1333333
> sense(move(sense(dist, "red")), "red")
[1] 0.16470588 0.49411765 0.21176471 0.05882353 0.07058824
> move(sense(move(sense(dist, "red")), "red"))
[1] 0.43294118 0.22470588 0.07529412 0.07882353 0.18823529
> sense(move(sense(move(sense(dist, "red")), "red")), "green")
[1] 0.54117647 0.09362745 0.03137255 0.09852941 0.23529412
> move(sense(move(sense(move(sense(dist, "red")), "red")), "green"))
[1] 0.13215686 0.04431373 0.10549020 0.25220588 0.46583333

And finally, to combine this, the movements can all be done recursively. One note here is that you must remove elements explicitly, rather than successively, subsetting an array, as the recycling rule will keep adding elements when you try to remove the last entry, resulting in infinite recursion.

moveall <- function(measurements, p){
  if(length(measurements) > 0) {
    moveall(measurements[-c(1)], move(sense(p, measurements[1])))
  } else {
    p
  };
}
> measurements
[1] "red"   "red"   "green"
> moveall(measurements, dist)
[1] 0.13215686 0.04431373 0.10549020 0.25220588 0.46583333