<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>Gary Sieling</title> <atom:link href="http://garysieling.com/blog/feed" rel="self" type="application/rss+xml" /><link>http://garysieling.com/blog</link> <description>Philadelphia Software Developer</description> <lastBuildDate>Thu, 23 May 2013 02:32:52 +0000</lastBuildDate> <language>en-US</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=481</generator> <item><title>My First A/B Test&#8230; with Results</title><link>http://garysieling.com/blog/my-first-ab-test-with-results?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=my-first-ab-test-with-results</link> <comments>http://garysieling.com/blog/my-first-ab-test-with-results#comments</comments> <pubDate>Thu, 23 May 2013 02:30:42 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Business]]></category> <category><![CDATA[Data Mining]]></category> <category><![CDATA[ab testing]]></category> <category><![CDATA[advertising]]></category> <category><![CDATA[split testing]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=1061</guid> <description><![CDATA[A/B testing gets a lot of attention on Hacker News, inbound.org, and other forums, and appeals to me as a data analysis exercise. As a software engineer with a practical bent, I like the concept of data analysis techniques which produce useful results while treating a system as a black box. This stands in contrast [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/my-first-ab-test-with-results" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p>A/B testing gets a lot of attention on Hacker News, inbound.org, and other forums, and appeals to me as a data analysis exercise. As a software engineer with a practical bent, I like the concept of data analysis techniques which produce useful results while treating a system as a black box. This stands in contrast to algorithms that aim to analyze data and tell a story, for instance <a
href="http://www.polisci.upenn.edu/~weisiger/impermanent.pdf">applying agent-based modeling to political science</a> and the study of <a
href="http://www.amazon.com/gp/product/0801451868/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0801451868&amp;linkCode=as2&amp;tag=thesecrelifeo-20">war and peace</a>.</p><p>Testing two variations of a site to see how people react turns out to be extremely difficult to tinker with on a blog. You need to have the right sort of problems to justify using statistics, and it&#8217;s challenging to create those problems to happen to justify the experiment. From another angle, I&#8217;ve long been leery of using split testing at all as <a
href="http://garysieling.com/blog/halving-page-load-time-with-pngcrush">keeping WordPress stable</a> has <a
href="http://garysieling.com/blog/vps-io-diagnosis">been a real pain</a> so I prefer to avoid additional operational complexity.</p><p>Enter <a
href="http://www.adzerk.com/inside-adzerk/">Adzerk</a>, which provides a hosted ad server, removing the need for additional infrastructure. The function of an ad server is to let you upload media (pictures, etc) and set up business rules for when to display each &#8220;ad&#8221;, effectively letting you run something resembling an Adsense clone, minus all the AI. Adzerk has a <a
href="http://www.adzerk.com/pricing/">nice free plan</a>, which covers you up to a ridiculously high number of impressions, so it&#8217;s not really necessary. I&#8217;ve been really happy with the site, although from their perspective I&#8217;m likely a terrible customer, as they&#8217;re not making any money off me.</p><p>The logical products to promote on a programming site seem like jobs, books, and developer tools &#8211; right now I&#8217;m just running campaigns on a couple sites consisting of links to Amazon pages. Here&#8217;s a screenshot a campaign set up very recently:</p><p><img
class="alignnone size-large wp-image-1062" alt="Screen Shot 2013-05-22 at 9.36.38 PM" src="http://garysieling.com/blog/wp-content/uploads/2013/05/Screen-Shot-2013-05-22-at-9.36.38-PM-578x429.png" width="578" height="429" /></p><p>Once you get enough impressions to compare, you can just turn off each entry and make new ones for the next test. There&#8217;s no particular reason you have to use traditional display advertising with this &#8211; as cheap as it is, one could easily build a very dynamic site using their API, for instance to pump in hot news, thumbnails for suggested stories, etc.</p><p>I&#8217;m running campaigns on a couple different sites &#8211; After a <a
href="http://garysieling.com/blog/lessons-learned-from-0-to-40000-blog-readers">slow build-up in readership</a>, and a <a
href="http://garysieling.com/blog/building-a-full-text-index-in-javascript">burst of Hacker News traffic</a>, I was able to finally achieve statistically significant results for clickthroughs on a test of four Javascript books:</p><table><tbody><tr><td><b>Title</b></td><td><b>Impressions</b></td><td><b>Clickthrough %</b></td><td><b>Clicks</b></td></tr><tr><td><a
href="http://www.amazon.com/gp/product/0596517742/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0596517742&amp;linkCode=as2&amp;tag=thesecrelifeo-20">Javascript the Good Parts</a></td><td>9801</td><td>0.2</td><td>20</td></tr><tr><td><a
href="http://www.amazon.com/gp/product/0596806752/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0596806752&amp;linkCode=as2&amp;tag=thesecrelifeo-20">Javascript Patterns</a></td><td>10092</td><td>0.22</td><td>22</td></tr><tr><td><a
href="http://www.amazon.com/gp/product/1847198708/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=1847198708&amp;linkCode=as2&amp;tag=thesecrelifeo-20">Ext CookBook</a></td><td>9944</td><td>0.11</td><td>11</td></tr><tr><td><a
href="http://www.amazon.com/gp/product/1935182110/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=1935182110&amp;linkCode=as2&amp;tag=thesecrelifeo-20">ExtJS in Action</a></td><td>10060</td><td>0.43</td><td>43</td></tr></tbody></table><p>&nbsp;</p><p>To test this, I made a table of each combination of values, comparing them using the <a
href="http://www.evanmiller.org/ab-testing/chi-squared.html">chi-squared test</a> on <a
href="http://www.evanmiller.org/ab-testing/">Evan&#8217;s Awesome A/B testing tools</a>. Fortunately this showed ExtJS in Action to be a clear winner over all the rest as I hoped- one risk of this technique is that there is a clear loser or a couple equivalent winners, where I was hoping for one to be the best.</p><table><tbody><tr><td>x</td><td>no difference</td><td>no difference</td><td>loses</td></tr><tr><td>no difference</td><td>x</td><td>no difference</td><td>loses</td></tr><tr><td>no difference</td><td>no difference</td><td>x</td><td>loses</td></tr><tr><td>wins</td><td>wins</td><td>wins</td><td>x</td></tr></tbody></table><p>&nbsp;</p><p>In all, this took around <i>nine months</i> to achieve, but with little effort on my part. The lesson I take is this: while this is fun, it may be better to look for big wins elsewhere- especially borrowing ideas from people who already run these tests. Additionally, the tests would complete much faster running only two variations, and many can run in parallel.</p><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/my-first-ab-test-with-results" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/my-first-ab-test-with-results/feed</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Building a full-text index of git commits using lunr.js and Github APIs</title><link>http://garysieling.com/blog/building-a-full-text-index-of-git-commits-using-lunr-js-and-github-apis?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=building-a-full-text-index-of-git-commits-using-lunr-js-and-github-apis</link> <comments>http://garysieling.com/blog/building-a-full-text-index-of-git-commits-using-lunr-js-and-github-apis#comments</comments> <pubDate>Mon, 20 May 2013 01:20:54 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Code Examples]]></category> <category><![CDATA[Full Text Search]]></category> <category><![CDATA[etl]]></category> <category><![CDATA[faceted search]]></category> <category><![CDATA[full-text search]]></category> <category><![CDATA[git]]></category> <category><![CDATA[github]]></category> <category><![CDATA[javascript]]></category> <category><![CDATA[lunr.js]]></category> <category><![CDATA[solr]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=1053</guid> <description><![CDATA[Github has a nice API for inspecting repositories &#8211; it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/building-a-full-text-index-of-git-commits-using-lunr-js-and-github-apis" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p>Github has a nice API for inspecting repositories &#8211; it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer employers). Github APIs return JSON, which has the nice property of resembling a tree structure &#8211; results can be recursed over without fear of infinite loops. Note that to download the entire commit history for a repository, you need to page through it by sha hash. The API I use here lacks diffs, which must be retrieved elsewhere.</p><p>To test this, access a URL like so. The configurable arguments are the repository owner and name fields.<br
/> <a
href="https://api.github.com/repos/torvalds/linux/commits">https://api.github.com/repos/torvalds/linux/commits</a></p><p>This is what a commit looks like:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #009900;">&#123;</span>
  <span style="color: #3366CC;">&quot;sha&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;7638417db6d59f3c431d3e1f261cc637155684cd&quot;</span><span style="color: #339933;">,</span>
  <span style="color: #3366CC;">&quot;url&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;https://api.github.com/repos/octocat/Hello-World/git/commits/7638417db6d59f3c431d3e1f261cc637155684cd&quot;</span><span style="color: #339933;">,</span>
  <span style="color: #3366CC;">&quot;author&quot;</span><span style="color: #339933;">:</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #3366CC;">&quot;date&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;2008-07-09T16:13:30+12:00&quot;</span><span style="color: #339933;">,</span>
    <span style="color: #3366CC;">&quot;name&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;Scott Chacon&quot;</span><span style="color: #339933;">,</span>
    <span style="color: #3366CC;">&quot;email&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;schacon@gmail.com&quot;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
  <span style="color: #3366CC;">&quot;committer&quot;</span><span style="color: #339933;">:</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #3366CC;">&quot;date&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;2008-07-09T16:13:30+12:00&quot;</span><span style="color: #339933;">,</span>
    <span style="color: #3366CC;">&quot;name&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;Scott Chacon&quot;</span><span style="color: #339933;">,</span>
    <span style="color: #3366CC;">&quot;email&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;schacon@gmail.com&quot;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
  <span style="color: #3366CC;">&quot;message&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;my commit message&quot;</span><span style="color: #339933;">,</span>
  <span style="color: #3366CC;">&quot;tree&quot;</span><span style="color: #339933;">:</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #3366CC;">&quot;url&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;https://api.github.com/repos/octocat/Hello-World/git/trees/827efc6d56897b048c772eb4087f854f46256132&quot;</span><span style="color: #339933;">,</span>
    <span style="color: #3366CC;">&quot;sha&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;827efc6d56897b048c772eb4087f854f46256132&quot;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
  <span style="color: #3366CC;">&quot;parents&quot;</span><span style="color: #339933;">:</span> <span style="color: #009900;">&#91;</span>
    <span style="color: #009900;">&#123;</span>
      <span style="color: #3366CC;">&quot;url&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;https://api.github.com/repos/octocat/Hello-World/git/commits/7d1b31e74ee336d15cbd21741bc88a537ed063a0&quot;</span><span style="color: #339933;">,</span>
      <span style="color: #3366CC;">&quot;sha&quot;</span><span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;7d1b31e74ee336d15cbd21741bc88a537ed063a0&quot;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#93;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div><p>To make the test simple, I download these as JSON locally, then start a python webserver. Were I to make many such calls on a public site, I’d set up a proxy to the github APIs.</p><pre>
python -m SimpleHTTPServer
</pre><p>This data has a number of nested objects and must be flattened to fit into the <a
href="http://lunrjs.com/">lunr.js</a> full-text index. This example uses the commit number (0, 1, 2..N) as the location in the index, but a real environment should use the commit hash to allow partitioning the ingestion process. Nested objects are flattened by joining subsequent keys with underscores in between. A production-worthy solution needs to escape these to prevent collisions.</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #000066; font-weight: bold;">var</span> documents <span style="color: #339933;">=</span> <span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000066; font-weight: bold;">function</span> recurse<span style="color: #009900;">&#40;</span>doc_num<span style="color: #339933;">,</span> base<span style="color: #339933;">,</span> obj<span style="color: #339933;">,</span> value<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>$.<span style="color: #660066;">isPlainObject</span><span style="color: #009900;">&#40;</span>value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    $.<span style="color: #660066;">each</span><span style="color: #009900;">&#40;</span>value<span style="color: #339933;">,</span> <span style="color: #000066; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>k<span style="color: #339933;">,</span> v<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      recurse<span style="color: #009900;">&#40;</span>doc_num<span style="color: #339933;">,</span> base <span style="color: #339933;">+</span> obj <span style="color: #339933;">+</span> <span style="color: #3366CC;">&quot;_&quot;</span><span style="color: #339933;">,</span> k<span style="color: #339933;">,</span> v<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span> <span style="color: #000066; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span>
    process<span style="color: #009900;">&#40;</span>doc_num<span style="color: #339933;">,</span> base <span style="color: #339933;">+</span> obj<span style="color: #339933;">,</span> value<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #000066; font-weight: bold;">function</span> process<span style="color: #009900;">&#40;</span>doc_num<span style="color: #339933;">,</span> key<span style="color: #339933;">,</span> value<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>documents.<span style="color: #660066;">length</span> <span style="color: #339933;">&lt;=</span> doc_num<span style="color: #009900;">&#41;</span>
    documents<span style="color: #009900;">&#91;</span>doc_num<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #339933;">!==</span> <span style="color: #003366; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span>
    documents<span style="color: #009900;">&#91;</span>doc_num<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span>key<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> value <span style="color: #339933;">+</span> <span style="color: #3366CC;">''</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
$.<span style="color: #660066;">each</span><span style="color: #009900;">&#40;</span>data<span style="color: #339933;">,</span> <span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>doc_num<span style="color: #339933;">,</span> commit<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  $.<span style="color: #660066;">each</span><span style="color: #009900;">&#40;</span>commit<span style="color: #339933;">,</span> <span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>k<span style="color: #339933;">,</span> v<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    recurse<span style="color: #009900;">&#40;</span>doc_num<span style="color: #339933;">,</span> <span style="color: #3366CC;">''</span><span style="color: #339933;">,</span> k<span style="color: #339933;">,</span> v<span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div><p>Normally, one sets up a lunr full-text index by specifying all the fields, much like Solr&#8217;s numerous XML config files. Lunr doesn&#8217;t have nearly as many configuration options, since you only specify the ‘boost’ parameter to increase the value of certain fields in ranking. I imagine this will change as the project grows, at the very least to include type hints.</p><p>Given the simplicity of field objects, you can infer infer the field list from JSON payloads. The code below provides two modes, one where you inspect the entire JSON payload, or one where you limit how many commits you check, a good option when JSON data is consistent.</p><p>The function accepts configuration objects resembling <a
href="http://www.garysieling.com/blog/tag/extjs">ExtJS</a> config objects, which lets you override as desired. If fields derived from existing data are required, they can be inserted after any documents are inserted.</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #000066; font-weight: bold;">function</span> inferIndex<span style="color: #009900;">&#40;</span>documents<span style="color: #339933;">,</span> config<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #000066; font-weight: bold;">return</span> lunr<span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000066; font-weight: bold;">this</span>.<span style="color: #660066;">ref</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'id'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000066; font-weight: bold;">var</span> found <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
    <span style="color: #000066; font-weight: bold;">var</span> idx <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">this</span><span style="color: #339933;">;</span>
&nbsp;
    $.<span style="color: #660066;">each</span><span style="color: #009900;">&#40;</span>documents<span style="color: #339933;">,</span>
      <span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>doc_num<span style="color: #339933;">,</span> doc<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
&nbsp;
        <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>config <span style="color: #339933;">&amp;&amp;</span>
            config.<span style="color: #660066;">limit</span> <span style="color: #339933;">&amp;&amp;</span>
            config.<span style="color: #660066;">limit</span> <span style="color: #339933;">&lt;</span> doc_num<span style="color: #009900;">&#41;</span>
          <span style="color: #000066; font-weight: bold;">return</span><span style="color: #339933;">;</span>
&nbsp;
        $.<span style="color: #660066;">each</span><span style="color: #009900;">&#40;</span>doc<span style="color: #339933;">,</span> <span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>k<span style="color: #339933;">,</span> v<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>found<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>config <span style="color: #339933;">&amp;&amp;</span> config<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
              idx.<span style="color: #660066;">field</span><span style="color: #009900;">&#40;</span>k<span style="color: #339933;">,</span> config<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #009900;">&#125;</span> <span style="color: #000066; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span>
              idx.<span style="color: #660066;">field</span><span style="color: #009900;">&#40;</span>k<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #009900;">&#125;</span>
            found<span style="color: #009900;">&#91;</span>k<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #003366; font-weight: bold;">true</span><span style="color: #339933;">;</span>
          <span style="color: #009900;">&#125;</span>
        <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #000066; font-weight: bold;">var</span> index <span style="color: #339933;">=</span>
  inferIndex<span style="color: #009900;">&#40;</span>documents<span style="color: #339933;">,</span>
    <span style="color: #009900;">&#123;</span>limit<span style="color: #339933;">:</span> <span style="color: #CC0000;">1</span><span style="color: #339933;">,</span>
     <span style="color: #3366CC;">'commit_author_name'</span><span style="color: #339933;">:</span><span style="color: #009900;">&#123;</span>boost<span style="color: #339933;">:</span><span style="color: #CC0000;">10</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div><p>Inserting flattened documents into the index becomes simple. The method below provides a callback, should you desire to add calculated fields fields.</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;">$.<span style="color: #660066;">each</span><span style="color: #009900;">&#40;</span>documents<span style="color: #339933;">,</span>
  <span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>doc_num<span style="color: #339933;">,</span> attrs<span style="color: #339933;">,</span> doc_cb<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000066; font-weight: bold;">var</span> doc <span style="color: #339933;">=</span>
      $.<span style="color: #660066;">extend</span><span style="color: #009900;">&#40;</span>
        <span style="color: #009900;">&#123;</span>id<span style="color: #339933;">:</span> doc_num<span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> attrs<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>doc_cb<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      doc <span style="color: #339933;">=</span> doc_cb<span style="color: #009900;">&#40;</span>doc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    index.<span style="color: #660066;">add</span><span style="color: #009900;">&#40;</span>doc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div><p>At this point we’ve indexed the entire commit history from a git repository, which lets us search for commits by topic. While this is useful, it’d be really nice to be able to facet on fields, which would return the number of documents in a category, like a SQL group by. I&#8217;ve found it particularly convenient to facet on author, date, or author&#8217;s company.</p><p>If you have access to the original documents, you can easily construct facets based on the results of a lunr search:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #000066; font-weight: bold;">function</span> facet<span style="color: #009900;">&#40;</span>index<span style="color: #339933;">,</span> query<span style="color: #339933;">,</span> data<span style="color: #339933;">,</span> field<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #000066; font-weight: bold;">var</span> results <span style="color: #339933;">=</span> index.<span style="color: #660066;">search</span><span style="color: #009900;">&#40;</span>query<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000066; font-weight: bold;">var</span> facets <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
  $.<span style="color: #660066;">each</span><span style="color: #009900;">&#40;</span>results<span style="color: #339933;">,</span> <span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>index<span style="color: #339933;">,</span> searchResult<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000066; font-weight: bold;">var</span> doc <span style="color: #339933;">=</span> data<span style="color: #009900;">&#91;</span>searchResult.<span style="color: #660066;">ref</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
&nbsp;
    facets<span style="color: #009900;">&#91;</span>doc<span style="color: #009900;">&#91;</span>field<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span>
      <span style="color: #009900;">&#40;</span>facets<span style="color: #009900;">&#91;</span>doc<span style="color: #009900;">&#91;</span>field<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">===</span> <span style="color: #003366; font-weight: bold;">undefined</span> <span style="color: #339933;">?</span> <span style="color: #CC0000;">0</span> <span style="color: #339933;">:</span>
      facets<span style="color: #009900;">&#91;</span>doc<span style="color: #009900;">&#91;</span>field<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #CC0000;">1</span><span style="color: #339933;">;</span> <span style="color: #009900;">&#125;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000066; font-weight: bold;">return</span> facets<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div><p>Commit messages in repositories where I work often contain names of clients who requested a feature or bug fix. Consequently doing a search faceted by author provides a list of who worked with each client the most &#8211; this can also tell you who has worked with various pieces of technology.</p><p>The following query demonstrates this technique:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #000066; font-weight: bold;">var</span> facets <span style="color: #339933;">=</span>
   facet<span style="color: #009900;">&#40;</span>index<span style="color: #339933;">,</span>
        <span style="color: #3366CC;">'driver'</span><span style="color: #339933;">,</span>
        documents<span style="color: #339933;">,</span>
        <span style="color: #3366CC;">'commit_author_name'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div><pre>
{"Wolfram Sang":24,"Linus Torvalds":3}
</pre><p>The approach shown here works well, but requires retrieving results requires access to the original document data. If we want to filter the results to a category, we need a richer search API than lunr currently provides, as well as callback options within the search API. In Solr there are also options to skip lower-casing data, as that may be inappropriate for category titles. Mitigating these issues will be explored further in future essays.</p><p>If you enjoyed this, you may also be interested in:</p><ul><li><a
href="http://garysieling.com/blog/building-a-full-text-index-in-javascript">Building a full-text index in Javascript</a></li><li><a
href="http://garysieling.com/blog/converting-git-commit-history-to-a-solr-full-text-index">Converting git commit history to a solr full-text index</a></li><li><a
href="http://garysieling.com/blog/building-a-naive-bayes-classifier-in-the-browser-using-map-reduce">Building a Naive Bayes Classifier in the Browser using Map-Reduce</a></li></ul><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/building-a-full-text-index-of-git-commits-using-lunr-js-and-github-apis" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/building-a-full-text-index-of-git-commits-using-lunr-js-and-github-apis/feed</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Full-Text Indexing PDFs in Javascript</title><link>http://garysieling.com/blog/building-a-full-text-index-in-javascript?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=building-a-full-text-index-in-javascript</link> <comments>http://garysieling.com/blog/building-a-full-text-index-in-javascript#comments</comments> <pubDate>Thu, 16 May 2013 02:58:02 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Code Examples]]></category> <category><![CDATA[Data Mining]]></category> <category><![CDATA[Data Science]]></category> <category><![CDATA[Javascript Code Examples]]></category> <category><![CDATA[Proof of Concepts]]></category> <category><![CDATA[full text search]]></category> <category><![CDATA[javascript]]></category> <category><![CDATA[jquery]]></category> <category><![CDATA[lucene]]></category> <category><![CDATA[lunr.js]]></category> <category><![CDATA[pdf.js]]></category> <category><![CDATA[phonegap]]></category> <category><![CDATA[scraping]]></category> <category><![CDATA[solr]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=1041</guid> <description><![CDATA[I once worked for a company that sold access to legal and financial databases (as they call it, &#8220;intelligent information&#8220;). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/building-a-full-text-index-in-javascript" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p>I once worked for a company that sold access to legal and financial databases (as they call it, &#8220;<a
href="http://thomsonreuters.com/">intelligent information</a>&#8220;). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the 200+ million PDFs that represent 20+ years of U.S. litigation. These processes can take many months of machine time, which puts a lot of pressure on the software teams that build them.</p><p>Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, <a
href="https://twitter.com/olivernn">Oliver Nightingale</a> is implementing a Javascript full-text indexer in the Javascript &#8211; combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.</p><p>As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it&#8217;s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.</p><p>Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.</p><p>Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem.</p><p>To test building this processing pipeline, we first look at how to extract text from PDFs, which will later be inserted into a full text index. The code for <a
href="http://mozilla.github.io/pdf.js/">pdf.js</a> is instructive, in that the Mozilla developers use browser features that aren&#8217;t in common use. Web Workers, for instance, let you set up background processing threads.</p><p>The pdf.js APIS make heavy use of Promises, which hold references in code to operations that haven’t completed yet. You operate on them using callbacks:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #000066; font-weight: bold;">var</span> pdf <span style="color: #339933;">=</span> PDFJS.<span style="color: #660066;">getDocument</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'http://www.pacer.gov/documents/pacermanual.pdf'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000066; font-weight: bold;">var</span> pdf <span style="color: #339933;">=</span> PDFJS.<span style="color: #660066;">getDocument</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'pacermanual.pdf'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
pdf.<span style="color: #660066;">then</span><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>pdf<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
 <span style="color: #006600; font-style: italic;">// this code is called once the PDF is ready</span>
<span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div><p>This API seems immature yet- ideally you should be able to do promise.then(f(x)).then(g(x)).then(h(x)) etc, but that isn&#8217;t yet available.</p><p>For rendering PDFs the Promise pattern makes a lot of sense, as it leaves room for parallelizing the rendering process. For merely extracting the text from a PDF it feels like a lot of work &#8211; you need to be confident that your callbacks run in order and track which one is last.</p><p>The following demonstrates how to extract all the PDF text, which is then printed to the browser console log:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;">‘use strict’<span style="color: #339933;">;</span>
<span style="color: #000066; font-weight: bold;">var</span> pdf <span style="color: #339933;">=</span> PDFJS.<span style="color: #660066;">getDocument</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'http://www.pacer.gov/documents/pacermanual.pdf'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000066; font-weight: bold;">var</span> pdf <span style="color: #339933;">=</span> PDFJS.<span style="color: #660066;">getDocument</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'pacermanual.pdf'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
pdf.<span style="color: #660066;">then</span><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>pdf<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
 <span style="color: #000066; font-weight: bold;">var</span> maxPages <span style="color: #339933;">=</span> pdf.<span style="color: #660066;">pdfInfo</span>.<span style="color: #660066;">numPages</span><span style="color: #339933;">;</span>
 <span style="color: #000066; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">var</span> j <span style="color: #339933;">=</span> <span style="color: #CC0000;">1</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;=</span> maxPages<span style="color: #339933;">;</span> j<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000066; font-weight: bold;">var</span> page <span style="color: #339933;">=</span> pdf.<span style="color: #660066;">getPage</span><span style="color: #009900;">&#40;</span>j<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #006600; font-style: italic;">// the callback function - we create one per page</span>
    <span style="color: #000066; font-weight: bold;">var</span> processPageText <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">function</span> processPageText<span style="color: #009900;">&#40;</span>pageIndex<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000066; font-weight: bold;">return</span> <span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>pageData<span style="color: #339933;">,</span> content<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #000066; font-weight: bold;">return</span> <span style="color: #000066; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>text<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          <span style="color: #006600; font-style: italic;">// bidiTexts has a property identifying whether this</span>
          <span style="color: #006600; font-style: italic;">// text is left-to-right or right-to-left</span>
          <span style="color: #000066; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">var</span> i <span style="color: #339933;">=</span> <span style="color: #CC0000;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> text.<span style="color: #660066;">bidiTexts</span>.<span style="color: #660066;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            str <span style="color: #339933;">+=</span> text.<span style="color: #660066;">bidiTexts</span><span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span>.<span style="color: #660066;">str</span><span style="color: #339933;">;</span>
          <span style="color: #009900;">&#125;</span>
&nbsp;
          <span style="color: #000066; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>pageData.<span style="color: #660066;">pageInfo</span>.<span style="color: #660066;">pageIndex</span> <span style="color: #339933;">===</span>
              maxPages <span style="color: #339933;">-</span> <span style="color: #CC0000;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #006600; font-style: italic;">// later this will insert into an index</span>
            console.<span style="color: #660066;">log</span><span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
          <span style="color: #009900;">&#125;</span>
        <span style="color: #009900;">&#125;</span>
      <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#40;</span>j<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000066; font-weight: bold;">var</span> processPage <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">function</span> processPage<span style="color: #009900;">&#40;</span>pageData<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000066; font-weight: bold;">var</span> content <span style="color: #339933;">=</span> pageData.<span style="color: #660066;">getTextContent</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
      content.<span style="color: #660066;">then</span><span style="color: #009900;">&#40;</span>processPageText<span style="color: #009900;">&#40;</span>pageData<span style="color: #339933;">,</span> content<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    page.<span style="color: #660066;">then</span><span style="color: #009900;">&#40;</span>processPage<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
 <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div><p>It’s not trivial to identify where headings and images are. This would require hooking into the rendering code, and possibly a deep understanding of PDF commands (PDFs appear to be represented as stream of rendering commands, similar to RTF).</p><p><strong>Lunr</strong><br
/> Creating a Lunr index and adding text is straightforward- all the APIs operate on JSON bean objects, which is a pleasantly simple API:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;">doc1 <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    id<span style="color: #339933;">:</span> <span style="color: #CC0000;">1</span><span style="color: #339933;">,</span>
    title<span style="color: #339933;">:</span> <span style="color: #3366CC;">'Foo'</span><span style="color: #339933;">,</span>
    body<span style="color: #339933;">:</span> <span style="color: #3366CC;">'Foo foo foo!'</span>
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
&nbsp;
doc2 <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    id<span style="color: #339933;">:</span> <span style="color: #CC0000;">2</span><span style="color: #339933;">,</span>
    title<span style="color: #339933;">:</span> <span style="color: #3366CC;">'Bar'</span><span style="color: #339933;">,</span>
    body<span style="color: #339933;">:</span> <span style="color: #3366CC;">'Bar bar bar!'</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
doc3 <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    id<span style="color: #339933;">:</span> <span style="color: #CC0000;">3</span><span style="color: #339933;">,</span>
    title<span style="color: #339933;">:</span> <span style="color: #3366CC;">'gary'</span><span style="color: #339933;">,</span>
    body<span style="color: #339933;">:</span> <span style="color: #3366CC;">'Foo Bar bar bar!'</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
index <span style="color: #339933;">=</span> lunr<span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000066; font-weight: bold;">this</span>.<span style="color: #660066;">field</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'title'</span><span style="color: #339933;">,</span> <span style="color: #009900;">&#123;</span>boost<span style="color: #339933;">:</span> <span style="color: #CC0000;">10</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
    <span style="color: #000066; font-weight: bold;">this</span>.<span style="color: #660066;">field</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'body'</span><span style="color: #009900;">&#41;</span>
    <span style="color: #000066; font-weight: bold;">this</span>.<span style="color: #660066;">ref</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'id'</span><span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #006600; font-style: italic;">// Add documents to the index</span>
index.<span style="color: #660066;">add</span><span style="color: #009900;">&#40;</span>doc1<span style="color: #009900;">&#41;</span>
index.<span style="color: #660066;">add</span><span style="color: #009900;">&#40;</span>doc2<span style="color: #009900;">&#41;</span>
index.<span style="color: #660066;">add</span><span style="color: #009900;">&#40;</span>doc3<span style="color: #009900;">&#41;</span></pre></td></tr></table></div><p>Searching is simple &#8211; one neat tidbit I found is that you can inspect the index easily, since it&#8217;s just a JS object:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #006600; font-style: italic;">// Run a search</span>
index.<span style="color: #660066;">search</span><span style="color: #009900;">&#40;</span>“foo”<span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #006600; font-style: italic;">// Inspect the actual index to see which docs match a term</span>
index2.<span style="color: #660066;">tokenStore</span>.<span style="color: #660066;">root</span>.<span style="color: #660066;">f</span>.<span style="color: #660066;">o</span>.<span style="color: #660066;">o</span>.<span style="color: #660066;">docs</span></pre></td></tr></table></div><p>When I was first introduced to full-text indexing, I was confused by what is meant by a “document” &#8211; this generalizes beyond a PDF or Office document to any database row, possibly including large blobs of text.</p><p>Full-text search would be pretty dumb if you had to build the index every time, and Lunr makes it really easy to serialize and deserialize the index itself:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #000066; font-weight: bold;">var</span> serializedIndex <span style="color: #339933;">=</span> JSON.<span style="color: #660066;">stringify</span><span style="color: #009900;">&#40;</span>index1.<span style="color: #660066;">toJSON</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #000066; font-weight: bold;">var</span> deserializedIndex <span style="color: #339933;">=</span> JSON.<span style="color: #660066;">parse</span><span style="color: #009900;">&#40;</span>serializedIndex<span style="color: #009900;">&#41;</span>
<span style="color: #000066; font-weight: bold;">var</span> index2 <span style="color: #339933;">=</span> lunr.<span style="color: #660066;">Index</span>.<span style="color: #660066;">load</span><span style="color: #009900;">&#40;</span>deserializedIndex<span style="color: #009900;">&#41;</span></pre></td></tr></table></div><p>Index.toJSON also returns a “bean” style object (not a string). I’ve never seen an API like this, and I really like the idea &#8211; it gives you a clean Javascript object with only the data that requires serialization.</p><p>The following are attributes of the index:</p><ul><li>corpusTokens &#8211; Sorted list of tokens</li><li>documentStore &#8211; list of each document &#8211; catenate</li><li>fields &#8211; The fields used to describe each document (similar to database columns)</li><li>pipeline &#8211; The pipeline object used to process tokens</li><li>tokenStore &#8211; Where and how often words are referenced in each document</li></ul><p>One great thing about this type of index is that the work can be done in parallel and then combined as a map-reduce job. Only three entries from the above object need to be combined, as “fields” and “pipeline” are static. The following demonstrates the implementation of the reduction step (note jQuery is referenced):</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">function</span> reduce<span style="color: #009900;">&#40;</span>a<span style="color: #339933;">,</span> b<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #000066; font-weight: bold;">var</span> j1 <span style="color: #339933;">=</span> a.<span style="color: #660066;">toJSON</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #000066; font-weight: bold;">var</span> j2 <span style="color: #339933;">=</span> b.<span style="color: #660066;">toJSON</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #006600; font-style: italic;">// The &quot;unique&quot; function does uniqueness by sorting,</span>
  <span style="color: #006600; font-style: italic;">// which we need here.</span>
  <span style="color: #000066; font-weight: bold;">var</span> corpusTokens <span style="color: #339933;">=</span>
      $.<span style="color: #660066;">unique</span><span style="color: #009900;">&#40;</span>
          $.<span style="color: #660066;">merge</span><span style="color: #009900;">&#40;</span>
              $.<span style="color: #660066;">merge</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> j1.<span style="color: #660066;">corpusTokens</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                           j2.<span style="color: #660066;">corpusTokens</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #006600; font-style: italic;">// It's important to create new arrays and</span>
  <span style="color: #006600; font-style: italic;">// objects throughout, or else you modify </span>
  <span style="color: #006600; font-style: italic;">// the source indexes, which is disastrous.</span>
  <span style="color: #000066; font-weight: bold;">var</span> documentStore <span style="color: #339933;">=</span>
     <span style="color: #009900;">&#123;</span>store<span style="color: #339933;">:</span> $.<span style="color: #660066;">extend</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
                      j1.<span style="color: #660066;">documentStore</span>.<span style="color: #660066;">store</span><span style="color: #339933;">,</span>
                      j2.<span style="color: #660066;">documentStore</span>.<span style="color: #660066;">store</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
      length<span style="color: #339933;">:</span> j1.<span style="color: #660066;">documentStore</span>.<span style="color: #660066;">length</span> <span style="color: #339933;">+</span> j2.<span style="color: #660066;">documentStore</span>.<span style="color: #660066;">length</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000066; font-weight: bold;">var</span> jt1 <span style="color: #339933;">=</span> j1.<span style="color: #660066;">tokenStore</span><span style="color: #339933;">;</span>
  <span style="color: #000066; font-weight: bold;">var</span> jt2 <span style="color: #339933;">=</span> j2.<span style="color: #660066;">tokenStore</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #006600; font-style: italic;">// The 'true' here triggers a deep copy</span>
  <span style="color: #000066; font-weight: bold;">var</span> tokenStore <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    root<span style="color: #339933;">:</span> $.<span style="color: #660066;">extend</span><span style="color: #009900;">&#40;</span><span style="color: #003366; font-weight: bold;">true</span><span style="color: #339933;">,</span> <span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> jt1.<span style="color: #660066;">root</span><span style="color: #339933;">,</span> jt2.<span style="color: #660066;">root</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
    length<span style="color: #339933;">:</span> jt1.<span style="color: #660066;">length</span> <span style="color: #339933;">+</span> jt2.<span style="color: #660066;">length</span>
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000066; font-weight: bold;">return</span> <span style="color: #009900;">&#123;</span>version<span style="color: #339933;">:</span> j1.<span style="color: #660066;">version</span><span style="color: #339933;">,</span>
          fields<span style="color: #339933;">:</span> $.<span style="color: #660066;">merge</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> j1.<span style="color: #660066;">fields</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
          ref<span style="color: #339933;">:</span> j1.<span style="color: #660066;">ref</span><span style="color: #339933;">,</span>
          documentStore<span style="color: #339933;">:</span> documentStore<span style="color: #339933;">,</span>
          tokenStore<span style="color: #339933;">:</span> tokenStore<span style="color: #339933;">,</span>
          corpusTokens<span style="color: #339933;">:</span> corpusTokens<span style="color: #339933;">,</span>
          pipeline<span style="color: #339933;">:</span> $.<span style="color: #660066;">merge</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> j1.<span style="color: #660066;">pipeline</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#40;</span>index1<span style="color: #339933;">,</span> index2<span style="color: #009900;">&#41;</span></pre></td></tr></table></div><p>I tested this by creating three indexes: index1, index2, and index3. index1 is {doc1}, index2 is {doc2, doc3}, and index3 is {doc1, doc2, doc3}. To test the code, you need simply diff:</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;">JSON.<span style="color: #660066;">stringify</span><span style="color: #009900;">&#40;</span>index3.<span style="color: #660066;">toJSON</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
JSON.<span style="color: #660066;">stringify</span><span style="color: #009900;">&#40;</span>combine<span style="color: #009900;">&#40;</span>index1<span style="color: #339933;">,</span> index2<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div><p><strong>Possibilities<br
/> </strong><br
/> Overall this technique has a lot of wasted network I/O, making this seem silly. On the other hand, there are listings on ebay and fiverr selling for &#8220;traffic&#8221;, which typically comes from pop-unders, botnets, hidden iframes, etc. You can easily find listings like “20,000 hits for $3”, and less in bulk. This is typically cheap because it has little commercial value other than perpetrating various forms of fraud.</p><p>You’d need a cheap VM with loads of bandwidth to use as a proxy, as well as publically available data &#8211; you couldn’t use this as a scraping technique due to browser protections against cross-domain requests. You&#8217;d also need to generate unique document IDs in a unique fashion, perhaps using the original URL.</p><p>If a traffic source runs on modern browsers, one could use this as a source of potentially cheap and unlimited processing power, even for point of combining the indexes, although provisions must be made for the natural instability of the system.</p><p>If you enjoyed this, you might also enjoy the following:</p><ul><li><a
href="http://garysieling.com/blog/building-a-full-text-index-of-git-commits-using-lunr-js-and-github-apis">Building a full-text index of git commits using lunr.js and Github APIs</a></li><li><a
href="http://garysieling.com/blog/extracting-pdf-text-with-scala">Extracting PDF Text with Scala</a></li><li><a
href="http://garysieling.com/blog/building-a-naive-bayes-classifier-in-the-browser-using-map-reduce">Building a Naive Bayes Classifier in the Browser using Map-Reduce</a></li><li><a
href="http://garysieling.com/blog/scraping-adsense-ads-with-phantomjs">Scraping Adsense Ads with PhantomJS</a></li><li><a
href="http://garysieling.com/blog/converting-git-commit-history-to-a-solr-full-text-index">Converting Git Commit History to a Solr Full Text Index</a></li></ul><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/building-a-full-text-index-in-javascript" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/building-a-full-text-index-in-javascript/feed</wfw:commentRss> <slash:comments>6</slash:comments> </item> <item><title>Lessons Learned from 0 to 40,000 Readers</title><link>http://garysieling.com/blog/lessons-learned-from-0-to-40000-blog-readers?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=lessons-learned-from-0-to-40000-blog-readers</link> <comments>http://garysieling.com/blog/lessons-learned-from-0-to-40000-blog-readers#comments</comments> <pubDate>Fri, 10 May 2013 03:05:58 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Business]]></category> <category><![CDATA[hacker news]]></category> <category><![CDATA[security]]></category> <category><![CDATA[twitter]]></category> <category><![CDATA[wordpress]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=1030</guid> <description><![CDATA[Starting Out I started writing a little over a year ago, after finding “Technical Blogging” by Antonio Cangiano through Hacker News. Since then, a bit over 40,000 people have read articles I&#8217;ve written, not a huge number in the grand scheme of things, but enough to draw a few lessons. The more I write, the [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/lessons-learned-from-0-to-40000-blog-readers" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p><strong>Starting Out</strong></p><p>I started writing a little over a year ago, after finding “<a
href="https://www.amazon.com/gp/yourstore/home/?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;linkCode=ur2&amp;tag=thesecrelifeo-20">Technical Blogging</a>” by Antonio Cangiano through Hacker News. Since then, a bit over 40,000 people have read articles I&#8217;ve written, not a huge number in the grand scheme of things, but enough to draw a few lessons. The more I write, the more good things happen on their own, but at random moments.</p><p>Before starting to write, I came upon ~650 followers on Google Plus through a <a
href="https://plus.google.com/106419647632534512037/posts">mutual-follow program on Hacker News</a> and was interested in the SEO potential. If you’re signed into Google, people you follow are promoted in search results. Before this I built a site for a family member selling <a
href="http://www.makingbeehives.com">beehive plans</a> and learned a bit about SEO. It turns out that making money by selling $10 print books in a narrow niche is difficult, especially when you can’t do the writing yourself, so I’ve pivoted to pursue technical subjects.</p><p><strong>Choosing Material</strong></p><p>Antonio recommends writing in a focused niche, but I am still developing direction. I started with vague ideas based on things that interest me &#8211; <a
href="http://garysieling.com/blog/category/database-performance-tuning">database tuning</a>, <a
href="http://garysieling.com/blog/category/data-mining">machine learning</a>, <a
href="http://garysieling.com/blog/category/data-science">data science</a>, <a
href="http://garysieling.com/blog/tag/statistics">statistics</a>, and practical applications of these. In practice, I’ve written on wider subjects &#8211; anything within “full stack web development” is fair game, trying to focus on new, or popular tech &#8211; <a
href="http://garysieling.com/blog/tag/scala">Scala</a>, DevOps (<a
href="http://garysieling.com/blog/tag/vagrant">Vagrant</a>/Chef/<a
href="http://garysieling.com/blog/category/virtualization">Virtualization</a>), <a
href="http://garysieling.com/blog/tag/hadoop">Hadoop</a>, <a
href="http://garysieling.com/blog/tag/r">R</a>, and <a
href="http://garysieling.com/blog/tag/scraping">scraping</a>. Typically I write with an audience in mind &#8211; e.g. a niche forum or subreddit. When this works well, it often ends up in the forum without my help, as in the case of writing about <a
href="http://garysieling.com/blog/tag/flippa">Flippa auction data</a>, where I received links from the <a
href="http://experienced-people.net/forums/forum.php">experienced-people forum</a> and the Flippa corporate blog.</p><p>Choosing a strategy for generating material can be challenging. There are several models for success: just write what you’re interested in; pick topics with a good ratio of searches to existing quality content; pick an area where you own a product; or document knowledge from your consulting services. The last approach is particularly valuable, as it’s easy to establish business value. This has been used to build info-products with great success by <a
href="https://training.kalzumeus.com/">Patrick Mckenzie</a>, <a
href="http://doubleyourfreelancingrate.com/">Brennan Dunn</a>, and <a
href="http://nathanbarry.com/books/">Nathan Barry</a>.</p><p>I’ve followed the first two approaches &#8211; there are a few tools to help you pick topics, such as the very <a
href="http://marketsamurai.com/c/gms12345">spammy looking, but useful Market Samurai</a>. For a person with a generic IT blog, several types of articles naturally arise from this approach: proof of concepts, weird undocumented error messages, things that ought to be in product documentation, and things that are in product documentation, but can’t be found through Google for whatever reason.</p><p>I’ve found a few motivating examples of the last case- “<a
href="http://garysieling.com/blog/tag/r">R</a>” is hard to search for, and a lot of their API documentation is in PDFs (weird&#8230;). Before receiving VC funding, <a
href="http://garysieling.com/blog/tag/extjs">ExtJS</a> had a period where their primary developers would mock bloggers on their forums, which seems to have somewhat limited the press they received outside their forum. They also made their help in a Rich Javascript UI that Google had trouble spidering.</p><p><a
href="http://garysieling.com/blog/scraping-adsense-ads-with-phantomjs">Detailed proof of concepts</a> lead to people with the title “Founder” contacting you about jobs, an improvement over typical recruiter spam. <a
href="http://garysieling.com/blog/tag/extjs">Writing resembling technical documentation</a> or <a
href="http://garysieling.com/blog/fixing-vagrant-error-chefexceptionscookbooknotfound">error messages</a> leads to a lot of people contacting me for help. I get enough traffic from this approach that I’m established as an authority who can <a
href="http://garysieling.com/blog/category/book-reviews">review Javascript books</a>.</p><p>Another article generation technique that works well is “_Old Concept_ with _New Tool_”. E.g. &#8220;<a
href="http://garysieling.com/blog/building-a-json-webservice-in-r">Building a JSON Webservice in R</a>&#8220;, “<a
href="http://garysieling.com/blog/implementing-k-means-in-scala">Implementing K-Means in Scala</a>” &#8211; basically trend-surfing. A key success of this is that someone actually applied to a job where I work because of a post called “<a
href="http://garysieling.com/blog/scala-vs-clojure">Scala vs. Clojure</a>.” The appeal of this tactic for the author is the ease of writing; for the reader, it allows them to see a familiar and easily-digestable concept in a new language.</p><p><strong>Editing</strong></p><p>I have a long list of potential titles, which I update whenever I recognize something that might be of interest. I also have an editing checklist, which I refine over time. For programming blogs, how to format code is important &#8211; I use a WordPress plugin based on <a
href="http://qbnz.com/highlighter/">GeSHi</a>. I found that code written in a REPL is often very dense, and I&#8217;ve had complaints from people on Reddit, so I try to format the code with more whitespace than I would normally use. I also push most of my code to github (if there is enough to justify it) &#8211; this lets me get feedback from people who are using it.</p><p><strong>Promotion</strong></p><p>Recently, Google Webmaster Tools showed a massive spike (400%) in how often they showed my blog in search. This was because they put my wordpress <a
href="http://www.garysieling.com/blog/tag/scraping">/tag/scraping</a> at the top of search results for a day. Few people clicked through &#8211; my WP tag pages were not setup to look appealing, and I’ve since begun to fix that. To that aim, I wrote <a
href="http://garysieling.com/blog/extracting-social-media-vote-counts-for-reddit-twitter-google-and-hacker-news">a script to extract counts</a> from Twitter, HN, and Reddit, to estimate popularity, so I could make each tag page a “Guide” to the subject in question, filtered to the most popular and useful articles I’ve written on each subject.</p><p>The other SEO lesson is that the “not provided” keyword in Google Analytics removes a lot of insight. When a user is logged into their Gmail account, keywords are blocked being sent to the recipient site by Google. A lot of traffic is also missing the name of the referring domain entirely, an frequent occurrence when one of my articles is posted on reddit. For me, “not provided” has risen from 50% of search traffic a year ago to 80% today. Since Google Analytics lets you see top articles by referring domain, having detailed URLs helps make up for the lack of keyword data.</p><p>I must confess, I’ve thought for a long time that Twitter was kind of stupid, but recently I’ve been disabused of this notion, although I may be the last person to figure it out. While attending <a
href="http://garysieling.com/blog/?s=philly%20ete">Philly ETE</a> I realized that during the keynotes people sit there checking their phones &#8211; looking around the room you see Twitter apps.Writing summaries of conference talks using the conference hashtag resulted in a bunch of local people reading my articles and following me. Software that follows twitter hashtags for you is essentially an niche search engine, of which it turns out there are many &#8211; <a
href="https://www.hnsearch.com/">hnsearch</a>, and the WordPress plugin search.</p><p>I’ve written several articles which have been posted to Twitter by 20+ people. I only noticed in hindsight &#8211; t.co is the 6th most productive source of visits for me. Some of these posts are automated &#8211; e.g. from HN or DZone.One of the most popular posts of these posts was an entirely hypothetical commentary on using map-reduce techniques to <a
href="http://garysieling.com/blog/building-a-naive-bayes-classifier-in-the-browser-using-map-reduce">farm database queries out to browsers</a>. It’s the only thing I’ve written that seems to have received significant attention on Google+ (19 events on G+, 24 on Twitter). Discovering how often this was shared prompted this post &#8211; I was surprised as it was something I’d whipped off in half an hour.</p><p>A few people have shared my articles on Reddit, which has introduced me to new subreddits, but most of these posts have fared poorly due to poor targeting. Posts I’ve made received more votes, even though they are self posts, because they are at least relevant. Unlike Reddit or Hacker News, DZone actively promotes it&#8217;s writers. If your articles do well on the social media/voting side of their forum, they invite you to join their network, which I have found very helpful.</p><p><strong>Technical Considerations</strong></p><p>Most default WordPress themes render terribly in mobile devices, so it’s worth testing, preferably on both Android and iOS. For me, about 10% of my visitors are on phones. The most common rendering issue is that they render the same view a desktop would see, but scaled down so small that you have to zoom and scroll both left-right and up-down. Installing a mobile plugin made my blog render usably, although not pretty, which has reduced the number of people who just give up.</p><p>For people who enter through mobile, most come from an app (Reddit, Hacker News, or Twitter), an email link, or an RSS subscription. An important consideration is whether it’s actually possible for someone interested to follow you if they are on a phone. On a desktop you can easily copy a URL to an RSS reader, or type in an email address, but on a phone it’s harder, which I suspect calls for tighter integration with twitter, etc, but I haven’t proved this out yet.</p><p>The second technical issue worth paying attention to is <a
href="http://garysieling.com/blog/halving-page-load-time-with-pngcrush">page load time</a>. Even when it’s fast for you, it’s not for others. I set this blog up on a <a
href="http://www.linode.com/?r=e0aff3e71285079f7ffaaee9b7a92b6bcb3d6295">Linode host</a>, and have been slowly improving this (e.g. APC, Image sizes, more memory from <a
href="http://www.linode.com/?r=e0aff3e71285079f7ffaaee9b7a92b6bcb3d6295">Linode</a>, Apache caching). The <a
href="http://wpengine.com/blog/">WPEngine blog</a> is extremely helpful for finding useful tidbits &#8211; the wasted effort I&#8217;ve gone through tuning this site seems to be what they are trying to help businesses avoid.</p><p>If you run your own server, you eventually will get hacked, especially if you use popular software &#8211; I’ve recently started using <a
href="http://wordpress.org/extend/plugins/better-wp-security/">Better WP Security</a> to mitigate this risk.</p><p>It should come as no surprise that spam has become very sophisticated- for instance, there are people scraping comments from other sites to post on yours. I accepted one comment I never was able to figure out &#8211; a character named “Weng Fu”, who goes around to tech blogs praising the glories of VB6 (!), presumably an exercise in trolling. I forgot about this comment until recently, when I found that Google Webmaster Tools shows I rank well for “Weng Fu VB6” &#8211; who knew.</p><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/lessons-learned-from-0-to-40000-blog-readers" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/lessons-learned-from-0-to-40000-blog-readers/feed</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>ExtJs JSON Reader Example</title><link>http://garysieling.com/blog/extjs-json-reader-example?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=extjs-json-reader-example</link> <comments>http://garysieling.com/blog/extjs-json-reader-example#comments</comments> <pubDate>Wed, 08 May 2013 08:33:22 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Code Examples]]></category> <category><![CDATA[extjs]]></category> <category><![CDATA[extjs 3.4]]></category> <category><![CDATA[extjs examples]]></category> <category><![CDATA[extjs tutorials]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=1012</guid> <description><![CDATA[I received the following email from a reader: Thank you very much for finding time to read my mail. I came across your blog http://garysieling.com/blog/extjs-pie-chart-example It would be greatly helpful, if you could provide me with the code of binding the data dynamically to the DS. I have already generated the data in JSON format via [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/extjs-json-reader-example" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><div>I received the following email from a reader:</div><div></div><div></div><div><em>Thank you very much for finding time to read my mail.</em></div><div></div><div><em>I came across your blog <a
href="http://garysieling.com/blog/extjs-pie-chart-example" target="_blank">http://garysieling.com/blog/<wbr
/>extjs-pie-chart-example</a></em></div><div></div><div></div><div><em>It would be greatly helpful, if you could provide me with the code of binding the data dynamically to the DS. I have already generated the data in JSON format via Servlet:</em></p><div></div><div></div><p><em>{&#8220;success&#8221;:true,&#8221;campaignList&#8221;<wbr
/>:[{"NumberOfCampaigns":1,"<wbr
/>CamapaignScheduleDate":"Mar 23, 2013 12:00:00 AM"}]}</em></p><div></div><div></div><p><em>How do I attach this here..</em></p><div></div><div></div><p><em>var store = Ext.create(&#8216;Ext.data.Store&#8217;, {</em><br
/> <em>    model: &#8216;PopulationPoint&#8217;,</em><br
/> <em>    data: <span
style="color: #cc0000;">MY_DATA</span></em><br
/> <em>  });</em></p><div></div><div></div><div><em>Any help would be of great use.</em></div></div><div><em><span
style="font-family: verdana, sans-serif;"> </span></em></div><div><div><em>Thanks and regards</em></div><div></div><div><strong>Response:</strong></div><div></div><div>There are a couple ways to approach this problem. The linked example that I wrote uses a raw Javascript object to render a pie chart. Lets say your servlet returns JSON in a variable &#8211; you could parse this yourself and use the result in the store, but Ext provides the JSONReader for this purpose:</div><div></div><div
class="wp_syntax"><table><tr><td
class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #000066; font-weight: bold;">var</span> store <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">new</span> Ext.<span style="color: #660066;">data</span>.<span style="color: #660066;">JsonStore</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#123;</span>
    url<span style="color: #339933;">:</span> <span style="color: #3366CC;">'/servlet'</span><span style="color: #339933;">,</span>
    root<span style="color: #339933;">:</span> <span style="color: #3366CC;">'campaignList'</span><span style="color: #339933;">,</span>
    fields<span style="color: #339933;">:</span> <span style="color: #009900;">&#91;</span><span style="color: #009900;">&#123;</span>name<span style="color: #339933;">:</span><span style="color: #3366CC;">'NumberOfCampaigns'</span><span style="color: #339933;">,</span> type<span style="color: #339933;">:</span> <span style="color: #3366CC;">'int'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
             <span style="color: #009900;">&#123;</span>name<span style="color: #339933;">:</span><span style="color: #3366CC;">'CampaignScheduleDate'</span><span style="color: #339933;">,</span> type<span style="color: #339933;">:</span><span style="color: #3366CC;">'date'</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#93;</span>
<span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div><div></div><div></div><div>A couple things to note here- this assumes you know the types in advance, and in particular, you want to test dates carefully to make sure you render them in a format Ext can parse.  If needed, you can specify a date format as well, like so: dateFormat: &#8216;m-d-Y g:i A&#8217;</div><div></div><div>You can add a &#8220;mapping&#8221; to the fields, to alias columns from the servlet. The root value is optional &#8211; if a servlet returns a raw array this is not necessary. You may need to add &#8220;totalProperty&#8221; to the store as well &#8211; this specifies the name of a property in the JSON payload which specifies the total number of records. This is only needed for paging scenarios, where not all the results are returned at once.</div></div><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/extjs-json-reader-example" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/extjs-json-reader-example/feed</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Entity recognition with Scala and Stanford NLP Named Entity Recognizer</title><link>http://garysieling.com/blog/entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer</link> <comments>http://garysieling.com/blog/entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer#comments</comments> <pubDate>Tue, 07 May 2013 00:40:21 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Code Examples]]></category> <category><![CDATA[Data Mining]]></category> <category><![CDATA[Data Science]]></category> <category><![CDATA[java]]></category> <category><![CDATA[natural language processing]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[pdf]]></category> <category><![CDATA[scala]]></category> <category><![CDATA[scraping]]></category> <category><![CDATA[stanford]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=1002</guid> <description><![CDATA[The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it&#8217;s fairly good at finding nouns, but not always at identifying the type of each noun. In this example, the entities I&#8217;d like to [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p>The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it&#8217;s fairly good at finding nouns, but not always at identifying the type of each noun.</p><p>In this example, the entities I&#8217;d like to see are different &#8211; companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.</p><p>For this text, selecting different options sometimes led to the classifier picking different options for a noun &#8211; one time it&#8217;s a person, another time it&#8217;s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes &#8211; if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I&#8217;ve noticed similar issues with company names.</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.tika.parser.pdf._</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.tika.metadata._</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.tika.parser._</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io._</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.xml.sax._</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">edu.stanford.nlp.ie.crf.CRFClassifier</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">edu.stanford.nlp.ling.CoreAnnotations</span>
&nbsp;
object pdfHandler <span style="color: #000000; font-weight: bold;">extends</span> <span style="color: #003399;">ContentHandler</span> <span style="color: #009900;">&#123;</span>
  val contents<span style="color: #339933;">:</span> <span style="color: #003399;">StringBuffer</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">StringBuffer</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
&nbsp;
  def characters<span style="color: #009900;">&#40;</span>ch<span style="color: #339933;">:</span> <span style="color: #003399;">Array</span><span style="color: #009900;">&#91;</span>Char<span style="color: #009900;">&#93;</span>, start<span style="color: #339933;">:</span> Int, length<span style="color: #339933;">:</span> Int<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    contents.<span style="color: #006633;">append</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#40;</span>ch<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def endDocument<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def endElement<span style="color: #009900;">&#40;</span>uri<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, localName<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, qName<span style="color: #339933;">:</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def endPrefixMapping<span style="color: #009900;">&#40;</span>prefix<span style="color: #339933;">:</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def ignorableWhitespace<span style="color: #009900;">&#40;</span>ch<span style="color: #339933;">:</span> <span style="color: #003399;">Array</span><span style="color: #009900;">&#91;</span>Char<span style="color: #009900;">&#93;</span>, start<span style="color: #339933;">:</span> Int, length<span style="color: #339933;">:</span> Int<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def processingInstruction<span style="color: #009900;">&#40;</span>target<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, data<span style="color: #339933;">:</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def setDocumentLocator<span style="color: #009900;">&#40;</span>locator<span style="color: #339933;">:</span> Locator<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def skippedEntity<span style="color: #009900;">&#40;</span>name<span style="color: #339933;">:</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def startDocument<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def startElement<span style="color: #009900;">&#40;</span>uri<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, localName<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, qName<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, atts<span style="color: #339933;">:</span> <span style="color: #003399;">Attributes</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def startPrefixMapping<span style="color: #009900;">&#40;</span>prefix<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, uri<span style="color: #339933;">:</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
object pdf <span style="color: #000000; font-weight: bold;">extends</span> App <span style="color: #009900;">&#123;</span>
  val file <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;&quot;</span><span style="color: #0000ff;">&quot;e:<span style="color: #000099; font-weight: bold;">\d</span>ata<span style="color: #000099; font-weight: bold;">\1</span>1-1285_i4dk.pdf&quot;</span><span style="color: #0000ff;">&quot;&quot;</span>
&nbsp;
  val pdf<span style="color: #339933;">:</span> PDFParser <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> PDFParser<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  val stream<span style="color: #339933;">:</span> <span style="color: #003399;">InputStream</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">FileInputStream</span><span style="color: #009900;">&#40;</span>file<span style="color: #009900;">&#41;</span>
  val handler<span style="color: #339933;">:</span> <span style="color: #003399;">ContentHandler</span> <span style="color: #339933;">=</span> pdfHandler
  val metadata<span style="color: #339933;">:</span> Metadata <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Metadata<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
  val context<span style="color: #339933;">:</span> ParseContext <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> ParseContext<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
&nbsp;
  pdf.<span style="color: #006633;">parse</span><span style="color: #009900;">&#40;</span>stream,
    handler,
    metadata,
    context<span style="color: #009900;">&#41;</span>
&nbsp;
  stream.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
&nbsp;
  val contents<span style="color: #339933;">:</span> <span style="color: #003399;">String</span> <span style="color: #339933;">=</span> pdfHandler.<span style="color: #006633;">contents</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
  println<span style="color: #009900;">&#40;</span>contents<span style="color: #009900;">&#41;</span>
&nbsp;
  val src <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;stanford-ner-2013-04-04/classifiers/&quot;</span>
  val classifier1 <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;english.all.3class.distsim.crf.ser.gz&quot;</span>
  val classifier2 <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;english.conll.4class.distsim.crf.ser.gz&quot;</span>
  val classifier3 <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;english.muc.7class.distsim.crf.ser.gz&quot;</span>
&nbsp;
  val serializedClassifier <span style="color: #339933;">=</span> src <span style="color: #339933;">+</span> classifier1
&nbsp;
  val classifier <span style="color: #339933;">=</span> CRFClassifier.<span style="color: #006633;">getClassifierNoExceptions</span><span style="color: #009900;">&#40;</span>serializedClassifier<span style="color: #009900;">&#41;</span>
  val out <span style="color: #339933;">=</span> classifier.<span style="color: #006633;">classify</span><span style="color: #009900;">&#40;</span>contents<span style="color: #009900;">&#41;</span>
&nbsp;
  var words <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span>
  <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>i <span style="color: #339933;">&lt;-</span> <span style="color: #cc66cc;">0</span> to out.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    val sentence <span style="color: #339933;">=</span> out.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>i<span style="color: #009900;">&#41;</span>
&nbsp;
    var foundWord <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;&quot;</span>
    var oldWordClass <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;&quot;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>j <span style="color: #339933;">&lt;-</span> <span style="color: #cc66cc;">0</span> to sentence.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      val word <span style="color: #339933;">=</span> sentence.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>j<span style="color: #009900;">&#41;</span>
      val wordClass <span style="color: #339933;">=</span> word.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>classOf<span style="color: #009900;">&#91;</span>CoreAnnotations.<span style="color: #006633;">AnswerAnnotation</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;&quot;</span>
&nbsp;
      <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>oldWordClass.<span style="color: #006633;">equals</span><span style="color: #009900;">&#40;</span>wordClass<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>oldWordClass.<span style="color: #006633;">equals</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;O&quot;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&amp;&amp;</span> <span style="color: #339933;">!</span>oldWordClass.<span style="color: #006633;">equals</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          print<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;[/&quot;</span> <span style="color: #339933;">+</span> oldWordClass <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;]&quot;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span>
      <span style="color: #009900;">&#125;</span>
&nbsp;
      <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>wordClass.<span style="color: #006633;">equals</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;O&quot;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&amp;&amp;</span> <span style="color: #339933;">!</span>wordClass.<span style="color: #006633;">equals</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>oldWordClass.<span style="color: #006633;">equals</span><span style="color: #009900;">&#40;</span>wordClass<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          print<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;[&quot;</span> <span style="color: #339933;">+</span> wordClass <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;]&quot;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span>
      <span style="color: #009900;">&#125;</span>
&nbsp;
      oldWordClass <span style="color: #339933;">=</span> wordClass
&nbsp;
      words <span style="color: #339933;">=</span> words <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span>
      print<span style="color: #009900;">&#40;</span>word<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      print<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
      <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>words <span style="color: #339933;">&gt;</span> <span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        words <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span>
        println<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div><pre>
11-1285 [ORGANIZATION]US Airways , Inc. [/ORGANIZATION]v.
[PERSON]McCutchen [/PERSON]-LRB- 4\/16\/13 -RRB- 1 -LRB-
Slip Opinion -RRB- OCTOBER TERM ,
2012 Syllabus NOTE : Where it
is feasible , a syllabus -LRB-
headnote -RRB- will be released ,
as isbeing done in connection with
this case , at the time
the opinion is issued . The
syllabus constitutes no part of the
opinion of the Court but has
beenprepared by the Reporter of Decisions
for the convenience of the reader
. See [LOCATION]United States [/LOCATION]v. [ORGANIZATION]Detroit
Timber &#038; Lumber Co. [/ORGANIZATION], 200
U. S. 321 , 337 .
SUPREME COURT OF THE [ORGANIZATION]UNITED STATES
Syllabus US AIRWAYS [/ORGANIZATION], INC. ,
IN ITS CAPACITY AS FIDUCIARY AND
PLAN ADMINISTRATOR OF THE [LOCATION]US [/LOCATION]AIRWAYS
, INC. . EMPLOYEE BENEFITS PLAN
v. [PERSON]MCCUTCHEN [/PERSON]ET AL. . CERTIORARI
TO THE [ORGANIZATION]UNITED STATES [/ORGANIZATION]COURT OF
APPEALS FOR THE THIRD CIRCUIT No.
11 -- 1285 . Argued November
27 , 2012 -- Decided April
16 , 2013 The health benefits
plan established by petitioner [ORGANIZATION]US Airways
[/ORGANIZATION]paid $ 66,866 in medical expenses
for injuries suffered by respondentMcCutchen ,
a [ORGANIZATION]US Airways [/ORGANIZATION]employee , in
a car accident caused by athird
party . The plan entitled [ORGANIZATION]US
Airways [/ORGANIZATION]to reimbursement if
[PERSON]McCutchen [/PERSON]
</pre><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer/feed</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Extracting PDF text with Scala</title><link>http://garysieling.com/blog/extracting-pdf-text-with-scala?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=extracting-pdf-text-with-scala</link> <comments>http://garysieling.com/blog/extracting-pdf-text-with-scala#comments</comments> <pubDate>Mon, 06 May 2013 23:18:51 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Code Examples]]></category> <category><![CDATA[java]]></category> <category><![CDATA[pdf]]></category> <category><![CDATA[pdfbox]]></category> <category><![CDATA[scala]]></category> <category><![CDATA[scraping]]></category> <category><![CDATA[tikia]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=999</guid> <description><![CDATA[This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn&#8217;t seem to have the ability to fill in interface [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/extracting-pdf-text-with-scala" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p>This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn&#8217;t seem to have the ability to fill in interface methods on an object.</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="scala" style="font-family:monospace;"><span style="color: #0000ff; font-weight: bold;">import</span> java.<span style="color: #000000;">io</span>.<span style="color: #000080;">_</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">import</span> org.<span style="color: #000000;">apache</span>.<span style="color: #000000;">tika</span>.<span style="color: #000000;">parser</span>.<span style="color: #000000;">pdf</span>.<span style="color: #000080;">_</span>
<span style="color: #0000ff; font-weight: bold;">import</span> org.<span style="color: #000000;">apache</span>.<span style="color: #000000;">tika</span>.<span style="color: #000000;">metadata</span>.<span style="color: #000080;">_</span>
<span style="color: #0000ff; font-weight: bold;">import</span> org.<span style="color: #000000;">apache</span>.<span style="color: #000000;">tika</span>.<span style="color: #000000;">parser</span>.<span style="color: #000080;">_</span>
<span style="color: #0000ff; font-weight: bold;">import</span> org.<span style="color: #000000;">xml</span>.<span style="color: #000000;">sax</span>.<span style="color: #000080;">_</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">object</span> pdfHandler <span style="color: #0000ff; font-weight: bold;">extends</span> ContentHandler <span style="color: #F78811;">&#123;</span>
	<span style="color: #0000ff; font-weight: bold;">def</span> characters<span style="color: #F78811;">&#40;</span>ch <span style="color: #000080;">:</span> Array<span style="color: #F78811;">&#91;</span>Char<span style="color: #F78811;">&#93;</span>, start<span style="color: #000080;">:</span> Int, length<span style="color: #000080;">:</span> Int<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
		println<span style="color: #F78811;">&#40;</span><span style="color: #0000ff; font-weight: bold;">new</span> String<span style="color: #F78811;">&#40;</span>ch<span style="color: #F78811;">&#41;</span><span style="color: #F78811;">&#41;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> endDocument<span style="color: #F78811;">&#40;</span><span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> endElement<span style="color: #F78811;">&#40;</span>uri<span style="color: #000080;">:</span> String, localName<span style="color: #000080;">:</span> String, qName<span style="color: #000080;">:</span> String<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> endPrefixMapping<span style="color: #F78811;">&#40;</span>prefix<span style="color: #000080;">:</span> String<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> ignorableWhitespace<span style="color: #F78811;">&#40;</span>ch<span style="color: #000080;">:</span> Array<span style="color: #F78811;">&#91;</span>Char<span style="color: #F78811;">&#93;</span>, start<span style="color: #000080;">:</span> Int, length<span style="color: #000080;">:</span> Int<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> processingInstruction<span style="color: #F78811;">&#40;</span>target<span style="color: #000080;">:</span> String, data<span style="color: #000080;">:</span> String<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> setDocumentLocator<span style="color: #F78811;">&#40;</span>locator<span style="color: #000080;">:</span> Locator<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> skippedEntity<span style="color: #F78811;">&#40;</span>name<span style="color: #000080;">:</span> String<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> startDocument<span style="color: #F78811;">&#40;</span><span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> startElement<span style="color: #F78811;">&#40;</span>uri<span style="color: #000080;">:</span> String, localName<span style="color: #000080;">:</span> String, qName<span style="color: #000080;">:</span> String, atts<span style="color: #000080;">:</span> Attributes<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
&nbsp;
	<span style="color: #0000ff; font-weight: bold;">def</span> startPrefixMapping<span style="color: #F78811;">&#40;</span>prefix<span style="color: #000080;">:</span> String, uri<span style="color: #000080;">:</span> String<span style="color: #F78811;">&#41;</span> <span style="color: #F78811;">&#123;</span>
	<span style="color: #F78811;">&#125;</span>
<span style="color: #F78811;">&#125;</span>
&nbsp;
<span style="color: #0000ff; font-weight: bold;">object</span> pdf <span style="color: #0000ff; font-weight: bold;">extends</span> App <span style="color: #F78811;">&#123;</span>
	<span style="color: #0000ff; font-weight: bold;">val</span> folder <span style="color: #000080;">=</span> <span style="color: #6666FF;">&quot;&quot;</span><span style="color: #6666FF;">&quot;<span style="color: #6666ff; font-weight: bold;">\\</span>nas<span style="color: #6666ff; font-weight: bold;">\F</span>iles<span style="color: #6666ff; font-weight: bold;">\D</span>ata<span style="color: #6666ff; font-weight: bold;">\p</span>acer2<span style="color: #6666ff; font-weight: bold;">\&quot;</span>&quot;</span><span style="color: #6666FF;">&quot;
	val subfolder = &quot;</span><span style="color: #6666FF;">&quot;&quot;</span>\00\00\gov.<span style="color: #000000;">uscourts</span>.<span style="color: #000000;">rid</span>.6064\<span style="color: #6666FF;">&quot;&quot;</span><span style="color: #6666FF;">&quot;
	val file = &quot;</span><span style="color: #6666FF;">&quot;&quot;</span>gov.<span style="color: #000000;">uscourts</span>.<span style="color: #000000;">rid</span>.6064.20.0.<span style="color: #000000;">pdf</span><span style="color: #6666FF;">&quot;&quot;</span><span style="color: #6666FF;">&quot;
&nbsp;
	val pdf : PDFParser = new PDFParser();
&nbsp;
	val stream : InputStream = new FileInputStream(folder + subfolder + file)
	val handler : ContentHandler = pdfHandler
	val metadata : Metadata = new Metadata()
	val context : ParseContext = new ParseContext()
&nbsp;
	pdf.parse(stream,
         handler,
         metadata,
         context)
&nbsp;
    stream.close()
}</span></pre></td></tr></table></div><p>Output:</p><pre>
UNITED STATES DISTRICT COURT
FOR THE DISTRICT OF RHODE ISLAND
...
It is hereby agreed by and between the parties that the above-captioned matter be
dismissed, with prejudice, no interest, no costs.
</pre><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/extracting-pdf-text-with-scala" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/extracting-pdf-text-with-scala/feed</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>Implementing k-means in Scala</title><link>http://garysieling.com/blog/implementing-k-means-in-scala?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=implementing-k-means-in-scala</link> <comments>http://garysieling.com/blog/implementing-k-means-in-scala#comments</comments> <pubDate>Fri, 03 May 2013 21:37:34 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Code Examples]]></category> <category><![CDATA[Data Mining]]></category> <category><![CDATA[Data Science]]></category> <category><![CDATA[artificial intelligence]]></category> <category><![CDATA[clustering]]></category> <category><![CDATA[data mining]]></category> <category><![CDATA[data science]]></category> <category><![CDATA[functional programming]]></category> <category><![CDATA[java]]></category> <category><![CDATA[k-means]]></category> <category><![CDATA[scala]]></category> <category><![CDATA[statistics]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=997</guid> <description><![CDATA[To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two &#8211; the exact points used are in the code below. This implements Lloyd&#8217;s algorithm, which tries to cluster points in iterations in a simple manner: 1. Assume a certain number of [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/implementing-k-means-in-scala" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p>To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two &#8211; the exact points used are in the code below.</p><p>This implements Lloyd&#8217;s algorithm, which tries to cluster points in iterations in a simple manner:</p><p>1. Assume a certain number of clusters<br
/> 2. Group the points at random<br
/> 3. Compute the center of each cluster<br
/> 4. For each point, compute which cluster is closest<br
/> 5. Move all the points into new groupings<br
/> 6. Repeat 3-5 a few times, until you&#8217;re happy with the results</p><p>I like how the functional programming style forces you to recreate all the data structures, in this case. It might be tempting to implement this in an imperative style, modifying data structures in place, but since steps 4-5 require separate data, you are protected against making it more difficult. You can see the full source below, or <a
href="https://github.com/garysieling/scala-k-means">on github</a>.</p><p>Since this example is fairly contrived, this converges pretty quickly:</p><pre>
Initial State:
  Cluster 0
  Mean: (17.83517750970944, 12.242720407317105)
    (10.8348626966492, 18.7800980127523))
    (7.7875624720831, 20.1569764307574))
    (11.9096128931784, 21.1855674228972))
    (22.4668345067162, 8.9705504626857))
    (7.91362116378194, 21.325928219919))
    (22.636600400773, 2.46561420928429))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (25.1439536911272, 3.58469981317611))
    (23.5359486724204, 4.07290025106778))
    (11.7493214262468, 17.8517235677469))
    (12.4277617893575, 19.4887691804508))
    (11.931275122466, 18.0462702532436))
    (25.4645673159779, 7.54703465191098))
    (21.8031183153743, 5.69297814349064))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (16.95249500233747, 12.848199048608048)
    (11.7265904596619, 16.9636039793709))
    (10.7751248849735, 22.1517666115673))
    (23.6587920739353, 3.35476798095758))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (7.03171204763376, 19.1985058633283))
    (23.7722765903534, 3.74873642284525))
    (10.259545802461, 23.4515683763173))
    (28.1587146197594, 3.70625885635717))
    (10.1057940183815, 18.7332929859685))
    (8.90149362263775, 19.6314465074203))
    (12.4353462881232, 19.6310467981989))
    (24.3793349065557, 4.59761596097384))
    (22.5447925324242, 2.99485404382734))
    (26.8942422516129, 5.02646862012427))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (27.0339042858296, 4.4151109960116))
    (11.0118378554584, 20.9773232834654))
Iteration: 0
  Cluster 0
  Mean: (23.781370272978315, 5.754127202865132)
    (11.7265904596619, 16.9636039793709))
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.296576237184727, 20.09138475584863)
    (10.8348626966492, 18.7800980127523))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 1
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 2
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 3
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 4
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 5
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
</pre><div
class="wp_syntax"><table><tr><td
class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">class</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span>dx<span style="color: #339933;">:</span> <span style="color: #003399;">Double</span>, dy<span style="color: #339933;">:</span> <span style="color: #003399;">Double</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  val x<span style="color: #339933;">:</span> <span style="color: #003399;">Double</span> <span style="color: #339933;">=</span> dx
  val y<span style="color: #339933;">:</span> <span style="color: #003399;">Double</span> <span style="color: #339933;">=</span> dy
&nbsp;
  override def toString<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">:</span> <span style="color: #003399;">String</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #0000ff;">&quot;(&quot;</span> <span style="color: #339933;">+</span> x <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;, &quot;</span> <span style="color: #339933;">+</span> y <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;)&quot;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def dist<span style="color: #009900;">&#40;</span>p<span style="color: #339933;">:</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">:</span> <span style="color: #003399;">Double</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">return</span> x <span style="color: #339933;">*</span> p.<span style="color: #006633;">x</span> <span style="color: #339933;">+</span> y <span style="color: #339933;">*</span> p.<span style="color: #006633;">y</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
object kmeans <span style="color: #000000; font-weight: bold;">extends</span> App <span style="color: #009900;">&#123;</span>
  val k<span style="color: #339933;">:</span> Int <span style="color: #339933;">=</span> <span style="color: #cc66cc;">2</span>
&nbsp;
  <span style="color: #666666; font-style: italic;">// Correct answers to centers are (10, 20) and (25, 5)</span>
  val points<span style="color: #339933;">:</span> <span style="color: #003399;">List</span><span style="color: #009900;">&#91;</span><span style="color: #003399;">Point</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #003399;">List</span><span style="color: #009900;">&#40;</span>
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">10.8348626966492</span>, <span style="color: #cc66cc;">18.7800980127523</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">10.259545802461</span>, <span style="color: #cc66cc;">23.4515683763173</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">11.7396623802245</span>, <span style="color: #cc66cc;">17.7026240456956</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">12.4277617893575</span>, <span style="color: #cc66cc;">19.4887691804508</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">10.1057940183815</span>, <span style="color: #cc66cc;">18.7332929859685</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">11.0118378554584</span>, <span style="color: #cc66cc;">20.9773232834654</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">7.03171204763376</span>, <span style="color: #cc66cc;">19.1985058633283</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">6.56491029696013</span>, <span style="color: #cc66cc;">21.5098251711267</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">10.7751248849735</span>, <span style="color: #cc66cc;">22.1517666115673</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">8.90149362263775</span>, <span style="color: #cc66cc;">19.6314465074203</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">11.931275122466</span>, <span style="color: #cc66cc;">18.0462702532436</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">11.7265904596619</span>, <span style="color: #cc66cc;">16.9636039793709</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">11.7493214262468</span>, <span style="color: #cc66cc;">17.8517235677469</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">12.4353462881232</span>, <span style="color: #cc66cc;">19.6310467981989</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">13.0838514816799</span>, <span style="color: #cc66cc;">20.3398794353494</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">7.7875624720831</span>, <span style="color: #cc66cc;">20.1569764307574</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">11.9096128931784</span>, <span style="color: #cc66cc;">21.1855674228972</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">8.87507602702847</span>, <span style="color: #cc66cc;">21.4823134390704</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">7.91362116378194</span>, <span style="color: #cc66cc;">21.325928219919</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">26.4748241341303</span>, <span style="color: #cc66cc;">9.25128245838802</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">26.2100410238973</span>, <span style="color: #cc66cc;">5.06220487544192</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">28.1587146197594</span>, <span style="color: #cc66cc;">3.70625885635717</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">26.8942422516129</span>, <span style="color: #cc66cc;">5.02646862012427</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">23.7770902983858</span>, <span style="color: #cc66cc;">7.19445492687232</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">23.6587920739353</span>, <span style="color: #cc66cc;">3.35476798095758</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">23.7722765903534</span>, <span style="color: #cc66cc;">3.74873642284525</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">23.9177161897547</span>, <span style="color: #cc66cc;">8.1377950229489</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">22.4668345067162</span>, <span style="color: #cc66cc;">8.9705504626857</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">24.5349708443852</span>, <span style="color: #cc66cc;">5.00561881333415</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">24.3793349065557</span>, <span style="color: #cc66cc;">4.59761596097384</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">27.0339042858296</span>, <span style="color: #cc66cc;">4.4151109960116</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">21.8031183153743</span>, <span style="color: #cc66cc;">5.69297814349064</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">22.636600400773</span>, <span style="color: #cc66cc;">2.46561420928429</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">25.1439536911272</span>, <span style="color: #cc66cc;">3.58469981317611</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">21.4930923464916</span>, <span style="color: #cc66cc;">3.28999356823389</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">23.5359486724204</span>, <span style="color: #cc66cc;">4.07290025106778</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">22.5447925324242</span>, <span style="color: #cc66cc;">2.99485404382734</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">25.4645673159779</span>, <span style="color: #cc66cc;">7.54703465191098</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">sortBy</span><span style="color: #009900;">&#40;</span>
      p <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#40;</span>p.<span style="color: #006633;">x</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; &quot;</span> <span style="color: #339933;">+</span> p.<span style="color: #006633;">y</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">hashCode</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
  def clusterMean<span style="color: #009900;">&#40;</span>points<span style="color: #339933;">:</span> <span style="color: #003399;">List</span><span style="color: #009900;">&#91;</span><span style="color: #003399;">Point</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">:</span> <span style="color: #003399;">Point</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    val cumulative <span style="color: #339933;">=</span> points.<span style="color: #006633;">reduceLeft</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>a<span style="color: #339933;">:</span> <span style="color: #003399;">Point</span>, b<span style="color: #339933;">:</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span>a.<span style="color: #006633;">x</span> <span style="color: #339933;">+</span> b.<span style="color: #006633;">x</span>, a.<span style="color: #006633;">y</span> <span style="color: #339933;">+</span> b.<span style="color: #006633;">y</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#40;</span>cumulative.<span style="color: #006633;">x</span> <span style="color: #339933;">/</span> points.<span style="color: #006633;">length</span>, cumulative.<span style="color: #006633;">y</span> <span style="color: #339933;">/</span> points.<span style="color: #006633;">length</span><span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def render<span style="color: #009900;">&#40;</span>points<span style="color: #339933;">:</span> <span style="color: #003399;">Map</span><span style="color: #009900;">&#91;</span>Int, <span style="color: #003399;">List</span><span style="color: #009900;">&#91;</span><span style="color: #003399;">Point</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>clusterNumber <span style="color: #339933;">&lt;-</span> points.<span style="color: #006633;">keys</span>.<span style="color: #006633;">toSeq</span>.<span style="color: #006633;">sorted</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;  Cluster &quot;</span> <span style="color: #339933;">+</span> clusterNumber<span style="color: #009900;">&#41;</span>
&nbsp;
      val meanPoint <span style="color: #339933;">=</span> clusterMean<span style="color: #009900;">&#40;</span>points<span style="color: #009900;">&#40;</span>clusterNumber<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;  Mean: &quot;</span> <span style="color: #339933;">+</span> meanPoint<span style="color: #009900;">&#41;</span>
&nbsp;
      <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>j <span style="color: #339933;">&lt;-</span> <span style="color: #cc66cc;">0</span> to points<span style="color: #009900;">&#40;</span>clusterNumber<span style="color: #009900;">&#41;</span>.<span style="color: #006633;">length</span> <span style="color: #339933;">-</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;    &quot;</span> <span style="color: #339933;">+</span> points<span style="color: #009900;">&#40;</span>clusterNumber<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#40;</span>j<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;)&quot;</span><span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
&nbsp;
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  val clusters <span style="color: #339933;">=</span>
    points.<span style="color: #006633;">zipWithIndex</span>.<span style="color: #006633;">groupBy</span><span style="color: #009900;">&#40;</span>
      x <span style="color: #339933;">=&gt;</span> x._2 <span style="color: #339933;">%</span> k<span style="color: #009900;">&#41;</span> transform <span style="color: #009900;">&#40;</span>
        <span style="color: #009900;">&#40;</span>i<span style="color: #339933;">:</span> Int, p<span style="color: #339933;">:</span> <span style="color: #003399;">List</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Point</span>, Int<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>x <span style="color: #339933;">&lt;-</span> p<span style="color: #009900;">&#41;</span> yield x._1<span style="color: #009900;">&#41;</span>
&nbsp;
  <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Initial State: &quot;</span><span style="color: #009900;">&#41;</span>
  render<span style="color: #009900;">&#40;</span>clusters<span style="color: #009900;">&#41;</span>
&nbsp;
  def iterate<span style="color: #009900;">&#40;</span>clusters<span style="color: #339933;">:</span> <span style="color: #003399;">Map</span><span style="color: #009900;">&#91;</span>Int, <span style="color: #003399;">List</span><span style="color: #009900;">&#91;</span><span style="color: #003399;">Point</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">:</span> <span style="color: #003399;">Map</span><span style="color: #009900;">&#91;</span>Int, <span style="color: #003399;">List</span><span style="color: #009900;">&#91;</span><span style="color: #003399;">Point</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    val unzippedClusters <span style="color: #339933;">=</span>
      <span style="color: #009900;">&#40;</span>clusters<span style="color: #339933;">:</span> <span style="color: #003399;">Iterator</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Point</span>, Int<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> clusters.<span style="color: #006633;">map</span><span style="color: #009900;">&#40;</span>cluster <span style="color: #339933;">=&gt;</span> cluster._1<span style="color: #009900;">&#41;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// find cluster means</span>
    val means <span style="color: #339933;">=</span>
      <span style="color: #009900;">&#40;</span>clusters<span style="color: #339933;">:</span> <span style="color: #003399;">Map</span><span style="color: #009900;">&#91;</span>Int, <span style="color: #003399;">List</span><span style="color: #009900;">&#91;</span><span style="color: #003399;">Point</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span>
        <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>clusterIndex <span style="color: #339933;">&lt;-</span> clusters.<span style="color: #006633;">keys</span><span style="color: #009900;">&#41;</span>
          yield clusterMean<span style="color: #009900;">&#40;</span>clusters<span style="color: #009900;">&#40;</span>clusterIndex<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// find the closest index</span>
    def closest<span style="color: #009900;">&#40;</span>p<span style="color: #339933;">:</span> <span style="color: #003399;">Point</span>, means<span style="color: #339933;">:</span> Iterable<span style="color: #009900;">&#91;</span><span style="color: #003399;">Point</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">:</span> Int <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
      val distances <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>center <span style="color: #339933;">&lt;-</span> means<span style="color: #009900;">&#41;</span> yield p.<span style="color: #006633;">dist</span><span style="color: #009900;">&#40;</span>center<span style="color: #009900;">&#41;</span>
      <span style="color: #000000; font-weight: bold;">return</span> distances.<span style="color: #006633;">zipWithIndex</span>.<span style="color: #006633;">min</span>._2
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// assignment step</span>
    val newClusters <span style="color: #339933;">=</span>
      points.<span style="color: #006633;">groupBy</span><span style="color: #009900;">&#40;</span>
        <span style="color: #009900;">&#40;</span>p<span style="color: #339933;">:</span> <span style="color: #003399;">Point</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> closest<span style="color: #009900;">&#40;</span>p, means<span style="color: #009900;">&#40;</span>clusters<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
    render<span style="color: #009900;">&#40;</span>newClusters<span style="color: #009900;">&#41;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">return</span> newClusters
  <span style="color: #009900;">&#125;</span>
&nbsp;
  var clusterToTest <span style="color: #339933;">=</span> clusters
  <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>i <span style="color: #339933;">&lt;-</span> <span style="color: #cc66cc;">0</span> to <span style="color: #cc66cc;">5</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Iteration: &quot;</span> <span style="color: #339933;">+</span> i<span style="color: #009900;">&#41;</span>
    clusterToTest <span style="color: #339933;">=</span> iterate<span style="color: #009900;">&#40;</span>clusterToTest<span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/implementing-k-means-in-scala" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/implementing-k-means-in-scala/feed</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>Scala zip/zipAll Example</title><link>http://garysieling.com/blog/scala-zipzipall-example?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=scala-zipzipall-example</link> <comments>http://garysieling.com/blog/scala-zipzipall-example#comments</comments> <pubDate>Fri, 03 May 2013 20:09:28 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Code Examples]]></category> <category><![CDATA[functional programming]]></category> <category><![CDATA[java]]></category> <category><![CDATA[scala]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=995</guid> <description><![CDATA[The zip function combines two lists into tuples. If the lists are of differing lengths, the shorter length is used. If you don&#8217;t like this behavior, the zipAll function will keep all elements, filling in specified values for the blanks (compare this to the recycling rule in R, which lets you continuously cycle through the [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/scala-zipzipall-example" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p>The zip function combines two lists into tuples. If the lists are of differing lengths, the shorter length is used. If you don&#8217;t like this behavior, the zipAll function will keep all elements, filling in specified values for the blanks (compare this to the recycling rule in R, which lets you continuously cycle through the shorter list).</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="java" style="font-family:monospace;">val a <span style="color: #339933;">=</span> <span style="color: #003399;">List</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;a&quot;</span>, <span style="color: #0000ff;">&quot;b&quot;</span>, <span style="color: #0000ff;">&quot;c&quot;</span>, <span style="color: #0000ff;">&quot;d&quot;</span><span style="color: #009900;">&#41;</span>
val b <span style="color: #339933;">=</span> <span style="color: #003399;">List</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span>, <span style="color: #cc66cc;">2</span>, <span style="color: #cc66cc;">3</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
println<span style="color: #009900;">&#40;</span>a.<span style="color: #006633;">zipAll</span><span style="color: #009900;">&#40;</span>b, <span style="color: #0000ff;">&quot;for missing values&quot;</span>, <span style="color: #cc66cc;">100</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div><p>And here is the output for each:</p><pre>
List((a,1), (b,2), (c,3))
List((a,1), (b,2), (c,3), (d,100))
</pre><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/scala-zipzipall-example" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/scala-zipzipall-example/feed</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Scala Tree Sort Example</title><link>http://garysieling.com/blog/scala-tree-sort-example?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=scala-tree-sort-example</link> <comments>http://garysieling.com/blog/scala-tree-sort-example#comments</comments> <pubDate>Fri, 03 May 2013 18:05:25 +0000</pubDate> <dc:creator>Gary</dc:creator> <category><![CDATA[Code Examples]]></category> <category><![CDATA[java]]></category> <category><![CDATA[scala]]></category> <category><![CDATA[tree]]></category> <guid
isPermaLink="false">http://garysieling.com/blog/?p=985</guid> <description><![CDATA[This demonstrates basic language features &#8211; case classes, iteration, anonymous functions, etc. abstract class Node case class LeafNode&#40;data: String&#41; extends Node; case class FullNode&#40;data: String, left: Node, right: Node&#41; extends Node case class LeftNode&#40;data: String, left: Node&#41; extends Node case class RightNode&#40;data: String, right: Node&#41; extends Node &#160; object test extends App &#123; val words [...]]]></description> <content:encoded><![CDATA[<div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/scala-tree-sort-example" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div><p>This demonstrates basic language features &#8211; case classes, iteration, anonymous functions, etc.</p><div
class="wp_syntax"><table><tr><td
class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">abstract</span> <span style="color: #000000; font-weight: bold;">class</span> Node
<span style="color: #000000; font-weight: bold;">case</span> <span style="color: #000000; font-weight: bold;">class</span> LeafNode<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">:</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">extends</span> Node<span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">case</span> <span style="color: #000000; font-weight: bold;">class</span> FullNode<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, left<span style="color: #339933;">:</span> Node, right<span style="color: #339933;">:</span> Node<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">extends</span> Node
<span style="color: #000000; font-weight: bold;">case</span> <span style="color: #000000; font-weight: bold;">class</span> LeftNode<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, left<span style="color: #339933;">:</span> Node<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">extends</span> Node
<span style="color: #000000; font-weight: bold;">case</span> <span style="color: #000000; font-weight: bold;">class</span> RightNode<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, right<span style="color: #339933;">:</span> Node<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">extends</span> Node
&nbsp;
object test <span style="color: #000000; font-weight: bold;">extends</span> App <span style="color: #009900;">&#123;</span>
  val words <span style="color: #339933;">=</span> <span style="color: #003399;">List</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;revertively&quot;</span>, <span style="color: #0000ff;">&quot;dispelled&quot;</span>, <span style="color: #0000ff;">&quot;overmoral&quot;</span>,
    <span style="color: #0000ff;">&quot;sylphid&quot;</span>, <span style="color: #0000ff;">&quot;nonhabitability&quot;</span>, <span style="color: #0000ff;">&quot;noiselessness&quot;</span>,
    <span style="color: #0000ff;">&quot;undisconnected&quot;</span>, <span style="color: #0000ff;">&quot;shoveling&quot;</span>, <span style="color: #0000ff;">&quot;visalia&quot;</span>, <span style="color: #0000ff;">&quot;ilo&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  def construct<span style="color: #009900;">&#40;</span>A<span style="color: #339933;">:</span> <span style="color: #003399;">List</span><span style="color: #009900;">&#91;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">:</span> Node <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    def insert<span style="color: #009900;">&#40;</span>tree<span style="color: #339933;">:</span> Node, value<span style="color: #339933;">:</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">:</span> Node <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
      tree match <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">case</span> <span style="color: #000066; font-weight: bold;">null</span> <span style="color: #339933;">=&gt;</span> LeafNode<span style="color: #009900;">&#40;</span>value<span style="color: #009900;">&#41;</span>
        <span style="color: #000000; font-weight: bold;">case</span> LeafNode<span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #339933;">&gt;</span> data<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          LeftNode<span style="color: #009900;">&#40;</span>data, LeafNode<span style="color: #009900;">&#40;</span>value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span>
          RightNode<span style="color: #009900;">&#40;</span>data, LeafNode<span style="color: #009900;">&#40;</span>value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">case</span> LeftNode<span style="color: #009900;">&#40;</span>data, left<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #339933;">&gt;</span> data<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          LeftNode<span style="color: #009900;">&#40;</span>value, LeftNode<span style="color: #009900;">&#40;</span>data, left<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span>
          FullNode<span style="color: #009900;">&#40;</span>data, left, LeafNode<span style="color: #009900;">&#40;</span>value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">case</span> RightNode<span style="color: #009900;">&#40;</span>data, right<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #339933;">&gt;</span> data<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          FullNode<span style="color: #009900;">&#40;</span>data, LeafNode<span style="color: #009900;">&#40;</span>value<span style="color: #009900;">&#41;</span>, right<span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span>
          RightNode<span style="color: #009900;">&#40;</span>value, RightNode<span style="color: #009900;">&#40;</span>data, right<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">case</span> FullNode<span style="color: #009900;">&#40;</span>data, left, right<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #339933;">&gt;</span> data<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          FullNode<span style="color: #009900;">&#40;</span>data, insert<span style="color: #009900;">&#40;</span>left, value<span style="color: #009900;">&#41;</span>, right<span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span>
          FullNode<span style="color: #009900;">&#40;</span>data, left, insert<span style="color: #009900;">&#40;</span>right, value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#125;</span>
      <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    var tree<span style="color: #339933;">:</span> Node <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>item <span style="color: #339933;">&lt;-</span> A<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      tree <span style="color: #339933;">=</span> insert<span style="color: #009900;">&#40;</span>tree, item<span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">return</span> tree
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
&nbsp;
  val f <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span>A<span style="color: #339933;">:</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span>
    <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>A<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  words.<span style="color: #006633;">map</span><span style="color: #009900;">&#40;</span>f<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  var x <span style="color: #339933;">=</span> construct<span style="color: #009900;">&#40;</span>words<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  def recurseNode<span style="color: #009900;">&#40;</span>A<span style="color: #339933;">:</span> Node, depth<span style="color: #339933;">:</span> Int<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    def display<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, depth<span style="color: #339933;">:</span> Int<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>i <span style="color: #339933;">&lt;-</span> <span style="color: #cc66cc;">1</span> to depth <span style="color: #339933;">*</span> <span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">print</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;-&quot;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#125;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    A match <span style="color: #009900;">&#123;</span>
      <span style="color: #000000; font-weight: bold;">case</span> <span style="color: #000066; font-weight: bold;">null</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        display<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;[]&quot;</span>, depth<span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #000000; font-weight: bold;">case</span> LeafNode<span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        display<span style="color: #009900;">&#40;</span>data, depth<span style="color: #009900;">&#41;</span>
        recurseNode<span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">null</span>, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
        recurseNode<span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">null</span>, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #000000; font-weight: bold;">case</span> FullNode<span style="color: #009900;">&#40;</span>data, left, right<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        display<span style="color: #009900;">&#40;</span>data, depth<span style="color: #009900;">&#41;</span>
        recurseNode<span style="color: #009900;">&#40;</span>left, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
        recurseNode<span style="color: #009900;">&#40;</span>right, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #000000; font-weight: bold;">case</span> RightNode<span style="color: #009900;">&#40;</span>data, right<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        display<span style="color: #009900;">&#40;</span>data, depth<span style="color: #009900;">&#41;</span>
        recurseNode<span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">null</span>, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
        recurseNode<span style="color: #009900;">&#40;</span>right, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #000000; font-weight: bold;">case</span> LeftNode<span style="color: #009900;">&#40;</span>data, left<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        display<span style="color: #009900;">&#40;</span>data, depth<span style="color: #009900;">&#41;</span>
        recurseNode<span style="color: #009900;">&#40;</span>left, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
        recurseNode<span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">null</span>, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def output<span style="color: #009900;">&#40;</span>A<span style="color: #339933;">:</span> Node, recurse<span style="color: #339933;">:</span> <span style="color: #009900;">&#40;</span>Node, Int<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> Unit<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    recurse<span style="color: #009900;">&#40;</span>A, <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def renderTree<span style="color: #009900;">&#40;</span>A<span style="color: #339933;">:</span> Node<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    output<span style="color: #009900;">&#40;</span>x, recurseNode<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  renderTree<span style="color: #009900;">&#40;</span>x<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  def sortedRender<span style="color: #009900;">&#40;</span>A<span style="color: #339933;">:</span> Node, depth<span style="color: #339933;">:</span> Int<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    def display<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">:</span> <span style="color: #003399;">String</span>, depth<span style="color: #339933;">:</span> Int<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    A match <span style="color: #009900;">&#123;</span>
      <span style="color: #000000; font-weight: bold;">case</span> <span style="color: #000066; font-weight: bold;">null</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #000000; font-weight: bold;">case</span> LeafNode<span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        display<span style="color: #009900;">&#40;</span>data, depth<span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #000000; font-weight: bold;">case</span> FullNode<span style="color: #009900;">&#40;</span>data, left, right<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        sortedRender<span style="color: #009900;">&#40;</span>left, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
        display<span style="color: #009900;">&#40;</span>data, depth<span style="color: #009900;">&#41;</span>
        sortedRender<span style="color: #009900;">&#40;</span>right, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #000000; font-weight: bold;">case</span> RightNode<span style="color: #009900;">&#40;</span>data, right<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        sortedRender<span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">null</span>, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
        display<span style="color: #009900;">&#40;</span>data, depth<span style="color: #009900;">&#41;</span>
        sortedRender<span style="color: #009900;">&#40;</span>right, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #000000; font-weight: bold;">case</span> LeftNode<span style="color: #009900;">&#40;</span>data, left<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span>
        sortedRender<span style="color: #009900;">&#40;</span>left, depth <span style="color: #339933;">+</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>
        display<span style="color: #009900;">&#40;</span>data, depth<span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  def renderTreeSorted<span style="color: #009900;">&#40;</span>A<span style="color: #339933;">:</span> Node<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
    output<span style="color: #009900;">&#40;</span>x, sortedRender<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #009900;">&#41;</span>
  renderTreeSorted<span style="color: #009900;">&#40;</span>x<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div><div
class="gpo_bottomcontainer"><div
class="gpo_buttons"> <g:plusone href="http://garysieling.com/blog/scala-tree-sort-example" size="standard" count="true"></g:plusone></div></div><div
style="clear:both"></div>]]></content:encoded> <wfw:commentRss>http://garysieling.com/blog/scala-tree-sort-example/feed</wfw:commentRss> <slash:comments>0</slash:comments> </item> </channel> </rss>