Entries from June 2012 ↓

A brief introduction to Weka

Weka is a GPL data mining tool written in Java, published by the University of Waikato. It includes an extensive series of pre-implemented machine learning algorithms, including well known classification and clustering algorithms. If you’ve ever been curious how Bayes Theorem works, this is a great tool to get up and running.

Weka uses a custom data format, called ARFF files (Attribute Relation File Format). This, in essence, specifies a table of data, along with a CSV style data listing. The data types are scaled down from a type of database: numerics, strings, dates, and nominal attributes (i.e. an equivalent to enumeration or pick list).

You can connect to any database with a JDBC connection string, provided the appropriate jars are on the classpath. Weka ships with a file called DatabaseUtils.props, which maps database types to the Weka types listed above.

Once you get some data in, you can try out different algorithms (and see how difficult it is to build predictive systems!)

 

Expert Search Statistics

The following are some interesting statistics about the Github expert-finder.

Unique repositories: 18,977

Source git repos (GB): 250+ GB

Solr Index Size: 3.2 GB

Time to build index: ~12 hours spread over several days (had to restart indexer several times)

Number of commits:  4,579,236

 

Advertisers used by banned sellers in Flippa auctions

In a previous post, I listed the top Flippa advertisers, gained through the node.js web scraper. Which advertisers are mentioned most often in auctions by banned sellers?

Number of domains listed by banned sellers Total number of domains Advertiser % Banned
140530044adsense5%
109121567amazon5%
108221016clickbank5%
37013454unknown3%
3554315commission junction8%
1641410infolinks12%

As you can see, there is a big drop in the “unknown” category, and a big increase in banned accounts associated with Infolinks and CJ. After visual inspection, the following factors seem to be at play:

  • Auctions without much information listed are likely to end unsold
  • In a small number of cases, advertisers banned from adsense or other networks may move on to other, smaller networks, but it is nearly impossible to measure this
  • Infolinks is commonly used in template style listings, as a potential revenue increase for the new site owner – presumably this means it isn’t, or sellers would use it to increase the value of their listings. This is a useful finding, as one factor that clearly identifies over-hyped listings.

Converting git commit history to a solr full-text index

I built a 4 million document archive from Github commits, which lets you search for open source experts, ranked by commit count. Click here to try the demo.

Solr is a relatively recent addition to the world of Lucene (2007); it adds a web-app UI over lucene, scaling (highly available reads), and configuration. For those unfamiliar, full-text indexing products build databases of words used in documents, allowing fast searching of text within a document. They handle language features such as synonyms (ran/run) and stemmed words (sear, seared, sears). Unlike typical database indices, they perform very well for finding similar words in close proximity.

I took solr training at a recent conference, Lucene Revolution, which is sponsored by Lucid Imagination, a solr consulting company. I decided to test it out in a few small projects, having previously done a FAST proof-of-concept. A colleague and I brainstormed ideas, and I wrote ETL code to convert several company git repositories into full text indexes. This could easily be expanded to link relevant JIRA tickets, Sharepoint documents, or other source control systems – in fact, one of the repositories I used for testing was a converted CVS repository.

Interest is growing in Solr, at least in part because a competitor, FAST, was acquired by Microsoft, who currently use it in Sharepoint. A solr index is conceptually similar to a single database table, and thus could also be used as a key-value store. Each column has detailed configuration, linking to various Java classes, so you can control whether data is hashed, compressed, or stored. Configuration files also control languages features such as stemming, heuristics, and synonyms. Each feature is stored in separate sets of files in the index, so you can easily figure out which features to turn off to save space.

I settled on JGit to read git repositories, after testing a couple Java-based libraries. JGit is a re-implementation of git in Java, and differs from other libraries which wrap the command line interface. It is also the basis for the Eclipse plugin EGit. EGit does not handle some line ending settings correctly, which concerned me, but it seems to work fine for this project. Reading repositories is by far the slowest part of this script. It only reads what you ask for— it is much faster to read just the commit history, without file diffs.

Solr provides a simple Java interface called solrj, which lets you push a list of rows (“documents”) to the index. Because of the single-table structure, it is quite common to have denormalized data, as well as data repeating in multiple columns. For example, you might concatenate all the fields you wish to search on, but store only indexed values so they can never be viewed. However, they can referenced through individual fields.

One of the important design considerations is what to use for a unique ID; each document needs an ID for later updates or deletions. When you push an existing document to a solr repository, the document is deleted, and the replacement is added (a soft delete). If you change the schema, everything is deleted. I used the commit ID, because I only indexed commit metadata; however, if I added entries for each file in the diff, I would need to use a different value for the ID. Database style sequence-based IDs would be a poor choice because it would make updates very difficult.

Some transformations are best done in Java, such as entity name normalization or extracting data from external systems. Because I combined several git repositories into one full-text index, I needed to normalize author names (“gsieling” and “Gary Sieling”). Intellij adds a line to commits identifying itself, so I added an artificial field identifying the commit author’s IDE of choice.

In a very crude and unfair comparison, the unoptimized Solr index is 90 MB of commits versus 2,000 MB of git history. The trade-offs that make Solr indexes small also make it fast, even without any kind of scaling.

This is post the first of many—in the future I will present interesting practical uses of this project, as well as options for parsing code. You can see the full source of this project on Github here.

Diagnosing Connection Leaks in Node.js and Postgres

In building a website scraper with Chrome and Node.js, I made mistakes that led to connection leaks. In this application, the scraper runs in a browser and connects to a node.js server, which saves data off to a database.

Once you know what the issues look like, they are easy to see, but otherwise often difficult to diagnose without code inspection. The database driver, nodepg, provides two modes for connection management. You either use built-in connection pooling, or manage connections on your own. As with connection management in any language,  the distinction can be subtle, even when the result is drastically different. If you use the nodepg callback API, the connection is naturally closed on its own:

var pg = require('pg'); 
var conString = "tcp://postgres:1234@localhost/postgres";
pg.connect(conString, function(err, client) {
client.query("SELECT NOW() as when", function(err, result) {
}); });

On the other hand, if you construct a connection object on your own, you must manage closing it:

var client = new pg.Client(conString);
client.connect();
var query = client.query("...");
// must keep track of client and close it...
query.on('end', function() { client.end(); });

If you refactor code between the methods, you must take care to manage the process carefully. The key to finding the issue is to eliminate servers from the mix. There are numerous limited resources that can run out in this environment – e.g. database connections, net TCP connections, TCP connections per process, memory, or per-process open files. Some require OS configuration, while others are triggered by application defects.

To narrow the issue, I used Apache Bench (ab), installed through XAMPP, to eliminate the front-end (browser/parser). It’s a little tricky to send post requests correctly with ab, so I proxied test requests through Fiddler, which is a little tricky it itself. The biggest win was installing pgbouncer on the NodeJS server. On it’s own, this eliminates most of the memory issues (even with connection leaks!), in spite nodepg having it’s own connection pool. The real value though, is the log message that occurs when instantiating a connection fails – you can clearly see whether the failure was connection to pgbouncer, or connection out from pgbouncer to the database.

I wish the API tracked where connections were created, then logged unreleased sessions when the request ended. This is fairly easy to set up in Java/Tomcat in a finalizer. In theory, the web server should be able to clear memory easily after a request finishes. This is likely a significant architectural downside to a single threaded web server like NodeJS. On the other hand, it forces you to fix all the issues, which may be better long term.

To detect this type of issue, my script needs to run for a while. My application typically died after parsing about 5,000 pages. After fixing the issue, the application makes it to at least 15,000 pages, and uses about 80% less memory – it switches from being memory to CPU bound.

Proxying Apache Bench through Fiddler

The directions as follows work for a local VM environment (VirtualBox/Vagrant), on a Windows host. Apache Bench is installed from XAMPP. If pointing ab directly at the VM, use the machine IP address (127.0.0.1 is fine). If proxying through Fiddler, it must use localhost, or the requests do not display in Fiddler. You must also select “Allow Remote Computers to Connect” in the Tools->Fiddler Options menu, under “Connections.”

POST request without Fiddler:

ab -n 1 -p post.txt -T application/x-www-form-urlencoded http://127.0.0.1:8080/index.js

POST request with Fiddler (you can also use the machine IP or hostname here):

ab -n 1 -X localhost:8888 -p post.txt -T application/x-www-form-urlencoded http://localhost:8080/index.js

Also note that complicated URLs may need to be quoted on the command line:

ab -n 100000 "http://127.0.0.1:8080/index.js?mode=data&fn=set_data&l=1&a0=%5B%22lozanotek%20/%20Viewfinder%22%2C%22search_langOverride%3D%26language%3D%26q%3Dfork_false%26repo%3D%26start_value%3D10659%26type%3DRepositories%22%2C%22Simple%20filter%20and%20classes%20that%20allow%20you%20to%20find%20views%20depending%20on%20the%20context%20for%20the%20request%22%2C%22%28JavaScript%29%22%5D"

Advertisers referenced in Flippa auctions

The following data show revenue generating partners most commonly mentioned in Flippa auctions. This data is from approximately 76,000 auctions representing over 58,000 domains – note that in many cases more than one advertiser is mentioned per site. This is a broad section of revenue generation strategies, from CPC, CPA, affiliate sales, link sales, etc. While the vast majority use the most well-known partners (Adsense, Amazon, CJ, Clickbank), there is a long tail of various types.

dyerware.com

AdvertiserNumber of Auctions
Adsense35286
Amazon25426
Clickbank24670
None15745
Commission Junction5100
Infolinks1709
Cpalead1614
Adbrite1543
ShareASale1100
Linkshare1059
Chitika899
text-link-ads391
MaxBounty365
Bidvertiser354
BuySellAds318
Smowtion260
Clicksor213
Neverblue207
Kontera185
Copeac160
Yahoo Publisher Network155
Inlinks112
Adhitz95
Pepperjam82
tnx66
Azoogle65
Adfly56
ValueClick Media/Fastclick54
Linkworth49
Leadbolt38
Tribal Fusion37
Clickbooth32
Rightmedia.com30
Burst Media21
Technorati Media17
Casale Media16
Skimlinks15
Vibrant Media14
Media.net14
Infinityads13
AdEngage12
Advertising.com12
Convert2media11
7search11
Adknowledge7
Revenueads7
Intelllinks6
Bannerconnect6
Adversal6
Pinball Publisher Network5
Inuvo5
Ads4dough5
Revresponse4
Pocketcents4
Searchfeed4
CX digital media4
Federated Media3
Pulse3603
Revenue Pilot3
Dog Time Media3
Affiliateer3
Blam Ads2
Surgedirect1
Dynamic Oxygen/exitjunction1
Ad Pepper Media1
Amped Media1
MaxOnline1
Adtegrity1
Adstract1
Intentclick1