Gary's Guide To Postgres

Please find below some of the most popular articles I've written on Gary's Guide To Postgres.

Table of Contents

1/3 of old Flippa website auctions point to abandoned sites

A brief introduction to Weka

Building a Website Scraper using Chrome and Node.js

Data Warehousing, NoSQL, and the Cloud

Detecting auction spam with Weka

Diagnosing Connection Leaks in Node.js and Postgres

Generating ARFF files for Weka from Postgres

Using Prolog to Generate Test Data



1/3 of old Flippa website auctions point to abandoned sites

Flippa is an auction site for buying and selling websites as businesses. Browsing the listings shows many low quality products. With careful inspection, there are often interesting, quality listings, but they are swallowed in the noise. Occasionally there are successful e-commerce sites, un-maintained high-traffic developer forums, or fire-sales on start-ups. Often these are educational, but [...] Read More...

A brief introduction to Weka

Weka is a GPL data mining tool written in Java, published by the University of Waikato. It includes an extensive series of pre-implemented machine learning algorithms, including well known classification and clustering algorithms. If you’ve ever been curious how Bayes Theorem works, this is a great tool to get up and running. Weka uses a [...] Read More...

Building a Website Scraper using Chrome and Node.js

A couple of months back, I did a proof of concept to build a scraper entirely in JavaScript, using webkit (Chrome) as a parser and front-end. Having investigated seemingly expensive SaaS scraping software, I wanted to tease out what the challenges are, and open the door to some interesting projects. I have some background in [...] Read More...

Data Warehousing, NoSQL, and the Cloud

With the nascent advent of NoSql, cloud computing and slick new databases, we seem to have forgotten from whence we came. I went to a conference recently on the open source search product Solr/Lucene. One of the keynote speakers, Chief Data Scientist of HortonWorks, discussed what turned him to NoSQL databases, in this case, a [...] Read More...

Detecting auction spam with Weka

Weka is an open-source data-mining tool written in Java, providing a host of data mining algorithms. I am using it to build a proof-of-concept model that can classify auctions based on their value: fraudulent listing, zero valued listing, overpriced listing, or underpriced listing. I’ve scraped some data from Flippa, a website/business auction site, to facilitate [...] Read More...

Diagnosing Connection Leaks in Node.js and Postgres

In building a website scraper with Chrome and Node.js, I made mistakes that led to connection leaks. In this application, the scraper runs in a browser and connects to a node.js server, which saves data off to a database. Once you know what the issues look like, they are easy to see, but otherwise often difficult [...] Read More...

Generating ARFF files for Weka from Postgres

Since all my scraped data is in Postgres, this is the easiest way to get it out – the fastest iteration possible. At some point I’ll probably switch to a Java library. It’s interesting to see, but probably the only lesson from this is that all ETL scripts are ugly. WITH advertisers_ranked AS ( SELECT [...] Read More...

Using Prolog to Generate Test Data

I’ve built several reporting systems where work was divided evenly between a charting UI and database scripts – an ETL job, report sql, and database schema. It’s nice to divide work between UI and database developers to take advantage of specialization, but not having data is always an issue for the first week or two [...] Read More...