Weka is an open-source data-mining tool written in Java, providing a host of data mining algorithms. I am using it to build a proof-of-concept model that can classify auctions based on their value: fraudulent listing, zero valued listing, overpriced listing, or underpriced listing. I’ve scraped some data from Flippa, a website/business auction site, to facilitate data mining experiments, particularly to see how difficult it would be to detect spam or fraudulent auctions.
An ideal classifier would identify which listings are over-priced, under-priced, or worthless due to fraud or puffery. I suspect that many auctions fit into the zero-valued and fraudulent categories. In browsing listings, one sees domains with WordPress and an installed template, but no true potential, trademark infringement, unfixable copyright infringement (for example, 10,000 articles about movies copied from IMDB), etc. There are high risk assets with potential, such as discarded startups. Some sites have high traffic, but are declining due to Google algorithm changes or to being in the MySpace ecosystem.
In previous experiments, I noted that template-based auctions can be detected programmatically, in unexpected ways. While this does not reveal fraud, it does identify sellers who build sites solely to sell on Flippa. The data set contained attributes for whether the seller has been banned, what advertisers are referenced in the auction, description length, and number of header tags used in the description. Generally these are meant to determine if an auction’s text is built from a template.
Only a small portion of sellers are banned. A naive algorithm to predict whether an auction is from a banned seller would assume that all auctions are good – this achieves 95% accuracy. Many algorithms in Weka fall back to this algorithm, if the data is otherwise inconclusive. This is a terrible algorithm from a buyer’s point of view- it is better to err on the side of false positives to discourage risky purchases.
In this test, each auction has 129 attributes as detailed above. There is a boolean attribute for each advertising company that may be mentioned in a listing. I generated Weka’s ARFF file directly from Postgres. The best performing algorithm was BayesNet.
This is the result of the naive algorithm, which assumes everything is ok:
Correctly Classified Instances: 48943 95.9836 %
Incorrectly Classified Instances: 2048 4.0164 %
This is the result of BayesNet which is, technically, a worse result:
Correctly Classified Instances: 48853 95.8071 %
Incorrectly Classified Instances: 2138 4.1929 %
However, comparing the confusing matrices, most of the errors appear to be false positives. Not only that, the number of false positives, relative to the number of true positives, is relatively small. A small victory, certainly, which more than anything shows the amount of work needed to tune these algorithms.
=== Confusion Matrix, naive ===
not banned, banned 48943 0 (not banned)
2048 0 (banned)
=== Confusion Matrix, Bayes Net ===
not banned, banned 48797 146 (not banned)
1992 56 (banned)
To improve this in future, I will augment the source data with new fields that may indicate problems, such as the ratio of price to traffic. Whether a seller has been banned is a crude way of identifying low quality listings, as the ban may be for only one listing. In the future I will also look at whether sites continue to exist after the sale, whether they use trademarked content, and whether they are re-sold and hold their value.