In one of my earlier pieces I explored decision trees in python, which lets you to train a machine learning algorithm to predict or classify data.
I like this style of model because the model itself is valuable; I’m more interested in finding underlying patterns than attempting to predict the future. Decision trees are nice if you want to predict one particular feature; however they aren’t as good for exploratory analysis and they force fit a hierarchy onto data.
Association rules are an alternative model which have a somewhat similar characteristic (they produce probabilistic logic models), but they do not focus on a specific attribute. This is more the style of what Amazon does when it says “people who bought this also bought this” – it doesn’t matter which item you start or end with. We have to use R instead of python – the only mention of association rules in sci-kit learn was a discussion of bumping it from the product to allow them to release in a timely manner, so maybe it will be available there at some point in the future.
To construct this, I’ve build database views that have only the values I want included in my model – this makes it relatively easy to pick and choose which features are included in the resulting model (if you find yourself running the code below a lot, it makes sense to materialize the view).
This uses the RPostgreSQL library, nothing special here.1
library("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,
host="dbhost",
user="postgres",
password="postgres",
port="5432",
dbname="dbname")
rs <- dbSendQuery(con,"select * from schema.my_view")
results <- fetch(rs,n=-1)
After a slight lag you'll have all the data loaded into memory (hopefully you're doing "small" data 🙂 ) - if not there are other R libraries that claim to help (I'm in the ~100k row area).
One of the nice things about association rules is that they are designed to work with categorical data, so if you bring down numeric or date attributes, you need to bin them. This is a nice contrast to decision trees in python, where you to force-fit all categorical attributes into boolean features (e.g. a feature might be "country_is_USA", "country_is_Canada", etc with a numeric value of 0 or 1).
You can see here that we need to do very little to transform my data - while the arules paper2 has binning examples, I don't have any attributes that require that treatment.
for (column in names(results)) {
results[column] <- as.factor(results[[column]])
}
lapply(results, class)
results_matrix <- as(results, "transactions")
Here is the key - we merely call the arules library with a bunch of arguments. The two most important ones are "minlength" and "maxlength" which identify how many conditions you'd like in your rules. If you pick "big" numbers (the default is 10) this will create astronomical numbers of rules, as each additional size will increase the rule count by orders of magnitude. Not only will this cause you to run out of RAM, it counter-intuitively fails after it claims to have generated all the rules, and is writing the results to some un-named file.
library("arules")
rules <- apriori(results_matrix,
parameter = list(support = 0.1, confidence = 0.6, maxlen = 5, minlen=2))
If you want to actually see these rules, I found I was forced to write them out to a file. The reason being a) it's nearly impossible to generate less then tens or hundreds of thousands of rules and b) nearly every operation on them will run out of memory trying to complete.
One example of this problem is the built-in "save to file" methods, which, oddly, convert the entire structure to a string before writing it to a file. I find it baffling that the R community can get these complex algorithms to work, but then fail to handle "large" file I/O in a graceful way. After messing around with various API methods from both arules in the R core, I ended up writing this to file myself:
batchSize = 10000
maxSize = length(rules)
i <- 1
while(i < maxSize)
{
print(i)
j <- i+batchSize
if (j > maxSize) {
j <- maxSize
}
write(rules[i:j],
file = "D:\\projects\\tree\\apriori-large.csv",
quote=FALSE,
sep = ",",
append=(i > 1),
col.names = FALSE)
i <- i + batchSize
}
What I found is that in addition to giving me rules which predict certain behaviors, this also uncovered hidden business rules in my data, which I hadn't appreciated (there are also some obvious ones: attributes which are filled in by a common subsystem are filled in together - more interesting here would be the exception cases)
As one final consideration, it is relatively easy to make a dataframe that masks the original values with booleans, i.e. "this row has a value or doesn't have a value", which in some cases is more interesting than knowing patterns than the actual data (especially when the original data is unique IDs). These also train super-fast. Note here that I'm using several different representations of null, as these are all in my data.
targets<-c(0, " ", "", NA, "-1")
mask<-apply(t,2, "%in%" , targets)
rules <- apriori(mask,
parameter = list(support = 0.2,
confidence = 0.7,
maxlen = 5,
minlen=2))
The nice thing about association rules is that in addition to predicting outcomes, we can use them to explore concepts in the data, although without much depth, and they don't force the data into a hierarchy. One problem remains though, which is filtering the rules in an appropriate way- since these resemble what Prolog does, it might be possible to use it to build concept trees from the data. Unfortunately I'm yet to get SWI-Prolog to load my ruleset without running out of RAM - there is also a probabilistic Prolog, which looks promising.
It's worth noting that while you can filter these rules based on the confidence of the rule, it's not actually helpful: your application may force two attributes to be filled out at the same time, which would give them a high correlation, but not a useful one. Association rules will also give you strange variations on this: if attribute A1 and A2 are filled in at the same time, you get a lot of rules like {A1} => {A2} and {A2} => {A1} (expected), then also {A1, B} => A2, {A2, B}= > A1, and so on.
The arules library has a fair number of options, which I will explore in future posts. I also intend to approach this from a different angle, to find outlier data (sort of the opposite of what we're doing here).