I’m running some tests on sklearn decision trees1, and the lessons learned so far may be interesting.
I’ve put my measurement code at the end – I’m tracking % correct, number of tests that are positive, negative, and false positives and negatives.
- When running predictions, if you have a defect where you include the ‘answer’ column in the test columns, the above code gives you a division by zero, which is a good check.
- For my data, when I run with criterion=’entropy’2 I get 2% increase in accuracy, but other people I talked to on twitter have had the opposite
- criterion=’entropy’ is noticeably slower than the default (‘gini’)
- The default decision tree settings create trees that are very deep (~20k nodes for ~100k data points)
- For my use case, I found that limiting the depth of trees and forcing each node to have a large number of samples (50-500) made much simpler trees with only a small decrease in accuracy.
- In forcing nodes to have more samples, the accuracy decreased ~0-5%, roughly along the range of how many samples were included at each node (50-500)
- I found that I needed to remove a lot of my database columns to get a meaningful result. For instance originally I had ID columns, which lets sklearn pick up data created in a certain time window (since the IDs are sequential) but I don’t think this is useful for what I want to do.
- You have to turn class based attribute values into integers (it appears to be using a numpy float class internally for performance reasons)
- SKLearn appears to only use range based rules. Combine this with the above and you get a lot of rules like “status > 1.5”
- The tree could conceivably generate equality conditions within the structure, although it’d be hard to tell (e.g. “status > 1.5”, “status < 2.5" would be equivalent to "status = 2" if status is an integer)
- I’m more interesting in discovering useful rules than in future predictions; it helps a lot to generate JSON3
- Within the JSON, the “entropy” and “impurity” field shows you how clean the rule is (0 = good). The “value” field shows how many items fit the rule (small numbers are probably not useful, at least for me)
testsRun = 0
testsPassed = 0
testsFalseNegative = 0
testsFalsePositive = 0
testsPositive = 0
testsNegative = 0
for t in test:
prediction = clf.predict(t)[0]
if prediction == 0:
testsNegative = testsNegative + 1
else:
testsPositive = testsPositive + 1
if prediction == test_v[testsRun]:
testsPassed = testsPassed + 1
else:
if prediction == 0:
testsFalseNegative = testsFalseNegative + 1
else:
testsFalsePositive = testsFalsePositive + 1
testsRun = testsRun + 1
print "Percent Pass: {0}".format(100 * testsPassed / testsRun)
print "Percent Positive: {0}".format(100 * testsPositive / testsRun)
print "Percent Negative: {0}".format(100 * testsNegative / testsRun)
print "Percent False positive: {0}".format(100 * testsFalseNegative / (testsFalsePositive + testsFalseNegative))
print "Percent False negative: {0}".format(100 * testsFalsePositive / (testsFalsePositive + testsFalseNegative))