Decision Tree Testing Lessons

I’m running some tests on sklearn decision trees¹, and the lessons learned so far may be interesting.

I’ve put my measurement code at the end – I’m tracking % correct, number of tests that are positive, negative, and false positives and negatives.

When running predictions, if you have a defect where you include the ‘answer’ column in the test columns, the above code gives you a division by zero, which is a good check.
For my data, when I run with criterion=’entropy’² I get 2% increase in accuracy, but other people I talked to on twitter have had the opposite
criterion=’entropy’ is noticeably slower than the default (‘gini’)
The default decision tree settings create trees that are very deep (~20k nodes for ~100k data points)
For my use case, I found that limiting the depth of trees and forcing each node to have a large number of samples (50-500) made much simpler trees with only a small decrease in accuracy.
In forcing nodes to have more samples, the accuracy decreased ~0-5%, roughly along the range of how many samples were included at each node (50-500)
I found that I needed to remove a lot of my database columns to get a meaningful result. For instance originally I had ID columns, which lets sklearn pick up data created in a certain time window (since the IDs are sequential) but I don’t think this is useful for what I want to do.
You have to turn class based attribute values into integers (it appears to be using a numpy float class internally for performance reasons)
SKLearn appears to only use range based rules. Combine this with the above and you get a lot of rules like “status > 1.5”
The tree could conceivably generate equality conditions within the structure, although it’d be hard to tell (e.g. “status > 1.5”, “status < 2.5" would be equivalent to "status = 2" if status is an integer)
I’m more interesting in discovering useful rules than in future predictions; it helps a lot to generate JSON³
Within the JSON, the “entropy” and “impurity” field shows you how clean the rule is (0 = good). The “value” field shows how many items fit the rule (small numbers are probably not useful, at least for me)

testsRun = 0
testsPassed = 0
testsFalseNegative = 0
testsFalsePositive = 0
testsPositive = 0
testsNegative = 0
for t in test:
  prediction = clf.predict(t)[0]
  if prediction == 0:
    testsNegative = testsNegative + 1
  else:
    testsPositive = testsPositive + 1

  if prediction == test_v[testsRun]:
    testsPassed = testsPassed + 1
  else: 
    if prediction == 0:
      testsFalseNegative = testsFalseNegative + 1
    else:
      testsFalsePositive = testsFalsePositive + 1
 
  testsRun = testsRun + 1
 
print "Percent Pass: {0}".format(100 * testsPassed / testsRun)
print "Percent Positive: {0}".format(100 * testsPositive / testsRun)
print "Percent Negative: {0}".format(100 * testsNegative / testsRun)
print "Percent False positive: {0}".format(100 * testsFalseNegative / (testsFalsePositive + testsFalseNegative))
print "Percent False negative: {0}".format(100 * testsFalsePositive / (testsFalsePositive + testsFalseNegative))

http://www.garysieling.com/blog/building-decision-tree-python-postgres-data [↩]
http://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria [↩]
http://www.garysieling.com/blog/convert-scikit-learn-decision-trees-json [↩]

Leave a Reply Cancel reply