Decision Tree Testing Lessons

I’m running some tests on sklearn decision trees1, and the lessons learned so far may be interesting.

I’ve put my measurement code at the end – I’m tracking % correct, number of tests that are positive, negative, and false positives and negatives.

  • When running predictions, if you have a defect where you include the ‘answer’ column in the test columns, the above code gives you a division by zero, which is a good check.
  • For my data, when I run with criterion=’entropy’2 I get 2% increase in accuracy, but other people I talked to on twitter have had the opposite
  • criterion=’entropy’ is noticeably slower than the default (‘gini’)
  • The default decision tree settings create trees that are very deep (~20k nodes for ~100k data points)
  • For my use case, I found that limiting the depth of trees and forcing each node to have a large number of samples (50-500) made much simpler trees with only a small decrease in accuracy.
  • In forcing nodes to have more samples, the accuracy decreased ~0-5%, roughly along the range of how many samples were included at each node (50-500)
  • I found that I needed to remove a lot of my database columns to get a meaningful result. For instance originally I had ID columns, which lets sklearn pick up data created in a certain time window (since the IDs are sequential) but I don’t think this is useful for what I want to do.
  • You have to turn class based attribute values into integers (it appears to be using a numpy float class internally for performance reasons)
  • SKLearn appears to only use range based rules. Combine this with the above and you get a lot of rules like “status > 1.5”
  • The tree could conceivably generate equality conditions within the structure, although it’d be hard to tell (e.g. “status > 1.5”, “status < 2.5" would be equivalent to "status = 2" if status is an integer)
  • I’m more interesting in discovering useful rules than in future predictions; it helps a lot to generate JSON3
  • Within the JSON, the “entropy” and “impurity” field shows you how clean the rule is (0 = good). The “value” field shows how many items fit the rule (small numbers are probably not useful, at least for me)
  • testsRun = 0
    testsPassed = 0
    testsFalseNegative = 0
    testsFalsePositive = 0
    testsPositive = 0
    testsNegative = 0
    for t in test:
      prediction = clf.predict(t)[0]
      if prediction == 0:
        testsNegative = testsNegative + 1
      else:
        testsPositive = testsPositive + 1
    
      if prediction == test_v[testsRun]:
        testsPassed = testsPassed + 1
      else: 
        if prediction == 0:
          testsFalseNegative = testsFalseNegative + 1
        else:
          testsFalsePositive = testsFalsePositive + 1
     
      testsRun = testsRun + 1
     
    print "Percent Pass: {0}".format(100 * testsPassed / testsRun)
    print "Percent Positive: {0}".format(100 * testsPositive / testsRun)
    print "Percent Negative: {0}".format(100 * testsNegative / testsRun)
    print "Percent False positive: {0}".format(100 * testsFalseNegative / (testsFalsePositive + testsFalseNegative))
    print "Percent False negative: {0}".format(100 * testsFalsePositive / (testsFalsePositive + testsFalseNegative))
    
    1. http://www.garysieling.com/blog/building-decision-tree-python-postgres-data []
    2. http://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria []
    3. http://www.garysieling.com/blog/convert-scikit-learn-decision-trees-json []