Galvanize Week 4: Supervised Learning, where trees grow down instead of up

Whoo! Picked up momentum last week. We're officially 1/3 of the way through the program. Personally, I was more excited about this week since there were lots of new topics to learn:

  • k-Nearest Neighbors (kNN)
  • Decision Trees
  • Bagging & Random Forests
  • Support Vector Machines (SVMs) & Kernels
  • Boosting
  • Profit curves
  • Sampling Methods for Imbalanced Classes

It was a lot of material for one week, so in the interest of reviewing for my own sake, this post will mostly contain high-level summaries.

We started off with k-Nearest Neighbors (kNN) to transition from last week's intro to classification problems. kNN is a pretty simple classification method in which an unknown point's class is determined by the classes of the points nearest to it. kNN's training procedure: save training data. Then, given some new data point to classify, find some number (k) of the nearest training data points that will return some "majority vote" for the new data point's predicted class. "Nearest" can be determined in a various distance metrics- we looked at Euclidean and cosine similarity.

Next up was Decision Trees, which are essentially flowcharts for your data. A Decision Tree is made up of nodes (where a split of a tree occurs based on the value of a certain feature) and edges (the "branches" that connect the output of one node to the input of the next). The first node is called the root node, and the end nodes that declare the outcome are called leaf nodes. The number of node levels is referred to as tree depth.

from Advanced Analytics with Spark

Each node's output is determined based on some boolean logic- for binary splits, a discrete feature's value will either match or not match a node's splitting value, and a continuous feature's value will either be above or below a node's threshold value. At a particular node, the feature to split on is the feature that gives us the biggest information gain. Information gain is a measure of the entropy (disorder) of the classifications that is reduced when the training data is split on some feature at a specific value. There are other metrics like Gini impurity to choose splits.

Decision Trees are nice for a variety of reasons- easy to interpret, can handle both continuous and/or discrete inputs and output (without needing to normalize inputs!), low prediction cost, etc. However, they can be computationally expensive to train, could need more training data so that splits at each node aren't learned for local optima, might not generalize well due to overfitting (high variance).

To address overfitting, we can either "prune" or decrease our tree depth, or we can employ ensemble methods to average away the variance to get a model that generalizes well.

Ensemble methods!

In ensemble methods, multiple Decision Trees are grown, which don't require pruning and can be "bushy". When it comes time to predicting, the predicted result is determined based on majority vote (classification) or averaged values (regression). While our trees might have accidentally fit noisy data individually, the final aggregated output is less sensitive (reduced variance) provided that the trees are not correlated.

In bagging (bootstrap aggregating), the training data is bootstrapped to produce n training sets, and n trees are fit respectively. Since the bootstrapped samples are drawn with replacement, they effectively act as independent realizations of the training data.

Since each bootstrapped sample is expected to contain roughly ~2/3 of the distinct training data, we can use the remainder of the training data to test each tree to get an out-of-bag (oob) score. This means we can conveniently cross-validate with our training set instead of reducing our training set to make room for a cross-validation set.

However, these trees probably look fairly similar. To encourage varied tree structures, we can look to Random Forests. Oh, we had some fun making up other names-

  • arbitrary arboretum
  • variable vegetation
  • frivolous jungle

But back to Random Forests. It follows the same bootstrapping process as bagging, but differs in tree construction. At each node of each tree, only a random subset of features are considered for determining the node's split variable & value. This way, we can force some decorrelation amongst our trees.

Someone in class asked an interesting question in a way that made me laugh:

So... if we're growing a bunch of shitty trees in Random Forest... does that mean we need more of them?

From my googling, I've read numbers of 10, 30, 100, or 500 are pretty standard numbers of trees... with not much support. However, this paper, How Many Trees in a Random Forest, compares numbers of trees for 29 datasets and suggests somewhere between 64 - 128 trees for a Random Forest. Generally, for bagging and Random Forests, having "too many trees" isn't really a problem. We'd expect test error to decrease and level off, so there's no overfitting problems- just less significant gains. Also, too many trees would likely be very computationally expensive (but parallelizable!).

On the other hand, number of trees matters for boosting. In boosting, trees are built in series- like subtrees from a parent. Each new tree sees it's predecessor's error and tries to improve upon it. The rate at which these trees "learn" is controlled by $\alpha$ (analogous to learning rate in gradient descent!). If $\alpha$ is too large, subsequent trees may learn errors that are actually just noise (high variance). So, $\alpha$ should be small, which gives these trees the name "weak learners". More trees will be needed since the learning rate is small, but the boosted model should generalize well. However, too many trees leads to both overfitting and computational expense. We used AdaBoost, Gradient Boosting, and Stochastic Gradient Boosting- all of which have different parameters to tune: tree depth, number of trees, number of features, etc. Thankfully, grid search can compare all combinations of these features for us :)

While it's extremely easy to throw data into a bunch of models and compare error metrics, it's important to spend more time investigating feature importance (how much does my RSS or information gain change if I shuffle a particular feature's column in my input matrix?) or partial dependence plots (what effect can I see in my model's output over the range of a particular feature's values and holding all other features constant?)

Support vector machines (SVMs) and kernel tricks still seem like black boxes to me right now. For classification problems where the classes don't appear to be linearly separable, you can apply some kernel function to your data to project it into some other feature space where the classes are more linearly separable. Quora: What is the kernel trick?

Phew.

comments powered by Disqus