Weeks 7 & 8: Decision Trees, Random Forests and Boosting

Decision Tree:

Decision Trees are one of the most simple classification methods.

Sometimes we do not have discrete features but we rather have continuous features.

In continuous space it is convenient to think about decision trees as partitioning the space.

Pruning Trees

Generative and Discriminative Models

Most of what we talked about so far, PCA and k-means and intrinsic dimension, they're all in the generative domain. We talked already about trees that are discriminative models and we're going to talk about boosting that is also a discriminative model.

one way that we would describe as generative modeling is to basically try to say what's the distribution

of pitches for males and what's the distribution of pitches for females. And so, suppose we have a dataset down there and suppose that we measure, for each set, we measure the mean and the standard deviation, or variance. We have those and now we are making the assumption that because we don't know anything more, we're just going to use a normal distribution. That's a very common assumption. So, that means that we assign probability of being in each class to each pitch. Okay, so we have this probability distribution and that probability distribution. Given those probability distributions, we can decide on the optimal threshold and that would be where these graphs intersect. That would tell us if it's above that threshold, we're more likely to be a female, below that threshold, we're more likely to be a male. So, that is the generative model approach. So, the important thing is that we're thinking about a model that is actually generating random samples from the distribution that we're observing. The discriminative approach that was started seriously by Vapnik in 1985 is completely different. It basically says, I don't really care what the distribution is, I just care about, what's the error of my threshold? Okay, so I can just look at all possible thresholds as they were passing through and then decide based on them, where do I have the smallest number of mistakes? So, I can draw a different graph in this case which is the number of mistakes, and what you see here is the number of mistakes for a threshold where the female's pitch is higher than the male's and this black line is where the male is higher than the female. So, the black line is clearly much worse and in terms of choosing the best, we can either choose the minimum here or we can choose the minimum here. The theory at that level doesn't really tell us what to do between these two. Just choose one of them and use that threshold. Okay. So, this would be this threshold or that threshold. Now, it seems like that's not really a very good way to use the data because if we can model it, really, as a normal distribution, then we can get much better results. However, the problem is that we don't really know that the distribution is normal and in most cases in real life, the distribution doesn't behave exactly like the normal distribution.

So, here is an example. Suppose that we have males here and females here, that's the majority of the people, and then we have a few males and females that have extremely high-pitched voice, and that doesn't really matter where there are males or females. So, those are kind of outliers and we would rather not look at them too much. Basically, we can just give up on them, on the males here, right? But what will happen if we do the generative approach with a Gaussian? We'll have a mean and a variance for one and a mean and a variance for the other. We'll draw distributions according to those and then we'll choose a threshold. And what is the result is that this threshold is really a very poor threshold because it is incorrect on all of these females. It doesn't even help us on these males. So, it doesn't really help us. This threshold, even though it's, in some sense, optimal, it is actually pretty bad. How can you combat that? Well, if you use the discriminative approach, then that works very well. So, the discriminative approach will find the best threshold, even if those outliers are here because those outliers don't really change very much the graph that tells you what's the smallest number of mistakes that a threshold can have. Okay, so this smallest number of mistakes is somehow an approximation of the smallest number of mistakes if you knew the exact distribution and it approaches that minimum very quickly, and there is theory to back that up, regardless of what the underlying distribution is. Now, a standard argument against that is to say, well, you are just not using the right distribution. So, indeed, if you are using the right, quote, unquote, distribution that looks something like this, it maybe allows for very high pitches of a few people, then you find these two distributions and then the threshold is the right threshold. That seems to contradict that this is the right thing to do, however, the problem, the hidden problem here is that we might not even have enough data to know what is the right distribution and we want to make the decision regardless. So, the way that Vapnik says it, if you have a problem of making the right decision, don't make it an even harder problem of identifying the distribution. There is an important difference here between what is called traditional statistics and modern machine learning. So, in traditional statistics, you have data and you do some analysis of the data and from that, you generate some kind of model of the world or estimate the state of the world. And then when you have to make a decision, you use decision theory to translate this estimate into a decision. The new approach of machine learning basically skips over estimating the world state. So, it basically says, I have data and I'm going to make a prediction, and I'm not really going to worry about whether what I have here in the middle is a model of the world or not. So, in some sense, this is very powerful because this is the way boosting works and this is the way, essentially, neural networks work, but in another way, it is very fragile because it doesn't really reveal anything substantial about the world. To compare them in a table,

we can think about generative models versus discriminative model in terms of goal, that one is probability estimate, the other is classification rule, in terms of how we measure performance, in one case, we measure performance by liklihood and the other one, misclassification rate, and in terms of mismatch problems. So, in generative, the problem of outliers can be very significant because these are elements that appear, or samples that appear, in places where the probability is extremely small. On the other hand, in discriminative, these are just misclassification like anything else, so they are not particularly important. You don't really have to get rid of the outliers because they will just be some additional mistakes. So, to summarize, generative models, the goal is to explain how the data is generated and in discriminative models, the goal is to predict the property of the data. Generative models can be much more accurate if they are correct, so if you, basically, are using the right class of models. But discriminative models are more robust against poor modeling or poor outliers or basically, they don't depend so strongly on the assumption that you make when you design. Of course, they depend somewhat, but much less, much less extremely than generative models. So, next we're going to go and talk about my favorite discriminative model, which is boosting.

Boosting

Boosting uses particular kind of training sets. Weighted training sets. So instead of looking at the training set as if all the example are equally important, what boosting does, it kind of change of the weights as it goes and concentrate more and more on the examples that are hard to discriminate or classify.

The boosting process takes a weak learner that can generate these weak rules, and it basically pushes it in different directions in order to get from it different weak rules.

So, the initial thing that it does, is of course it gives it the examples weighted equally. Okay, so that's just the regular training set, and it gets some rule from that. Based on that rule, it changes the weights of the examples. In fact, it changes them in exactly the right way so that this rule that was generated, now on the weighted set, looks completely random. Okay, so it looks like half with probability or with weight half it's correct, and with weight half it's incorrect. And so given that, the weak learner cannot give you back the same rule because that rule is exactly half and half. It has to give you a different rule, and supposedly it gives you one and you take that one. Okay, then this process repeats, and you repeat it potentially many times, and you get many of these weak rules. The interesting thing is that, counter to what you think about learning usually, rather than these rules becoming better and better, they typically become worse and worse, because as you change the distribution, you're making the distribution over the examples harder and harder, and it's harder and harder to find a rule that would be slightly better than random guessing. So what we do then with all of these rules is we have essentially some weighted average, of weighted sum, of all of these rules and that gives us what we call the strong rule, okay. The strong rule is basically what we use to predict.

So let's look at this in a slightly different notation that hopefully would make things clearer. Suppose that we have a training set, x1,y1 to xm,ym, and we have an example, with have the labels be minus 1 or plus 1, and now we basically do this iterative process. We define a distribution over the examples, and then we ask for the weak classifier to generate a rule of thumb that is slightly better than random guessing which means that epsilon t is smaller than half. And then after we do that iterations many times, we generate the final rule. Okay, so that's the general scheme of boosting. There's several boosting algorithms.

The first one that became useful and famous is AdaBoost. Okay, so here is AdaBoost. That is the algorithm that Rob Schapire and myself developed in 1997, and here is the way that AdaBoost works. So, it starts with a uniform distribution and then at every iteration, given the current distribution, and the rule that you got from the weak learner, you update the distribution in the following way. You basically multiply, by a number smaller than 1, every example in which the weak rule is correct, and you multiply by a number larger than 1, every weak rule that is incorrect, and this alpha is chosen exactly so that the current rule, with respect to the new distribution, will be exactly half and half, half the time correct and half the time incorrect or have correlation zero. And then the alpha t turns out to be chosen according to the following formula. And then the final rule is simply taking the weak rules, with these weights, these alpha t weights, and summing, taking the sum of all of these weak rules and then finally taking the sign of that. So, checking whether the label of that is positive or whether this sum is positive then the label is plus 1, or negative, the label is minus 1.

Boosting as Gradient Descent

by considering AdaBoost as a gradient descent algorithm. So we have used gradient descent for minimizing functions, for finding the optimum, and what we are going to do here is try to find an optimum of a function, which is the prediction function, the sum of the weak learners. So what that is called is called variational optimization, so variational optimization is essentially very similar to gradient descent, but the space of things that you're working on, rather than working on vectors of some dimension d, these are vectors of infinite dimension, or functions. You can think about something like the shape of the wing of an airplane, as being a continuous function that you want to optimize for some goal, let's say to minimize drag or to maximize lift, okay? So you would use variational optimization methods to do that, and basically it is just like gradient descent, it's just that the derivatives are now derivatives over functions. So we have x that is our input, and we have y that is our output, plus one and minus one. And we have a training set which has particular x's and labels, plus one or minus one, okay? So here are the plus examples, and here are the minus examples. And what we want to do is basically find a function that will be minus when it's mostly minuses, and would be plus when it's mostly pluses, right? But we have only a finite number of examples, so we have some ambiguity about where, how we want to do it. So let's see how this becomes a variational problem.

To do boosting you need some kind of a weak learner.

A combination of boosting and decision trees.

One way to think about decision trees is as a partition of the space

How can you represent a tree as boosted sum of weak rules.

Overfitting with Boosting

Boosting tends to not overfit

We got training error zero but the only thing that will stop the boosting algorithm in its tracks is if it gets a weak rule that is perfect. If the weak rule is perfect, then basically you are dividing by zero and you can't continue.

In this experiment, we went on and what you see is that, of course, the training error cannot change because it's already zero. And the test error actually continues to decrease in a very significant way. What you have here is a decrease from this error of about nine here to an error of about 3%. And as I said before, this is for 10 trees, 100 trees, and 1,000 trees. So this is very different from what we expect usually from doing machine learning with models that have more and more parameters. What we expect is that after we reach training error zero, then the test error will start to increase, because we are over-fitting. But here, what we see is, to the contrary, actually boosting continues to improve the test error after the training error is already zero. So what this means at the high level is that we are using just about 10,000 training examples, that's the size of this training set, and we are fitting these 1,000 trees that together have about 2,000,000 parameters. So how can you ever fit 2,000,000 parameters using just 10,000 trees? I mean what you expect is that the curse of dimensionality will completely destroy your predictive ability. But that doesn't happen. To explain why this is going on, let's think again about the margins that we talked about a few videos ago. Here we have our examples, and the horizontal axis is the margin, and what we see here is that all of the examples are to the right of the zero, so all of the examples suffer error zero. And in a sense in terms of this zero-one loss we're done. There's nothing more that we can do. But the loss function that boosting is minimizing is this exponential function. It's not this one. And therefore, it's basically just going to increase, it's going to push these examples further. Further to the right. So later they will be something like this. And what we have here is that we have no examples with the small margin. Even though the training error didn't change, the distribution of the margin did change in a significant way. And that, we're claiming, is the reason that we are getting better performance on the test data.

So let's think about it again. If we have the loss function here that we are trying to minimize, we are bounding it by this exponential function. But we might be bounding it by something else, like Hinge-Loss is basically this kind of loss and that has to do with the error of support vector machines. Soft support vector machines use this kind of error. And then you have Logitboost that minimizes a slightly different function, and then Brownboost, which is another algorithm that I invented, which has yet another kind of loss function. But all of these loss functions, they try to push examples not just to cross zero, but to be further away from zero. And here is the experimental evidence that this is actually happening. So what we have here is the margin theta. Actually we have something we call the normalized margin. This is a margin divided by the sum of the absolute values of all of the alpha weights. And so that number is between minus one and one. And now what we are looking at is the cumulative distribution of the margins, of the training examples, as a function of which iterations we're in. So if we look at iteration number five, we see that all of the examples are to the right of zero. So basically that is consistent with what we know which is that at iteration five all of the training examples are correct. But then, if we go to iteration 100 or 1000, we see that these examples around here were pushed to have a margin that is above 0.5, and what we paid for it is that some examples that had margin one, meaning that all of the rules were correct on them, they actually got slightly smaller margin. This is consistent with what we were saying that after five examples, the training error doesn't go down, but the margin distribution changes and the minimal margin that you have is larger. So that is associated with having smaller test error. And what you see here is that after 100 iterations you have the line, and after 1000 iterations you have the dotted line, and so there's almost no change. So even though we added 900 trees, actually we're just standing in place. We didn't really change our prediction by very much. This was actually proven by Rob, me, and Peter Bartlett and Wee Sun Lee, and we have a general theorem that says that this is not just something that we can observe empirically, but it has theoretical explanations. So, what we are talking about here is convex combinations of some kind of base functions. So we have basis functions that we're using. That's the weak rules. And we are taking a convex combination of them. We are not here concerned about how we are finding the combination, the convex combination. We're just interested in what is the difference between the training error and the test error. What we have shown is that the probability of making a mistake with this rule, with the rule F on the true distribution is bounded by two terms. One term is the fraction of the training examples, whole margin is smaller than theta. That's where we said okay, we don't have any examples with margins smaller than 0.5. And then we have a term that depends just on theta, and on some other parameters, which are not critical at this point. So these are the size of the training sample and the VC dimension, and then the theta. But the important thing, the thing that explains basically the experiments,

is that there is no dependence on the number of weak rules that are combined. So when we say that this has 2,000,000 parameters, this is really not so important because the important thing is how much of a margin we can get, and the number of parameters can be ignored. So that basically says there's no dependence here on the number of parameters, as long as you have a significant margin. And if you know something about support vector machines, there are very similar theorems in that case too. So this suggests an optimization problem. The optimization problem looks something like this. Let's suppose that we have our data that has some margin, and then we want to maximize the margin because we want the one over theta to be small, but we also want to have a small number of examples that are below the margin. So this is a combined problem of where we have the number of examples that are below theta, and then we are trying to make thetaas big as possible. This problem, unfortunately,is at least as hard as findingthe best combination, the combination that minimizes the number of mistakesand that problem is NP-hard. So we really don't know how to do this in any effective way. What I want to talk about now is a little bit of the idea of the proof. It's not the formal proof, for that you'll have to go to the paper, but the idea of the proof is actually quite intuitive and simple. Suppose that we have now positive and negative examples, or male and female, that are in this plane, and they are separated by this line over here. But not perfectly. Like here there is an example that is not separated, and here and here are examples We have this separator that has some small number of mistakes, but we have this other property that we have a large margin. So we can basically, none of the examples gets too close to the separating line here, and that, we claim, is useful for the amount of test error that you would get. So why is that? Basically it is because, if you look at the possibilities of wiggling basically this plane or this line, we see that you can wiggle it quite a lot and not really change the training error. So we have a lot of room to wiggle things and because of that, we are going to be more confident about our predictions. Why are we going to be confident? It basically boils down to looking at different examples and at our confidence level. So if we have examples here that are pretty far from all of these hyperplanes that are good, then they are unlikely to change prediction if we have some kind of alteration to the training set. Like, for instance, suppose we just sampled a new training set from the same distribution. We are likely to get one of these hyperplanes as being the best rule, and this best rule will agree, all of these best rules will agree on the predictions here. And similarly, if you have examples here. Examples that are far from the margin, you are very confident about them because you know that if you even change your classifier a little bit, it's not going to change your prediction on those examples. On the other hand, if you have examples that fall within the small margin area, then those examples you are quite likely to predict differently if you get a slightly different training set. Those examples you are not confident about, but, luckily, in this case you don't really have too many of those examples, and so on most of the examples you can be very confident. If you had a lot of examples right in here, then you wouldn't have large margin.Why are the large margins good?

but you can trade off model bias for data variation, okay? So you can, by making the model more flexible, you can decrease the model bias, but by making it more flexible, you can increase the data variation. So that is what's called the bias variance dilemma.