Weeks 5 & 6 : K Means and Intrinsic Dimensions

K means is an algorithm whose goal is to find the center of the clusters.

To start the k-means algorithm, we must first define a set of centroids randomly. After each iteration in the algorithm, we then: 1. Associate each point with a closest centroid to minimize the distance from the centroid, and 2. Update the centroids to be the means of the points associated with them.

We know that the k-means algorithm is complete when the centroids stop changing. At this point, the algorithm has converged.

K Means the theory

The main property that we are going to keep track for in K means is the Root Mean Squared Error or RMSE.

'At every step of the K mean algorithm the RMSE cannot increase it can either decrease or stay the same.

K Means measure of performance part 1

There are 3 ways of measuring the K means performance:-

1. Number of labelling errors

2. Number of errors in the location of the centroid.

3. RMSE- Root Mean squared error'

In the above case we do not need to know anything about the true centers. We can basically use the centroids we got and calculate the RMSE.

K means is not about finding the correct clusters. But simply partitioning the space into parts.

The first thing is that if you take two points in a Gaussian distribution, let's say, in high dimensions. Then, with very high probability the two points have a particular distance between them. They have a particular distance no closer and not further, okay? And so, that is very strange. We are not used to that when we think about two dimensions like if we take random points in a Gaussian, they have various distances. But in high dimension, we have what is called concentration, and that means that every pair of points is very very close. And so you can't really use things like nearest neighbors and so on, okay? Another thing that we get is if we take the data that we generated, which is basically a ring, but the ring in high dimensions, and instead of protecting it onto the first two dimensions that we see, that's basically what we're going to continue doing, we just choose a random dimension, and project it there, than the data will look exactly like a single Gaussian. So basically the special dimension in which we see a ring is kind of disappeared. It's very hard to find it. It's basically somewhere in these many many dimensions. Okay? And the last thing is that RMSE decreases with a number of centroids, but it decreases only extremely slowly. So basically, you can add more and more and more points to your KMeans, but you basically are not improving the RMSE. And finally, KMeans really breaks down. So what you get from KMeans is not very useful. Okay, so here is our data, and here in this case, we have 500 points, four centers, and the radius is two. So in this case, everything works very nicely.

We see that there is very clear separation of the clusters. There are mistakes, because the clusters are relatively close to each other, but the mistakes are where we expect them to be.

So here is what we get when we have 20, dimension 20, and the number of examples is 500.

And what we see is that yes, it still works in the sense that the centers are close, but if you look at the reduction in the amount of error, it is not very much, like if we basically look at the error it reduced from 90 to a little bit below 80. Here it is scaled, so it looks very much the same as before. But if we look at it at the full-scale, we see that it reduced very little. And now, if we go to even larger dimension. So here we have a dimension of 500, and the number of examples just 100, then everything breaks down, right? So we have basically more dimensions than we have examples, which is, in general, people expect it to never work. So what do we see here? What we see is that in terms of the reduction in the RMSE,

it's almost unseeable, you know, it's very very tiny. And if you look at it zoomed up, it's basically completely without an elbow. Just kind of continuous linear reduction. And if we look at the examples and the mistakes, we see that they are all of the place. Basically the centers are as we expect them in a square, but these centroids are all bundled up here in the middle. And you to think that in other dimensions, they are basically up-and-down and everything. So they basically have nothing to do with the plane on which the four examples live in, or the four centers live. Okay so basically, at this ratio of dimension to number of examples, it is next to impossible to find any structure, not to mention the structure of actually finding where the KMeans are. Okay, so to summarize, we can measure the performance in various ways, but most ways require knowledge of the ground truth. So we know how the data was generated and we are trying to recover that. And that's almost never really the case when we want to use KMeans. So the only measure that really stays with us even when we don't know the ground truth is the behavior of the RMSE, okay? And we showed you how to use the RMSE to estimate the number of centers. So just something to think about. Can you think how to use, to use it, the RMSE, to estimate whether a particular centroid is in the right location, okay? So not saying that there is an exact answer to this,

but in some situations, you can do that very well. So just something for you to think and discuss. So some of the experiments we did, we saw that making the cluster closer makes

the clustering harder. Decreasing the number of examples per cluster increases the error in the centroids, but doesn't necessarily hurt the labeling error or the RMSE. Increasing the number of clusters makes the distribution here into a ring, the way that we did it. And KMeans will find some kind of good partitioning with a low RMSE, but it's not going to really relate to the original distribution. It's somehow just going to break up the ring into small parts that can be used in something like vector quantization. And when you go to high dimensions, that breaks KMeans, right? So basically, if you are moving from data that you have at, 10, 20 dimensions and you move to data that is at the 1000 dimensions, then you can expect a lot of trouble when you're using KMeans.

In PCA we project X on O by moving in both X and Y direction

In regression we project X onto O by moving only in the Y direction

Initialization of K means

Data in High Dimensions

Having very high dimensional data leads to concentration around value. The distance between two points in the same cluster is also the same as the distance between points of different clusters.

What is the curse of dimensionality?In high-dimensional data, the data points are very sparse. The curse of dimensionality occurs when the data is too high-dimensional and our statistical methods break down. As the dimensions get higher, the data becomes more difficult for us to visualize.

Intrinsic Dimensions

What PCA states is if variance is dominated by d largest eigen values then the set has intrinsic dimension d

What is the intrinsic dimension?The relationship between two dimensions, where a dimension is the number of degrees of freedom

Fractional Dimensions

Search This Blog

UC San Diego Big Data Analytics with Spark

Weeks 5 & 6 : K Means and Intrinsic Dimensions

Comments

Post a Comment