To learn K- means clustering

Someone explain K- means clustering. I want to learn about it. It’s application uses, how it used architecture, and in grasshopper how it implemented. To be frank my knowledge is 0 in this so explain me from basics. Good luck.

It is a method for grouping multidimensional data (oftentimes points) into clusters, so that all the points in one cluster have more in common with each other than they do with points in another cluster.

There’s no possible answer to how or why it’s used, that’s like asking what role multiplication has in architecture.

Is k-means part of grasshopper 1.0? I remember writing an algorithm for 2.0 but possibly I did it once before. I think it’s an iterative approach:

  1. Randomly pick k points out of the set.
  2. Assign all remaining points to these seeds points based on proximity
  3. Move the seed coordinates to the average of all the points per cluster.
  4. Assign all points to the new seed points based on proximity.
  5. Repeat 3 and 4 until the clusters no longer change.

The main problem with k-means is that you have to decide up front what the value of k is supposed to be. This can be difficult, if not impossible to get right. You may end up having to calculate different solutions for a whole range of k values and then pick the one which has the best clustering.

As a (slightly morbid) example let’s consider trying to categorise deaths. Let’s say we have collected information on all the people who died last year and we want to figure out how to change policies or funding to reduce this total number.

Per death there’s a lot of information (age, gender, smoker, underlying conditions, in hospital or at home or outside,…) so it’s definitely a multidimensional problem.

So we run a k-means clustering with k=2 and we end up with 2 clusters. One of them clearly encompasses those who died of old age, while the other cluster is an odd sock drawer of everyone else. Clearly 2 is too small a value to work in this case.

Next up we’ll try k=3 and we get [old age], [accident], and [disease]. Better, but still very coarse. Traffic or workplace accidents? Cancer or infection? Healthy or unhealthy old age?

So you can try k=6 or k=20 and get more distinct groupings. But at some point you’ll get diminishing returns, where the additional clusters get arbitrary. It can be difficult to tell when this happens, just as it can be difficult to figure out what overarching quality connects data points inside any single cluster.

8 Likes

Simplifying… consider that you have a set of plants that you want to distribute in a garden according to their properties and needs. The big ones with the big ones, the ones that require more water with the ones that require more water, etc, by similarity. Your garden has four areas, how do you choose the area to place each plant? One way to do this is to measure all the properties of each plant and represent this data as a point in a feature space. You put four random points in this space, and for each plant you look for its closest point. Group the plants by the index of the closest point and average each group. Update the points by moving them closer to the average and repeat the process until they converge. Then the final four groups represent how you should gather the plants by similarity in the four areas of the garden.

It is worth noting that when performing any vector based clustering method that it will always arrive at a solution - but that does not mean it is the best solution. Taking from the excellent example by @Dani_Abalde, the centroids of each cluster will stabilize by grouping points closer and closer to the averages of the parts, but that does not mean points are not miss-classified. You can begin understanding the density of points by looking at the SSE or the silhouette score to see how many clusters are needed to best represent the data.

Thanks to all