Clustering a list of numbers

Petras_Vestartas · March 20, 2020, 6:46pm

Hi,

I am trying find a method how to extract a list of numbers from a list that are the most self-similar?
I would like to filter out values that stand out too much from a general list.

From screenshot attached first list would return
0.065, 0.066, 0.067, 0.067, 0.066, 0.066
second
0.049, 0.049, 0.049, 0.049.0.051,0.052

One idea was to use K-Means clustering for the list.
But I am wondering if there is a simple mathematical way to get this?
I do not mind using .net math numeric libraries if needed.

MostCommon.gh (20.7 KB)

rawitscher-torres · March 20, 2020, 8:28pm

Hi Petras,

try this as a starting point


using System.Linq;

 private void RunScript(DataTree<double> data, double threshold, ref object A)
  {
    A = GroupSimilarNumbers(data, threshold);
  }

  // <Custom additional code> 

  public DataTree<double> GroupSimilarNumbers(DataTree<double> data, double threshold)
  {
    DataTree<double> Out = new  DataTree<double>();

    for (int i = 0; i < data.BranchCount; i++)
    {
      Dictionary<double, double> groups = new Dictionary<double, double>();

      for (int j = 0; j < data.Branches[i].Count - 1; j++)
      {
        double val = data.Branches[i][j];
        double nextVal = data.Branches[i][j + 1];

        if(Similar(val, nextVal, threshold))
        {
          if(!groups.ContainsValue(val))
            groups.Add(val, val);
        }

        if(groups.Count > 0)
        {
          foreach (KeyValuePair<double, double> entry in groups)
          {

            if(Similar(val, entry.Value, threshold))
            {

              if(!groups.ContainsValue(val))
              {
                groups.Add(val, val);
                break;
              }
            }

            if(Similar(nextVal, entry.Value, threshold))
            {
              if(!groups.ContainsValue(nextVal))
              {
                groups.Add(nextVal, nextVal);
                break;
              }
            }
          }

        }


      }

      Out.AddRange(groups.Select(a => a.Value).ToArray(), new GH_Path(i));
    }

    return Out;
  }


  public  bool Similar(double a, double b, double threshold)
  {

    return Math.Abs((a - b)) <= threshold;
  }

Important Note: The algorithm would be much simpler if all the branches are sorted before hand. There is actually a small bug in this version because of this. So just sort the data before hand.

Dani_Abalde · March 20, 2020, 8:41pm

Another option with just gh components, use that numbers as a coordinate with [Construct Point], group them with [Points Group] component and use [List Item] to filter the numbers with the Indices output.

Dani_Abalde · March 20, 2020, 8:43pm

It would only make sense to use KMeans here if you know in advance how many groups you’re going to do, the K number.

tobias.stoltmann · October 13, 2021, 3:16pm

@Petras_Vestartas
I am asking myself the same question at the moment.
I would like to group them by their similiarity or deviations, e.g. in your first example when searching for a 0.005 devaition in

[0.164, 0.065, 0.066, 0.067, 0.067, 0.066, 0.066, 0.071]

it would return
[0.164], [0.065, 0.066, 0.067, 0.067, 0.066, 0.066], [0.071]]

I could think of simply iterating through the list, but it seems a bit un-elegant to me and a bit too thoughtless.

PeterFotiadis · October 13, 2021, 5:19pm

You can do that (but as Dani said, blah, blah) … but why kill the mosquito with the bazooka? Tons of LINQ related stuff around.

Like:

PeterFotiadis · October 14, 2021, 8:34am

Get some entry Level take (as exposed in the SO link above - but why use floats ? [unless you have 1Z numbers on hand]). Obviously delta dictates the N of clusters (contrary to some KMeans).

LINQ_GroupRelativeCloseNumbers.gh (114.6 KB)

To do: In one query … OrderBy things [with various options] AFTER the GroupBy.