Finding similar branches in a data tree


I have a data tree with multiple branches where each branch contains a list of numbers. I would like to identify which of the branches are similar. Two branches would be similar if one of them has at least 50% (or any other arbitrary number/percent) of the same numbers as the other branch. The lists could have a different length, the order of the list does not matter, lists contain only unique numbers. For example we have three branches with the following numbers:

  1. 1, 2, 3
  2. 2, 5, 4
  3. 3, 4, 1

Branches one and three would be similar as they both contain the numbers 1 and 3. Branch 2 is not similar to any other branch, as it only contains 1 matching number with either of the two other branches.

If anyone has a way of implementing this, then I would be very grateful. Sorry for the shitty branch numbers.
Similar (4.0 KB)


Here is a C# code to perform such an analysis.
Two lists are considered similar if the ratio (nb of common values)/(maximum list count) is greater or equal to the similarity coefficient.

My first appraoch was to loop twice and compare a new candidate to all lists already stored in a “similar” set. But with 50% similarity, all lists ended up being in the same set.
So this can be disabled with the compareToAll input, where candidates are only compared to the first list of the set.

With your dataset it gives three groups. (13.2 KB)

Edit : just realized it was better to include a new candidate in a set if it was similar to ALL existing lists in the set, not just one of them. This gives 4 groups. Added check with Text Join. (20.3 KB)

1 Like

Hi, magicteddy. Your gh file is the original flie without C#. Would you like re-upload the file?

Thanks, but as Tao Lin pointed out, you uploaded the original file :smiley:

Of course it is, I wrote the script in another file :man_facepalming:
It’s updated in my post above.