Python statistics get slower and slower

This simple Python script gets significantly slower each time the input wire is connected (or the data changes?). By 2/10ths or 3/10ths of a second each time! It is only one line using ‘statistics’:

import statistics

a = statistics.variance(x)

statistics_2023Feb25a
statistics_2023Feb25a.gh (10.2 KB)

Oddly, the Profiler time is retained from the last time the GH file was saved? Rebooting Windows shows the last Profiler time at first but when the input wire is reconnected, the time drops to less than one second, then increments again with each reconnection.

This value would be very useful if not for the degrading Profiler times. Why does this happen?

Is there a C# alternative that doesn’t suffer the same behavior?

I would look for a native GH component in math (the same menu section where the GhPython component itself lives). Otherwise, the variance formulae’re not difficult to implement manually Variance - Wikipedia

For performance I’d try these:

https://numerics.mathdotnet.com/DescriptiveStatistics.html

statistics is not actually included in IronPython2.7. Have McNeel included it in Rhino?

statistics is an unusual pure-Python CPython 3.4 module: “aimed at the level of graphing and scientific calculators” and “not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages”

Looking in the source it goes out of its way to support both Fractions and Decimal, so will never be as fast as math or any core library written in C.

Yeah, I looked there first, of course. The mystery here is why the Python gets slower and slower?

I worked around it by other means. (Sketchy purple group in this post.)

The mystery is why running a CPython module in IronPython doesn’t raise an error. Getting slower and slower is not to be unexpected either though.

I have no clue about CPython vs, IronPython. Not interested in “inside baseball”.

Pure-python modules are not CPython modules.

But indeed statistics is bundled with Rhino.

@Joseph_Oster Why this increasing slowness happens is a mystery indeed. My best guess is something in the statistics module and/or its dependencies cause IronPython to do weird things. Maybe in combination with how perhaps tree is automagically chopped into its branches for list access (lots of New Implicit Grasshopper Cycles messages in script window output pane). But instead of wasting time into IronPython or the statistics and dependant modules:

Here is a simple Python implementation that should work just fine

statistics_2023Feb25a.gh (14.0 KB)

1 Like

Thank you very much, that’s way, WAY FASTER! Very nice. I can just barely understand it but am too much of a dilettante at Python to ever write it myself.

# two-pass variance, from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Two-pass_algorithm
# see example explanation also at https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Example
# regarding floating point issues

s = sum(x)
m = s / len(x)

variance = sum([(i - m)*(i - m) for i in x]) / ((len(x) - 1) * 1.0)

a = variance

:+1:

Let me rewrite a bit so it is hopefully a bit more clear?

n = len(x)
the_sum = sum(x)
mean = the_sum / n

# use list comprehension to get all the squares of difference from mean
squares_of_difference_from_mean = [(xi -  mean) * (xi-mean) for xi in x]

# following I call just sum of squares, but from above
# know that this is sum of squares of difference from mean
sum_of_squares = sum(squares_of_difference_from_mean)

# using 1.0 here to ensure we get a float result in n_1,
# which itself stands for n - 1
n_1 = n - 1.0

# Finally the variance in the two-pass variance algorithm
variance = sum_of_squares / n_1

Anyway, good that this is helpful :slight_smile:

1 Like

Thank you again, it was clear enough, though Python has some peculiar ways of writing “for” loops after the fact, among other things. I haven’t translated this style of notation to code in almost fifty years, though it says the same thing as both Python examples you posted:

30561183657c1c3967c0f2b107d951e210482c19

From the Wikipedia page you referenced.

Very useful, thanks! A good candidate for a standard GH math component? :wink: