IronPython is super slow on sets and dicts with wide spanned integers

trademacher · March 14, 2023, 11:34am

Could someone please verify that the following code takes superlong in Rhinos IronPython:

offset = int(1E6) 
array = range(100000) + [n + offset for n in range(100000)]
set(array)

Please note that:

the creation of the set is the slow part
the size of the set is only 200k, so nothing special
the execution time drops dramatically when the inner offset is set from 1E6 to 1E5
the same code is rocket fast in cPython and shows absolutely no offset dependency
converting the integers into strings before applying set is reasonably fast

Here is a more detailed script that shows the timings for different set sizes and offsets:

import timeit
def _timeit_setup(size, offset):
    return "a = range(%d) + [i + %d for i in range(%d)]" % (size,offset,size)

sizes = [i*10000 for i in range(1,10)]
offsets = [int(10**i) for i in range(7)]

for size in sizes:
    for offset in offsets:
        setup = _timeit_setup(size, offset)
        time = timeit.timeit("set(a)", setup=setup, number=10)
        print("size=%d, offset=%d: %.6fs" % (size, offset, time))

The last step takes more than 45s (!) on my computer. That is really annoying.

I noticed the same issue for integer dictionary keys. So, there is obviously something odd with the (Rhino?)IronPython implementation of integer hashables.

A possible solution for that would be to use System.Collection.Generic.HashSet instead of the native set implementation. Unfortunately, the interface is pretty different and not as ‘mighty’ (e.g. set.intersection_update …) as the set interface. Any other ideas?

farouk.serragedine · March 14, 2023, 11:44am

Trademacher:

Could someone please verify that the following code takes superlong in Rhinos IronPython:
offset = int(1E6) 
array = range(100000) + [n + offset for n in range(100000)]
set(array)
Please note that:

the creation of the set is the slow part

the size of the set is only 200k, so nothing special

the execution time drops dramatically when the inner offset is set from 1E6 to 1E5

the same code is rocket fast in cPython and shows absolutely no offset dependency

converting the integers into strings before applying set is reasonably fast

Here is a more detailed script that shows the timings for different set sizes and offsets:
import timeit
def _timeit_setup(size, offset):
    return "a = range(%d) + [i + %d for i in range(%d)]" % (size,offset,size)

sizes = [i*10000 for i in range(1,10)]
offsets = [int(10**i) for i in range(7)]

for size in sizes:
    for offset in offsets:
        setup = _timeit_setup(size, offset)
        time = timeit.timeit("set(a)", setup=setup, number=10)
        print("size=%d, offset=%d: %.6fs" % (size, offset, time))
The last step takes more than 45s (!) on my computer. That is really annoying.

I noticed the same issue for integer dictionary keys. So, there is obviously something odd with the (Rhino?)IronPython implementation of integer hashables.

A possible solution for that would be to use System.Collection.Generic.HashSet instead of the native set implementation. Unfortunately, the interface is pretty different and not as ‘mighty’ (e.g. set.intersection_update …) as the set interface. Any other ideas?

I believe IronPython uses a different hashing algorithm compared to CPython. I agree that It is painfully slow. The easiest solution would be to make a 2ndary plugin in C++/C# to run specific functions which are extremely slow.

603419608 · March 14, 2023, 1:40pm

Trademacher:

import timeit
def _timeit_setup(size, offset):
    return "a = range(%d) + [i + %d for i in range(%d)]" % (size,offset,size)

sizes = [i*10000 for i in range(1,10)]
offsets = [int(10**i) for i in range(7)]

for size in sizes:
    for offset in offsets:
        setup = _timeit_setup(size, offset)
        time = timeit.timeit("set(a)", setup=setup, number=10)
        print("size=%d, offset=%d: %.6fs" % (size, offset, time))

Hi Trademacher
IronPython can be slow at times.
I think there are two ways to improve your code.
1： you can use xrange instead of range .xrange
2： Replace list()+list() with set()|set()

import timeit

def _timeit_setup(size, offset):
    return "a = set(range(%d)) | set(i + %d for i in xrange(%d))" % (size,offset,size)

sizes = [i*10000 for i in xrange(1,10)]
offsets = [int(10**i) for i in xrange(7)]

for size in sizes:
    for offset in offsets:
        setup = _timeit_setup(size, offset)
        time = timeit.timeit("set(a)", setup=setup, number=10)
        print("size=%d, offset=%d: %.6fs" % (size, offset, time))

trademacher · March 14, 2023, 2:10pm

Thanks for your comments.

Unfortunately, I think it’s not the same thing. If I’m not mistaking the setup-argument code in timeit is excluded from the overall timed execution. So, in your example the critical set-creation is done outside of the timed scope.
Actually, you need to run:

for size in sizes:
    for offset in offsets:
        timedCode = "set(range(%d)) | set(i + %d for i in xrange(%d))" % (size, offset, size)
        time = timeit.timeit(timedCode,  number=10)
        print("size=%d, offset=%d: %.6fs" % (size, offset, time))

which results in more or less the same timings.

Interestingly, your code shows that

set(myUglySet)

seem to work fine again.

James_Parrott · March 14, 2023, 2:13pm

I had to terminate Rhino via Task Manager before your code finished. So yes I can indeed verify that it takes super long.

First of all, do you really need to hash integers? If you need to store a collection of unique numbers, isn’t computational effort better spent maintaining a sorted list, rather than computing a hash for every single number?

Otherwise, I think you’re on the right lines trying to pick a better concrete implementation for your needs than IronPython’s one size fits all. I was trying to find out how sets are implemented in Iron Python and found the old module sets. I couldn’t find a HashSet in Rhino but there is a SortedSet. This, sets.Set, (and a dict with all values None) were all much faster for me (with much smaller test data):

slow_set_of_int.gh (15.9 KB)

trademacher · March 14, 2023, 2:24pm

Hi James,

thanks for your response. And yes I/we need to work with sets and dicts because we are using labels (that get stored in RhinoGeometry-objects using some tricks) to identify (Finite)Elements.

You can load the HashSets like this:

import clr
import System
clr.AddReference("System.Core")
from System.Collections.Generic import HashSet

This seems to work fine. But I’m not sure about yet.

James_Parrott · March 14, 2023, 2:31pm

Ah brilliant - thanks.

By the way, I looked a bit more closely at the image I posted - the SortedSet actually seems to speed up by a factor of 20 as the offset size increases. So if it doesn’t spend too long on an initial sort, it’s perfect for your application. I tried the tests in reverse order to rule out cacheing, and picked a wider range of offsets and it confirms it:

fast_sorted_sets

set is quirky in Iron Python (I actually reported a bug in set comprehensions). But its .Net backend does have its uses.

trademacher · March 14, 2023, 2:45pm

We use sets very often. And large parts of our code framework are used in cPython and ironPython. By now, it seems that i’ll end up in subclassing the HashSet-Classes and giving them the same interface as the python-set. Than we can easily switch between both implementations. Whew! That’s a long way to go.

James_Parrott · March 14, 2023, 3:13pm

You know what’s best for your application. It’s old code, but there’s an ‘ABC’ interface here called BaseSet might have done a lot of that for you:

github.com

IronLanguages/ironpython2/blob/master/Src/StdLib/Lib/sets.py

"""Classes to represent arbitrary sets (including sets of sets).

This module implements sets using dictionaries whose values are
ignored.  The usual operations (union, intersection, deletion, etc.)
are provided as both methods and operators.

Important: sets are not sequences!  While they support 'x in s',
'len(s)', and 'for x in s', none of those operations are unique for
sequences; for example, mappings support all three as well.  The
characteristic operation for sequences is subscripting with small
integers: s[i], for i in range(len(s)).  Sets don't support
subscripting at all.  Also, sequences allow multiple occurrences and
their elements have a definite order; sets on the other hand don't
record multiple occurrences and don't remember the order of element
insertion (which is why they don't support s[i]).

The following classes are provided:

BaseSet -- All the operations common to both mutable and immutable
    sets. This is an abstract class, not meant to be directly

This file has been truncated. show original

trademacher · March 14, 2023, 3:25pm

Awesome! Thanks!
That’s really helpful to get the interface right.

Topic		Replies	Views
Set type broken? Scripting	1	688	December 27, 2013
Is there a Python dictionary without values? Rhino Developer developer , python	25	11846	April 21, 2019
How to speed up ghpython codes? Scripting grasshopper , python , rhino6	16	1765	January 28, 2019
Exception from hashlib.py, OS X 10.10.5, Python 2.7.10. (was: Which Python Standard Library should Rhino's IronPython use?) Scripting mac , python	27	5249	September 23, 2015
IronPython stopped loading in 32bit Rhino for Windows	3	1533	April 18, 2014

IronPython is super slow on sets and dicts with wide spanned integers

Related topics