IronPython is super slow on sets and dicts with wide spanned integers

Could someone please verify that the following code takes superlong in Rhinos IronPython:

offset = int(1E6) 
array = range(100000) + [n + offset for n in range(100000)]
set(array)

Please note that:

  • the creation of the set is the slow part
  • the size of the set is only 200k, so nothing special
  • the execution time drops dramatically when the inner offset is set from 1E6 to 1E5
  • the same code is rocket fast in cPython and shows absolutely no offset dependency
  • converting the integers into strings before applying set is reasonably fast

Here is a more detailed script that shows the timings for different set sizes and offsets:

import timeit
def _timeit_setup(size, offset):
    return "a = range(%d) + [i + %d for i in range(%d)]" % (size,offset,size)

sizes = [i*10000 for i in range(1,10)]
offsets = [int(10**i) for i in range(7)]

for size in sizes:
    for offset in offsets:
        setup = _timeit_setup(size, offset)
        time = timeit.timeit("set(a)", setup=setup, number=10)
        print("size=%d, offset=%d: %.6fs" % (size, offset, time))

The last step takes more than 45s (!) on my computer. That is really annoying.

I noticed the same issue for integer dictionary keys. So, there is obviously something odd with the (Rhino?)IronPython implementation of integer hashables.

A possible solution for that would be to use System.Collection.Generic.HashSet instead of the native set implementation. Unfortunately, the interface is pretty different and not as ‘mighty’ (e.g. set.intersection_update …) as the set interface. Any other ideas?

1 Like

I believe IronPython uses a different hashing algorithm compared to CPython. I agree that It is painfully slow. The easiest solution would be to make a 2ndary plugin in C++/C# to run specific functions which are extremely slow.

Hi Trademacher
IronPython can be slow at times.
I think there are two ways to improve your code.
1: you can use xrange instead of range .xrange
2: Replace list()+list() with set()|set()

import timeit

def _timeit_setup(size, offset):
    return "a = set(range(%d)) | set(i + %d for i in xrange(%d))" % (size,offset,size)

sizes = [i*10000 for i in xrange(1,10)]
offsets = [int(10**i) for i in xrange(7)]

for size in sizes:
    for offset in offsets:
        setup = _timeit_setup(size, offset)
        time = timeit.timeit("set(a)", setup=setup, number=10)
        print("size=%d, offset=%d: %.6fs" % (size, offset, time))

Thanks for your comments.

Unfortunately, I think it’s not the same thing. If I’m not mistaking the setup-argument code in timeit is excluded from the overall timed execution. So, in your example the critical set-creation is done outside of the timed scope.
Actually, you need to run:

for size in sizes:
    for offset in offsets:
        timedCode = "set(range(%d)) | set(i + %d for i in xrange(%d))" % (size, offset, size)
        time = timeit.timeit(timedCode,  number=10)
        print("size=%d, offset=%d: %.6fs" % (size, offset, time))

which results in more or less the same timings.

Interestingly, your code shows that

set(myUglySet) 

seem to work fine again.

I had to terminate Rhino via Task Manager before your code finished. So yes I can indeed verify that it takes super long.

First of all, do you really need to hash integers? If you need to store a collection of unique numbers, isn’t computational effort better spent maintaining a sorted list, rather than computing a hash for every single number?

Otherwise, I think you’re on the right lines trying to pick a better concrete implementation for your needs than IronPython’s one size fits all. I was trying to find out how sets are implemented in Iron Python and found the old module sets. I couldn’t find a HashSet in Rhino but there is a SortedSet. This, sets.Set, (and a dict with all values None) were all much faster for me (with much smaller test data):

slow_set_of_int.gh (15.9 KB)

Hi James,

thanks for your response. And yes I/we need to work with sets and dicts because we are using labels (that get stored in RhinoGeometry-objects using some tricks) to identify (Finite)Elements.

You can load the HashSets like this:

import clr
import System
clr.AddReference("System.Core")
from System.Collections.Generic import HashSet

This seems to work fine. But I’m not sure about yet.

Ah brilliant - thanks.

By the way, I looked a bit more closely at the image I posted - the SortedSet actually seems to speed up by a factor of 20 as the offset size increases. So if it doesn’t spend too long on an initial sort, it’s perfect for your application. I tried the tests in reverse order to rule out cacheing, and picked a wider range of offsets and it confirms it:

fast_sorted_sets

set is quirky in Iron Python (I actually reported a bug in set comprehensions). But its .Net backend does have its uses.

We use sets very often. And large parts of our code framework are used in cPython and ironPython. By now, it seems that i’ll end up in subclassing the HashSet-Classes and giving them the same interface as the python-set. Than we can easily switch between both implementations. Whew! That’s a long way to go.

You know what’s best for your application. It’s old code, but there’s an ‘ABC’ interface here called BaseSet might have done a lot of that for you:

2 Likes

Awesome! Thanks!
That’s really helpful to get the interface right.