Multithreading a function issues

Hi. Is it possible to multithread a function which takes a couple of lists as an input, where each of those lists have a couple of thousands of items (8760 in this case)?

Here is an example function:

def mainFunction(years, months, days, hours):
    # lists: years, months, days, hours contain 8760 values each
    L1 = []; L2 = []; L3 = []; L4 = []; L5 = []; L6 = []; L7 = []; L8 = []
    
    for i in xrange(8760):
        v1_0, v1_1, v1_2 = subFunction1(years[i], months[i], days[i], hours[i])
        v2_0, v2_1 = subFunction2(v1_0, v1_1, v1_2)
        v3_0, v3_1, v3_2 = subFunction3(v2_0, v2_1, v2_2, v2_3, v2_4)

        L1.append(v1_0); L2.append(v1_1); L3.append(v1_2); L4.append(v2_0); L5.append(v2_1); L6.append(v3_0); L7.append(v3_1); L8.append(v3_2)
    
    return L1, L2, L3, L4, L5, L6, L7, L8

So basically the mainFunction inputs four lists (years, months, days, hours), where each of them contains 8760 values.
Then inside the mainFunction, I call the subFunction1, subFunction2, subFunction3 and append the values they return to L1, L2 ... L8 lists. At the end the mainFunction should return all these lists.

I tried making a parallel function inside the mainFunction, and calling it for 0 to 8760 integers:

def mainFunction(years, months, days, hours):
    # lists: years, months, days, hours contain 8760 values each
    L1 = []; L2 = []; L3 = []; L4 = []; L5 = []; L6 = []; L7 = []; L8 = []
    
    def parallelFunction(i):
        v1_0, v1_1, v1_2 = subFunction1(years[i], months[i], days[i], hours[i])
        v2_0, v2_1 = subFunction2(v1_0, v1_1, v1_2)
        v3_0, v3_1, v3_2 = subFunction3(v2_0, v2_1, v2_2, v2_3, v2_4)
        
        L1.append(v1_0); L2.append(v1_1); L3.append(v1_2); L4.append(v2_0); L5.append(v2_1); L6.append(v3_0); L7.append(v3_1); L8.append(v3_2)
    
    ghpythonlib.parallel.run(parallelFunction, xrange(8760), False)
    #or System.Threading.Tasks.Parallel.ForEach(xrange(8760), parallelFunction)
    
    return L1, L2, L3, L4, L5, L6, L7, L8

However the final run time was equal or a little bit higher in comparison with upper initial non-threaded function.

I also tried to pack the variables of the initial input lists, and then unpack them inside the parallel function, as suggested by Steve git example:

def mainFunction(years, months, days, hours):
    # lists: years, months, days, hours contain 8760 values each
    L1 = []; L2 = []; L3 = []; L4 = []; L5 = []; L6 = []; L7 = []; L8 = []
    
    def parallelPVWatts(subList):
        year, month, day, hour = subList
        
        v1_0, v1_1, v1_2 = subFunction1(year, month, day, hour)
        v2_0, v2_1 = subFunction2(v1_0, v1_1, v1_2)
        v3_0, v3_1, v3_2 = subFunction3(v2_0, v2_1, v2_2, v2_3, v2_4)
        
        L1.append(v1_0); L2.append(v1_1); L3.append(v1_2); L4.append(v2_0); L5.append(v2_1); L6.append(v3_0); L7.append(v3_1); L8.append(v3_2)
    
    listOfLists = [[years[i], months[i], days[i], hours[i]] for i in xrange(8760)]

    ghpythonlib.parallel.run(parallelFunction, listOfLists, False)
    #or System.Threading.Tasks.Parallel.ForEach(listOfLists, parallelFunction)
    
    return L1, L2, L3, L4, L5, L6, L7, L8

This did not decrease the run time either.

Can it be said that multithreading (with the aim of decreasing the run time) can not be applied in these cases, where there are a couple of lists with thousands of items in them?

By the way, I did not post the subFunction1, subFunction2, subFunction3 because they are quite longy. Also getting years, months, days, hours list requires additional functions.

I would be very grateful for any kind of reply.

Multi threading also has overhead associated with it, and running multiple threads may even be slower. I have seen several examples where single threaded was as fast or even faster than multi.
This has to do with details about cpu architecture, cache access, memory access, etc.

Thank you Menno.

I tried running the code on another PC, and again got similar or a bit higher run times than with regular function. I think this may not be an issue with hardware, but rather it is either the problem with my code, or it even may not be possible to multithread a function like this (with a couple of lists).

@djordje,

it may be worth a try to test your code without using append if you know how many objects will be in the final list you’re currently appending to. Instead of using append, try to allocate a list with the expected amount of None items before, outside the multithreaded worker function eg:

results = [None] * len(items)
def MyWorker(w): 
    # add the results
    results[w] = AnotherFunction(items[w])
tasks.Parallel.ForEach(xrange(len(items)), MyWorker)

c.

[quote=“djordje, post:3, topic:34200”]
this may not be an issue with hardware,[/quote]

It’s about how code interact with hardware, as indicated by signature menno, and clement (dynamic allocation is always slower than fixed allocations)

Depending on the size and the distribution of data (and more) you may cause “context switching” for the CPU, and also placing locks often turns out to be more costly than to streamline everything into one thread.

There’s a reason to why parallell computing hasn’t become widespread although multiple processors has been around for quite some time.

If you need locks in your code (or if the code in your parallel library places locks for you) you may also want to look into an alternative approach using “ring buffers” instead (a “pattern” or approach to programming as to avoid the need for locks altogether) which sometimes can speed up algorithms X times. Locks is a performance killer.

// Rolf

1 Like

Hi @clement,

Thank you for the example code!!
Indeed the waiting time has decreased by 2 seconds in comparison with parallel function from the first post!
But it is still on pair with the non-threaded function.
Still thank you for the suggestion I might use it for some future parallel examples.

Hi @RIL,

How can I check if locks are placed in my code?

Is it possible for you to post a “ring buffer” example of the upper initial ironpython code?

Hi djordje,

You would have to look into the source code of the parallel library. I have never used the Python language or its derivatives so I can’t help you (I’m a Delphi Pascal dude)

But in any case, here’s some info on the philosophy of ring buffers (Disruptor pattern) at Martin Fowlers site. In the introduction text they present an “unexpected” real world use case.

The parallell approach isn’t always what you want. Performance is what you want, and in order to achieve this you would have to know exactly what it is tha chokes your algorithm.

Edit: For the full and more sophisticated approach (Disruptor), read this:
http://lmax-exchange.github.io/disruptor/

This is perhaps something also for the McNeel guys to look into (if they didn’t already do that) in cases when there’s need to frighten ol’ Rhino to perform better. :slight_smile:

// Rolf

Thank you Rolf,

Will try reading both articles.

@Steve, @dale,

Would it be possible to hear your comment on this topic please? Thank you.

Have you analyzed the time it takes to execute subFunction1, 2, and 3? Is there one execution of those functions that takes an overwhelming amount of the time?

It’s pretty hard to really give any quality input on this since I don’t have a clue about the characteristics of what is going on in the script.

Hi @Steve,

None of the subFunctions 1,2,3 is heavy. In fact when I print their run times for each integer from 0 to 8760, I am getting: 0.0 for all three of them.

I hope it is not a problem if I attach the code as a grasshopper definition?
There are four components in the definition: the one on the top is unchanged, non-threaded version.
Those three below it correspond to the two examples I tried in my initial post, with third one being created based on Clement’s upper advice.

In all four components the mainFunction is actually mainParallel and it starts at line 375.
Function parallelFunction is in fact parallelPVWatts.
Functions subFunctions 1,2,3 are actually: NRELsunPosition, POAirradiance, pvwatts respectively.

Please let me know if there is any other information that I need to provide.

The grasshopper definition requires ghpython plugin to be installed:
multithreading_test.gh (482.8 KB)

Those scripts are pretty complicated and I just don’t have the bandwidth at the moment to completely understand what they are doing. I also don’t have energy plus installed on my computer, so I guess there is no way for me to really run these scripts.

If you are getting very small runtimes for all of your functions, then what sort of overall runtime lengths are you getting?

The NRELsunPosition, POAirradiance and pvwatts functions can be found in the smallest top left component:
They start from the line 5240 to line 5369 (NRELsunPosition), 5408 to 5544 (POAirradiance) and 5546 to 5618 line (pvwatts). They are all simple in a sense that neither of them uses any loop nor call other functions. They input a certain number of variables, and perform one line mathematical operations on each one of them.

You don’t have to have the EnergyPlus installed. Only Honeybee components require it, these are Ladybug ones. They do not require any other installation.

Here are two screenshots with component’s runtimes:

On both I am getting similar run times as with regular non-threaded component (the one at the left top corner).

Hi djordje

I looked at your definition, and it seems that there is some stuff to restructure - and multithreading that function should probably not be the immediate goal. In fact, on my system, the main() function is called 7 times (remember what happens when you graft and have list access: code has to run for each branch), and for each of these, it only takes 0.38s. It’s easier to see this in the newer GhPython in Rhino WIP (more news about this soon).

0.38s * 7 = 2.66s

The whole component takes 9.1s.

9.1s - 2.66 = 6.44s. It’s 6 seconds outside that part. The larger chunk of time is spent outside the function! Additionally, performing thread spawning in a loop will only make things slower. So, you are spending more time than before by multithreading this.

In this case, I’d suggest that you start by first studying where most of the time is spent (profiling).
Also, be absolutely certain about what is happening in the multithreaded environment! xrange should not be used, for example. Also, you need fixed sized arrays. Append() is not guaranteed to happen in the right order, and all your items will be appended wrongly, if at all.

Giulio

Giulio Piacentino
for Robert McNeel & Associates
giulio@mcneel.com

Hi Giulio,

Thank you for your reply and detailed explanations!

0.38s * 7 = 2.66s
The whole component takes 9.1s.
9.1s - 2.66 = 6.44s. It’s 6 seconds outside that part. The larger chunk of time is spent outside the function!

I am aware of the heaviness of other parts of the component. Some of them are either too complicated for me to even try to multithread them, some of them are a part of the global library, managed by other Ladybug developers.
Still I am surprised that the main function accounts for only 29% of the runtime of the whole component, I though the percentage it takes was a bit higher. So what I could do is try to multithread my personal code from the component, especially the main function as it uses the simplest code with a single for loop. If I manage to understand how to multithread it, then I can apply this to some other components as well.
Regardless of all of this, I am wondering why are multithreaded mainParallel functions still not faster than non-multithreaded main function? I printed the run time of each of the non-threaded main function and threaded mainParallel functions:

The run times of the mainParallel functions are even higher than the main function.
Why is that so?

Is this the reason:

Additionally, performing thread spawning in a loop will only make things slower. So, you are spending more time than before by multithreading this.

What does this mean? That a threading should not be performed if _PVsurface input has a data tree with more than one branch?
Can you clarify this please?

I tried inputting a single item to the _PVsurface input. Still, the run times of the threaded functions are higher than the non-threaded one:

xrange should not be used, for example.

Understood. But even with range instead, there is no difference. All three upper screenshots have been recorded once I replaced the xrange with range.

Also, you need fixed sized arrays. Append() is not guaranteed to happen in the right order, and all your items will be appended wrongly, if at all.

Clement suggested this in his upper reply (that’s the third bottom right component in the multithreading_test.gh file). But I was not aware of the fact that without the fixed array items will not be appended to the list in the correct order. Thank you for the clarification on this.
multithreading_test2.gh (494.0 KB)

I do not have a clear recipe for when multithreading makes sense and when not, especially because the multithreaded setup, as you noticed, has a cost also in terms of different algorithms you can pick (no append, for example). In this particular case, you are multithreading a function that takes 0.38s.

Imagine you need to test-drive a car by doing 8km on it with a special little device. Except you have 8 cars available and 8 little devices. Getting out of your car, finding 8 drivers (I have 8 different cores here) for each different car, giving them each a little device, and then each one just having to do 1km, might actually take a lot longer than just doing the 8 km on the single car yourself.

So, multithreading the 0.38s function makes little sense, as you noticed.

What you could try to do, is multithreading the entire component, using one thread for each of the ~2s operations that creates each branch of the output. I suppose that would give better results. It will be less tricky if the inputs and outputs were a single class because you could avoid the datatree and the grafting: just processing a plain list of inputs into a plain list of outputs.

I hope this gives some ideas,

Giulio

Giulio Piacentino
for Robert McNeel & Associates
giulio@mcneel.com

Giulio thank you once again for detailed and down to Earth explanations. I appreciate it!

I do not have a clear recipe for when multithreading makes sense and when not, especially because the multithreaded setup, as you noticed, has a cost also in terms of different algorithms you can pick (no append, for example).

I won’t have issues with no append.

In this particular case, you are multithreading a function that takes 0.38s.
Imagine you need to test-drive a car by doing 8km on it with a special little device. Except you have 8 cars available and 8 little devices. Getting out of your car, finding 8 drivers (I have 8 different cores here) for each different car, giving them each a little device, and then each one just having to do 1km, might actually take a lot longer than just doing the 8 km on the single car yourself.

So, multithreading the 0.38s function makes little sense, as you noticed.

Isn’t this the run time on your PC? On mine the run time is higher as seen on upper screenshots. Isn’t the percentage of function’s run time in comparison with the whole component’s run time the thing that matters? On both yours and mine PCs its around 30%.
Here is another test on the mentioned AMD Athlon 64 X2 3000+ (1.6 GHz, 2 cores, 2 threads). The mainParallel and main function now take up to 1.8 seconds:

The threaded mainParallel functions run times are similar with non-threaded main one. It is difficult for me to understand that a cpu which has 2 cores and threads therefor 2 cars can not perform on 1 km distance better than 1 car.

What you could try to do, is multithreading the entire component, using one thread for each of the ~2s operations that creates each branch of the output. I suppose that would give better results. It will be less tricky if the inputs and outputs were a single class because you could avoid the datatree and the grafting: just processing a plain list of inputs into a plain list of outputs.

Would it be too much from me to ask you to clarify this a bit please? My apology if this topic took much of your time.