Parallel for loop

Hi All,

Part of a script I have been working on requires the identification of regions on a mesh with a certain angle. I wrote some code (python) to go through the mesh normals, calculate the angle of the normal, append this to a list, check if it is at a specified angle and append it to another list:

    mesh_obj = rs.coercemesh(Mesh)
    vertices = mesh_obj.Vertices
    normals=mesh_obj.Normals
    
    Gradient=23
    tolerance=1

    Angles=[]
    points=[]

    for index, normal in enumerate(normals):
        Angle=rs.Angle2(([0,0,0],normals[index]),([0,0,0],[0,0,1]))[0] #Find angle to z-axis
        Angles.append(Angle)
        if Angle > Gradient-tolerance and Angle < Gradient+tolerance: #Check if angle is within tolerance
            points.append(rs.AddPoint(vertices[index]))

It works as intended however for large meshes it takes a long time. It seems like this could multithreaded somehow to drastically reduce the time.

Is this possible in Rhino Python? Is there a faster way to calculate these angles?

Thanks

The fact that you are populating two arrays inside the loop (Angles and points) makes it a lot harder to multi-thread. You need to keep the order intact, which probably means having several threads running consecutive ranges of indices and then recombining the partial results afterwards.

Another solution (probably easier to implement) is to not append values to lists, but resize them ahead of time and then set values by index. That way you don’t have to worry about racing conditions.

I am not too concerned in keeping the order of the points list as that order does not have much meaning for what I need it for. If this does not matter does the solution become easier?

How could I implement your alternate solution? I know how large the angles list should be so rather than

Angles=[]

I would have something like

Angles=[0]*len(normals)

Correct?

Where would I go from there?

Sorry, yo no habla python.

In C# it would look like this:

double[] angles = new double[length];
Point3d[] points = new Point3d[length];

for (int i = 0; i < length; i++)
{
  double angle = Vector3d.VectorAngle(normals[i], Vector3d.ZAxis);
  angles[i] = angle;
  if (angle > gradient - tolerance && angle < gradient + tolerance)
    points[i] = vertices[i];
  else
    points[i] = Point3d.Unset;
}

But the above obviously hasn’t been parallellized yet.

1 Like

Oh hang on, you’re using the RhinoScript python classes and actually adding points to the rhino document. That’s your performance impact right there. It also means you probably can’t multithread this, as adding geometry to the document may not be a thread-safe operation in Rhino.

If you want speed, in this case the solution is to write more efficient code rather than multi-thread inefficient code.

Ah that seems to speed things up significantly.

I changed the code so that I add the points in one go using rs.AddPoints at the end rather than in the loop. It now looks like:

    for index, normal in enumerate(normals):
        Angle=rs.Angle2(([0,0,0],normals[index]),([0,0,0],[0,0,1]))[0]
        Angles.append(Angle)
        if Angle > Gradient-tolerance and Angle < Gradient+tolerance:
            points.append(vertices[index])
    
    points=rs.AddPoints(points)

Thanks for the help!

Out of interest, how could I go about making the angle calculation parallel (forgetting about the points)? Is this possible with Rhino Python?

Hi wattzie,

Multithreading in RhinoPython editor is possible but you will have to install grasshopper and ghpython first: Download and install the latest grasshopper version. Download the ghpython.gha file. Fire up your Rhino, type “Grasshopper” into the command box. When grasshopper window loads, go to File -> Special Folders -> Components Folder. Copy the downloaded ghpython.gha file to this folder. Right click on it -> Properties. Click on “unblock”, if such button appears. If it does not, you are good to go. Close the Grasshopper and Rhino. Fire the Rhino and Grasshopper again. Now, I am not sure if this has been fixed with the latest release of ghpython, but just in case it hasn’t do this too: double click to Grasshopper canvas and type: Python script. Click on the icon at the very bottom of the search box that will appear. This will add ghpython component to your canvas. Open it (double click on it) and type: print "hello world". Click on “Ok”. Close the Grasshopper window and open the RhinoPython editor one.
Paste the following code:


import rhinoscriptsyntax as rs
import ghpythonlib.parallel
import Rhino
import time

def pickVertices(_vertices, _normals):
    angles = []
    points = []
    zAxis = rs.VectorCreate([0,0,1],[0,0,0])
    for i,normal in enumerate(_normals):
        angle = rs.Angle2(([0,0,0],_normals[i]), ([0,0,0],[0,0,1]))[0]
        angles.append(angle)
        if (angle > gradient-tolerance) and (angle < gradient+tolerance):
            points.append(_vertices[i])
    return points

def pickVerticesParall(_vertNrmls):
    vertex, normal = _vertNrmls
    angles = []
    points = []
    angle = rs.Angle2(([0,0,0],normal), ([0,0,0],[0,0,1]))[0]
    angles.append(angle)
    if (angle > gradient-tolerance) and (angle < gradient+tolerance):
        points.append(vertex)
    return points


meshId = rs.GetObject("pick up your mesh", 32, True)

if meshId:
    parallel = rs.GetString("use parallel?", "no", ["yes", "no"])
    meshObj = rs.coercemesh(meshId)
    vertices = [vertex for vertex in meshObj.Vertices]
    normals = [normal for normal in meshObj.Normals]

    gradient = 23
    tolerance = 1

    if parallel == "yes":
        time1a = time.time()
        vertNrmls = [(vertices[i],normals[i]) for i in range(len(vertices))]
        ptsObj = ghpythonlib.parallel.run(pickVerticesParall, vertNrmls, True)
        time1b = time.time()
        print round(time1b - time1a, 3)
    elif parallel == "no":
        time2a = time.time()
        ptsObj = pickVertices(vertices, normals)
        time2b = time.time()
        print round(time2b - time2a, 3)

try:
    ptIds = rs.AddPoints(ptsObj)
except:
    pass

meshVerticesParallel.py (1.6 KB)

Check both parallel and non parallel methods, and the differences in processing time will be printed out.
Have in mind that even more quicker “way” is using directly RhinoCommon methods instead of rhinoscriptsyntax functions. So you will get even more speed gain if you replace the line 22 with:

    angle = Rhino.Geometry.Vector3d.VectorAngle(normal, Rhino.Geometry.Vector3d(0,0,1))*180/3.14
1 Like

Just a side note. I’m pretty sure that ghpythonlib.parallel is simply a wrapper around one of the System.Threading.Tasks methods (probably this one). So you should be able to use those directly without having the install Grasshopper and the GHPython component. I’m sure Steve can elaborate on this…

That’s true, but it is pretty hard to get the syntax correct in python for using the generic Parallel.For loop. The ghpythonlib.parallel function wraps Parallel.For for simpler access. It also does the pre-allocation work and calls the first method in the loop in serial in case there is some sort of caching that needs to be done in functions that is not thread-safe.

I also think that making things parallel without examining the true bottlenecks in your code is just adding complication. In this case, the speed bottleneck is in adding individual points to the document using rhinoscriptsyntax (which forces the document to redraw every time a point is added.)

Good to know, thanks Steve…

Thanks again @djordje!

The RhinoCommon method speeds things up significantly. It seems like this is doing the same thing as rs.Angle2. Is there a similar speed difference between all the Rhino Python definitions and their RhinoCommon counterparts? It seems like there is a whole other world of scripting methods that I am yet to discover.

You are right @stevebaer, the most significant gains came from optimising the code.I think the ability to do things in parallel will come in hand in the future.

Thanks again for the help

Yes, because all rhinoscriptsyntax functions are python scripts which use RhinoCommon.

The big difference in performance between using geometry classes directly and using them via the Rhino document is potentially five fold:

  • When geometry is added to the document, it is first duplicated. This is no big deal for small things like points, but breps and meshes of course take some time to duplicate. The reason we do this is to make sure that nobody but us can directly change the geometry instances in our document, that way madness lies. Conversely, when you get geometry from the document and modify it, it will also first have to be duplicated.
  • If geometry is replaced or deleted, copies of the old geometry are placed in undo/redo records. This too takes additional resources, both memory and processor.
  • Changes to the document are often accompanied by viewport redraws. This impact can be especially large if shading meshes have to be calculated as well.
  • Geometry inside the document doesn’t exist on its own. We also have to construct an attribute class which maintains information about what layer an object is on, what its name is, to which groups it belongs, linetype, arrowheads, printwidth, etc. etc. etc.
  • There are events associated with object addition/removal to and from the document. Other plugins may choose to handle these events and do all sorts of expensive calculations when they happen. This leads to utterly unpredictable performance impacts.

So there are excellent performance reasons to not put your geometry in the document. In addition to performance, not sharing your geometry through the Rhino document is also safer as it prevents every Tom, Dick and Harry from messing with it behind your back.

@stevebaer,

I modified the parallel.py code for my application but see no speedup with additional threads. Below is a subset of the code that demonstrates the problem. I removed two lines from the parallel.py code:

helper(pieces[0])
pieces = pieces[1:]

because in my code there is no setup code in the function that needs to be done once.

In my actual code, I am coloring the 2.2M vertices of a mesh and so the line using exp(log is replaced with much longer code that examines the Z of each vertex and computes a color for that vertex. So while the function is not the same, the result is; adding threads slowly adds to the time to complete all calculations. Looking at the Resource Monitor I can see that when all threads are used, CPU utilization is 100% and the frequency, as shown by the task manager, is at 100% = 4 GHz. So the slow down is not due to thermal throttling (the CPU is water cooled and remains at 60 degrees C or less). Running the code in parallel provides no speed benefit but increases the power 18X (my CPU has 18 cores). This is not what I wanted. Here is a view of the Resource Monitor showing 24 to 36 threads being used. The CPU activity continues to increase but the execution time does not improve. Be sure to click on the picture to see the full view.

And here is a view showing 1 to 12 threads being used:


The CPU activity increases significantly with more threads but the same number of computations is being done in about the same time. What in the world are the CPU’s finding to do to keep themselves so busy?

Here is the threading code example:

from System import Environment
import System.Threading.Tasks as tasks
from time import time
from math import exp, log

printR = False
threads = Environment.ProcessorCount
print 'Threads available = ',threads

# get_items can be called in parallel.
def get_items(range_list):
	items = []
	for i in xrange(range_list[0], range_list[1]):
		# Still get ptZ for slope map to enable zMax limit on coloring.
		items.append(exp(log(exp(log(i+1.)))))
	return items

def run(function, m):
	global threads, printR
	time1 = time()
	# Initialize pieces for holding range_list used by each thread.
	pieces = []
	# Find number of data items in each list needed in order to make threads number of range_list's.
	n = m // threads
	# Find any remainder.
	r = m % n
	# Find number of items in output results.
	rr = m // n
	# If there is a remainder, reduce m to prevent overrun.
	if r > 0: m -= n
	# Fill data list with range list and add to pieces to be processed.
	last_i = 0
	for i in xrange(n,m+n,n):
		# Make enumeration index k with values of 0,1,2,3,...
		k = i // n - 1
		# Add any remainder onto end of last range_list.
		if k == rr-1: i += r
		# Define range of numbers.
		data = [last_i,i]
		# Save i for start of next range_list.
		last_i = i
		# Add list of enumeration index and range_list to pieces.
		pieces.append([k, data])
	if printR:
		for piece in pieces: print piece
	# Size results output list to hold rr entries.
	results = range(rr)
	if printR: print '    Time to setup pieces and results for parallel execution = ', time() - time1
	# Define helper procedure for feeding data to function and output to results list.
	def helper(piece):
		i, data  = piece
		local_result = function(data)
		results[i] = local_result
	# Execute function in parallel using threads.
	tasks.Parallel.ForEach(pieces, helper)
	# Un-comment the next line and comment-out the line above to execute function sequentially.
	#for piece in pieces: helper(piece)
	if printR: print '    Time to run parallel tasks = ', time() - time1
	return results
	
num_points = 3000000
for num_threads in xrange(1, threads+1):
	threads = num_threads
	time1 = time()
	itemss = run(get_items, num_points)
	print 'Time to load items lists using {} threads = {}'.format(threads, time() - time1)
"""
# Use this code to combine sub-lists into one list.
time1 = time()
all_items = []
for items in itemss:
	for item in items:
		all_items.append(item)
print 'Time to combine {} items lists into one list of {} items using 1 thread = {}'.format(threads, len(all_items), time() - time1)
"""

The output from the above code on my 18 core computer is:

Threads available =  36
Time to load items lists using 1 threads = 2.9946
Time to load items lists using 2 threads = 3.2071
Time to load items lists using 3 threads = 3.3686
Time to load items lists using 4 threads = 3.558
Time to load items lists using 5 threads = 3.663
Time to load items lists using 6 threads = 3.72
Time to load items lists using 7 threads = 3.7716
Time to load items lists using 8 threads = 3.8885
Time to load items lists using 9 threads = 3.924
Time to load items lists using 10 threads = 3.9836
Time to load items lists using 11 threads = 4.0232
Time to load items lists using 12 threads = 4.0669
Time to load items lists using 13 threads = 4.0995
Time to load items lists using 14 threads = 4.1049
Time to load items lists using 15 threads = 4.2463
Time to load items lists using 16 threads = 4.1852
Time to load items lists using 17 threads = 4.2158
Time to load items lists using 18 threads = 4.2758
Time to load items lists using 19 threads = 4.2225
Time to load items lists using 20 threads = 4.3516
Time to load items lists using 21 threads = 4.3416
Time to load items lists using 22 threads = 4.4082
Time to load items lists using 23 threads = 4.4222
Time to load items lists using 24 threads = 4.5297
Time to load items lists using 25 threads = 4.4754
Time to load items lists using 26 threads = 4.5431
Time to load items lists using 27 threads = 4.6073
Time to load items lists using 28 threads = 4.6128
Time to load items lists using 29 threads = 4.7737
Time to load items lists using 30 threads = 4.731
Time to load items lists using 31 threads = 4.82
Time to load items lists using 32 threads = 4.9496
Time to load items lists using 33 threads = 4.9331
Time to load items lists using 34 threads = 5.0725
Time to load items lists using 35 threads = 5.4179
Time to load items lists using 36 threads = 5.1221

I do not believe the slow down is due to thread overhead as this code example spends lots of time computing each piece. The code time can be increased and the thread overhead decreased by making num_points = 3000000 larger. But the outcome is exactly time same with a slowing increasing execution time as the number of threads is increased.

Would this work better if I used VB code instead? Is the Python interpreter refusing to allow threads to execute in parallel due to the global interpreter lock, or [GIL]? I read that this is the case with some Python versions but I also read that IronPython has no GIL and can fully exploit multiprocessor systems.

I am trying to develop an App for processing the .obj file from drone maps to display 3D elevation and slope maps and to support interactive queries for cut & fill applications. I have gone over the code again and again to optimize its performance by developing more efficient algorithms and making extensive use of RhinoCommon calls instead of rhinoscriptsyntax. These improvements have reduced the time of operations by 10X to 100X but many still require several minutes to compete. This slow performance seriously damages the interactive feel of the application. The performance is particularly bad for high-resolution models with 1 thread but the potential for improvement is over 10X with today’s affordable 16-18-core CPU’s. My challenge is to get Rhino to harness this capability like you have done for the Raytraced view.

Regards,
Terry.

This doesn’t look like it has anything to do with Rhino. If you run this script outside of Rhino through IronPython, do you get similar results? If so, it may be something to ask the IronPython guys about at

@djordje,

When I run your angles code on my 18-core computer, the parallel option takes 82 sec while the non-parallel option runs in only 40 sec. Something is not working. I looked at the Performance Monitor and it certainly shows all 36 threads are 100% busy when running the parallel option but all this work does not save any time. Could there a Global Interpreter Lock that is holding back the Python code?

Regards,
Terry.

@stevebaer,

How can I run my real code (not this example which mimics the same behavior) outside Rhino since it has lines like:

if contourGeo.Contains(P3d(xc0,yc0,z0), XYplane, 0.01) == PointContainment.Inside:

that are specific to Rhino.

I need something that works inside the Rhino environment to run my 5500 line Python script that makes hundreds of calls to Rhino methods.

Can you point me to an example of a Python script that uses tasks.Parallel.ForEach to provide a significant speedup which I can run on my computer? I tried djordje’s angle example further up in this Parallel for loop Forum Topic but it runs over 2X slower with the parallel option. How does this example run on your computer?

Regards,
Terry.

Sorry Terry, I was not referring to your “real” code. You had a sample that didn’t use any Rhino and appeared to not show any difference with multiple CPUs utilized. I was wondering if the same results happened outside of the Rhino environment. I don’t have time at the moment to run tests, but will try to get to this soon.

@stevebaer,

It would be great to learn of your results in running djordje’s angle code and other tests. My computer really speeds up my photogrammetry work on Agisoft’s Photoscan, using all 36 threads of my CPU and all 3500+ CUDA processors on my Nvidia GTX 1080 Ti GPU card. And Rhino makes great use of the Nvidia GPU to speed up the Raytraced view achieving even higher utilization of the CUDA processors than Photoscan, very impressive. So you can see I am now excited to bring parallel processing to my long-running Python scripts. For me, it would be the best thing since sliced bread to get my Python scripts to exploit parallel execution. I have spent 100’s of hours tuning my code to get its execution times down to something almost approaching reasonable. The potential exists to make its operations 2X to 10X faster if I can get the obviously parallelizable portions of the code to actually run in parallel.

Looking forward to your help.

Regards,
Terry.