300X Speedup by Porting Python code to C++ DLL with ctypes interface


#1

@dale, @Helvetosaur
I have a large mesh with 4.4 million faces. It is a 3D model of a map created from 400 drone images. I have colored the mesh using a Python script but it takes time to generate the vertex colors before I can apply the colors using:

meshVertexColors(mesh, colors)

The results look nice:

but can be painfully slow to iterate in order to bring focus on an area of interest:

If I moved generation of the colors array to C-code I think this could be much faster, perhaps less than a second. I have read about using ctypes for interfacing to C-code from Python. My question is, can I use the method you outlined in “Creating your first C/C++ plugin for Rhino” http://developer.rhino3d.com/guides/cpp/your-first-plugin-windows/ to compile my C-code and create a C-code dll library which I can then call from Python using a ctypes interface?

I am worried about starting along the path of using your guide since it is targeted towards creation of a stand-alone C-code plugin. But this seems like overkill if all I need is the dll which I can reference in Python using:

import ctypes
# Load the shared library into c types.
libc = ctypes.CDLL("./myclib.so")

Any guidance you can provide would be much appreciated as I have not done any C-code work in the Rhino environment (or any other environment for that matter). But I have programmed in many other languages and written 100,000+ lines of code. So I am not too worried about the C-code it self but I am definitely concerned about getting the C-code complied and linked into a dll library which I can call from Python.

I also have many other map types that could benefit from using a fast C-code function so learning how to do this would be of significant benefit. My Drone Maps Python script is now 7300+ lines, works quite well and has a nice GUI:


But it is painfully slow for some operations.

I have already exploited the parallel code I lifted from Grasshopper but this provides no more than 2-4X speed up. I need more like 10-100X speedup which hopefully C-code can provide.

Regards,
Terry.


#2

C# & Alea (GPU) can give you 100X speedup. CPU only probably not.

http://www.aleagpu.com/release/3_0_4/doc/gallery.html

Edit: … and Hybridizer, which supports generics as well:

// Rolf


#3

Hi Terry, we implemented ShapeOp (a “header-only C++ library for static and dynamic geometry processing”) in GHPython using ctypes. It was a bit crashy last time I had it running, but perhaps some of the implementation detail might be useful (it was stable upon release):

Edit: This code in particular:


#4

Thanks for your nice example.

After many trial and errors, I was able to create a simple x64 DLL with Microsoft Visual Studio 2017 and load it into my Python program and get it to run.

I get a 13X speedup using a DLL to add 1,000,000 numbers. This grows to 72X for 100,000,000 numbers. This is a very nice speedup. Now I will try coding some of my slower Python functions in a C++ DLL.

Regards,
Terry.


#5

@AndersDeleuran @RIL,

I ported my Python code for creating a slope map into a C++ DLL. The execution time decreased from 6 sec in Python to only 0.02 sec with the C++ DLL. This means the C++ DLL is 300X faster than Python. So I am very happy I learned how to do this. Being a retired Intel engineer I am not surprised at this result; the Intel CPU I helped design runs at 250 ps per operation (4 GHz frequency) while the Python interpreter runs more like 75 ns per operation at best. Thus using a C++ DLL can provide significant speedup if you are replacing pure Python code with no Rhinoscript calls.

For code with significant Rhinoscript calls, I have found that the System.Threading.Tasks code (as used in the Grasshopper parallel code) can provide up to 6X speedup in the case of generating contours for a mesh. In this case the Python script spends so much time in the Rhino.Geometry.Mesh.CreateContourCurves code, which is outside the bottleneck of the Python interpreter, many calls to CreateContourCurves can be launched in parallel as the Python interpreter is not busy much of the time.

So the combination of C++ DLL’s and System.Threading.Tasks has provided a tremendous improvement to my drone maps software for generating:
Elevation Maps
Slope Maps
Contour Maps
Automatic Identification of Trees
2D Profiles of Terrain
3D Area of Ditches
3D Volume of Piles
Trimming Mesh to Any Curve without any crashes or non-results
These operations now take just a few seconds on meshes with up to 4.4M faces.

@AndersDeleuran what kind of speedup did you see when you ported your solver functions to C++ DLL’s. Also is there a way to call Rhinoscript functions from the C++ DLL? Can you provide details for this?

This is the Python code:

#
# Setup code common to either Python or C++DLL.
#
vertices = meshGeo.Vertices
# Flatten vertices to x,y,z list for use below in constructing flat ctypes float array.
fvertices = vertices.ToFloatArray() # This also speeds up accessing the vertices below.
maxX, maxY = self.data.maxX, self.data.maxY # These are the maximum X and Y of the mesh computed elsewhere.
Xgs = 1./Xbox_size  # Xbox_size is 2' for my 4.4M faces mesh.
gsx = int(round(Xgs*maxX))
gsy = int(round(Xgs*maxY))
bins = (gsx+1)*(gsy+1)
count = vertices.Count
cnormals = (ct.c_float * (3*count))()
vvertices = (ct.c_float * (3*count))()
slopePts = (ct.c_float * count)()
bin_slope = (ct.c_float * bins)()
xave = (ct.c_float * bins)()
yave = (ct.c_float * bins)()
zave = (ct.c_float * bins)()
cXgs = ct.c_float(Xgs)
normals = list(meshGeo.Normals) # This speeds up accessing the normals below.
for i in xrange(count):
	j = 3*i
	j1 = j+1
	j2 = j+2
	v = normals[i]
	cnormals[j]  = v.X
	cnormals[j1]  = v.Y
	cnormals[j2]  = v.Z
	vvertices[j] = fvertices[j]
	vvertices[j1] = fvertices[j1]
#
# Python code below was ported to C++ DLL.
#
 for i in xrange(bins):
	bin_slope[i] = 0.0
	xave[i] = 0.0
	yave[i] = 0.0
	zave[i] = 0.0
for i in xrange(0,3*count,3):
	# Get x,y coordinates of vertex.
	x,y = vvertices[i],vvertices[i+1]
	# Scale x and y to bin units.
	ix,iy = int(Xgs*x), int(Xgs*y)
	# Get index of bin.
	j = iy*gsx+ix
	# Vector add normal to bin.
	xave[j] = xave[j]+cnormals[i]
	yave[j] = yave[j]+cnormals[i+1]
	zave[j] = zave[j]+cnormals[i+2]
# Compute slope for each bin.
for i in xrange(bins):
	vx,vy,vz = xave[i], yave[i], zave[i]
	if vz == 0.0: vz = 1.
	bin_slope[i] = abs(100.*sqrt(vx*vx + vy*vy)/vz)
# Compute slope for each vertice.
for i in xrange(0,3*count,3):
	# Get x,y coordinates of vertex.
	x,y = vvertices[i],vvertices[i+1]
	# Scale x and y to bin units.
	ix,iy = int(Xgs*x), int(Xgs*y)
	# Get index of bin.
	j = iy*gsx+ix
	# Use bin slope.
	slopePts[i // 3] = bin_slope[j]

This is the Python code that calls the C++ DLL:

name = 'C:\Users\Terry\source\\repos\slopes\\x64\Release\\slopes.dll'
so = ct.cdll.LoadLibrary(name)
# DLL runs in 20 ms for 2.2M vertices and 4.4M faces mesh.
so.slopes(ct.byref(cnormals), ct.byref(vvertices), ct.byref(slopePts), ct.byref(xave), ct.byref(yave), ct.byref(zave), ct.byref(bin_slope), cXgs, gsx, gsy, count)

and this is the C++ DLL code:

// slopes.cpp : Computes the average slope at each vertice in a mesh based upon the average slope of its surrounding bin.
#include "stdafx.h"
#include "math.h"
#define DLLEXPORT extern "C" __declspec(dllexport)
DLLEXPORT void slopes(float *normals, float *vvertices, float *slopePts, float *xave, float *yave, float *zave, float *bin_slope, float Xgs, int gsx, int gsy, int count)
{
	float x, y, vx, vy, vz, slope;
	int ix, iy, j;
	int bins = (gsx + 1)*(gsy + 1);
	// Zero average and bin_slope arrays.
	for (int i = 0; i < bins; ++i)
	{
		xave[i] = 0.0;
		yave[i] = 0.0;
		zave[i] = 0.0;
		bin_slope[i] = 0.0;
	}
	// Find average x,y,z components of normal for each bin.
	for (int i = 0; i < 3*count; i+=3) # 3X count and step of 3 are needed to stride across flat vvertices list with x,y,z per vertex.
	{
		// Get x, y coordinates of vertex.
		x = vvertices[i];
		y = vvertices[i+1];
		// Scale x and y to bin units.
		ix = (int)(Xgs*x);
		iy = (int)(Xgs*y);
		// Get index of bin.
		j = iy * gsx + ix;
		// Vector add normal to bin.
		xave[j] = xave[j] + normals[i];
		yave[j] = yave[j] + normals[i+1];
		zave[j] = zave[j] + normals[i+2];
	}
	// Compute slope for each bin from average x,y,z components.
	for (int i = 0; i < bins; ++i)
	{
		vx = xave[i];
		vy = yave[i];
		vz = zave[i];
		// Prevent division by zero that happens in rare cases.
		if (vz == 0.0) { vz = 1.; }
		bin_slope[i] = fabs(100.*sqrt(vx*vx + vy * vy) / vz); // Slope in % grade.
	}
	// Compute slope for each vertice
	for (int i = 0; i < 3*count; i+=3) # 3X count and step of 3 are needed to stride across flat vvertices list with x,y,z per vertex.
	{
		// Get x, y coordinates of vertex.
		x = vvertices[i];
		y = vvertices[i+1];
		// Scale x and y to bin units.
		ix = int(Xgs*x);
		iy = int(Xgs*y);
		// Get index of bin.
		j = iy * gsx + ix;
		// Use bin slope average.
		slope = bin_slope[j];
		if (slope < 1000.)
		{
			slopePts[i / 3] = slope;
		}
		else
		{
            // Limit maximum computed slope to 1000%.
			slopePts[i / 3] = 1000.;
		}
	}
}

Regards,
Terry.


(Tom) #6

I’ve done p/invoking both in IronPython and C#. I did this for another reason. Code protection. Speed increases by 300 % from IronPython(which is C#) is totally unrealistic. Its “maybe” based on a difference between interpreted and compiled code, memory usage, the way of measuring etc.
Making C Modules for time critical parts of Python is quite normal and nothing special. I bet if you write efficient C# code and compile it as Rhino or Grasshopper plugin you get pretty similar (slightly slower) results, without this p/invoke mess.


#7

ShapeOp was/is a pure C/C++ project with no/little prior implementation outside of those domains, so can’t really add much here I’m afraid. That said, we implemented Kangaroo2 within GHPython more or less simultaneously (@DanielPiker was also part of the ShapeOp effort), and to be honest I found the performance was pretty much the same across the two frameworks (likely K2 being pure .NET vs. the cost of implementing through ctypes with ShapeOp).

Speaking specifically to scripting components: I’ve found that one can severely reduce bottlenecks (especially input/output parameter costs) by following a few rules of thumbs discussed in this old thread. On that note, I would probably always recommend going with compiled C# components prior to dropping to a lower level language.

And if that still is not fast enough, maybe drop to the C++ Rhino API.

Edit: What Tom said :slight_smile:


#8

@TomTom,

I agree that 300X speedup is not typical. My observations are that it only happens for Python code that contains no Rhinoscript calls (which are wonderfully fast because they are implemented in C# or C++). When there is a mix of both Python and Rhinoscript calls (like to Rhino.Geometry.Mesh.CreateContourCurves) there is an opportunity for using Tasks.Parallel code to speed things up. But I have not yet learned how to move such mixed code to a C++ DLL. Can you provide a reference for how to do this?

One of the reasons I posted the results of my work was so that new users like me could more clearly understand the details of how to implement a Ctypes interface to a C++ DLL. It took me hours to decipher conflicting descriptions of how to construct this interface and get it working. Now that I’ve done it, its no big deal. So I can understand that for an old hand like yourself, this is probably of little interest. But for me it is still an exciting improvement, one of dozens I have made in the last month to improve my elevation map generation time from 45 sec down to 1.8 sec. I have started reading thru your posts and I am finding some interesting items that could benefit my work. Thanks for sharing on the Rhino forum.

Regards,
Terry.


#9

@AndersDeleuran,

I do not use Grasshopper but the old thread you provided still had some interesting observations.

I am still a bit lost as to what is the best approach for speeding up mixed Python + Rhinoscript code. Can I get ctypes C++ DLL to understand Rhinoscript commands or should I go to the C++ Rhino API you mention?

Regards,
Terry.


(Tom) #10

You can write C++/C# or VB.Net Plugins to create Custom Commands or GH Components. For Visual Studio there is even a wizard/template for it. From there you can use Rhinocommon.dll to access all functionality. Python uses Rhinoscriptsyntax which is a wrapper around Rhinocommon. I don’t know the exact procedure for C++, but in C#/VB its pretty easy. P/Invoking is difficult and a bit dangerous. So you really should know what to do. I’m not so sure with it. That is why I’m trying to prevent this. In my last comment I meant that using ctypes is common and you find a lot of resource for it. But still you access unmanaged memory, from an managed environment, which can be very problematic.
Tasks.Parallel is part of the dot.net framework (C#/VB.Net), which is pretty easy to call from a dot.net language. Well at least, if you work thread-safe. So this is @AndersDeleuran saying as well. Just write a custom command/plugin/component and you will notice that your performance will be similar. The C# compiler is so good, that it actually does a lot of optimisation for you. And if you need to work with pointers you can always use them when setting the compiler into unsafe mode. If you are more familiar with C++ you can use it as well. And it offers a managed environment as well.