How to speed up copying mesh faces in C++ API vs Python + Rhinocommon

Terry_Chappell · October 1, 2019, 7:25pm

I am trying to improve the performance of my Python script by moving the Python code below, which creates several meshes, into a C++ API DLL.

faces = meshGeo.Faces
# Make .NET List of all faces to use in AddFaces below.
facesL = List[MeshFace](nvert)
# Add all faces from unified mesh to .NET List for quick access using GetRange.
facesL.AddRange(faces)
lower_index = 0
# Make duplicate mesh without faces.
dmesh = meshGeo.DuplicateMesh()
dmesh.Faces.Clear()
last_index = upper_face_index[-1][0]
for upper_index, name in upper_face_index:
	# If not last mesh, duplicate the no-faces mesh.
	if upper_index != last_index: next_meshGeo = dmesh.DuplicateMesh()
	# Use dmesh when making the last mesh.
	else: next_meshGeo = dmesh
	# Copy over the faces used in this mesh.
	next_meshGeo.Faces.AddFaces(facesL.GetRange(lower_index, upper_index - lower_index))
	# Use upper_index for lower limit in next range.
	lower_index = upper_index
	# Compact the mesh to remove unused vertices, colors, textures and normals.
	next_meshGeo.Compact()
	# Add mesh to document.
	doc.Objects.AddMesh(next_meshGeo)

The corresponding C++ API code is:

DLLEXPORT void make_meshes(uint32_t doc_serial_number, uint32_t mesh_serial_number, uint32_t nmi, int32_t* upper_face_indices, uint32_t* mesh_serial_numbers) {
	// Get doc from RuntimeSerialNumber passed as uint_32_t.
	CRhinoDoc* pDoc = CRhinoDoc::FromRuntimeSerialNumber(doc_serial_number);
	const CRhinoObject* obj = pDoc->LookupObjectByRuntimeSerialNumber(mesh_serial_number);
	const ON_Mesh* mesh = ON_Mesh::Cast(obj->Geometry());
	// Make new meshes.
	int lower_index = 0;
	// Duplicate mesh.
	ON_Mesh dmesh(*mesh);
	// Remove all faces.
	dmesh.DeleteComponent(ON_COMPONENT_INDEX(ON_COMPONENT_INDEX::mesh_face, 14));
	// Get index of last face.
	int32_t last_index = upper_face_indices[nmi-1];
	for (int i = 0; i < nmi; ++i) {
		// Get upper face index for this mesh.
		int upper_index = upper_face_indices[i];
		// Find number of faces in this mesh.
		int num = upper_index - lower_index;
		// If not last mesh, duplicate the no-faces mesh.
		if (upper_index != last_index) {
			// Duplicate no-faces mesh.
			ON_Mesh new_mesh(dmesh);
			// Set capacity of new mesh.  Without this, face copying is 30% slower.
			new_mesh.m_F.SetCapacity(num);
			// Copy over faces used in this mesh.
			for (int j = 0; j < num; ++j) {
				new_mesh.m_F[j] = mesh->m_F[lower_index + j];
			}
			new_mesh.Compact();
			// Add new mesh to Rhino document.
			CRhinoMeshObject* meshObject = pDoc->AddMeshObject(new_mesh);
			mesh_serial_numbers[i] = meshObject ? meshObject->RuntimeSerialNumber() : 0;
			// Use upper_index for lower limit in next range.
			lower_index = upper_index;
		}

The slow C++ part is where the old faces are copied into the new_mesh:

for (int j = 0; j < num; ++j) { new_mesh.m_F[j] = mesh->m_F[lower_index + j]; }

In Python this is done using the higher performance .NET List instead of a Python list:

next_meshGeo.Faces.AddFaces(facesL.GetRange(lower_index,upper_index-lower_index))

The Python+.NET+Rhinocommon code is 50% faster. Is there a way to do this faster in the C++ API?

I have moved several blocks of Python code that talk to Rhino into a C++ API DLL and seen significant speed up. But not in this case. Is it simply that a .NET List is faster than a C++ API ON_SimpleArray for copy operations? Or is the Rhinocommon meshGeo.Faces.AddFaces(IEnumerable) procedure faster than looping thru: new_mesh.m_F[i] = mesh->m_F[lower_index+j];. Since only a subset of the faces are being copied, I do not see a way to use the possibly more efficient C++ option of:
new_mesh.m_F.Append(num, mesh->m_F.Array()) to perform the copy.

Regards,
Terry.

mnewberg · October 1, 2019, 8:03pm

The python methods appear to be calling

int faceIndex = pMesh->m_F.Count();
if( pMesh->SetQuad(faceIndex, vertex1, vertex2, vertex3, vertex4) )
rc = faceIndex;
pMesh->DestroyRuntimeCache();
}

over and over again as seen in this code:

~~https://github.com/mcneel/rhinocommon/blob/master/c/on_mesh.cpp~~
https://github.com/mcneel/rhino3dm/blob/master/src/librhino3dmio_native/on_mesh.cpp

Have you tried that?

*Updated with rhino3dm repo.

mnewberg · October 2, 2019, 4:58pm

@Terry_Chappell

I was able to get a 40% increase in speed by removing an extra ON_Mesh copy (AddMeshObject copies the mesh), and switching dmesh.DeleteComponent to dmesh.m_F.SetCount(0).

void make_meshes(uint32_t doc_serial_number, uint32_t mesh_serial_number, uint32_t nmi, int32_t* upper_face_indices, uint32_t* mesh_serial_numbers) {
// Get doc from RuntimeSerialNumber passed as uint_32_t.
CRhinoDoc* pDoc = CRhinoDoc::FromRuntimeSerialNumber(doc_serial_number);
const CRhinoObject* obj = pDoc->LookupObjectByRuntimeSerialNumber(mesh_serial_number);
const ON_Mesh* mesh = ON_Mesh::Cast(obj->Geometry());
ON_3dmObjectAttributes attribs;
pDoc->GetDefaultObjectAttributes(attribs);
// Make new meshes.
int lower_index = 0;
// Duplicate mesh.
ON_Mesh *dmesh = mesh->Duplicate();
// Remove all faces.
dmesh->m_F.SetCount(0);

// Get index of last face.
int32_t last_index = upper_face_indices[nmi - 1];
for (unsigned int i = 0; i < nmi; ++i) {
	// Get upper face index for this mesh.
	int upper_index = upper_face_indices[i];
	// Find number of faces in this mesh.
	int num = upper_index - lower_index;
	// If not last mesh, duplicate the no-faces mesh.
	if (upper_index != last_index) {
		// Duplicate no-faces mesh.
		ON_Mesh *new_mesh = dmesh->Duplicate(); // Will be deleted by CRhinoMeshObject
		// Set capacity of new mesh.  Without this, face copying is 30% slower.
		new_mesh->m_F.SetCapacity(num);
		// Copy over faces used in this mesh.
		for (int j = 0; j < num; ++j) {
			auto &face = mesh->m_F[lower_index + j];
			new_mesh->SetQuad(j, face.vi[0], face.vi[1], face.vi[2], face.vi[3]);
		}
		new_mesh->Compact();

		CRhinoMeshObject *meshObject = new CRhinoMeshObject(attribs);

		meshObject->SetMesh(new_mesh);

		if (!pDoc->AddObject(meshObject)) {
			delete meshObject;
			meshObject = NULL;
		}

		uint32_t mesh_serialNumber = meshObject ? meshObject->RuntimeSerialNumber() : 0;

		mesh_serial_numbers[i] = mesh_serialNumber;
		// Use upper_index for lower limit in next range.
		lower_index = upper_index;
	}
}

delete dmesh;
}

Terry_Chappell · October 2, 2019, 11:38pm

@mnewberg

Thanks for the reference to the Rhinocommon C++ code and the new additions you made to the code. I had previously examined the Rhinocommon code but renewed my efforts with your recommendation. I also added detailed timing information to the C++ code and found that duplicating the mesh was taking 75% of the time. So from my re-examination of the Rhinocommon code I pulled the details for using memcpy to quickly copy components from the starting mesh and combined this with your code for adding the new meshes to the document. The result is that our new, hybrid code runs 275% faster than my starting code using Python + .NET List + AddFaces with GetRange.

The results:

My starting version using Python + .NET List and AddRange ran in 5.06 sec. on a 5M face test case with 387 MB .OBJ file.
Your version in the post above ran in 3.96 sec.

The hybrid version below runs in 1.84 sec.

 // This is 275% faster than Python using Rhinocommon and .NET List.
 DLLEXPORT void make_meshes(uint32_t doc_serial_number, uint32_t mesh_serial_number,
 	int32_t nmi, int32_t* upper_face_indices, uint32_t* mesh_serial_numbers,
 	int32_t& duration_d1, int32_t& duration_d2, int32_t& duration_d3,
 	int32_t& duration_d4, int32_t& duration_d5, int32_t& duration_d6, int32_t& duration_d7) {
 	// Get doc from RuntimeSerialNumber passed as uint_32_t.
 	CRhinoDoc* pDoc = CRhinoDoc::FromRuntimeSerialNumber(doc_serial_number);
 	// Get starting mesh from doc using its RuntimeSerialNumber.
 	const CRhinoObject* obj = pDoc->LookupObjectByRuntimeSerialNumber(mesh_serial_number);
 	// Get geometry of starting mesh.
 	const ON_Mesh* mesh = ON_Mesh::Cast(obj->Geometry());
 	// Do something with Attributes.
 	ON_3dmObjectAttributes attribs;
 	pDoc->GetDefaultObjectAttributes(attribs);
 	//
 	// Make sub-meshes based upon face indices in upper_face_indices.
 	//
 	// Zero lower_index at start.
 	int lower_index = 0;
 	chrono::steady_clock::time_point time1 = chrono::steady_clock::now();
 	for (int i = 0; i < nmi; ++i) {
 		// Get upper face index for this mesh.
 		int upper_index = upper_face_indices[i];
 			chrono::steady_clock::time_point time2 = chrono::steady_clock::now();
 			// Create new mesh for sub-mesh.
 			ON_Mesh* new_mesh = new ON_Mesh();
 			//
 			// Copy over vertices.
 			//
 			int32_t count = mesh->m_V.Count();
 			new_mesh->m_V.SetCapacity(count);
 			ON_3fPoint* vdest = new_mesh->m_V.Array();
 			::memcpy(vdest, mesh->m_V.Array(), count*sizeof(ON_3fPoint));
 			new_mesh->m_V.SetCount(count);
 			//
 			// Copy over colors.
 			//
 			chrono::steady_clock::time_point time3 = chrono::steady_clock::now();
 			int32_t ccount = mesh->m_C.Count();
 			if (ccount) {
 				new_mesh->m_C.SetCapacity(ccount);
 				ON_Color* dest = new_mesh->m_C.Array();
 				::memcpy(dest, mesh->m_C.Array(), ccount*sizeof(uint32_t));
 				new_mesh->m_C.SetCount(ccount);
 				memset(&(new_mesh->m_Ctag), 0, sizeof(new_mesh->m_Ctag));
 			}
 			//
 			// Copy over textures.
 			//
 			chrono::steady_clock::time_point time4 = chrono::steady_clock::now();
 			int32_t tcount = mesh->m_T.Count();
 			if (tcount) {
 				new_mesh->m_T.SetCapacity(tcount);
 				ON_2fPoint* dest = new_mesh->m_T.Array();
 				::memcpy(dest, mesh->m_T.Array(),tcount*sizeof(ON_2fPoint));
 				new_mesh->m_T.SetCount(tcount);
 				memset(&(new_mesh->m_Ttag), 0, sizeof(new_mesh->m_Ttag));
 			}
 			//
 			// Copy over faces for just this sub-mesh.
 			//
 			chrono::steady_clock::time_point time5 = chrono::steady_clock::now();
 			// Find number of faces in this mesh.
 			int num = upper_index - lower_index;
 			// Set face capacity of new mesh. Copy 30% slower without this.
 			new_mesh->m_F.SetCapacity(num);
 			// Set destination of copy to be start of new mesh-face array.
 			ON_MeshFace* dest = new_mesh->m_F.Array();
 			// Offset source by lower_index into starting-mesh face array.
 			const ON_MeshFace* src = mesh->m_F.Array() + lower_index;
 			::memcpy(dest, src, num * sizeof(ON_MeshFace));
 			new_mesh->m_F.SetCount(num);
 			//
 			// Compact the mesh to remove unused vertices, colors &textures.
 			//
 			chrono::steady_clock::time_point time6 = chrono::steady_clock::now();
 			new_mesh->Compact();
 			//
 			// Add new mesh to Rhino document and return its RuntimeSerialNumber.
 			//
 			chrono::steady_clock::time_point time7 = chrono::steady_clock::now();
 			CRhinoMeshObject *meshObject = new CRhinoMeshObject(attribs);
 			meshObject->SetMesh(new_mesh);
 			if (new_mesh->IsValid()) {
 				pDoc->AddObject(meshObject);
 				mesh_serial_numbers[i] = meshObject ? meshObject->RuntimeSerialNumber() : 0;
 			}
 			// Use upper_index for lower limit of next sub-mesh..
 			lower_index = upper_index;
 			chrono::steady_clock::time_point time8 = chrono::steady_clock::now();
 			duration_d1 += (int)chrono::duration_cast<chrono::microseconds> (time3 - time2).count();
 			duration_d2 += (int)chrono::duration_cast<chrono::microseconds> (time4 - time3).count();
 			duration_d3 += (int)chrono::duration_cast<chrono::microseconds> (time5 - time4).count();
 			duration_d4 += (int)chrono::duration_cast<chrono::microseconds> (time6 - time5).count();
 			duration_d5 += (int)chrono::duration_cast<chrono::microseconds> (time7 - time6).count();
 			duration_d6 += (int)chrono::duration_cast<chrono::microseconds> (time8 - time7).count();
 	}
 	//RhinoApp().RunScript(pDoc->RuntimeSerialNumber(), L"_Zoom _All _Extents", 0);
 	chrono::steady_clock::time_point time9 = chrono::steady_clock::now();
 	duration_d7 = (int)chrono::duration_cast<chrono::microseconds> (time9 - time1).count();
 }

Questions:

I see you added 3 lines associated with Attributes. What do these do? The resulting meshes look the same without these 3 references to Attributes.
Can you go into more details as to why:

CRhinoMeshObject* meshObject = pDoc->AddMeshObject(new_mesh);

runs 20% slower than:

CRhinoMeshObject *meshObject = new CRhinoMeshObject(attribs);
meshObject->SetMesh(new_mesh);
pDoc->AddObject(meshObject)

Why is :: in front of the memcpy lines? The codes seems to run the same without them. Is this just a shorthand for indicating it comes from the std namespace?

Currently I am pretty happy with our result: The new, hybrid C++ code is 275% faster than the old Python + .NET List + AddRange code for generating the sub-meshes. For a mesh with 5M faces and 32 sub-meshes, the total time to read its 387 MB .OBJ file, generate its unified mesh (with all faces) and then create the 32-sub-meshes based upon its 32 MTL files is 3.46 sec today vs 6.84 sec yesterday. The largest improvement came from eliminating duplication of the starting mesh in favor of using memcpy to quickly copy over all of its vertices, colors and textures and then the subset of faces for a sub-mesh. McNeel’s choice to use ON_SimpleArrays to hold these components is what enables the use of a simple memcpy operation to very quickly add components to a new sub-mesh.

I am so glad to find someone with an interest in the C++ API code. I have just recently learned how to create a DLL that uses the C++ API to more quickly access Rhino objects than Rhinocommon without resorting to a plugin. This way I get to do all the high-level architecture of the script in Python and then tap into the performance of the C++ API where needed.

Any ideas for making our code faster?

Regards,
Terry.

mnewberg · October 3, 2019, 7:43pm

@Terry_Chappell

I generally tend to stay away from using memcpy unless required, the built in ON_SimpleArray (Copy Constructor / AppendMethod) template will already use memcpy and handle the pointer arthmetic correctly for you.

In my example I used the default object attributes, which I believe is the same value you would get if you don’t pass one for the constructor. Normally when I am writing a command I use the Object Attributes of the src object. This puts the new geometry on the same layer, group, material, plot color as the source geometry. Sometimes this is really helpful.

The reason why CRhinoMeshObject method of adding geometry is faster is the AddMeshObject takes memory on the C++ stack (new_mesh), and makes a new copy on the heap, later your functions deletes this stack memory. The CRhinoMeshObject method takes ownership of memory already on the heap, and will delete it when the CRhinoMeshObject is no longer used. If CRhinoMeshObject fails to get added, then it is your responsibility to make sure to delete the memory correctly to prevent memory leaks as seen in the example.

The :: in ::memcpy is for global namespace (scope). This is to make sure to use the global version encase a local version was created. You can find all the C++ namespace rules here: https://en.cppreference.com/w/cpp/language/namespace

Below is the code cleaned up, which should run exactly like your sample above:

int lower_index = 0;
for (int i = 0; i < nmi; ++i) {
	// Get upper face index for this mesh.
	int upper_index = upper_face_indices[i];
	// Create new mesh for sub-mesh.
	ON_Mesh* new_mesh = new ON_Mesh();
	
	new_mesh->m_V = mesh->m_V;
	new_mesh->m_C = mesh->m_C;
	new_mesh->m_T = mesh->m_T;
	
	// Find number of faces in this mesh.
	int num = upper_index - lower_index;
	// Set face capacity of new mesh. Copy 30% slower without this.
	new_mesh->m_F.SetCapacity(num);
	// Set destination of copy to be start of new mesh-face array.
	ON_MeshFace* dest = new_mesh->m_F.Array();
	// Offset source by lower_index into starting-mesh face array.
	const ON_MeshFace* src = mesh->m_F.Array() + lower_index;
	new_mesh->m_F.Append(num, src);
	
	new_mesh->Compact();
	
	if (new_mesh->IsValid()) {

		CRhinoMeshObject *meshObject = new CRhinoMeshObject(attribs);
		meshObject->SetMesh(new_mesh);
		if (!pDoc->AddObject(meshObject)) {
			delete meshObject;
			meshObject = NULL;
		}

		mesh_serial_numbers[i] = meshObject ? meshObject->RuntimeSerialNumber() : 0;
	}
	// Use upper_index for lower limit of next sub-mesh..
	lower_index = upper_index;
}

Terry_Chappell · October 3, 2019, 10:41pm

Matt,

Thanks for looking over the latest code. I replaced the memcpy with the cleaner ON_SimpleArray Copy/Append method and found the code ran at the same speed. Here are the latest timing of the complete script:

So I guess we are done. No more speed tricks?

Regards,
Terry.

Terry_Chappell · October 4, 2019, 2:19am

Matt,

How do you know that ON_SimpleArray uses memcpy for its Append method? Is there code available that shows these details? Or documentation?

Regards,
Terry.

mnewberg · October 4, 2019, 12:17pm

@Terry_Chappell

C++ templates need to be in the header. If you are using Visual Studio you should be able to right click the selected call and select “Go To Definition” pretty quickly as well.

github.com

mcneel/opennurbs/blob/b844466e88fcf5a42c9d6d526abb3dcc44f939b5/opennurbs_array_defs.h#L504-L529


void ON_SimpleArray<T>::Append( int count, const T* buffer ) 
{
  if ( count > 0 && nullptr != buffer ) 
  {
    const size_t sizeof_buffer = count * sizeof(T);
    void* temp = nullptr;
    if ( count + m_count > m_capacity ) 
    {
      int newcapacity = NewCapacity();
      if ( newcapacity < count + m_count )
        newcapacity = count + m_count;
      if ( buffer >= m_a && buffer < (m_a + m_capacity) )
      {
        // buffer is in the block of memory about to be reallocated
        temp = onmalloc(sizeof_buffer);
        memcpy(temp, buffer, sizeof_buffer);
        buffer = (const T*)temp;
      }
      Reserve( newcapacity );
    }

This file has been truncated. show original

Terry_Chappell · October 4, 2019, 6:38pm

Matt,

Thanks for the mentoring.This will be very helpful to my work. In fact, just a few minutes inspecting its methods and I was able to get a script working for quickly importing a pointcloud. I had spent hours flailing around without this knowledge.

Many thanks,
Terry.

AlanMattano · December 18, 2019, 11:58am

I notice that when coping a surface or mesh can take long. This is because there is a texture assignee to the material.

Idea:
And if you try to take out the assigned material, make the copy and later assign the material one more time the material to the original object and the copy?

In other words, debug if is faster to copy without the material or using the default one.