Will mesh with pre-def bounding box display faster after being imported?

@dale, @stevebaer,

I am importing a large mesh (100M faces) into Rhino using a Python script that calls a C++ DLL. It works great, importing this large mesh in only 8 sec. But after it is imported, it takes 2 minutes before it is displayed in the viewport. I was wondering if the bounding box of the mesh was already defined before Rhino tried to display it, would it be displayed faster?

If so then how do I add the bounding box to the mesh? I already have the bounding box from running my script so there is no need to calculate it again. Under ON_Mesh in the Rhino C++ SDK I did not see a way to do this. I see that GetBBox will trigger the generation of a bounding box but I do not see a method for adding a pre-existing bounding box to the mesh.,

It not then is there anything else I can do to make it easier and faster for Rhino to display this freshly imported mesh?

I tried importing the 4.7GB .obj file using Rhino File -> Import but this fails. Using my script, once the mesh is imported and finally displayed, the viewport can be moved and zoomed responsively. But it is so painful to wait 2 minutes for it to be displayed the first time.

Here is a link to the 100M faces mesh .obj file if you want to try importing it:

Regards,
Terry.

I doubt the bounding box makes much of a difference in the timing to initial draw. We have to take all of the data that a mesh is composed of and make a copy of it on the GPU for drawing.

For me as an innocent bystander two questions come up:

Why, if Terry can get the data into Rhino in 8 seconds, does it take Rhino 2 minutes to get it to the GPU?

Is there any way for Terry to put it into the GPU directly as part of the import and would that speed things up?

@Terry_Chappell the m_vertex_bbox on an ON_Mesh is private and I don’t advise attempting to set it through any of pointer offset hack.

How are you creating the CRhinoMeshObject that gets added to the document?

Good point. The reason my scripts runs so fast is that it uses lots of parallelism. This works well on my new computer with a 16-core processor and 24 threads (Intel i9 12900 KF CPU). Also the memory system is fast with 128 GB of 3400 MHz DDR4 and 2 TB of PCIe 4.0x4 SSD with up to 7000 MB/s serial read speed and over 5000 MB/s for the write. The GPU card is an Nvidia GTX 3080 ti which has fast memory and sits on a PCIe 4.0 x 16 bus with high BW. So I think the hardware is easily capable of over 3 GB/s transfer rate. In fact I timed the step for adding the 3 GB mesh to the Rhino document and found it ran at 9 GB/s so the Rhino code for this is quite good (probably uses a bunch of memcpy in C++ to get this done fast since the 4-channel DRAM has over 50 GB/s of bandwidth).

Thus I am mystified as to how 2 minutes slips by before I can see an image of the mesh. The CPU can execute 5,000,000,000 instruction per sec on 1 core which scales to at least 50,000,000,000 with all 16 cores if parallelism can be exploited. So in 2 minutes it can do up to 5,000,000,000,000 instructions (aka 5 trillion). If you send me the display code I may be able to get it to run faster.

Now it the grand scheme of things making the Rhino display engine run 100x faster is not going to solve work hunger. So this problem is not that big of a deal. But for me, retired from Intel and looking for interesting problems to work on, I find this challenge fascinating.

Regards,
Terry.

Is your mesh far from the origin? This involves many more calculations to get things properly set up for display.

There are also many calculations made for things like snapping and edge display that may be at play

Are there ngons?
Are all vertices being set in float or double precision?
Are there vertex colors?
Are there texture coordinates?

All of these will add to calculation time.

Steve,

I am using the faster method of AddObject. This skips the checks provided by using AddMesh (or something called like this) but I do not need the checks as my import code already does all of them. I especially like my implementation of the find identical faces and vertices checks. This runs 4X faster than using an unordered multimap in C++ like Rhino does when doing these checks. I know Rhino uses this since when I used it I got the same exact time as Rhino does for these checks over a 100X range of mesh sizes. This motivated me to find a faster way which uses my own custom hash with highly efficient miss handling.

Anyway, I use AddObject which is very fast.

Regards,
Terry.

Good; that eliminates a copy.

You can perform an analysis in your code to see how long the cached bbox calculation is taking by just calling BoundingBox on your ON_Mesh

Not far from origin. No ngons. No colors. Float vertices no DP. Vanilla case.

Good idea. I will time the bbox gen step.

This is the time between adding geometry to the document and getting pixels to draw in the right spot on the screen. There are potentially many other non-display calculations involved when objects are initially added.

Not really in any practical manner.

I did some timings on my 2 Nvidia cards:
GTX 3080 ti with 12GB of memory took 80 sec
GT 1030 with 2 GB of memory took 107 sec
so the card seems to play some small roll in the speed.

@stevebaer,

It takes only 0.1 sec to create the bounding box for the 100M faces mesh timed using:

chrono::steady_clock::time_point time1, time2	
time1 = chrono::steady_clock::now();
ON_BoundingBox bBox0 = mesh->BoundingBox();
time2 = chrono::steady_clock::now();
RhinoApp().Print("\nTime to create and get bounding box = %.4f", 1e-6 * chrono::duration_cast<chrono::microseconds> (time2 - time1).count());

Time to create and get bounding box = 0.0969 sec

After this, getting the bounding box takes less than 100 us:

ON_BoundingBox bBox1 = mesh->BoundingBox();

Time to get bounding box = 0.0000 sec

So as you surmised, computing the bounding box is not slowing down the display. With my best Nvidia card, there is still 80.9 sec left to account for.

Regards,
Terry.

That’s what I figured as computing the bounding box is just one pass through the vertex list.

How long does it take to draw a single frame? If you run TestMaxSpeed, you should be able to figure out how long it takes to draw the mesh and how long it takes to initially set it up for drawing.

@stevebaer ,

I ran TestMaxSpeed twice, once with the layer holding the mesh turned off and once with it turned on and the mesh visible. The times were the same at 3.27 sec and 30.63 FPS:

My script reads the .obj file, creates 1 mesh of 100M faces from it, adds the mesh to a layer that is not visible and then adds it to the Rhino document. When I turn on the layer, it takes 81 sec for the mesh to be displayed.

So what does this tell us?

By the way, this mesh test case was created by arraying 1 mesh of 2M faces 49 times (7x7 grid) and then exporting this to a .obj file. When my script reads the .obj it sees that it is comprised of 49 meshes stored serially (vertices, faces, vertices, faces … 49 times). So it reads each mesh 1 at a time, creates all the data needed to make a mesh (faces, vertices, vertex normals, face normals) and concatenates this data into master lists for these items and then finally creates 1 combined mesh from the master lists. This thus creates a single mesh with 100M faces. The script can also keep all the meshes separate when importing and recreate the original document with 49 individual meshes. For actual cases, I do have single mesh with 100M faces so I am emulating that case with this combined-mesh approach.

Regards,
Terry.

That tells us that the drawing of a frame is not a significant portion of the time. Beyond that I would have to run a performance analysis on the model in Visual Studio to gain insight on what is taking time

@steve,

I also tried importing the .obj file as 49 separate meshes but the performance is the same; slow to display.

I saved the 1 mesh of 100M faces case to a .3dm file. A link to it is attached below if you want to use it as a test case for your Visual Studio analysis.

If I bring up Rhino and open this file it does not take too long to load. But then if I turn on the layer for the mesh, it takes 80 sec for it to appear using my Nvidia RTX 3080 ti GPU.

Anything improvements you can make will be greatly appreciated by many. Many of us now have multicore processors which I detect using this code in my C++ DLL:

//
// Procedures to find number of cores, logical cores and more.
//
#include <malloc.h>    
typedef BOOL(WINAPI* LPFN_GLPI)(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION, PDWORD);

// Helper function to count set bits in the processor mask.
DWORD CountSetBits(ULONG_PTR bitMask) {
	DWORD LSHIFT = sizeof(ULONG_PTR) * 8 - 1;
	DWORD bitSetCount = 0;
	ULONG_PTR bitTest = (ULONG_PTR)1 << LSHIFT;
	DWORD i;
	for (i = 0; i <= LSHIFT; ++i) {	bitSetCount += ((bitMask & bitTest) ? 1 : 0); bitTest /= 2;	}
	return bitSetCount;
}

DLLEXPORT int _cdecl get_core_count() {
	LPFN_GLPI glpi;
	BOOL done = FALSE;
	PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = NULL;
	PSYSTEM_LOGICAL_PROCESSOR_INFORMATION ptr = NULL;
	DWORD returnLength = 0;
	DWORD logicalProcessorCount = 0;
	DWORD numaNodeCount = 0;
	DWORD processorCoreCount = 0;
	DWORD processorL1CacheCount = 0;
	DWORD processorL2CacheCount = 0;
	DWORD processorL3CacheCount = 0;
	DWORD processorPackageCount = 0;
	DWORD byteOffset = 0;
	PCACHE_DESCRIPTOR Cache;

	glpi = (LPFN_GLPI)GetProcAddress( GetModuleHandle(TEXT("kernel32")), "GetLogicalProcessorInformation");
	if (NULL == glpi) {	RhinoApp().Print("\nGetLogicalProcessorInformation is not supported. Core count set to 8.\n"); return (8); }

	while (!done) {
		DWORD rc = glpi(buffer, &returnLength);
		if (FALSE == rc) {
			if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
				if (buffer)	free(buffer);
				buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(returnLength);
				if (NULL == buffer)	{ RhinoApp().Print("\nError: Allocation failure. Core count set to 8.\n"); return (8); }
			}
			else { RhinoApp().Print("\nERROR %d. Core count set to 8.\n", GetLastError()); return (8); }
		}
		else { done = TRUE; }
	}
	ptr = buffer;
	while (byteOffset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= returnLength) {
		switch (ptr->Relationship) {
		case RelationNumaNode:
			// Non-NUMA systems report a single record of this type.
			numaNodeCount++;
			break;
		case RelationProcessorCore:
			processorCoreCount++;
			// A hyperthreaded core supplies more than one logical processor.
			logicalProcessorCount += CountSetBits(ptr->ProcessorMask);
			break;
		case RelationCache:
			// Cache data is in ptr->Cache, one CACHE_DESCRIPTOR structure for each cache. 
			Cache = &ptr->Cache;
			if (Cache->Level == 1) { processorL1CacheCount++; }
			else if (Cache->Level == 2)	{ processorL2CacheCount++; }
			else if (Cache->Level == 3)	{ processorL3CacheCount++; }
			break;
		case RelationProcessorPackage:
			// Logical processors share a physical package.
			processorPackageCount++;
			break;
		default:
			RhinoApp().Print("\nError: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.\n");
			break;
		}
		byteOffset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
		ptr++;
	}
	/*
	RhinoApp().Print("\nGetLogicalProcessorInformation results:\n");
	RhinoApp().Print("Number of NUMA nodes: %d\n",
		numaNodeCount);
	RhinoApp().Print("Number of physical processor packages: %d\n",
		processorPackageCount);
	RhinoApp().Print("Number of processor cores : % d\n",
		processorCoreCount);
	RhinoApp().Print("Number of logical processors: %d\n",
		logicalProcessorCount);
	RhinoApp().Print("Number of processor L1/L2/L3 caches: %d/%d/%d\n",	processorL1CacheCount, processorL2CacheCount, processorL3CacheCount);
	*/
	free(buffer);
	// 10% faster for importing multiple serial read meshes.
	return logicalProcessorCount;
	// Not as fast.
	//return processorCoreCount;
}

Here is an example where I called it from my C++ code to setup the number of parallel reads of the .obj file when sampling it to find the start/stop locations of vertices, textures, normals and faces in the file.

	// Define lists for storing location of vertices, textures, normals and faces in .obj file.
	int32_t npar = get_core_count();
	auto vmin = make_unique<uint64_t[]>(npar); auto vmax = make_unique<uint64_t[]>(npar); auto vtmin = make_unique<uint64_t[]>(npar); auto vtmax = make_unique<uint64_t[]>(npar);
	auto vnmin = make_unique<uint64_t[]>(npar); auto vnmax = make_unique<uint64_t[]>(npar); auto fmin = make_unique<uint64_t[]>(npar); auto fmax = make_unique<uint64_t[]>(npar);

My code then uses binary-chop searches to find the exact start/stop locations of the mesh elements and then reads and parses them in parallel (24-way in my CPU’s case). This way I am able to read the 100MB of data for each of the 2M faces mesh in 30 ms or so. My wife always tells me to work fuzzy to focused if I want to get things done so I tried to follow her paradigm in developing this code. At first it only has an approximate idea of where the data is in the .obj file but it only takes 100 ms to gain this knowledge. Then if finds the exact locations which is very fast, a few ms, because it only has to search in very small regions of the 4.7 GB big file. Finally it reads the data and parses it in parallel since it knows the exact locations. The alternative is to just start reading the file, parsing every line and storing the data but this is 100X slower. This is the best trick I found for quickly reading the .obj file. Perhaps there are similar tricks that can be discovered for accelerating the display of my big meshes.

Regards,
Terry.

I haven’t had a lot of luck with the VS performance analysis tool lately as it has been pretty flaky in recent releases. Spending an hour getting a tooled build of Rhino only to have it fail with the analysis tool is pretty frustrating. I probably won’t be trying this until I switch to VS2022