Performance of Parallel Grasshopper invocations in RhinoCompute

I’ve got a process being run inside a RhinoCompute endpoint via a plugin I wrote. One step of this process involves calling a Grasshopper definition I created which takes as input some numbers and generates Brep geometry as output. I need to invoke this definition multiple times, and began looking at doing so in parallel threads instead of serially on the same thread for performance reasons.

Here’s an execution timeline for when I call my GH definitions in serial, one after another:
image

Each line is one execution of the definition, the number to the far left of each line is the .net managed thread ID, and the number to the far right is how long that invocation took in milliseconds.

Here is the same chart, but when the GH calls are done via Parallel.ForEach:
image

I’m guessing that since this entire process is happening in the context of a single Compute instance, all the threads are sharing the same “rhino runtime instance”, so there is some sort of locking or resource contention occurring inside the rhino runtime, leading to parallel invocations actually taking on average 10% longer, and the individual invocations having a less stable runtime.

Are these results a surprise to any of the good folks at McNeel? Might there be specific components I’m using in my Grasshopper definition which could cause these slowdowns? The GH definitions are not all that efficiently written right now, I got lazy and did some volumetric boolean operations when I could have done more pre-calculation to keep things out of brep-land as much as possible, so I do have some options for shaving milliseconds inside them. Might there be some way to invoke GH, or some option I can set somewhere, which would give me better parallel performance? Is GH just fundamentally not designed for this kind of usage? In production, this system will have multiple compute instances, so I could refactor it to farm out the individual GH calls back to the same cluster and get parallelization that way, but that’s adding network IO, and some complexity around partitioning the workload to reduce the impact of more network calls.