Effect of Hyperthreading on Render Speed

Hello everyone, this is gonna be a long read so bear with me.

My boss recently upgraded his desktop from two Sandy Bridge Xeon E5-2609s to two Sandy Bridge Xeon E5-2687Ws, which is an upgrade from 4c/4t per CPU (E5-2609s are not hyperthreaded) to 8c/16t per CPU. Since the CPUs are both Sandy Bridge and he is doubling the total number of cores on his system we expected that running the E5-2687Ws with hyperthreading (HT) disabled would result in roughly double the render speed in Flamingo, which it did and he was quite satisfied with that. We then tried turning on hyperthreading, expecting an increase in performance and were surprised to see that the performance actually dropped by about 8%. I should note here that the all-core boost clock remained the same whether hyperthreading was on or off, so it wasn’t additional heat throttling the CPU as far as I can tell, CPU temps never showed above 80C. We were both rather confused as to how hyperthreading would reduce the overall performance of the system when the CPU clock remained the same. You would think that doubling the total number of threads would result in an increase in performance. So I’ve been researching hyperthreading and SMT and CPU caches for the past 12 hours or so and have come up with a theory, and I would like some volunteers to help me test this.

I think this performance discrepancy has a lot to do with how much cache is baked onto the CPU.

So from what I’ve been able to understand, CPUs have three levels of cache, L1, L2, and L3 cache. the actual processing cores on the CPU can only access the L1 cache directly. If the CPUs need data or instructions that aren’t being stored in the L1 cache, the L1 cache will look upstream and ask the L2 cache if it has the data. If the L2 cache doesn’t have it, it will look upstream at the L3 cache. These different levels of cache decrease in speed and increase in size as you go up the ladder. L1 cache is the smallest and fastest, L3 cache is the largest and slowest (comparatively, it’s still miles faster than DRAM). L1 and L2 cache are typically tied to a single core and cannot be accessed by other cores. L3 cache is typically shared amongst a subset of cores, or all the cores.

So, in a CPU with simultaneous multithreading (or hyperthreading, as Intel’s marketing calls it), each processing core has two compute threads that can execute instructions simultaneously, but each thread shares the L1 and L2 cache that’s assigned to the core. So, if there is a thread that has a lot of data that needs to be processed, that will fill up the cache, and the second thread won’t have much space left over to work on anything. My guess is that the computations that Flamingo does likely fill up the entire cache, and leave little if any for the other thread to be able to do anything. So instead of (in my bosses case 32 threads) being processed simultaneously, it actually only has enough cache to do 16 simultaneously, and so it’s swapping between the threads and as a result we’re actually losing performance in the overhead of splitting the tasks and combining the results back.

So my boss’s current CPUs have 64KB of L1, and 256KB of L2 cache per core, and 20MB of L3 cache that is shared between all 8 cores on a CPU. With SMT on, that means each thread now has to share that L1 and L2, or one thread will just use it all while the other remains idle. So if Flamingo has a lot of data in that L1 and L2 cache for just one thread, another thread of Flamingo can’t fit in that cache. I think that’s why when we enabled hyperthreading, ran a render, and looked at task manager CPU usage, we saw half the threads were being sparsely used, along with an overall performance decrease.

Now, when I went home and ran a render on my 12 core system with SMT both on and off, I got a fairly significant difference. With SMT on, I got 23M rays/sec, and without SMT I only got 18M in the same scene which is a 25% difference. I think this is likely due to the fact that my AMD Ryzen 9 3900X has 64KB of L1, and 512KB of L2 per core, along with 64MB of L3 cache, of which each core shares 16MB with two other cores. I think the increased L2 and L3 cache are the main drivers of the increased performance on my machine because now the cache has enough room to run two threads of Flamingo.

This also holds true for newer Intel CPUs, I also have a laptop with a i7-8850H which is 6c/12t, and I saw the same slight performance decrease with SMT on vs off. The L1 and L2 cache cache per core is the same as the Sandy Bridge Xeons.

So it seems from my results that the likely difference in SMT performance is due to cache size, but I don’t know specifically if it’s due to the L2 cache size or the L3 cache. I would like to then ask if anyone here is running a Skylake-X or newer Intel HEDT processor, which has 1MB of L2 cache per core, if they could please try to render any scene they want in Flamingo with SMT on vs off and report back if they see any significant performance differences.

It seems like, for now, the conclusion I have come up with is that if you have a lower tier Intel CPU (like consumer Core i3,i5,i7, and i9, low end Xeon) that if you do a lot of rendering in Flamingo you will likely get better performance by disabling hyperthreading. AMD seems to get a performance benefit from SMT so I would leave it on if you have a modern AMD Zen CPU. If anyone here is very knowledgeable about computer architectures I’d also like to know if my thoughts are completely wrong or at least somewhat understandable.

Hi @tnielsen,

Welcome to Discourse!

I can assure you that nobody here (at McNeel) is researching processor architecture or assigning grades to cpus, etc.

That said, you might be in V-Ray’s benchmark and what they deem as the best cpu you can buy for rendering with their tool:

https://benchmark.chaosgroup.com/next/cpu

– Dale

Unfortunately it seems that Vray is much more scalable than Flamingo is, as some of the CPU tests for the Sandy Bridge E5-2687W show a 30% performance uplift in Vray when using hyperthreading vs an 8% real world performance drop in Flamingo. Since I’m trying to find the best configuration for Flamingo and Flamingo only, I don’t think Vray will give me any useful information.

@nathanletwory Sounds interesting. Does this bring anything new to your thinking about Cycles? Or is Cycles already designed to automatically work well with CPU hyperthreading regardless of cache size like Vray apparently is?