@stevebaer,
I modified the parallel.py code for my application but see no speedup with additional threads. Below is a subset of the code that demonstrates the problem. I removed two lines from the parallel.py code:
helper(pieces[0])
pieces = pieces[1:]
because in my code there is no setup code in the function that needs to be done once.
In my actual code, I am coloring the 2.2M vertices of a mesh and so the line using exp(log is replaced with much longer code that examines the Z of each vertex and computes a color for that vertex. So while the function is not the same, the result is; adding threads slowly adds to the time to complete all calculations. Looking at the Resource Monitor I can see that when all threads are used, CPU utilization is 100% and the frequency, as shown by the task manager, is at 100% = 4 GHz. So the slow down is not due to thermal throttling (the CPU is water cooled and remains at 60 degrees C or less). Running the code in parallel provides no speed benefit but increases the power 18X (my CPU has 18 cores). This is not what I wanted. Here is a view of the Resource Monitor showing 24 to 36 threads being used. The CPU activity continues to increase but the execution time does not improve. Be sure to click on the picture to see the full view.
And here is a view showing 1 to 12 threads being used:
The CPU activity increases significantly with more threads but the same number of computations is being done in about the same time. What in the world are the CPU’s finding to do to keep themselves so busy?
Here is the threading code example:
from System import Environment
import System.Threading.Tasks as tasks
from time import time
from math import exp, log
printR = False
threads = Environment.ProcessorCount
print 'Threads available = ',threads
# get_items can be called in parallel.
def get_items(range_list):
items = []
for i in xrange(range_list[0], range_list[1]):
# Still get ptZ for slope map to enable zMax limit on coloring.
items.append(exp(log(exp(log(i+1.)))))
return items
def run(function, m):
global threads, printR
time1 = time()
# Initialize pieces for holding range_list used by each thread.
pieces = []
# Find number of data items in each list needed in order to make threads number of range_list's.
n = m // threads
# Find any remainder.
r = m % n
# Find number of items in output results.
rr = m // n
# If there is a remainder, reduce m to prevent overrun.
if r > 0: m -= n
# Fill data list with range list and add to pieces to be processed.
last_i = 0
for i in xrange(n,m+n,n):
# Make enumeration index k with values of 0,1,2,3,...
k = i // n - 1
# Add any remainder onto end of last range_list.
if k == rr-1: i += r
# Define range of numbers.
data = [last_i,i]
# Save i for start of next range_list.
last_i = i
# Add list of enumeration index and range_list to pieces.
pieces.append([k, data])
if printR:
for piece in pieces: print piece
# Size results output list to hold rr entries.
results = range(rr)
if printR: print ' Time to setup pieces and results for parallel execution = ', time() - time1
# Define helper procedure for feeding data to function and output to results list.
def helper(piece):
i, data = piece
local_result = function(data)
results[i] = local_result
# Execute function in parallel using threads.
tasks.Parallel.ForEach(pieces, helper)
# Un-comment the next line and comment-out the line above to execute function sequentially.
#for piece in pieces: helper(piece)
if printR: print ' Time to run parallel tasks = ', time() - time1
return results
num_points = 3000000
for num_threads in xrange(1, threads+1):
threads = num_threads
time1 = time()
itemss = run(get_items, num_points)
print 'Time to load items lists using {} threads = {}'.format(threads, time() - time1)
"""
# Use this code to combine sub-lists into one list.
time1 = time()
all_items = []
for items in itemss:
for item in items:
all_items.append(item)
print 'Time to combine {} items lists into one list of {} items using 1 thread = {}'.format(threads, len(all_items), time() - time1)
"""
The output from the above code on my 18 core computer is:
Threads available = 36
Time to load items lists using 1 threads = 2.9946
Time to load items lists using 2 threads = 3.2071
Time to load items lists using 3 threads = 3.3686
Time to load items lists using 4 threads = 3.558
Time to load items lists using 5 threads = 3.663
Time to load items lists using 6 threads = 3.72
Time to load items lists using 7 threads = 3.7716
Time to load items lists using 8 threads = 3.8885
Time to load items lists using 9 threads = 3.924
Time to load items lists using 10 threads = 3.9836
Time to load items lists using 11 threads = 4.0232
Time to load items lists using 12 threads = 4.0669
Time to load items lists using 13 threads = 4.0995
Time to load items lists using 14 threads = 4.1049
Time to load items lists using 15 threads = 4.2463
Time to load items lists using 16 threads = 4.1852
Time to load items lists using 17 threads = 4.2158
Time to load items lists using 18 threads = 4.2758
Time to load items lists using 19 threads = 4.2225
Time to load items lists using 20 threads = 4.3516
Time to load items lists using 21 threads = 4.3416
Time to load items lists using 22 threads = 4.4082
Time to load items lists using 23 threads = 4.4222
Time to load items lists using 24 threads = 4.5297
Time to load items lists using 25 threads = 4.4754
Time to load items lists using 26 threads = 4.5431
Time to load items lists using 27 threads = 4.6073
Time to load items lists using 28 threads = 4.6128
Time to load items lists using 29 threads = 4.7737
Time to load items lists using 30 threads = 4.731
Time to load items lists using 31 threads = 4.82
Time to load items lists using 32 threads = 4.9496
Time to load items lists using 33 threads = 4.9331
Time to load items lists using 34 threads = 5.0725
Time to load items lists using 35 threads = 5.4179
Time to load items lists using 36 threads = 5.1221
I do not believe the slow down is due to thread overhead as this code example spends lots of time computing each piece. The code time can be increased and the thread overhead decreased by making num_points = 3000000 larger. But the outcome is exactly time same with a slowing increasing execution time as the number of threads is increased.
Would this work better if I used VB code instead? Is the Python interpreter refusing to allow threads to execute in parallel due to the global interpreter lock, or [GIL]? I read that this is the case with some Python versions but I also read that IronPython has no GIL and can fully exploit multiprocessor systems.
I am trying to develop an App for processing the .obj file from drone maps to display 3D elevation and slope maps and to support interactive queries for cut & fill applications. I have gone over the code again and again to optimize its performance by developing more efficient algorithms and making extensive use of RhinoCommon calls instead of rhinoscriptsyntax. These improvements have reduced the time of operations by 10X to 100X but many still require several minutes to compete. This slow performance seriously damages the interactive feel of the application. The performance is particularly bad for high-resolution models with 1 thread but the potential for improvement is over 10X with today’s affordable 16-18-core CPU’s. My challenge is to get Rhino to harness this capability like you have done for the Raytraced view.
Regards,
Terry.