How to use GPU in my C++ DLL called from Python script

@dale @stevebaer

Currently I am using Python with calls to a C++ DLL to accelerate execution. But even taking advantage of the 24 cores running at 5 GHz in my CPU, the parallel-programmed C++ code is still quite slow, taking over 20 sec for 1 operation (finding intersecting faces in a mesh). If I could run this operation on my NVIDIA 3080 ti GPU which has over 10,000 CUDA cores running around 1.5 GHz, then I may get close to a 100X speedup (10,000/24 * 1.5 GHz/5 GHz).

What is the best way to add support for CUDA code to my C++ DLL in Microsoft Visual Studio? I used a guide written by @dale to include Rhino support in the DLL so I think I need to start from here in order to add support for CUDA code. On the other hand, I see that NVIDIA recommends Nsight Visual Studio Edition Version 2022.4 to bring GPU computing into Microsoft Visual Studio.

If I start with Nsight, what are the steps to add Rhino support? Also what version of Nsight should be used, 2022.4 or an older version? Last time I tried to use Microsoft Visual Studio 2022 in Rhino, it had some errors as Rhino was not yet fixed up to support it. Thus I am concerned about which version of Nsight should be used.

My Python script calls the C++ DLL to accelerate slow operations. This has been working fine for the last several years, providing over 100X speedup for some operations.

Now I want to take the next step and exploit my GPU for further speedup.

Regards,
Terry.

You don’t have to install Nsight, but it probably can’t hurt either. I don’t use Nsight hardly ever when working on Cycles kernels for CUDA/Optix, which is used in Rhino 7 Raytraced and Rhino Render.

You need the CUDA Toolkit installed so that you can compile code you write for the GPU to be compiled to PTX. I imagine there are getting-started tutorials for CUDA programming out there. Those probably go through the steps required to get up and running with this.

Typically you’ll do something like the following:

  1. write code to run on the GPU
  2. compile it to the correct format - PTX is good, since it will allow the GPU drivers to build the final version
  3. write code that sets up your kernel
  4. write code that provides data to your kernel
  5. write code to retrieve results created by your kernel

Nathan,

This is quite different from what I do now.

I create a C++ DLL using Visual Studio following the steps outlined by Dale.
I call C++ functions in the DLL from my Python script using the C-types interface to pass/receive values.

I do not understand how this flow intersects the one you describe.

I would need a detailed working example from you that I can reproduce in order to master your procedure.

I created a detailed working example of how to create a C++ DLL which can be called from Python
in one of my prior posts. It is derived from Dale’s outline but includes a fully working example to help lead the user more quickly to success. Below is the link to my post.

So I am looking for a bit more help if you can provide it.

@dale @stevebaer can either of you pitch in here?

Regards,
Terry.

I don’t have any experience with this, sorry.

– Dale

@stevebaer

Say Steve what’s your take on this?

Regards,
Terry.

You state you want to utilize the GPU to gain more performance for actions like finding intersections.

Now, to use GPU from your C++ DLL you need to utilize CUDA toolkit to do so.

So you prepare data in your Python script. You pass it on to your C++ function in your DLL. Your C++ function passes that on to your GPU kernel implementation.

To gain a better understanding of GPU programming using CUDA I suggest reading through:

Probably driver api on how to use the API to load code you’ve written and compiled for the GPU, how to prepare date to pass into the kernel, etc.

1 Like

Nathan’s suggestion is pretty much exactly the same as what I would recommend. You would need use the cuda toolkit and the nvidia guides are a good place to start.

@nathanletwory

I created a CUDA project using the CUDA 12.0 Runtime template:

Using this template creates a project that already contains some working CUDA code in Kernel.cu
image

The included C++ CUDA example code Kernel.cu is simple; it adds 2 arrays. Most of the code is shown below.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);
    
    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors a and b from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}

After a build of this CUDA project there are the following files in the Release directory:

Kernel.cu happens to include a main() section so it can be run by itself. But the function of interest for calling from a C++ DLL is:

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

This is what I want to call from my C++ DLL code. Once I get this working, then I should be able to expand this solution to add many other procedures to Kernel.cu and call these from my C++ DLL. Of course I will need to carefully consult the CUDA guide:
CUDA C++ Programming Guide (nvidia.com)
in order to properly pass values and setup the number of Threads per Block, Blocks per Grid and possibly Thread Block Clusters.

Currently the header in my C++ DLL code, which supports Rhino operations, looks like this:

#include "StdAfx.h"
#include "rhinoSdkPlugInDeclare.h"
//#include "SampleImportGeomviewPlugIn.h"
#include "Resource.h"

#include <string>
#include <cstdlib>
#include "stdafx.h"
#include <iostream>
#include <fstream>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <windows.h>

#include <sstream>      // std::istringstream

//#include <winsock2.h>
#include <ppl.h>
//#include <concurrent_vector.h>
#include <array>
#include "stdafx.h"
#include <chrono>
#include <math.h>
#include <list>
#include <queue>

#include <unordered_set>

#include <thread> // sleep_for
//#include <map>
//#include <unordered_map>
#include <filesystem>
#include <sys/stat.h>
#include <wchar.h>
//#include <random>
#include <cstdlib>
#include <tuple>
#include <set>

using namespace std::this_thread;     // sleep_for, sleep_until
using namespace std::chrono_literals; // ns, us, ms, s, h, etc.
#define COMPILER_SUPPORTS_128_BIT_INTEGERS 1

//#include <vector>
//#include <tuple>
//#include <algorithm>
//#include <iostream>
#include <locale.h>

// The plug-in object must be constructed before any plug-in classes derived
// from CRhinoCommand. The #pragma init_seg(lib) ensures that this happens.
#pragma warning( push )
#pragma warning( disable : 4073 )
#pragma init_seg( lib )
#pragma warning( pop )

using namespace concurrency;
using namespace std;

#define DLLEXPORT extern "C" __declspec(dllexport)
#define UCLASS()
#define GENERATED_BODY()
#define UFUNCTION()

What do I need to add to this header in order to be able to call the addWithCuda procedure in the CUDA code compiled into the file kernal.cu OBJ file. Or is there something else than needs to be changed so that my C++ DLL code can call the CUDA procedure?

Lastly, in my C++ DLL would I call addWithCuda using:

const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

just like shown in the main() section of the Kernel.cu code?

Regards,
Terry.

Here is another attempt to call a CUDA function from C++ DLL.

I created a new CUDA 12.0 Runtime project called cudaSuperProjector. Then I did the next steps:

  1. A header for the CUDA code was created called cudaSuperProjector.h:
class __declspec(dllexport) cudaSuperProjector {
public:
    cudaSuperProjector() { }
    ~cudaSuperProjector() { }
    void sumVectors(float* c, float* a, float* b, int N);
};
  1. The code for this project is in cudaSuperProjector.cu:
#include <stdio.h>

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include "cudaSuperProjector.h"

__global__ void addKernel(float* c, const float* a, const float* b) {
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(float* c, const float* a, const float* b, unsigned int size) {
    float* dev_a = 0;
    float* dev_b = 0;
    float* dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(float));
    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(float));
    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(float));

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(float), cudaMemcpyHostToDevice);

    // Launch a kernel on the GPU with one thread for each element.
    addKernel <<<1, size >>> (dev_c, dev_a, dev_b);
    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(float), cudaMemcpyDeviceToHost);
    return cudaStatus;
}

void cudaSuperProjector::sumVectors(float* c, float* a, float* b, int N) {
    cudaError_t cudaStatus = addWithCuda(c, a, b, N);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSuperProjector::sumVectors failed!");
    }
}
  1. Then I went to Properties of the project and in General set value of Configuration Type to Dynamic Library (.dll). Now everything for creating a library is ready. I Built this project and in the X64\Release folder I found cudaSuperProjector.dll and cudaSuperProjector.lib. Then I created a directory C:\Users\Terry\source\repos\cudaSuperProjector\lib and in it I placed a copy of cudaSuperProjector.dll and cudaSuperProjector.lib from the X64\Release folder. Also I created a director C:\Users\Terry\source\repos\cudaSuperProjector\include and copied over cudaSuperProjector.h from the X64\Release folder.

  2. Then a separate C++ DLL Project called SuperProjector was created to call the sumVectors CUDA function in the cudaSuperProjector.dll. In this project a C++ file called SuperProjector.cpp was added:

#include <stdio.h>

#include "C:\Users\Terry\source\repos\cudaSuperProjector\cudaSuperProjector.h"

int main(int argc, char** argv) {

    float a[6] = { 0, 1, 2, 3, 4, 5 };
    float b[6] = { 1, 2, 3, 4, 5, 6 };
    float c[6] = { };

    cudaSuperProjector csp;
    csp.sumVectors(c, a, b, 6);
    printf("c = {%f, %f, %f, %f, %f, %f}\n",
        c[0], c[1], c[2], c[3], c[4], c[5]);

    return 0;
}
  1. In Properties of the project SuperProjector under VC++ Directories -> Library Directories, I added the path to the cudaSuperProjector dll and lib files: C:Users\Terry\source\repos\cudaSuperProjector\lib;
    Then in VC++ Directories->Include Directories I added the path to the cudaSuperProjector header: C:Users\Terry\source\repos\cudaSuperProjector\include;.
    With these changes the SuperProjector Properties under VC++ Directories look like this:

Then I went to the Linker->Input and added C:\Users\Terry\source\repos\cudaSuperProjector\lib\cudaSuperProjector.lib.
With this change , the SuperProjector Properties under Linker->Input look like this:

  1. The Project SuperProjector Builds a DLL without error. But when I go to run it I get the error:
    Unable to start program because C:\Users\Terry\source\repos\SuperProjector\x64\Release\SuperProjector.dll is not a valid Win 32 application.

Any ideas?

Regards,
Terry.

How it works for Cycles, the raytracing engine we use, there is code that compiles to a PTX file separately from the main executable.

We use the driver API to load the PTX file, find the kernel(s) defined in that and finally execute it.

This is laid out in the second link from my previous post.

I just finished updating my last post. So maybe you could take another look at it?

I read thru the contents of both your links and see what they are promoting. Translating this to working code is the trick. I am very close now with the simple example I have given above. Don’t know if you have time to copy the files and give it a try.

I will look over the driver_api link information again to see if it can help.

Regards,
Terry.

I am currently working on my MacOS M1 laptop.

I don’t know about the vcxproj setup since my build process has been hand-crafted: I build the Cycles kernels via a batch file. The entry batch file is at build_all_cubins.bat, which creates .cubin and .ptx files, the actual compile command in the bunch of batch files is here. The CUDA source code entry file is at cycles/kernel.cu at master_rhino-7.0_20200203_tmp · mcneel/cycles · GitHub .

The detection of either a .cubin file, or the .ptx file is done here. This is then loaded as module here.

An actual path trace pass starts here with getting the path trace kernel, followed by setting up data and finally launching the kernel.

Couldn’t you just port relevant portions of your C++ code to shaders - maybe compute shaders -, which are what GPUs usually understand to process?
You wouldn’t even need CUDA for this. I’ve only dabbled a little in writing OpenGL shaders for visual purposes, but there’s usually a vertex and fragment shader that you pass data to from C++. The vertex one takes care of the point maths - it does lots of operations in parallel -, and the fragment shader takes care of the pixels that will get rasterized. I don’t know if you need the latter part, if you only want to do data processing. Keep in mind that I’m not a professional like the other guys here.

as for the error in your step #6, it just means that you cannot start a .dll for debugging, and must instead start some executable that loads it, which you do in the project properties > debugger

image

this question comes up often when people try to make a rhino plugin, and forget to set the path to the rhino executable here, instead of the plugin .rhp they are building

in your case though, your plugin is probably already set to start rhino for debugging, and you just need to choose it as the “Startup Project”, instead of your cuda .dll project – either way, the point is just that you need to have the debugger start rhino, instead of trying to directly execute a .dll