CUDA language bindings – With modern C++

By Raymond Glover

In this tutorial I explain the basics of writing cross-platform CUDA-enabled C++ extensions for Python/Node.js/Java applications, and introduce kernelpp, a miniature framework for heterogeneous computing. The accompanying code for this tutorial is available here.

At the intersection of data and optimization, the ability to use hardware acceleration effectively forms a set of considerations increasingly important among application developers. This is especially true in cloud computing, which is democratizing the data center, and with it the availability of dedicated compute appliances such as GPUs and FPGAs.

In parallel to this, C++ is becoming a de facto way among application developers to deliver core components targeting multiple underlying platforms. Whether it be for the desktop, mobile, embedded or cloud; C++ is a common denominator that can run practically anywhere without compromising on performance or flexibility.

With C/C++ acting as the glue between data intensive/performance sensitive applications and the underlying hardware, it’s become a catalyst for heterogeneous computing.

Target Audience

This tutorial is aimed at those considering extending their Python/Node.js/Java applications with native and hardware accelerated components, or adopting one of the many pre-existing libraries, but don’t know where to start or what the landscape for such things looks like.

In it, we’ll construct a minimum working example of a native extension for these 3 popular languages, and identify important aspects and principals to keep in mind when building your own. I’ll be assuming you have a basic understanding of C++, and at least one of the target languages.

Part 1: Native Extensions

Lets outline what we’re going to build. Imagine we have an application, and a crucial part of it is some computationally expensive algorithm. We’ve benchmarked the application and determined an implementation of this algorithm with a high-performance numerical library such as Eigen, cuBLAS or NPP would likely be a worthwhile investment.

Our aim will be to construct this as a native extension, balancing performance, portability, and productivity. For lack of a better name, we’ll call it mwe (minimum working examples). Like most native extensions, there are three components of it to consider:

1. Loader – The language-specific interface to our extension. At a minimum, it’ll be responsible for finding and loading the extension itself. Whilst we want our extension to have a common API across all target languages, we also want it to feel idiomatic and ergonomic within the context of each. The loader is responsible for this.

2. Extension (mwe) – The core implementation of our algorithm(s), shared across target languages.

3. Binding – Acts as the interface between the extension and the target language runtime, describing the native extension through the target language’s type system. In reality this will be a shared library (e.g. a .dll or .so file) accessible to the loader.

Possible approaches

Before we get started, I’ll skim over some common methods native libraries can be integrated to other languages. I think you can group each in to one of three categories:

1. Raw C – Most modern languages provide a C interoperability layer or foreign function interface (FFI). For all but the most trivial APIs, a C based integration would be both verbose and error prone. Even if we wrote a raw C integration, we’d find it hard to resist reinventing one of the other two approaches as we did so. The additional complexity of supporting multiple versions of the same language can also be extremely burdensome; we’d want to avoid, or at least isolate, such complexities.

2. IDL bridges and generators – Popular examples include SWIG and the more nascent djinni. These bindings aim to abstract away the idiosyncrasies of individual languages by providing an interface definition language (IDL) with which to define data structures and operations at the language boundary. The binding will try to figure out the rest, including how to marshall data across this boundary, during an automated generation phase. As you’d imagine, the tradeoff in ease of use is the flexibility and internal complexity of such generators.

3. IDL-like C++ wrappers – IDL-like wrappers are a middle ground between the first two approaches; by using techniques like metaprogramming and compile-time introspection (both prominent features of C++), we can aim to build a language-specific wrapping mechanism that avoids the opaque abstractions in 2 and the boilerplate and incidental complexity of 1. Popular examples include boost.python and nan.

For a relatively simple extensions like mwe, we could consider the convenience of method 2. However, it’s not unrealistic to expect real-world extensions to eventually encounter shortcomings with this approach, in particular when expressing parts of an API with idioms specific to a language, or idioms that differ slightly between languages; e.g. asynchronous callbacks, exceptions, or in dealing with the nuances of garbage collection.

This seems somewhat unavoidable; if you’re using a mechanism that aims to hide the differences between two programming languages, then you also loose the ability to consider either one in isolation. To compensate, higher-level concepts are introduced in the binding itself, such as the typemaps or features mechanisms in SWIG. However, the added flexibility also introduces a new layer of complexity not readily understood by non-experts.

Instead, we’ll use method 3, which I believe to be the most pragmatic. By exploiting the features of C++11/14, we’ll build 3 lightweight wrappers to mwe (one per language) in an IDL-like abstraction. In this way we can also enforce a strict separation of concerns between bindings, and so hope to keep our bindings simpler. We’ll use three binding libraries that align with these aims:

pybind11 (Python) – Seamless operability between C++11 and Python
v8pp (node.js) – Bind C++ functions and classes into V8 JavaScript engine
jni.hpp (Java) – A modern, type-safe, header-only, C++14 wrapper for JNI

The build system

In recent years, the tooling to develop complex cross-platform applications has significantly matured. Tools like Bazel, Buck and CMake orchestrate the building, testing, packaging and deployment for a variety of platforms and toolchains. The oldest and probably most widely used of these is CMake.

CMake is unusual (but not unique) in that it’s really a meta-build system used to generate a build environment, rather than target build artifacts (executables and so on.) So for instance, on Windows, CMake can be used to generate a Visual Studio solution (an .sln file), whilst on Linux it’s usually used to generate a Make based project (a Makefile file). CMake has support for many other build systems. Furthermore, the latest version of some IDEs and other productivity tools now have official support for CMake, making it available to a wider set of developers.

🔨 Step 1 – Tools & Code

Install CMake.
You’ll also need a recent C++ compiler compatible with CUDA 8, for example Visual Studio 2015, gcc 5.3 or Xcode 7 for Windows, Linux and Mac respectively.

Using Git, clone the repository with submodules:

 git clone https://github.com/rayglover-ibm/cuda-bindings.git --recursive

CMake has first-class support for C, C++, Objective-C and Fortran compilers. Furthermore, extending CMake is certainly possible (and in some cases, preferable) to support other languages. That being said, it’s not the go-to tool for building and packaging everything. Integrations with Apache Maven (a Java build and package manager) and Gradle (Android’s integrated build system) can configure and drive CMake builds; this process is preferable when building complex packages for their respective platforms (e.g. .apk packages for Android), even if it sounds less convenient at first glance.

Project structure

Here is what the complete directory structure for mwe looks like:

.
├───include  . . . . . . . . . . . . . . . .  (1)
├───src  . . . . . . . . . . . . . . . . . .  (2) 
├───cmake  . . . . . . . . . . . . . . . . .  (4)
├───third_party
│    ├───kernelpp  . . . . . . . . . . . . .  (3)
│    └───gsl_lite  . . . . . . . . . . . . .  (3)
│
└───bindings
    ├───nodejs . . . . . . . . . . . . . . .  (5)
    │   ├───mwe
    │   └───third_party
    │       ├───node-cmake . . . . . . . . .  (3)
    │       └───v8pp . . . . . . . . . . . .  (3)
    │
    ├───python . . . . . . . . . . . . . . .  (5)
    │   ├───mwe
    │   └───thirdparty
    │       └───pybind11 . . . . . . . . . .  (3)
    │
    └───java   . . . . . . . . . . . . . . .  (5)
        ├───src
        │   ├───main
        │   └───test
        └───third_party
            └───jni.hpp  . . . . . . . . . .  (3)

Contains the public interface for mwe
Core implementation of mwe
Submodule
Build helpers
Binding implementations

The core implementation resides in src, and each individual binding resides within bindings. Third-party libraries are placed as descendants to their dependant component(s).

Each binding is structured in a way considered idiomatic in the respective language. This is relevant for languages that require packages, modules, or source files to be arranged in a certain way in the file system. We’ll discus this in more detail in part 3.

In part 2, we implement our mwe microlibrary.