CUDA language bindings – Part 3: Bindings

Python
Node.js (Javascript)
Java
Memory Management (in brief)
Summary

As I mentioned earlier, bindings reside in bindings/<lang> directories, and the structure within each is specific to the target language’s canonical representation of a package or module. The implementation of the bindings, however, follow a common template:

From the target language’s perspective, each native module is named binding and is subsequently consumed by a loader called mwe. This encapsulation allows us to:
- Hide the incidental complexity of searching for the binding on the file system, and loading it into the runtime.
- Make it convenient (and more productive) to write language-specific utilities in the target language itself.
- Make it easier to refine the public interface to the binding on a per target language basis. This could, for example, make it easier for users to integrate your native library in to 3rd party libraries (e.g. Numpy in python).
When you have multiple target languages, it can become burdensome to maintain high quality documentation for each as your module evolves. To make this easier, each binding has a collection of integration tests, some of which should be simple enough to be understood by users of the binding and serve as minimal working examples.
With CMake we can configure, build and test the bindings in an automated way that fits a single workflow. In our case, each build configuration is defined in bindings/<lang>/CMakeLists.txt.

Python

Since our single operation so far is only dealing with int primitives (and no objects per-se) and a 32-bit integer type is a natively supported type in most languages, the bindings should be relatively trivial. This is the case for Python.

However, since our kernel operation returns maybe<int> instead of int, we need to either implement maybe<T> or find a substitute in each language. If our target language was say F# (which supports discriminated unions), Haskell (where Maybe and Either are built-in types) or Go (which doesn’t support exceptions) then it would feel natural (occasionally preferable) to use these analogous types and mechanisms (e.g. pattern matching, destructuring or multiple return values) to complement maybe<T>.

Instead, because exceptions and the try ..catch statement is the generally accepted way to deal with exceptional behavior in Python, Javascript and Java, we’ll use this idiom. To facilitate this, we’ll introduce custom type converters to handle this behavior transparently in each language.

Our python binding implementation, using pybind11, is similar to the one in the pybind11 documentation here; The difference being that we transparently convert the maybe<T> to T or throw an exception that will bubble up to Python.

So, In bindings/python/mwe/binding.cpp, we declare a module initializer with the PYBIND11_PLUGIN macro:

#include <pybind11/pybind11.h>

PYBIND11_PLUGIN(binding)
{
    pybind11::module m("binding", "python binding example");       (1)
    m.def("add", &mwe::add, "A function which adds two numbers");  (2)
    return m.ptr();                                                (3)
}

Declare an extension module, called binding
Declare a function, add
Return the internal representation of binding back to the python interpreter

The python loader, in bindings/python/mwe/__init__.py, is trivially simple, and just exposes the contents of the binding at the package level.

from mwe.binding import *

When the above import statement is run, our initializer function is invoked, which ultimately instantiates our binding in the Python interpreter. Pybind11 handles the rest.

That’s all there is to it. We can build and test our binding with CMake:

🔨 Step 3 – Build and test the python binding

To work You’ll need Python 3 installed on your system. On Linux, you may already have Python installed, but you will also need the python development package, typically called python-devel or python-dev. On Windows or Mac, I recommend Miniconda for 64-bit Python 3.x.

mkdir build_py && cd build_py
cmake -G "Visual Studio 14 2015 Win64" -Dmwe_WITH_PYTHON=ON ..  (1)
cmake --build . --config Debug
ctest . -VV -C Debug                                            (2)

Note that at (1) we using the CMake -Dmwe_WITH_PYTHON=ON option which enables the Python binding and its associated test suite. When we run the tests at (2) CMake will configure python to execute these tests.

After the build completes, the files of interest to us are arranged as a canonical Python package in the build directory:

.
└───bindings
    └───python
        └───mwe  . . . . . . . . . . . . . . . . . . .  (1)
                __init__.py  . . . . . . . . . . . . .  (2)
                binding.cp35-win_amd64.pyd . . . . . .  (3)

The mwe package root, as described here.
The package entry point.
The compiled extension module, named according to the platform, architecture and python version the module was compiled for.

We can also inspect the the module interactively, making sure to set the PYTHONPATH environment variable to point at the location of the mwe package:

$ PYTHONPATH=./bindings/python python -i
>>> import mwe as mwe
>>> help(mwe)
Help on package mwe:

NAME
    mwe

PACKAGE CONTENTS
    binding

FUNCTIONS
    add(...) method of builtins.PyCapsule instance
        add(arg0: int, arg1: int) -> int

        A function which adds two numbers

>>> mwe.add(1, 2)
3

Node.js (Javascript)

Within the node.js environment the Javascript engine, V8, is exposed to various native modules. Many of these come packaged with node.js itself, and form part of the node.js ecosystem. Alongside, we’ll be introducing our own mwe module written with v8pp. In this setting, v8pp acts as the metaprogramming layer over V8 to help us with many of the incidental details of this process.

Javascript is a peculiar language in that it doesn’t have classes per-se, and so doesn’t fundamentally distinguish between a class and its instances. Instead we have object, a fundamental data type which in essence is a collection of properties. Furthermore, in Javascript all functions are objects too, and so there exists a kind of duality between functions and objects. People familiar with Javascript will also know about the prototypal object model, and how these concepts can be used together to create something akin to class hierarchies and inheritance.

So, given that classes and modules are not concepts in Javascript, and that functions and objects are generated at runtime, we need a mechanism to describe the module with a statically compiled language like C++. With V8, this is achieved with the V8 template concept. Not to be confused with C++ templates, a V8 template is a blueprint for Javascript functions and objects to be created by V8 at runtime.

To help us understand what’s going on, we can describe how our module will be created with the roughly equivalent (and slightly unusual) Javascript:

var m = { add: function() { ... };                      (1)
Object.setPrototypeOf(module.exports, m);               (2)

Declare the object m with a single property "add", the value of which our add implementation.
Override the prototype of the module.exports built-in property; making add available to consumers through the module prototype chain.

Subsequently, a user can consume the module and call add() in an external script in the usual way: require("./binding").add(...);.

The difference between this Javascript and the equivalent C++ we’ll write is that in C++ we instead instantiate m from an object template. To achieve this, in bindings/nodejs/mwe/binding.cpp, we first create a standard node.js addon with the macro NODE_MODULE (imported from node.h) taking a name and a single module initializer function where the template is defined and instantiated:

#include <node.h>
#include <v8.h>
#include <v8pp/module.hpp>

void init(v8::Local<v8::Object> exports) { ... }
NODE_MODULE(binding, init)

Our module, again called binding, will be initialized when it’s loaded in to the V8 runtime. The init function is also passed an v8::Object we’ve named exports; this is equivalent to the exports built-in shown above.

Within init we define the object template, m:

void init(v8::Local<v8::Object> exports)
{
    v8pp::module m(v8::Isolate::GetCurrent());          (1)
    m.set("add", &mwe::add);                            (2)
    exports->SetPrototype(m.new_instance());            (3)
}

Declare m, a v8pp::module that wraps a v8::ObjectTemplate.
Set add on the object template to pointe to mwe::add
Create and instance of m, and assign it to the module prototype.

You may notice the similarities with the Python binding; (2) is almost identical. Again, since coofoo::add actually returns maybe<int> rather than int, we write a v8pp converter for the generic maybe<T> template which will throw a Javascript exception if an incoming maybe<T> holds an error, or recursively convert T to a v8::Value (which in our case is a one step conversion from int to a v8::Number.) For v8pp, these conversion specializations are fairly easy to write, although you should become familiar with some core V8 concepts like isolates, scopes and handles, which are described in the V8 embedder’s guide before you write your own.

🔨 Step 4 – Build and test the nodejs binding

To work You’ll need node.js installed on your system from here. The build will also automatically download the headers and libraries to build against your version of node.js.

mkdir build_js && cd build_js
cmake -G "Visual Studio 14 2015 Win64" -Dmwe_WITH_NODEJS=ON ..
cmake --build . --config Debug
ctest . -VV -C Debug

Once built, we should have the following folder structure in the build directory:

.
└───bindings
    └───nodejs
        └───mwe  . . . . . . . . . . . . . . . . . . .  (1)
                package.json . . . . . . . . . . . . .  (2)
                index.js . . . . . . . . . . . . . . .  (3)
                binding.node . . . . . . . . . . . . .  (4)

The npm package root, following the structure described here
The package description file.
The package entrypoint.
The native module.

Java

Lastly, the Java binding. Most Java developers may be familiar with the term Java Native Interface (JNI), which is the mechanism (and specification) native libraries use to declare native extensions to the Java runtime.

There are two distinct parts to a JNI extension. Owing to Java being a statically typed language, we first need to declare the interface we wish to expose in Java at compilation time. To do this, we use the native keyword to mark out which methods will be implemented by the binding.

All functions in Java are defined at class scope, and this has implications for how we declare and subsequently use our binding. The natural approach is to define the API as a facade of static native methods on a class called com.mwe.Binding, and to use its static constructor to initialize the binding by loading the native mwe library:

package com.mwe;

public class Binding
{
    static { System.loadLibrary("binding"); }           (1)
    public static native int add(int a, int b);         (2)
}

Static constructor which loads the library
A Native method declaration

The second step is to write the C++ glue that binds to the Java class we just defined. Using JNI directly is particularly error prone and lacking type safety (often leading to runtime errors.) Instead we’ll be using another IDL-like template wrapper.

JNI bindings will usually export a function called JNI_OnLoad, and at runtime this function is called by the JVM during a call to System.loadLibrary. Each Java class we define has an associated C++ struct in the wrapper with a static register_jni function, which when passed the jni interface, registers the class function(s). A skeleton implementation could look like the following:

#include <jni/jni.hpp>

struct Binding
{
    static constexpr auto Name() { return "com/mwe/Binding"; };
    static void register_jni(jni::JNIEnv& env) { ... }
};

extern "C" JNIEXPORT jint JNICALL JNI_OnLoad(JavaVM* vm, void*)
{
    Binding::register_jni(jni::GetEnv(*vm));
    return jni::Unwrap(jni::jni_version_1_2);
}

Each bindable C++ struct has a static function, Name(), which returns the fully qualified name of the associated Java class. In our case the name is "com/mwe/Binding". For each native method on this class we defined in Java, we register the associated native implementation using jni::MakeNativeMethod(...). A complete implementation of Binding looks like:

#include <jni/jni.hpp>
#include "mwe.h"

using namespace jni;

struct Binding
{
    static constexpr auto Name() { return "com/mwe/Binding"; };
    using _this = jni::Class<Binding>;

    static jint add(JNIEnv& env, _this, jint a, jint b)
    {
        mwe::maybe<int> r = mwe::add(a, b);
        return util::try_throw(env, r) ? 0 : r.get<int>();
    }

    static void register_jni(JNIEnv& env)
    {
        RegisterNatives(env, _this::Find(env),
            MakeNativeMethod<decltype(&add), &add>("add"));
    }
};

The binding is now complete. Notice that in this case mwe::add is not called directly by the metatemplate library, but instead called by a helper which performs the relevant type conversions and error handling. Writing custom converters doesn’t seem to be feature of jni.hpp, but this is only mildly inconvenient in our case.

We build Binding.cpp as a shared library (a .so, .dll or .dylib file, depending on the platform), and Binding.java as an archive (a .jar). To consume it within a Java application we import the java class like any other. Here is complete example that imports all the static methods of Binding with import static com.mwe.Binding.*:

import static com.mwe.Binding.*;

public class BindingTest {
    public static void main(String[] args) {
        int c = add(5, 3);
        assert c == 8;
    }
}

🔨 Step 5 – Build and test the Java binding

To work, you’ll need a JDK installed on your system. You should also make sure the JAVA_HOME environment variable is set to the location of your JDK installation.

mkdir build_java && cd build_java
cmake -G "Visual Studio 14 2015 Win64" -Dmwe_WITH_JAVA=ON ..
cmake --build . --config Debug
ctest . -VV -C Debug

Once built (by supplying the -Dmwe_WITH_JAVA=ON option to CMake) we should have the following folder structure in the build directory:

.
└───bindings
    ├───java
    │       mwe.jar . . . . . . . . . . . . . . . . . . (1)
    └───Debug
            binding.dll . . . . . . . . . . . . . . . . (2)

The mwe Java package
The native library

Memory management (in brief)

Many numerical libraries have interfaces that describe a set of data structures like matrices and n-dimensional vectors of data. This is also the lingua franca of other domains like image processing and machine learning. To leverage the data-parallel capabilities of heterogeneous hardware (especially GPUs), or the libraries that do so, you should become familiar with some of the pre-exiting mechanisms for dealing with this kind of data.

The mwe repository contains an extra minimum working example of vector addition, introduced by overloading our add kernel (in src/kernels/add.h) with an additional operation:

template <compute_mode> static error_code op(
    const gsl::span<int> a,
    const gsl::span<int> b,
          gsl::span<int> result);

Here we’re declaring an operation taking three views of 1-dimensional vectors, implemented by adding the first two together into result. In the last part of this tutorial, I’ll single out 1 aspect of this operation worthy of further explanation: gsl::span<T>, and it’s implications for memory management.

The `gsl::span<T>`

A gsl::span<T> is a view over contiguous memory, intended as an alternative to the error-prone (pointer, length) idiom. It also supports describing the shape and span of n-dimensional data. It forms a part of the Guidelines Support Library (GSL) library, which is a support library for the C++ core guidelines, an ambitious attempt by the C++ standards committee to produce a set of rules and best practice for writing modern C++.

Passing views and sub-views of n-dimensional vectors around, rather than the data itself, forms an important part of writing language bindings that manipulate large amounts of data through the layers of a binding. By passing views, we can aim to avoid unnecessary memory allocation and copying. When incorporating other hardware (e.g. GPUs) this topic becomes especially important, and can make or break application performance since some copying (back and forth between various bits of hardware) becomes unavoidable. This should impact the way your library is designed, and indeed, some languages have special mechanisms to support this in a standardized way (e.g. the Python Buffer Protocol).

Ownership and Garbage Collection

Let’s be more concrete about what a view-like type such as gsl::span<T> actually is in memory management terms. When a view goes out of scope, the data being viewed remains valid because it’s owned by something else – it’s a non-owning reference to some data. In more technical terms, the lifetime of this data is not bounded by the lifetime of the view. This is in contrast to, say, a std::vector<T> and the data it owns; in this case the data contained within a vector is bounded by its lifetime.

The concepts of lifetime and ownership are a pervasive part of binding to languages that have automatic memory management (AMM) schemes. In our case, the data being referenced by the gsl::span<T>’s are owned by the target language runtime. For our kernel operation to work correctly, we may need to interact with this memory management scheme to ensure the referenced data (and by implication, the gsl::span<T>’s) remains valid for the duration of the operation.

Typically, when working within the confines of a language with AMM, an object’s lifetime is determined, transparently, by a garbage collector. As soon as these objects are referenced from outside the runtime (say by a native extension) you will occasionally need to make the AMM scheme aware about your intentions more explicitly. The specific mechanism(s) used to do this vary across languages, but the basic principles about lifetimes and ownership are the same.

For using our gsl::span<T>, we need to ensure the AMM of the target runtime doesn’t free, or (in the case of a compacting garbage collector, move) the underlying data. To this end, the IDL-like wrappers we’ve used in this tutorial provide varying degrees of support in addition to what’s provided by the raw API. The source code accompanying this tutorial offers a demonstration in each language, but here is an overview:

Memory management overview

Runtime	Strategy	Raw API	Additional wrapper support
CPython	reference counting	Py_INCREF, etc.	✓
V8	Tracing and compacting GC	Local/Persistent handles and Scopes	Some
JVM	Implementation defined	Local/global references	Limited

Summary

I hope you found this tutorial interesting and insightful. The main aim was to show how, with the features of modern C++, it’s becoming far easier to extend applications without resorting to complex language bridges, or writing esoteric binding code, often requiring you to be several domain experts simultaneously. For heterogeneous computing, we’ve shown that the modern features of C++ also make it productive to write kernels for a variety of hardware configurations, without necessarily having to resort to complex frameworks, or unifying approaches that historically fail to achieve performance portability.