CUDA Libraries

As writing efficient GPU kernels is extremely difficult, CUDA provides many libraries for solving common problems. The CUBLAS library has many routines for highly optimised BLAS¹ which run on the GPU. Here is an overview of libraries which CUDA provides:

CUBLAS: Optimised BLAS routines, which involves matrix multiplications and vector calculations.
CURAND: Library for random number generation on the GPU. This is useful for Monte-Carlo simulations.
CUFFT: Library for calculating Fast Fourier Transforms.
CUSOLVER: Library for solving linear algebra problems, such as diagonalising a matrix (eigenvalue and eigenvector decomposition).
CUSPARSE: Library with support for using sparse matrices.
CUDNN: Optimised routines for deep neural networks.
CUTENSOR: Accelerated tensor linear algebra library providing tensor contraction, reduction and element-wise operation.

Many of the BLAS routines already have a higher level interface, for example, the matrix multiplication. One can multiply two matrices very easily using:

julia

using CUDA
n = 2;
a = CUDA.rand(Float32, n, n);
b = CUDA.rand(Float32, n, n);
CUDA.@sync c = a*b

Or the in-place version:

julia

using LinearAlgebra
mul!(c, a, b)

Since each CUDA array has its own type - the CuArray - Julia can dispatch to the most optimised routines. In this case, CUDA.jl creates a dispatch for the mul! function in LinearAlgebra.jl which calls the underlying CUDA.CUBLAS.gemm! which stands for “Generic Matrix Multiply”².

Naming conventions in CUDA are incredibly old-fashioned, and it is not very clear what each function does. One will have to check the documentation to see which function you need for your code. Even then, it is often difficult to decipher exact what algorithms are available. Creating high quality, well-documented, high-level front-end APIs for CUDA.jl is an ongoing process. Fortunately, many packages provide GPU acceleration, using these underlying libraries. You will find many GPU implementations in packages like NNLibCUDA.jl, which is a dependency of Flux.jl - one of the largest machine learning libraries in Julia.

Many of the high-level BLAS operations already extend the LinearAlgebra.jl functions to work with types from CUDA.jl. Try just changing the input types to your existing code. If you find errors, track down the functions which are not implemented and find a suitable GPU compatible replacement. This may mean having to write your own CUDA kernel.

Footnotes

Basic Linear Algebra Subprograms- BLAS↩
One can track down the actual implementation by using the @which and @edit macros, which allows one to find the source code for the method. This can also be done with the use of debugging, but this is very slow in Julia.↩

Back Next

High Performance Computing in Julia

1. Introduction

2. Foundations

3. Julia

4. Measuring Performance

5. Optimisation

6. Parallel Programming

7. Multithreading

8. GPU Programming

Table of Contents

CUDA Libraries

Footnotes