Notes on Performance

NumPy

NumPy is highly optimized for single core performance using any and all available SIMD instructions on the running system (SSE/AVX/NEON). NumPy will indirectly utilize multiple cores if the version of the package you are using is compiled with a library, primary BLAS, which has multi-core code in it. Simple, parallelizable operations, like taking the sin() of every element in an array, do not use multiple cores unless NumPy has been specially compiled with that capability.

Numba

Numba dynamically compiles functions into native code. Given the optimizations already in NumPy and the overhead for invoking the compiler, Numba isn't necessarily any faster than NumPy on a single core. Numba's advantage is that it supports distributing parallelizable code to all available CPU cores which can lead to a significant speedup.

Numba has an important stealth feature: it transparently supports NumPy random() in its parallelized code. This is very useful since Numba only PRNG state for each core rather than for each agent, which is a much more manageable amount of data.

Numba can also compile to CUDA kernel code and can inter-operate with PyCUDA. We have not yet looked into using the same PRNG state for multiple agents through Numba compiled CUDA code.

PyCUDA

PyCUDA provides low-level access to Nvidia's CUDA toolkit (kernels are specified in C++), but that comes with some challenges. E.g., marshaling kernel parameters across to the GPU can be finicky (depends on structure packing which requires knowledge of how 32-bit and 64-bit values are packed).

PyCUDA provides access to CUDA's PRNG primitives, but, on their own, these are not directly useful at the limits of the numbers of agents we want to simulate, e.g., 10⁹.

Taichi

Taichi appears to have a great advantage of transparently switching between compiling for CPU, CUDA, and Metal (Apple SIMD API).

Examples

All numbers below are from an Apple MacBook Pro with Apple M1 Max Silicon and 32GB of RAM.
Reading a random sine value from the results come from my C++ days where the compiler would optimize all the computation away if the results were not used.
We run each function once before timing to make sure all code is already JITted.

TLDR	2x10⁸ elements	Speed Up
NumPy	1.8670 seconds	1x
Numba	1.5907 seconds	~1.2x
Numba optimized	0.2674 seconds	~7x
Taichi	0.0206 seconds	~90x

import time
import numpy as np

N = 100_000_000

def numpy_sines(n):
    return np.sin(np.random.rand(n))

sines = numpy_sines(N)

t0 = time.perf_counter_ns()
sines = numpy_sines(N)
t1 = time.perf_counter_ns()

print(f"numpy_sines({N=}) took {(t1-t0)/1_000_000_000} seconds")
index = np.random.randint(0, N)
print(f"sines[{index=}] = {sines[index]}")

numpy_sines(N=100000000) took 0.90880412 seconds
sines[index=4184225] = 0.8232138393787853

import numba as nb

@nb.jit(nopython=True, nogil=True, cache=False)
def numba_sines(N):
    sines = np.sin(np.random.rand(N))
    return sines

sines = numba_sines(np.uint32(N))

t0 = time.perf_counter_ns()
sines = numba_sines(np.uint32(N))
t1 = time.perf_counter_ns()

print(f"numba_sines({N=}) took {(t1-t0)/1_000_000_000} seconds")
index = np.random.randint(0, N)
print(f"sines[{index=}] = {sines[index]}")

numba_sines(N=100000000) took 0.804739 seconds
sines[index=25089351] = 0.31106103585469225

from numba import float64, uint32, prange

@nb.jit(float64[:](uint32), nopython=True, nogil=True, parallel=True)
def numba_opt_sines(N):
    sines = np.empty(N)
    for i in prange(N):
        sines[i] = np.sin(np.random.rand())
    return sines


sines = numba_opt_sines(np.uint32(N))

t0 = time.perf_counter_ns()
sines = numba_opt_sines(np.uint32(N))
t1 = time.perf_counter_ns()

print(f"numba_opt_sines({N=}) took {(t1-t0)/1_000_000_000} seconds")
index = np.random.randint(0, N)
print(f"sines[{index=}] = {sines[index]}")

numba_opt_sines(N=100000000) took 0.132869041 seconds
sines[index=59459871] = 0.822506090513809

# %pip install taichi

import taichi as ti
import taichi.math as tm

ti.init(arch=ti.gpu)

sines = ti.field(dtype=ti.f32, shape=N)

@ti.kernel
def taichi_gpu_sines():
    for i in sines:
        sines[i] = tm.sin(ti.random(float))

index = np.random.randint(0, N)
taichi_gpu_sines()

t0 = time.perf_counter_ns()
taichi_gpu_sines()
extract = sines[index]
t1 = time.perf_counter_ns()

print(f"taichi_gpu_sines({N=}) took {(t1-t0)/1_000_000_000:0.6f} seconds")
print(f"sines[{index=}] = {extract}")

[Taichi] Starting on arch=metal
taichi_gpu_sines(N=100000000) took 0.011076 seconds
sines[index=22445860] = 0.276404470205307

Provide feedback

Saved searches