Skip to content

How it works

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 15, 2021 · 10 revisions

It maps every element to an active page's element and maps every active page to a frozen page that is stored in a graphics card. Mapping is in interleaved order so that bandwidth from multiple pcie bridges is available when array accesses are multithreaded (and automatically thread-safe). Multiple frozen pages can be directed from same active page but multiple active pages can not point to same frozen page. Similarly multiple array elements can lead to same active page but multiple active pages can not contain same array element.

There are (by default) 4(or memMult parameter) virtual gpus used for each physical gpu. This causes overlapped data transfers in multithreaded access and increases bandwidth. Setting X for active pages per gpu causes 4X total active pages per physical gpu. Indexing is seamless. An array index of 55123 may be on a GTX1080 while index of 55124 may access a RX6800 in same system.

When elements are only few byte sized, then increasing number of virtual gpus (memMult={50,100,100,...}) give better performance than increase caching per virtual gpu. When elements are bigger, both random-access and sequential access gain better performance by bigger pages and caching. Worst case scenario is using small elements (like char) and doing really random (std::mt19937) accesses. Best case is sequential access from multiple threads using an element type that has sizeof 4kB - 128kB.

The "virtual gpu"s are "directly mapped" to virtual array and each one of them have least-recently-used(LRU) cache eviction policy so that every K-th page is in same pool of LRU-cached pages. This makes the "associativity" of cache tunable by "memMult" and "maxActivePagesPerGpu" parameters. "maxActivePagesPerGpu" is actually number of "cache lines" per LRU cache and "memMult" is number of LRU caches per physical gpu. OpenCL supported graphics cards can support 10s (if not 100s) of OpenCL command queues per context and there can be multiple contexts per card. Each command queue is independently working. This benefits from usage of more threads than logical cores. FX8150 can work with 64 threads. While a page is not in RAM (not in cache), it fetches data from graphics card and yields thread execution for other threads to continue other tasks like some math or other i/o.

Very simple, thin layer that uses no background thread (currently)