zip_vector is a compressed variable length array that uses vectorized block codecs to compress and decompress integers using variable bit-width deltas. The integer block codecs are optimized for vector instruction sets using Google's Highway C++ library for portable SIMD/vector intrinsics.
zip_vector reduces memory footprint and can speed up in-order traversal of arrays by lowering global memory bandwidth using lightning fast vector compression.
The implementation transparently compresses and decompresses fixed size pages to and from an indexed non-linear slab using simple block compression codecs. The implementation is optimized for sequential access using custom iterators but out of order accesses are possible at the cost of performance. The implementation relies for performance on the sum of compression latency plus the latency of transferring compressed blocks to and from global memory being less than the cost of transferring uncompressed blocks to and from global memory.
The implementation supports 64-bit and 32-bit array containers using block compression codecs that perform width reduction for absolute values, signed deltas for relative values, and special blocks for constant values and sequences.
zip_vector<int32_t>
- { 8, 16, 24 } bit signed and unsigned fixed-width values.
- { 8, 16, 24 } bit signed deltas with per block initial value.
- constants and sequences using per block initial value and delta.
zip_vector<int64_t>
- { 8, 16, 24, 32, 48 } bit signed and unsigned fixed-width values.
- { 8, 16, 24, 32, 48 } bit signed deltas with per block initial value.
- constants and sequences using per block initial value and delta.
The order of compression and decompression of minimally sized blocks ensures that accesses to uncompressed data happen in L1 and L2 caches whereas global memory accesses read and write compressed data thus effectively increasing global memory bandwidth. Nevertheless, the primary goal is to reduce in-memory footprint.
The zip_vector template is intended to be similar to std::vector although the current prototype implementation does not yet implement all traits present in std::vector, nor is it thread-safe, but it does support iterators.
zip_vector<int64_t> vec;
vec.resize(8192);
for (size_t i = 0; i < vec.size(); i++) {
vec[i] = i;
}
int64_t s1 = 0, s2 = 0;
for (size_t i = 0; i < vec.size(); i++) {
s1 += vec[i];
}
for (auto v : vec) {
s2 += v;
}
assert(s1 == s2);
Internally a slab is organised to contain blocks of compressed and uncompressed page data written in the order the vector is accessed. Pages have a fixed power of 2 number of elements but blocks in the slab are variable length due to the use of block compression codecs. Blocks containing compressed pages in the slab are found using an offsets array which is indexed by right shifting array indices.
Pages are scanned and compressed using fast but simple block compression codecs that perform width reduction for small absolute values or encode values using signed deltas, and special blocks for constant values and sequences with constant deltas.
- 8, 16, 24, 32, and 48 bit signed and unsigned absolute values.
- 8, 16, 24, 32, and 48 bit signed deltas with per block initial value.
- constants and sequences using per block initial value and delta.
Compression efficiency ranges from 12.5% (8-bit) to 75% (48-bit) or worst case 100% (plus ~ 1% metadata overhead). The codecs can compress sign-extended values thus canonical pointers on x86_64 will use a maximum of 48-bits. Sometimes pages of temporally coherent pointers can be compressed with 16-bit or 24-bit deltas or even a constant sequence simply using an initial value and delta. On x86_64 the codecs uses runtime cpuid feature detection to select generic code or AVX-512 optimized code.
There is one active area in the slab per thread to cache the current page uncompressed. When a page boundary is crossed the active area is scanned and recompressed to the slab. Blocks can transition from compressed to uncompressed, uncompressed to compressed and they can change size when recompressed.
If after scanning, it is found that the active area for the current page can't be compressed, the offsets array will be updated to point to uncompressed data which will subsequently be updated in-place. When scanned again it may later transition back to a compressed state in which case the uncompressed area will be returned to a binned free list. If high entropy data is written in-order, the slab should end up structured equivalently to a regular linear array.
Page dirty status is tracked so that if there are no write accesses to a block then scanning and compression can be skipped and it is only necessary to perform decompression when crossing block boundaries. Please note this initial prototype implementation is not thread safe.
It is recommended to build using Ninja:
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build -- --verbose
- Support additional bit-widths
- The current scheme uses these widths: 2^(n), 2^(n+1) + 2^(n)
- 1, 2, 3, 4, 6, and 12 bit deltas are still to be implemented.
- Improve bitmap slab allocator
- The current bitmap allocator uses a naive exhaustive first fit algorithm.
- The slab bitmap could be partitioned based on page index to reduce worst case scan performance at the expense of requiring rebalancing of the slab partitions when the slab is resized.
- The slab could be made denser by using a stochastic best fit algorithm that records size statistics to guide tactical choices about which bitmap chunks are split to match statistical demand for frequent block sizes.
- Add multi-threading support
- One approach is to add a per page reference count semaphore and to simply keep pages that are being concurrently accessed as uncompressed, with the last thread recompressing the page. There is some complexity for synchronizing slab resizes.
This table shows vecotorized block codecs that have so far been implemented.
bits | 48 | 32 | 24 | 16 | 12 | 8 | 6 | 4 | 3 | 2 | 1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
absoulte | i64 | X | X | X | X | X | ||||||
u64 | X | X | X | X | X | |||||||
i32 | X | X | X | |||||||||
u32 | X | X | X | |||||||||
relative | i64 | X | X | X | X | X | ||||||
u64 | X | X | X | X | X | |||||||
i32 | X | X | X | |||||||||
u32 | X | X | X |
Benchmarks are split into high level benchmarks for the array class and low-level benchmarks for the block compression codecs.
- Zip Vector Array Benchmarks (benchmark program:
bench-zip-vector
) - Low Level Codec Benchmarks (benchmark program:
bench-zvec-codecs
)
Benchmark of in-order read-only traversal of a compressed array.
These benchmarks show the throughput for arrays with statistics that target each of the block compression codecs. std::vector is compared to zip_vector, with 1D iteration, and 2D iteration using a page-sized block stride to take advantage of LLVM/Clang's auto-vectoriztion.
- Clang 14.0.0, Intel Core i9-7980XE, 4.3GHz, AVX-512
- GCC 11.2.0, Intel Core i9-7980XE, 4.3GHz, AVX-512
zip_vector<int64_t> with 2D iteration
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
zip_vector_2D-abs-8 | 16MiW | 0.136 | 7,339,965,481 | 27999.7 |
zip_vector_2D-rel-8 | 16MiW | 0.253 | 3,945,984,200 | 15052.7 |
zip_vector_2D-abs-16 | 16MiW | 0.211 | 4,743,869,411 | 18096.4 |
zip_vector_2D-rel-16 | 16MiW | 0.298 | 3,355,174,115 | 12799.0 |
zip_vector_2D-abs-24 | 16MiW | 0.372 | 2,687,863,404 | 10253.4 |
zip_vector_2D-rel-24 | 16MiW | 0.479 | 2,089,221,836 | 7969.7 |
zip_vector_2D-abs-32 | 16MiW | 0.358 | 2,793,575,309 | 10656.6 |
zip_vector_2D-rel-32 | 16MiW | 0.399 | 2,504,306,913 | 9553.2 |
zip_vector_2D-abs-48 | 16MiW | 0.482 | 2,076,496,569 | 7921.2 |
zip_vector_2D-rel-48 | 16MiW | 0.548 | 1,824,931,526 | 6961.6 |
zip_vector<int32_t> with 2D iteration
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
zip_vector_2D-abs-8 | 16MiW | 0.074 | 13,465,854,621 | 51368.2 |
zip_vector_2D-rel-8 | 16MiW | 0.075 | 13,334,215,010 | 50866.0 |
zip_vector_2D-abs-16 | 16MiW | 0.149 | 6,723,057,823 | 25646.4 |
zip_vector_2D-rel-16 | 16MiW | 0.148 | 6,767,145,742 | 25814.6 |
zip_vector_2D-abs-24 | 16MiW | 0.286 | 3,492,346,682 | 13322.2 |
zip_vector_2D-rel-24 | 16MiW | 0.287 | 3,481,434,937 | 13280.6 |
zip_vector<int64_t> with 1D iteration
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
zip_vector_1D-abs-8 | 16MiW | 1.113 | 898,209,739 | 3426.4 |
zip_vector_1D-rel-8 | 16MiW | 1.220 | 819,679,680 | 3126.8 |
zip_vector_1D-abs-16 | 16MiW | 1.171 | 853,726,092 | 3256.7 |
zip_vector_1D-rel-16 | 16MiW | 1.256 | 795,872,012 | 3036.0 |
zip_vector_1D-abs-24 | 16MiW | 1.339 | 746,892,358 | 2849.2 |
zip_vector_1D-rel-24 | 16MiW | 1.454 | 687,776,124 | 2623.7 |
zip_vector_1D-abs-32 | 16MiW | 1.279 | 782,121,557 | 2983.6 |
zip_vector_1D-rel-32 | 16MiW | 1.367 | 731,442,921 | 2790.2 |
zip_vector_1D-abs-48 | 16MiW | 1.458 | 685,831,325 | 2616.2 |
zip_vector_1D-rel-48 | 16MiW | 1.533 | 652,166,701 | 2487.8 |
zip_vector<int32_t> with 1D iteration
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
zip_vector_1D-abs-8 | 16MiW | 1.083 | 923,105,940 | 3521.4 |
zip_vector_1D-rel-8 | 16MiW | 1.083 | 923,441,839 | 3522.7 |
zip_vector_1D-abs-16 | 16MiW | 1.155 | 865,819,832 | 3302.8 |
zip_vector_1D-rel-16 | 16MiW | 1.153 | 867,467,088 | 3309.1 |
zip_vector_1D-abs-24 | 16MiW | 1.287 | 777,083,587 | 2964.3 |
zip_vector_1D-rel-24 | 16MiW | 1.288 | 776,600,612 | 2962.5 |
std::vector<int64_t>
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
std_vector_1D-abs-8 | 16MiW | 0.491 | 2,037,519,162 | 7772.5 |
std_vector_1D-rel-8 | 16MiW | 0.486 | 2,055,820,499 | 7842.3 |
std_vector_1D-abs-16 | 16MiW | 0.488 | 2,047,901,254 | 7812.1 |
std_vector_1D-rel-16 | 16MiW | 0.486 | 2,057,335,612 | 7848.1 |
std_vector_1D-abs-24 | 16MiW | 0.488 | 2,047,893,255 | 7812.1 |
std_vector_1D-rel-24 | 16MiW | 0.485 | 2,060,771,399 | 7861.2 |
std_vector_1D-abs-32 | 16MiW | 0.491 | 2,036,940,546 | 7770.3 |
std_vector_1D-rel-32 | 16MiW | 0.508 | 1,970,149,975 | 7515.5 |
std_vector_1D-abs-48 | 16MiW | 0.488 | 2,050,546,412 | 7822.2 |
std_vector_1D-rel-48 | 16MiW | 0.489 | 2,045,396,563 | 7802.6 |
std::vector<int32_t>
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
std_vector_1D-abs-8 | 16MiW | 0.227 | 4,399,989,194 | 16784.6 |
std_vector_1D-rel-8 | 16MiW | 0.224 | 4,456,231,246 | 16999.2 |
std_vector_1D-abs-16 | 16MiW | 0.224 | 4,460,175,033 | 17014.2 |
std_vector_1D-rel-16 | 16MiW | 0.233 | 4,293,298,885 | 16377.6 |
std_vector_1D-abs-24 | 16MiW | 0.229 | 4,370,102,289 | 16670.6 |
std_vector_1D-rel-24 | 16MiW | 0.232 | 4,307,159,106 | 16430.5 |
zip_vector<int64_t> with 2D iteration
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
zip_vector_2D-abs-8 | 16MiW | 0.388 | 2,577,607,383 | 9832.8 |
zip_vector_2D-rel-8 | 16MiW | 0.505 | 1,978,698,920 | 7548.1 |
zip_vector_2D-abs-16 | 16MiW | 0.465 | 2,151,404,781 | 8207.0 |
zip_vector_2D-rel-16 | 16MiW | 0.529 | 1,891,680,714 | 7216.2 |
zip_vector_2D-abs-24 | 16MiW | 0.682 | 1,467,151,472 | 5596.7 |
zip_vector_2D-rel-24 | 16MiW | 0.788 | 1,268,457,762 | 4838.8 |
zip_vector_2D-abs-32 | 16MiW | 0.610 | 1,638,794,975 | 6251.5 |
zip_vector_2D-rel-32 | 16MiW | 0.668 | 1,497,905,530 | 5714.1 |
zip_vector_2D-abs-48 | 16MiW | 0.795 | 1,258,082,232 | 4799.2 |
zip_vector_2D-rel-48 | 16MiW | 0.882 | 1,133,306,070 | 4323.2 |
zip_vector<int32_t> with 2D iteration
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
zip_vector_2D-abs-8 | 16MiW | 0.526 | 1,899,431,185 | 7245.8 |
zip_vector_2D-rel-8 | 16MiW | 0.526 | 1,900,954,056 | 7251.6 |
zip_vector_2D-abs-16 | 16MiW | 0.596 | 1,677,024,963 | 6397.3 |
zip_vector_2D-rel-16 | 16MiW | 0.596 | 1,679,239,464 | 6405.8 |
zip_vector_2D-abs-24 | 16MiW | 0.785 | 1,273,253,906 | 4857.1 |
zip_vector_2D-rel-24 | 16MiW | 0.783 | 1,276,540,527 | 4869.6 |
zip_vector<int64_t> with 1D iteration
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
zip_vector_1D-abs-8 | 16MiW | 1.020 | 980,372,353 | 3739.8 |
zip_vector_1D-rel-8 | 16MiW | 1.134 | 881,642,641 | 3363.2 |
zip_vector_1D-abs-16 | 16MiW | 1.091 | 916,833,601 | 3497.4 |
zip_vector_1D-rel-16 | 16MiW | 1.161 | 861,238,444 | 3285.4 |
zip_vector_1D-abs-24 | 16MiW | 1.298 | 770,233,637 | 2938.2 |
zip_vector_1D-rel-24 | 16MiW | 1.423 | 702,710,323 | 2680.6 |
zip_vector_1D-abs-32 | 16MiW | 1.238 | 807,769,314 | 3081.4 |
zip_vector_1D-rel-32 | 16MiW | 1.279 | 781,678,405 | 2981.9 |
zip_vector_1D-abs-48 | 16MiW | 1.407 | 710,610,011 | 2710.8 |
zip_vector_1D-rel-48 | 16MiW | 1.508 | 663,060,725 | 2529.4 |
zip_vector<int32_t> with 1D iteration
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
zip_vector_1D-abs-8 | 16MiW | 0.762 | 1,312,336,224 | 5006.2 |
zip_vector_1D-rel-8 | 16MiW | 0.761 | 1,313,326,131 | 5009.9 |
zip_vector_1D-abs-16 | 16MiW | 0.849 | 1,178,253,224 | 4494.7 |
zip_vector_1D-rel-16 | 16MiW | 0.849 | 1,178,316,529 | 4494.9 |
zip_vector_1D-abs-24 | 16MiW | 1.030 | 970,811,341 | 3703.4 |
zip_vector_1D-rel-24 | 16MiW | 1.034 | 967,127,596 | 3689.3 |
std::vector<int64_t>
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
std_vector_1D-abs-8 | 16MiW | 0.565 | 1,769,713,987 | 6750.9 |
std_vector_1D-rel-8 | 16MiW | 0.568 | 1,761,752,989 | 6720.6 |
std_vector_1D-abs-16 | 16MiW | 0.570 | 1,754,592,031 | 6693.2 |
std_vector_1D-rel-16 | 16MiW | 0.569 | 1,758,711,679 | 6709.0 |
std_vector_1D-abs-24 | 16MiW | 0.570 | 1,754,429,283 | 6692.6 |
std_vector_1D-rel-24 | 16MiW | 0.567 | 1,764,868,512 | 6732.4 |
std_vector_1D-abs-32 | 16MiW | 0.570 | 1,754,771,878 | 6693.9 |
std_vector_1D-rel-32 | 16MiW | 0.571 | 1,750,488,167 | 6677.6 |
std_vector_1D-abs-48 | 16MiW | 0.570 | 1,755,106,344 | 6695.2 |
std_vector_1D-rel-48 | 16MiW | 0.569 | 1,756,333,139 | 6699.9 |
std::vector<int32_t>
benchmark | size(W) | time(ns) | word/sec | MiB/s |
---|---|---|---|---|
std_vector_1D-abs-8 | 16MiW | 0.373 | 2,679,107,795 | 10220.0 |
std_vector_1D-rel-8 | 16MiW | 0.369 | 2,707,621,476 | 10328.8 |
std_vector_1D-abs-16 | 16MiW | 0.369 | 2,708,236,438 | 10331.1 |
std_vector_1D-rel-16 | 16MiW | 0.370 | 2,699,387,340 | 10297.3 |
std_vector_1D-abs-24 | 16MiW | 0.368 | 2,717,198,314 | 10365.3 |
std_vector_1D-rel-24 | 16MiW | 0.372 | 2,685,961,839 | 10246.1 |
Benchmarks of the low level integer block compression codecs using AVX-512 for 32-bit and 64-bit datatypes.
Figure 1: Benchmark ZVec 64-bit Block Encode, Intel Core i9-7980XE, 4.3GHz, AVX-512
Figure 2: Benchmark ZVec 64-bit Synthesize Block, Intel Core i9-7980XE, 4.3GHz, AVX-512
Figure 3: Benchmark ZVec 64-bit Encode Block, Intel Core i9-7980XE, 4.3GHz, AVX-512
Figure 4: Benchmark ZVec 64-bit Decode Block, Intel Core i9-7980XE, 4.3GHz, AVX-512
Benchmarks of the ZVec AVX-512 codecs with a 32-bit datatype.
Figure 5: Benchmark ZVec 32-bit Block Encode, Intel Core i9-7980XE, 4.3GHz, AVX-512
Figure 6: Benchmark ZVec 32-bit Synthesize Block, Intel Core i9-7980XE, 4.3GHz, AVX-512
Figure 7: Benchmark ZVec 32-bit Encode Block, Intel Core i9-7980XE, 4.3GHz, AVX-512
Figure 8: Benchmark ZVec 32-bit Decode Block, Intel Core i9-7980XE, 4.3GHz, AVX-512
zip_vector and the zvec block codecs are available under PLEASE LICENSE, an ISC derived license using authors' implied copyright as detailed under The Berne Convention. The primary difference between PLEASE LICENSE and the ISC license is that PLEASE LICENSE emphasizes implied copyright which is why PLEASE LICENSE is positioned where the text Copyright would normally be placed. There is no restriction stating that the copyright notice must be retained. From this perspective it may seem close to public domain, however it isn't public domain since copyright is not relinquished, rather it is explicitly retained in an implied form to defend the free use of the work.
All rights to this work are granted for all purposes, with exception of
author's implied right of copyright to defend the free use of this work.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.