This is an optimized library for ChaCha, a stream cipher with a 256 bit key and a 64 bit nonce.
HChaCha is also implemented, which is used to build XChaCha, a variant which extends the nonce from 64 bits to 192 bits. See Extending the Salsa20 nonce.
The most optimized version for the underlying CPU, that passes internal tests, is selected at runtime.
All assembler is PIC safe.
If you encrypt anything without using a MAC (HMAC, Poly1305, etc), you will be found, and made fun of.
The library can be initialized, i.e. the most optimized implementation that passes internal tests will be automatically selected, in two ways, neither of which are thread safe:
-
int chacha_startup(void);
explicitly initializes the library, and returns a non-zero value if no suitable implementation is found that passes internal tests -
Do nothing and use the library like normal. It will auto-initialize itself when needed, and hard exit if no suitibale implementation is found.
Common assumptions:
-
chacha_key
,chacha_iv
, andchacha_iv24
variables can be accessed through theirb
member, which is an array of unsigned bytes. -
rounds
is an even number 2 or greater. -
If
in
isNULL
, the output will be stored toout
(useful for things like random number generation or generating intermediate keys).
in
and out
are assumed to be word aligned. Incremental support has no alignment requirements, but will obviously slow down if non word-aligned pointers are passed.
void chacha(const chacha_key *key, const chacha_iv *iv, const uint8_t *in, uint8_t *out, size_t inlen, size_t rounds);
void xchacha(const chacha_key *key, const chacha_iv24 *iv, const uint8_t *in, uint8_t *out, size_t inlen, size_t rounds);
Encrypts inlen
bytes from in
to out, using
key,
iv, and
rounds`.
Incremental in
and out
buffers are not required to be word aligned. Unaligned buffers will require copying to aligned buffers however, which will obviously incur a speed penalty.
void chacha_init(chacha_state *S, const chacha_key *key, const chacha_iv *iv, size_t rounds);
void xchacha_init(chacha_state *S, const chacha_key *key, const chacha_iv24 *iv, size_t rounds);
Initialize the chacha_state with key
and iv
, and rounds
, and sets the internal block counter to 0.
size_t chacha_update(chacha_state *S, const uint8_t *in, uint8_t *out, size_t inlen);
size_t xchacha_update(chacha_state *S, const uint8_t *in, uint8_t *out, size_t inlen);
Generates/xors up to inlen + 63
bytes depending on how many bytes are in the internal buffer, and returns the number of encrypted bytes written to out
.
size_t chacha_final(chacha_state *S, uint8_t *out);
size_t xchacha_final(chacha_state *S, uint8_t *out);
Generates/crypts any leftover data in the state to out
, returns the number of bytes written.
void hchacha(const uint8_t key[32], const uint8_t iv[16], uint8_t out[32], size_t rounds);
Computes HChaCha in to out
, using key
, iv
, and rounds
.
const size_t rounds = 20;
chacha_key key = {{..}};
chacha_iv iv = {{..}};
uint8_t in[100] = {..}, out[100];
chacha(&key, &iv, in, out, 100, rounds);
Encrypting incrementally, i.e. with multiple calls to collect/write data. Note that passing in data to be encrypted will not always result in data being written out. The implementation collects data until there is at least 1 block (64 bytes) of data available.
const size_t rounds = 20;
chacha_state S;
chacha_key key = {{..}};
chacha_iv iv = {{..}};
uint8_t in[100] = {..}, out[100], *out_pointer = out;
size_t i, bytes_written;
chacha_init(&S, &key, &iv, rounds);
/* add one byte at a time, extremely inefficient */
for (i = 0; i < 100; i++) {
bytes_written = chacha_update(&S, in + i, out_pointer, 1);
out_pointer += bytes_written;
}
bytes_written = chacha_final(&S, out_pointer);
x86-64, SSE2-32, and SSE3-32 versions are minorly modified from DJB's public domain implementations.
- Generic: chacha_ref
- 386 compatible: chacha_x86
- SSE2: chacha_sse2
- SSSE3: chacha_ssse3
- AVX: chacha_avx
- XOP: chacha_xop
- AVX2: chacha_avx2
- x86-64 compatible: chacha_x86
- SSE2: chacha_sse2
- SSSE3: chacha_ssse3
- AVX: chacha_avx
- XOP: chacha_xop
- AVX2: chacha_avx2
x86-64 will almost always be slower than SSE2, but on some older AMDs it may be faster
- ARMv6 chacha_armv6
- NEON chacha_neon
See asm-opt#configuring for full configure options.
If you would like to use Yasm with a gcc-compatible compiler, pass --yasm
to configure.
The Visual Studio projects are generated assuming Yasm is available. You will need to have Yasm.exe somewhere in your path to build them.
./configure
make lib
and make install-lib
OR copy bin/chacha.lib
and app/include/chacha.h
to your desired location.
./configure --pic
make shared
make install-shared
./configure
make util
bin/chacha-util [bench|fuzz]
Benchmarking will implicitly test every available version. If any fail, it will exit with an error indicating which versions did not pass. Features tested include:
- Partial block generation
- Single block generation
- Multi block generation
- Counter handling when the 32-bit low half overflows to the upper half
- Streaming and XOR modes
- Incremental encryption
- Input/Output alignment
Fuzzing tests every available implementation for the current CPU against the reference implementation. Features tested are:
- HChaCha output
- One-shot ChaCha
- Incremental ChaCha with potentially unaligned output
As I have not updated any benchmarks yet, raw cycle counts should have ~10-20 cycles added from the overhead of targets not being hardcoded.
Impl. | 1 byte | 8 | 12 | 20 | 576 bytes | 8 | 12 | 20 | 8192 bytes | 8 | 12 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SSSE3-64 | 237 | 300 | 437 | 1.71 | 2.23 | 3.30 | 1.46 | 1.90 | 2.82 | |||
SSE2-64 | 262 | 337 | 500 | 1.98 | 2.65 | 3.97 | 1.68 | 2.29 | 3.42 | |||
SSSE3-32 | 287 | 350 | 487 | 2.04 | 2.69 | 3.99 | 1.72 | 2.37 | 3.59 | |||
SSE2-32 | 312 | 400 | 562 | 2.43 | 3.26 | 4.95 | 2.12 | 2.90 | 4.52 |
Impl. | 8 | 12 | 20 |
---|---|---|---|
SSSE3-64 | 162 | 237 | 362 |
SSSE3-32 | 175 | 250 | 375 |
SSE2-64 | 200 | 275 | 450 |
SSE2-32 | 200 | 275 | 450 |
Impl. | 1 byte | 8 | 12 | 20 | 576 bytes | 8 | 12 | 20 | 8192 bytes | 8 | 12 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AVX-64 | 176 | 240 | 364 | 1.22 | 1.68 | 2.64 | 1.04 | 1.46 | 2.29 | |||
SSSE3-64 | 180 | 248 | 384 | 1.35 | 1.88 | 2.94 | 1.18 | 1.65 | 2.59 | |||
AVX-32 | 184 | 248 | 380 | 1.50 | 2.03 | 3.10 | 1.24 | 1.72 | 2.68 | |||
SSSE3-32 | 228 | 292 | 428 | 1.84 | 2.47 | 3.74 | 1.65 | 2.23 | 3.41 |
Impl. | 8 | 12 | 20 |
---|---|---|---|
AVX-64 | 116 | 180 | 308 |
AVX-32 | 128 | 192 | 320 |
SSSE3-64 | 128 | 192 | 328 |
SSSE3-32 | 136 | 204 | 336 |
Timings are with Turbo Boost and Hyperthreading, so their accuracy is not concrete. For reference, OpenSSL and Crypto++ give ~0.8cpb for AES-128-CTR and ~1.1cpb for AES-256-CTR, and ~7.4cpb for SHA-512.
Impl. | 1 byte | 8 | 12 | 20 | 576 bytes | 8 | 12 | 20 | 8192 bytes | 8 | 12 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AVX2-64 | 146 | 194 | 313 | 0.68 | 0.97 | 1.48 | 0.52 | 0.71 | 1.08 | |||
AVX2-32 | 170 | 218 | 337 | 0.83 | 1.11 | 1.66 | 0.62 | 0.83 | 1.24 | |||
AVX-64 | 146 | 194 | 316 | 1.06 | 1.50 | 2.33 | 0.94 | 1.32 | 2.05 | |||
AVX-32 | 158 | 206 | 328 | 1.32 | 1.82 | 2.81 | 1.12 | 1.57 | 2.47 |
(these are all literally the same version, timing differences are noise)
Impl. | 8 | 12 | 20 |
---|---|---|---|
AVX2-64 | 81 | 155 | 251 |
AVX2-32 | 87 | 155 | 254 |
AVX-64 | 87 | 155 | 274 |
AVX-32 | 87 | 152 | 251 |
Timings are with Turbo on, so accuracy is not concrete. I'm not sure how to adjust for it either, and depending on clock speed (3.1ghz vs 4.0ghz), OpenSSL gives between 0.73cpb - 0.94cpb for AES-128-CTR, 1.03cpb - 1.33cpb for AES-256-CTR, and 10.96cpb - 14.1cpb for SHA-512.
Impl. | 1 byte | 8 | 12 | 20 | 576 bytes | 8 | 12 | 20 | 8192 bytes | 8 | 12 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
XOP-64 | 194 | 269 | 418 | 1.09 | 1.47 | 2.25 | 0.93 | 1.22 | 1.80 | |||
AVX-64 | 245 | 344 | 544 | 1.41 | 1.97 | 3.14 | 1.20 | 1.63 | 2.51 | |||
XOP-32 | 247 | 322 | 471 | 1.44 | 1.96 | 3.01 | 1.26 | 1.70 | 2.59 | |||
AVX-32 | 276 | 375 | 573 | 1.88 | 2.53 | 3.78 | 1.62 | 2.16 | 3.23 |
Impl. | 8 | 12 | 20 |
---|---|---|---|
XOP-64 | 84 | 160 | 309 |
XOP-32 | 91 | 165 | 318 |
AVX-64 | 144 | 243 | 441 |
AVX-32 | 144 | 237 | 441 |
I don't have access to the cycle counter yet, so cycles are computed by taking the microseconds times the clock speed (666mhz) divided by 1 million. For comparison, on long messages, OpenSSL 1.0.0e gives 52.3 cpb for aes-128-cbc (woof), and djb's armneon6 Salsa20/20 implementation gives 8.2 cpb.
Impl. | 1 byte | 8 | 12 | 20 | 576 bytes | 8 | 12 | 20 | 8192 bytes | 8 | 12 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
NEON-32 | 460 | 573 | 814 | 3.53 | 4.73 | 7.13 | 3.06 | 4.26 | 6.47 | |||
ARMv6-32 | 437 | 565 | 793 | 5.33 | 7.07 | 10.87 | 5.07 | 6.93 | 10.73 |
NEON shares the same implementation as ARMv6 as NEON latencies are too high for a single block.
Impl. | 8 | 12 | 20 |
---|---|---|---|
NEON-32 | 294 | 446 | 658 |
ARMv6-32 | 294 | 446 | 658 |
Public Domain, or MIT