Submission for the Hutter Prize.
v1 submitted on June 10, 2021.
v2 submitted on August 1, 2021. v2 is faster than v1, but has worse compression rate.
v3 submitted on August 9, 2021. v3 gets a slightly better compression rate than v2 (but is slower). v2 was not above the 1% improvement threshold on a testing computer.
This submission contains some small modifications on top of the recent STARLIT Hutter Prize winner.
Below is the cmix-hp v3 result:
Metric | Value |
---|---|
cmix-hp compressor's executable file size (S1) | 396929 bytes |
cmix-hp self-extracting archive size (S2) | 113733212 bytes |
Total size (S) | 114130141 bytes |
Previous record (L) | 115352938 bytes |
cmix-hp improvement (1 - S/L) | 1.06% |
Experiment platform | |
---|---|
Operating system | Ubuntu 20.04 |
Processor | Intel Core i7-7700K @ 4.20GHz (Geekbench score 1288) |
Memory | 32 GB DDR4 |
Decompression running time | 42.8 hours |
Decompression RAM max usage | 6905 MiB |
Decompression disk usage | ~35GB |
Time, disk, and RAM usage are approximately symmetric for compression and decompression.
Here is a comparison between different entries on this computer:
Name | running time (hours) | S1 | S2 |
---|---|---|---|
STARLIT | 44.13 | 401264 | 114920105 |
cmix-hp v1 | 45.44 | 397242 | 113688192 |
cmix-hp v2 | 41.94 | 396779 | 113760510 |
cmix-hp v3 | 42.8 | 396929 | 113733212 |
The submission has several small tweaks on top of STARLIT. The most substantial change is to have a huge PPM model, which gets swapped to disk to improve memory usage.
- Changed mod_ppmd from v3 to v2, which supports a higher memory limit.
- Used one PPM model (order-25) instead of two.
- Used mmap to store PPM memory to disk (reducing RAM usage).
- Added a limit to the cmix context mixer size to make it more memory efficient.
- Made the PAQ8HP context mixer use a hash map, to reduce memory usage.
- Memory tuning for PAQ8HP: shifted more memory to wordModel.
- Reduced memory usage for some cmix models, and removed "Direct" models.
- Removed some unused preprocessor code.
- Exported fewer model predictions from PAQ8HP to the cmix mixer.
- Removed some cmix context mixers.
Thanks to Kaido Orav for these suggested improvements:
- Swapped certain byte regions during preprocessing - a trick originally used in PAQ8HP.
- Improved handling of zero state in PAQ8HP mixer.
- Better handling of UTF characters in PAQ8HP.
- Added back two of the cmix context mixers (which were removed in v2).
The installation and usage instructions for cmix-hp are the same as for STARLIT. For convenience, most of the information below is copied from STARLIT documentation.
One important note: it is recommended to change one variable in the source code for PPM. From line 26 in src/models/ppmd.cpp:
// If mmap_to_disk is set to false (recommended setting), PPM will only use RAM
// for memory.
// If mmap_to_disk is set to true, PPM memory will be saved to disk using mmap.
// This will reduce RAM usage, but will be slower as well. *Warning*: this will
// write a *lot* of data to disk, so can reduce the lifespan of SSDs. Not
// recommended for normal usage.
bool mmap_to_disk = true;
This variable is set to true by default, to comply with the Hutter Prize RAM limit.
Building cmix-hp compressor from sources requires clang-12, upx-ucl, and make packages. On Ubuntu, these packages can be installed by running the following scripts:
./install_tools/install_upx.sh
./install_tools/install_clang-12.sh
A bash script is provided for compiling cmix-hp compressor from sources on Ubuntu. This script places the cmix-hp executable file named as cmix
in ./run
directory. The script can be run as
./build_and_construct_comp.sh
To run the cmix-hp compressor use
cd ./run
cmix -e <PATH_TO_ENWIK9> enwik9.comp
Expected output:
78125 bytes -> 415377 bytes in 74.17 s.
199784 bytes -> 1131233 bytes in 184.88 s.
Detected block types: TEXT: 100.0%
934188796 bytes -> 113536067 bytes in 155972.59 s.
free(): invalid size
Command terminated by signal 6
The error message does not affect the compression validity.
The compressor is expected to output an executable file named archive9
in the same directory (./run
). The file archive9
when executed is expected to reproduce the original enwik9 as a file named enwik9_restored
. The executable file archive9
should be launched without argments from the directory containing it.
cd ./run
./archive9
Expected output:
78125 bytes -> 415377 bytes in 72.31 s.
113536067 bytes -> 934188796 bytes in 154012.40 s.
Thanks to:
- Artemiy Margaritov for releasing STARLIT - it is a great achievement.
- Eugene Shelwien for releasing mod_ppmd - an essential component for this submission.
- Kaido Orav for suggesting improvements for cmix-hp v2.