Minimal multi-threaded grep.
No bells, no whistles, no case insensitivity, no regular expressions. Just plain text search. Works similar to:
sed -n '/needle/p' haystack.txt
This will print all lines containing the string needle in the file haystack.txt. Because of the multi-threading the order of the printed lines will be somewhat random.
EXAMPLE
Search for lines containing the string '.se-' in the file cred, but skip lines containing the string '-|-|--':
./fast-grep -v '-|-|--' '.se-' cred
LIMITATIONS
- Only tested with GNU/Linux and OS X.
- Will fail for large files (> 2Gb) on 32-bit OS (possible to fix by using LFS and repeatedly calling mmap64 with 1 or 2 Gb in each iteration, this is currently not implemented).
BENCHMARK
Searching a 9.3G ascii-text file named cred with search string ".se-" (406006 matching lines).
Computers:
- HP 8560w laptop (Core i7m, Linux), GNU sed 4.2.1, GNU grep 2.14, perl v5.12.4, Python 2.7.5
- Mac (Core i5, OS X), BSD sed (version?), BSD grep 2.5.1, perl v5.16.2, Python 2.7.5
- Ubuntu (Xeon W3550), GNU sed 4.2.1, GNU grep 2.10, perl v5.14.2, Python 2.7.3
Commands:
-
Sed
time sed -n '/\.se-/p' cred > se.txt
-
Perl
time perl -e 'while (<>) { /\.se-/ && print; }' cred > se.txt
-
Python
time python2 -c 'for line in open("cred"): if ".se-" in line: print line.rstrip()' > se.txt
-
Grep
time grep '\.se-' cred > se.txt
-
Fast-grep
time ./fast-grep '.se-' cred > se.txt
Results:
Computer | 1 sed | 2 perl | 3 python | 4 grep | 5 fast-grep |
HP laptop | 1m15s | 1m7s | 0m57s | 0m19s | 0m5s |
Mac OS X | 3m2s | 0m45s | 0m33s | 2m29s | 0m5s |
Ubuntu | 1m13s | 1m14s | 1m13s | 1m11s | 1m17s (1m10s w/o threading) |
Note: fast-grep runs considerably slower the first time. After the file contents are cached in memory it performs good. This requires a computer with sufficient RAM, otherwise the filesystem becomes a bottleneck. None of the benchmarks in the table are from the first run. All programs had an equal chance to cache the file contents.
Note2: Sed and grep runs faster on Linux while perl and python runs faster on OS X. To be completely fair the tests should be re-run on the same hardware. Sed and grep can be explained with different implementations (BSD/GNU) while the file reading seems to be faster on the Linux machine (ext4/HFS+).
Note3: The Ubuntu machine didn't have enough RAM to cache the whole file in memory, so all programs had to do a lot of disk accesses. In this case it really measures the speed of the disk rather than the speed of the program.
LICENSE
~~=) All Rights Reversed - No Rights Reserved (=~~
Setting Orange, the 28th day of The Aftermath in the YOLD 3179
Albert Veli