Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically determine which hash to use #1

Closed
rob-p opened this issue Nov 13, 2022 · 19 comments
Closed

Automatically determine which hash to use #1

rob-p opened this issue Nov 13, 2022 · 19 comments
Labels
enhancement New feature or request

Comments

@rob-p
Copy link
Contributor

rob-p commented Nov 13, 2022

Hi @jermp,

It would be great if, in piscem (and perhaps generally upstream in sshash?) the size of the hash function used could be automatically adjusted based on the estimated collision probability. Specifically, if there is a large number of keys, then this function (https://github.com/COMBINE-lab/piscem-cpp/blob/5231b57776422c5fcd77b48de3b39a21161726c2/include/util.hpp#L355) triggers the program to terminate with the relevant message.

I wonder if, instead, a more robust behavior might be to check the collision probability and, if it's too high, just use the larger hash values instead. Alternatively (not sure if it's a good idea though), one could just always try with the 64-bit hash and then, if a collision occurs, try again with the 128-bit hash.

I don't think this will be super common for "normal" sized indices, but there's already been some interest expressed in trying to build quite large indices with piscem (e.g. https://genomic.social/web/@raw937/109334346658954871), and I imagine that in such a case, this issue would come up rather quickly.

@rob-p rob-p added the enhancement New feature or request label Nov 13, 2022
@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

Hi @rob-p,
for large indexes, I would always default to 128-bit hash codes. This will not affect the index size, obviously. The hasher is only used by the PTHash table for the minimizers. There will only be a slight penalty at query time, since computing 128-bit hash codes is more expensive than 64-bit ones, but I wouldn't regard this as a "real" issue :)
Also, I saw someone jevinskie/pthash@f767dff integrated xxhash in PTHash instead of murmur-hash2. Perhaps that will even bridge the performance gap.

@rob-p
Copy link
Contributor Author

rob-p commented Nov 13, 2022

Thanks! Any suggestions on when to check or how to best genericize the code for this? I suspect we have to record the hash size in the serialized output too.

@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

If we just default to 128-bit hashes, it's just a matter of changing one template, here: https://github.com/jermp/sshash/blob/master/include/util.hpp#L26. No need to save anything to the index file.
We should, however, bring the SSHash version used in piscem-cpp up to date, better through a git submodule as we discussed here COMBINE-lab/piscem-cpp#2 some time ago.

@rob-p
Copy link
Contributor Author

rob-p commented Nov 13, 2022

But when we load the index, don't we need to know which hash it was built with? Are you suggesting to always use the 128-bit hash? I think many common indices (e.g human genome and txome) are well within the safe 64-bit hash space.

I agree about unifying with upstream. We should see what the full set of necessary changes is.

@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

But when we load the index, don't we need to know which hash it was built with? Are you suggesting to always use the 128-bit hash? I think many common indices (e.g human genome and txome) are well within the safe 64-bit hash space.

Yes, I am.

But if we still want to make a choice, then the choice must be chosen statically before building like this
https://github.com/jermp/pthash/blob/master/src/build.cpp#L199, having an idea of how many minimizers to expect.
Then we would need to specify a template in this class https://github.com/jermp/sshash/blob/master/include/minimizers.hpp which must be also exposed by the sshash::dictionary class. PTHash will then take care of the serialization.

@rob-p
Copy link
Contributor Author

rob-p commented Nov 13, 2022

Always using 128-bits is, of course cleaner and simpler;P. Any idea what the effect of this is on construction and queries?

@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

Yeah, exactly :D
Here I've built some large indexes (over the axolotl genome, for example) using 128-bit hash codes:
https://github.com/jermp/sshash#large-scale-benchmark. I wouldn't say 128-bit hash codes are going to hurt construction...maybe slightly affect random lookups. But as we keep growing in size, sooner or later we must switch to 128-bit hashes!

@rob-p
Copy link
Contributor Author

rob-p commented Nov 13, 2022

I totally agree. I just wonder if we want to take a bowtie2 like approach (dynamically switch based on what we are indexing) or always use 128-bit hashes, even for small references. The affect isn't nearly as big for us as bowtie2, bacause our size is unaffected. If the random lookup differences are very small, then it may be worth to take the hit to keep things simpler, but if the effect is non-trivial, we may want to make dynamic. I could try building a small (txome/genome) index with both and testing it out later, but am afk right now.

@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

Yes. I will make some experiment too and post them here in this thread.

@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

Ok I've tried on the human chromosome 13. To test, I've created two folders, build_64 and build_128 and compiled with lib in the build_128 folder after switching to 128-bit hash codes. Then, I ran the following commands in both folders:

./sshash build -i ~/DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.ust_k31.fa.gz -k 31 -m 17 -o chr13.sshash --verbose
./sshash bench -i chr13.sshash
./sshash query -i chr13.sshash -q ../data/queries/SRR5833294.10K.fastq.gz 

and obtained identical results.

64-bit hashes:

avg_nanosec_per_positive_lookup 708.357
avg_nanosec_per_negative_lookup 789.448
avg_nanosec_per_positive_lookup_advanced 786.834
avg_nanosec_per_negative_lookup_advanced 874.468
avg_nanosec_per_access 295.47
iterator: avg_nanosec_per_kmer 26.5712

2022-11-13 16:39:35: performing queries from file '../data/queries/SRR5833294.10K.fastq.gz'...
2022-11-13 16:39:35: DONE
==== query report:
num_kmers = 460000
num_positive_kmers = 29737 (6.46457%)
num_searches = 11763/29737 (39.5568%)
num_extensions = 17974/29737 (60.4432%)
elapsed = 273.431 millisec / 0.273431 sec / 0.00455718 min / 594.415 ns/kmer

128-bit hashes:

avg_nanosec_per_positive_lookup 711.307
avg_nanosec_per_negative_lookup 819.396
avg_nanosec_per_positive_lookup_advanced 815.91
avg_nanosec_per_negative_lookup_advanced 841.34
avg_nanosec_per_access 296.331
iterator: avg_nanosec_per_kmer 24.5083

2022-11-13 16:38:03: performing queries from file '../data/queries/SRR5833294.10K.fastq.gz'...
2022-11-13 16:38:03: DONE
==== query report:
num_kmers = 460000
num_positive_kmers = 29737 (6.46457%)
num_searches = 11763/29737 (39.5568%)
num_extensions = 17974/29737 (60.4432%)
elapsed = 278.481 millisec / 0.278481 sec / 0.00464135 min / 605.393 ns/kmer

Can you confirm this behavior on your end as well?

@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

On the whole human genome:

./sshash build -i ~/DNA_datasets/Homo_sapiens.GRCh38.dna.toplevel.ust_k31.fa.gz -k 31 -m 20 -o human.sshash --verbose
./sshash bench -i human.sshash
./sshash query -i human.sshash -q ../data/queries/SRR5833294.10K.fastq.gz

64-bit hashes:

avg_nanosec_per_positive_lookup 1460.99
avg_nanosec_per_negative_lookup 1648.1
avg_nanosec_per_positive_lookup_advanced 1825.97
avg_nanosec_per_negative_lookup_advanced 1643.93
avg_nanosec_per_access 606.522
iterator: avg_nanosec_per_kmer 25.9827

2022-11-13 19:14:00: performing queries from file '../data/queries/SRR5833294.10K.fastq.gz'...
2022-11-13 19:14:00: DONE
==== query report:
num_kmers = 460000
num_positive_kmers = 426405 (92.6967%)
num_searches = 92561/426405 (21.7073%)
num_extensions = 333844/426405 (78.2927%)
elapsed = 177.664 millisec / 0.177664 sec / 0.00296107 min / 386.226 ns/kmer

128-bit hashes:

2022-11-13 19:13:44: performing queries from file '../data/queries/SRR5833294.10K.fastq.gz'...
2022-11-13 19:13:44: DONE
==== query report:
num_kmers = 460000
num_positive_kmers = 426405 (92.6967%)
num_searches = 92561/426405 (21.7073%)
num_extensions = 333844/426405 (78.2927%)
elapsed = 176.937 millisec / 0.176937 sec / 0.00294895 min / 384.646 ns/kmer

2022-11-13 19:13:47: performing queries from file '../data/queries/SRR5833294.10K.fastq.gz'...
2022-11-13 19:13:47: DONE
==== query report:
num_kmers = 460000
num_positive_kmers = 426405 (92.6967%)
num_searches = 92561/426405 (21.7073%)
num_extensions = 333844/426405 (78.2927%)
elapsed = 181.181 millisec / 0.181181 sec / 0.00301968 min / 393.872 ns/kmer

@rob-p
Copy link
Contributor Author

rob-p commented Nov 13, 2022

I'll test chromosome 5 now on my end — and then maybe see what I get with xxhash ;P

@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

Great! Thank you @rob-p.

@rob-p
Copy link
Contributor Author

rob-p commented Nov 13, 2022

So it seems my machine is slower than yours (uniformly; expected since it's quite old now). These are with the default hasher.

build

./sshash build -i chr5_utig.fa  -k 31 -m 17 -o chr5.sshash --verbose

64-bit hasher

./sshash bench -i chr5.sshash
avg_nanosec_per_positive_lookup 987.98
avg_nanosec_per_negative_lookup 1015.84
avg_nanosec_per_positive_lookup_advanced 907.097
avg_nanosec_per_negative_lookup_advanced 1013.81
avg_nanosec_per_access 425.64
iterator: avg_nanosec_per_kmer 22.8699

128-bit hasher

./sshash bench -i chr5_128.sshash
avg_nanosec_per_positive_lookup 1082.97
avg_nanosec_per_negative_lookup 1066.39
avg_nanosec_per_positive_lookup_advanced 1023.78
avg_nanosec_per_negative_lookup_advanced 1035.59
avg_nanosec_per_access 449.019
iterator: avg_nanosec_per_kmer 23.8181

@jermp
Copy link
Collaborator

jermp commented Nov 13, 2022

Pretty similar results 64 vs. 128 bits, no?

@rob-p
Copy link
Contributor Author

rob-p commented Nov 13, 2022

Yea pretty similar. It looks like my submodule-fu is not awesome, so pulling in the xxhash fork you listed earlier is non-trivial for me. I'll try and work on that later and report back (have to get the little guy up now ;P).

@rob-p
Copy link
Contributor Author

rob-p commented Nov 14, 2022

So I got xxhash to work, and the results don't seem appreciably different (basically within measurement noise) both to what is above, and between the two modes:

64-bit hasher

./sshash bench -i chr5_xxhash_64.sshash
avg_nanosec_per_positive_lookup 1132.35
avg_nanosec_per_negative_lookup 1027.64
avg_nanosec_per_positive_lookup_advanced 1021.32
avg_nanosec_per_negative_lookup_advanced 1022.37
avg_nanosec_per_access 433.611
iterator: avg_nanosec_per_kmer 23.2661

128-bit hasher

./sshash bench -i chr5_xxhash_128.sshash
avg_nanosec_per_positive_lookup 1025.05
avg_nanosec_per_negative_lookup 1021.49
avg_nanosec_per_positive_lookup_advanced 946.707
avg_nanosec_per_negative_lookup_advanced 1050.17
avg_nanosec_per_access 416.959
iterator: avg_nanosec_per_kmer 23.3242

Also, 128-bit wyhash looks reasonable as well (still no differences I'd say are much bigger than measurement noise):

./sshash bench -i chr5_wyhash_128.sshash
avg_nanosec_per_positive_lookup 911.362
avg_nanosec_per_negative_lookup 1056.89
avg_nanosec_per_positive_lookup_advanced 949.793
avg_nanosec_per_negative_lookup_advanced 1027.2
avg_nanosec_per_access 432.319
iterator: avg_nanosec_per_kmer 23.0188

@jermp
Copy link
Collaborator

jermp commented Nov 14, 2022

Excellent! Thank you for these results. So, I would say:

  • we don't have to bother to include xxHash;
  • just default to 128-bit hashes? I think this is very reasonable.

@jermp
Copy link
Collaborator

jermp commented Nov 9, 2023

Done, as of jermp/sshash@3bc3c16.

@jermp jermp closed this as completed Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants