Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit Intersection Size #38

Open
dnbaker opened this issue Feb 6, 2020 · 9 comments
Open

Emit Intersection Size #38

dnbaker opened this issue Feb 6, 2020 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@dnbaker
Copy link
Owner

dnbaker commented Feb 6, 2020

Currently, dashing emits union sizes with --sizes, but not intersection size. You can get to that by subtracting the estimated set cardinality, but it would be preferable to emit it directly.

@dnbaker dnbaker added the enhancement New feature or request label Feb 6, 2020
@dnbaker dnbaker self-assigned this Feb 6, 2020
@dnbaker
Copy link
Owner Author

dnbaker commented Feb 11, 2020

Fixed as of 22e340a. (Emits intersection size instead of union size.)

@dnbaker dnbaker closed this as completed Feb 11, 2020
@mihkelvaher
Copy link

mihkelvaher commented Feb 20, 2020

Maybe it's still WIP, but with
dashing dist myfastas/* --size
v0.4.2 outputs paths/sizes ##Names and a matrix, but with v0.4.3-16-g4a39 instead of the matrix there's this:

terminate called after throwing an instance of 'sketch::exception::NotImplementedError'
  what():  union_size not available for type double bns::us::union_size(const T&, const T&) [with T = sketch::hll::hllbase_t<>]
Aborted

Using just dashing dist myfastas/* without --size works fine.

EDIT: working with the v0.4.4 release
EDIT2: v0.4.4 emits union sizes, not intersection

@dnbaker dnbaker reopened this Feb 20, 2020
@dnbaker
Copy link
Owner Author

dnbaker commented Feb 20, 2020

Hi Mihkel,

Thanks for reporting. I made a mistake making this change.

It's been patched, both in master at this commit and v0.4.5.

Would you give it another try for me?

Sample output (the files with cp are copies, so the expected intersection is complete):

Dashing version: v0.4.3-16-g4a39
#Path	Size (est.)
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz	4829255
bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz	4829255
bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz	4829255
bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz	4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz	2718859
bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz	2718859
bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz	2718859
bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz	2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz	2433839
bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz	2433839
bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz	2433839
bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz	2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz	2368528
bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz	2368528
bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz	2368528
bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz	2368528
##Names	bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz	bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz	bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz	bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz	bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz	bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz	bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz	bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz	bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz	bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz	bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz	bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz	bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz	bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz	bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz	bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz	-	4.82926e+06	4.82926e+06	4.82926e+06	0	0	0	0	0	0	0	0	0	0	00
bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz	-	-	4.82926e+06	4.82926e+06	0	0	0	0	0	0	0	0	0	0	00
bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz	-	-	-	4.82926e+06	0	0	0	0	0	0	0	0	0	0	0	0
bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz	-	-	-	-	0	0	0	0	0	0	0	0	0	0	0	0
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz	-	-	-	-	-	2.71886e+06	2.71886e+06	2.71886e+06	0	0	0	0	0	0	00
bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz	-	-	-	-	-	-	2.71886e+06	2.71886e+06	0	0	0	0	0	0	00
bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz	-	-	-	-	-	-	-	2.71886e+06	0	0	0	0	0	0	0	0
bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz	-	-	-	-	-	-	-	-	0	0	0	0	0	0	0	0
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz	-	-	-	-	-	-	-	-	-	2.43384e+06	2.43384e+06	2.43384e+06	1.70487e+06	1.70487e+06	1.70487e+06	1.70487e+06
bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz	-	-	-	-	-	-	-	-	-	-	2.43384e+06	2.43384e+06	1.70487e+06	1.70487e+06	1.70487e+06	1.70487e+06
bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz	-	-	-	-	-	-	-	-	-	-	-	2.43384e+06	1.70487e+06	1.70487e+06	1.70487e+06	1.70487e+06
bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz	-	-	-	-	-	-	-	-	-	-	-	-	1.70487e+06	1.70487e+06	1.70487e+06	1.70487e+06
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	2.36853e+06	2.36853e+06	2.36853e+06
bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-	2.36853e+06	2.36853e+06
bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	2.36853e+06
bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-

@mihkelvaher
Copy link

TL;DR: -T vs default gives different results.

Using another k-mer program with the same k I found some differences that should not be there.
dashing dist fastas/* --sizes -T --use-full-khash-sets --kmer-length 32
Dashing vs real
60 vs 54
747 vs 711
...
Omitting -T gives the exact results (which should be the case with --use-full-khash-sets?).

Also, a minor detail (but probably a bit tricky to fix): -T reports intersections with self to be 0.

@dnbaker
Copy link
Owner Author

dnbaker commented Feb 25, 2020

Thanks for the find, I'll investigate this soon. You're right that it'd be a big tricky to have -T report intersections along the diagonal, but checking out the off-diagonal entries is important.

@mihkelvaher
Copy link

mihkelvaher commented Feb 27, 2020

In practice, the zero info on the diagonal is more or less irrelevant so it can be pushed to the bottom of the TODO list.

Comparing the results of different data structures vs full khash sets I found that --sizes doesn't work with bloom nor super minhash. Maybe it's intentional.

Some testing results, might be interesting:
Currently, bb-minhash seems to outperform others (intersection sizes in correlation with full):

HLL_E	0,14208915
HLL_I	0,14208915
HLL_J	0,153491658
countingRangeMinhash	0,833072532
rangeMinhash	0,848715951
bb-minhash	0,93973125

The values are found on a small set of full genomes where only smaller intersection sizes are observed with --sketch-size 17. Taking into account more (larger) intersection sizes the correlations become more similar with bb-minhash still at the top.

@dnbaker
Copy link
Owner Author

dnbaker commented Apr 7, 2020

I'm closing this for now, but feel free to open if you have any further issues.

@dnbaker dnbaker closed this as completed Apr 7, 2020
@mihkelvaher
Copy link

I managed to hide the comment previously so well I had trouble myself finding it 😄 :

Any plans on implementing --sizes with bloom filter (super minhash also didn't work the last time I tried)?
I'm asking this just because using bloom filters the estimated intersection size should be more likely overestimated than underestimated (?) which is preferred for my project (afraid more of FNs than FPs).

@dnbaker
Copy link
Owner Author

dnbaker commented Apr 7, 2020

I see. I can add this sometime relatively soon. I would guess that the bloom filter would overestimate.

I'm not sure if there's any way to arrive at cardinality estimates using SuperMinHash (and thereby intersection sizes), but I'll see what I can do. It uses random number generators and sampling to get values with given probabilities.

I'll probably have time in a couple of weeks, and thanks for the reminder.

@dnbaker dnbaker reopened this Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants