Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The subtraction function of Dashing #80

Open
liaoherui opened this issue May 17, 2021 · 8 comments
Open

The subtraction function of Dashing #80

liaoherui opened this issue May 17, 2021 · 8 comments

Comments

@liaoherui
Copy link

Firstly, thanks a lot for your excellent tool! It's really cool!

This is my question:

Currently, I have three files, A.fq, B.fa, C.fa. And I want to calculate the distance between A.fq and B.fa, but not consider k-mers in C.fa. In other words, I need to do dist(A-C, B-C). I am not sure whether dashing can be used to do this?

If it works, how about many "B", I mean, dist(A-C, "genome_path"-C), where "genome_path" refers to many fasta files (just like the input file (genome_path.txt) of dist function of dashing), not only one B.

Thanks a lot for your answering in advance!

@dnbaker
Copy link
Owner

dnbaker commented May 17, 2021

Hi there!

This is currently not supported, but it's something you could use Dashing to perform parts of the computation. If you perform sketching with --full-khash-sets, then Dashing will create 64-bit hash sets and write them to disk in a gzip file. The first 8 bytes is a number indicating the number of hashes, and the rest of the file is 64-bit. You could then load the hash sets for A and B, filtering by C.

I see that it could be very useful, and we'll consider supporting it directly as development continues. Let me know if you need any further help or have any question..

Thanks,

Daniel

@liaoherui
Copy link
Author

liaoherui commented May 18, 2021

Hi, Daniel

Got it. Thanks a lot for your prompt reply!

I will try to use the "sketch" function. In this case, is it possible to load these hash files by python? Cause I am not very familiar with c++.

Another not so important question:
(Besides, by using the "sketch" function, is it possible to save hash with genome labels like {kmer1->{A, C}, kmer2->{B}} in the future version? If so, it will be really helpful for the development of k-mer-based analysis tool! But I guess it could be time-consuming or is not the target of Dashing. If so, you can ignore this question.)

Regards,
Herui Liao

@liaoherui
Copy link
Author

Hi, Daniel

I have just tested the "--full-khash-sets" parameters with one genome. But there is an error.

The command I used is dashing_s128 sketch -p12 --use-full-khash-sets -k31 test.fasta

And the error info is:
image

I have assigned 200G memory for this job, which should be enough for one small bacterial genome (~5M).

@dnbaker
Copy link
Owner

dnbaker commented May 18, 2021

Hi Liao,

You're right, that probably isn't running out of memory. That was a bug, unfortunately, which I've found/fixed in this branch, and it's now been merged into main here.

You can download the new binaries from https://github.com/dnbaker/dashing/tree/main/release/.

You could do a lot of functionality from within Python. To parse each of these k-mer files, here's some Python code:

def parse_khs(path):
    import numpy as np
    import gzip
    data = np.frombuffer(gzip.open(path, "r").read(), dtype=np.uint64)
    ld = data[0]
    data = data[1:]
    assert ld == len(data)
    return data

This yields a hash set in vector form. So after using the fixed/rebuilt code to eliminate the segmentation fault, you might try something like this:

def filtered_jaccard(a, b, c):
    adata, bdata, cdata = map(parse_khs, (a, b, c))
    cset = set(c)
    afilt = set(adata) - cset
    bfilt = set(bdata) - cset
    isz = afilt & bfilt
    union = afilt | bfilt
    return len(isz) / len(union)

@liaoherui
Copy link
Author

Hi, Daniel

Thanks a lot for your suggestions and the new version of dashing!

However, when I tried this new binary, it seems there are still some problems. As shown in the picture below, the new version can not be executed while the old version works well.

image

The same with s256 and s512.

@dnbaker
Copy link
Owner

dnbaker commented May 20, 2021

Thanks for letting me know.

What's your operating system?

You could check if it has permissions (cmhod 755 dashing_s128), in case that fixes it. Otherwise, maybe cloning fresh and building manually might help?

git clone --recursive https://github.com/dnbaker/dashing
cd dashing && make dashing

You can then install it either with sudo make install or by manually copying the executable into a folder in the PATH environmental variable.

@liaoherui
Copy link
Author

Thanks! Will try.

My operation system is Linux, and the version info is shown below.
image

@dnbaker
Copy link
Owner

dnbaker commented Aug 15, 2021

By the way - while there's no such functionality in this software, I added to Dashing2 a feature which supports something like this. The feature uses the flag --filterset <path>, which reads all k-mers from that path and then skips them when sketching from the remaining files.

And your request inspired me to provide it - so thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants