Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--outprefix option and cmp subcommand are invaild #68

Open
XiaomingXu1995 opened this issue Mar 23, 2023 · 7 comments
Open

--outprefix option and cmp subcommand are invaild #68

XiaomingXu1995 opened this issue Mar 23, 2023 · 7 comments

Comments

@XiaomingXu1995
Copy link

Hi Daniel,
I run the dashing2 with "./dashing2_savx2 sketch -F bacteria.list -S 1024 --threads 48 -o bacteria.sketch" to get the sketches, and the bacteria.sketch and the bacteria.sketch.name.txt are generated.
The cached sketch files are saved adjacent to the input file, and I try to specify the directory for the cached files by the option "--outprefix or --prefix", but it does not work. This makes the directory of the original input genome file directory chaotic.
Without the option of "--cache", the cached file will be in the input genome directory as well. How can I cancel the cached file?

In addition, I want to use the cmp or dist subcommand to compute the all-vs-all pairwise distances by the bacteria.sketch, but I cannot get the help information of this subcommand by "./dashing2_savx2 dist --help" and do not know how to use it.

Best,
Xiaoming

@XiaomingXu1995 XiaomingXu1995 changed the title --outprefix option and cmp subcommand is invaild --outprefix option and cmp subcommand are invaild Mar 24, 2023
@dnbaker
Copy link
Owner

dnbaker commented Mar 27, 2023

Hi Xiaoming -

Thank you for this issue! I'm looking into it. I really appreciate the feedback and I'll let you know when it's fixed up.

Best wishes,

Daniel

@dnbaker dnbaker mentioned this issue Mar 29, 2023
@dnbaker
Copy link
Owner

dnbaker commented Mar 29, 2023

Checking back in on this - I seem to have fixed this issue on my machine and have updated the main branch accordingly. (See the linked PR.)

You could build from source now; I'm also working on updating binaries, but that won't be done until tomorrow. Please let me know how this works for you.

Thanks again!

Daniel

@XiaomingXu1995
Copy link
Author

Thanks for your update.

I build the latest source and found that the "--cache" and "--prefix" options are valid. The cached sketch files are no longer adjacent to the genome files.
I run as ./dashing2 sketch -F bacteria.list -S 1024 -p 48 -o bacteria.sketch --prefix bacteria_sketch --cache, there will be output sketch files: bacteria.sketch and bacteria.sketch.names.txt, and the cached files are stored in the directory of bacteria_sketch.

However, there are new problems when computing the distance.
I try to compute the all-vs-all pairwise distances of these genomes by the pre-sketched file(bacteria.sketch), as:
./dashing2 cmp --cmpout bacteria.dist --presketched bacteria.sketch -p 48, but it failed with these error logs:

 Don't have permission to map.
: Invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid argument
Aborted (core dumped)

And the bacteria.sketch has been overwritten. I need to re-generate the sketch file.
When I re-generated the sketch files, I did not know how to use the cached sketch files in the bacteria_sketch directory.

So, it generates two new questions:

  • How to use the pre-sketched file bacteria.sketch to compute the all-vs-all distance?
  • Does Dashing2 support using the cached file saved in the bacteria_sketch directory to generate the new sketch file?

Thank you very much!
Best,
Xiaoming

@dnbaker
Copy link
Owner

dnbaker commented Mar 30, 2023

Hi Xiaoming,

You found another bug. Thank you! There's a large feature-set (lots of ways to run computation), and we were erasing the sketched data because I was opening the file in the wrong mode to load the data for analysis. The code path that I've used this concatenated sketch file with primarily was directly calling cmp with -o enabled in order to mmap the large set of sketches for large runs. I should have made sure this was tested.

Your command is correct! I'm able to run that command exactly after fixing this bug.

dashing2 cmp --cmpout bacteria.dist --presketched bacteria.sketch -p 4

So:

  • Your command is correct; I'm fixing this bug and pushing it.
  • Yes, you can re-create the concatenated sketch file if you have all the cached sketches in bacteria_sketch.

For point (2), we don't have a command for it, but I just added a python function which does this. In python/parse.py, there's a new function convert_sketches_to_packed_sketch.

You call it like:

from python.parse import python.parse.convert_sketches_to_packed_sketch
import glob
paths = list(glob.iglob("bacteria_sketch/*ss"))
individual_sketches = convert_sketches_to_packed_sketch(paths, "packed_bacteria.sketch")

Now, it might be faster to just re-sketch, but that's an option.

Here's the PR: #70.

Thanks again! I really appreciate it. Again, I've updated the main branch, but new binaries will be available tonight or tomorrow now.

Best,

Daniel

@XiaomingXu1995
Copy link
Author

Hi daniel,
Thanks for your update in time!
I run with dashing2 cmp --cmpout bacteria.dist --presketched bacteria.sketch -p 4, but it generates the output distance file bacteria.dist in a binary format, not human-readable. Do you have the same problem?

Best,
Xiaoming

@dnbaker
Copy link
Owner

dnbaker commented Apr 3, 2023

Hi Xiaoming,

Thanks again! I can reproduce it, but only with some builds, which is confusing to me.

The version I've built of dashing2 for my laptop is working, but only when I removed -flto from the build command did I get normal text output.

I've made changes in this PR:

#72

I'm adding some new binaries; could you please give it another shot? Here are the OSX, and I'll get the linux later today. (https://github.com/dnbaker/dashing2-binaries/tree/main/osx/v2.1.14) I think I'll wait until you confirm that it's fixed to merge it in in case there are more issues.

Thanks!

Daniel

@XiaomingXu1995
Copy link
Author

Hi Daniel,
Thanks for your update!

I can get the human-readable result by compiling without -flto option.
Both computing distance from genome files directly and from pre-generated sketches work well.

Besides, I have tested the binaries v2.1.14 dashing2_savx2 on an AMD workstation and dashing2_s512bw on an Intel workstation.
Both of them work well.

Thank you for your work again!
Best,
Xiaoming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants