Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rule plot_stats fails with "OverflowError: value too large to convert to npy_uint32" #36

Open
gernophil opened this issue Aug 19, 2022 · 4 comments

Comments

@gernophil
Copy link

Hey everyone,

I am trying to run this pipeline with 144 samples so the resulting files are quite big. I managed to get it almost to the end, but the last rule (plots_stats) fails with OverflowError: value too large to convert to npy_uint32. I guess, I just have to many rows in my calls.tsv.gzto be handled. The complete error log is:

Traceback (most recent call last):
  File "/[PATH]/workflow_var_calling/.snakemake/scripts/tmp10j_ba31.plot-depths.py", line 16, in <module>
    sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False)
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/series.py", line 2899, in unstack
    return unstack(self, level, fill_value)
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 501, in unstack
    constructor=obj._constructor_expanddim)
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 116, in __init__
    self.index = index.remove_unused_levels()
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1494, in remove_unused_levels
    uniques = algos.unique(lab)
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/algorithms.py", line 367, in unique
    table = htable(len(values))
  File "pandas/_libs/hashtable_class_helper.pxi", line 937, in pandas._libs.hashtable.Int64HashTable.__cinit__
OverflowError: value too large to convert to npy_uint32

any ideas?

@dlaehnemann
Copy link

Sorry, I have no quick and easy ideas for a fix. One way that should make this script work better on such large datasets could be to exchange pandas code with polars, which should be quicker and more memory-efficient:
https://pola-rs.github.io/polars-book/user-guide/

As this is not such a long script, and not too complicated, switching the library used for handling the dataframes should not be overly complicated. But unless you already know polars, it will surely take a moment to find all the right syntax (but would have the added benefit of learning polars;).

@dlaehnemann
Copy link

Also, another caveat: switching to polars does not guarantee that this will run through. It's just more likely.

@gernophil
Copy link
Author

Thanks for that. I will definitely take a look into polars. Never used it before. I took a different approach now. I did just split the calls.tsv.gz in half (and copied the header to the 2nd half) and ran the rule separately on these files. It's running for 4h now, but no error so far. Fingers crossed :).

@dlaehnemann
Copy link

Fingers crossed! 😅

As a more general solution, some meaningful way of (programmatically) stratifying samples might make sense, i.e. having some sort of annotation column in config/samples.tsv that defines groups that you want to split your samples into and then a rule that splits calls.tsv.gz into those groups and then make the rule plots_stats: actually work on those smaller files.

We don't currently have the capacity to provide something like this, but are always welcoming pull requests and will try to review and merge those quickly.

And if you are looking for a more actively maintained snakemake workflow for variant calling, we are putting a lot of effort into this one:
https://snakemake.github.io/snakemake-workflow-catalog/?usage=snakemake-workflows/dna-seq-varlociraptor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants