Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat" when running deeprvat_annotations #121

Open
pichuan opened this issue Aug 10, 2024 · 8 comments

Comments

@pichuan
Copy link

pichuan commented Aug 10, 2024

Hi,

This is a follow-up step after #117 (comment).

(I'm opening a new issue so that each issue can be more single threaded. If you prefer me to put in the same issue, please let me know)

So, after #117 (comment) , if we assume that running --cache_version 84 with vep is acceptable, I was able to finish running the two vep commands.

From there, if I run:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

I see this error:

[Sat Aug 10 03:44:17 2024]
Error in rule add_gene_ids:
    jobid: 5
    output: output_dir/annotations/chckpts/add_gene_ids.chckpt
    shell:
        deeprvat_annotations add-gene-ids output_dir/annotations/tmp/protein_coding_genes.parquet output_dir/annotations/annotations.parquet output_dir/annotations/annotations.parquet && touch output_dir/annotations/chckpts/add_gene_ids.chckpt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-08-10T034348.372053.snakemake.log

And, when I run that command on its own, here is what I'm seeing:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ deeprvat_annotations add-gene-ids output_dir/annotations/tmp/protein_coding_genes.parquet output_dir/annotations/annotations.parquet output_dir/annotations/annotations.parquet && touch output_dir/annotations/chckpts/add_gene_ids.chckpt
/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/array/chunk_types.py:129: UserWarning: A NumPy version >=1.22.4 and <2.3.0 is required for this version of SciPy (detected version 1.21.2)
  import scipy.sparse
/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
Traceback (most recent call last):
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module>
    sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')())
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/pichuan/deeprvat/deeprvat/annotations/annotations.py", line 1993, in add_gene_ids
    merged = annotations.merge(genes, on=["gene_base"], how="left")
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/frame.py", line 10093, in merge
    return merge(
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 110, in merge
    op = _MergeOperation(
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 707, in __init__
    self._maybe_coerce_merge_keys()
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 1340, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat

The error here is ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat.

Can you help me understand what might be wrong here?

Thank you!

@Marcel-Mueck
Copy link
Collaborator

Dear @pichuan, looks like your Gene column inside your annotations.parquet file is cast to np.float64, possibly when reading in the VEP output. This is odd, since this column in the VEP output should contain Ensemble stable IDs,all of which should start with ENSG... Maybe you could check whether the tsv files from the VEP output (e.g. example/annotations/output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv) contain a column calledGene, which contains only values starting with ENSG. If in doubt, I would recommand rerunning vep using the "updated" cache version (as I described in issue 117 ):

@pichuan
Copy link
Author

pichuan commented Aug 14, 2024

Thank you @Marcel-Mueck .

After updating the cache version and rerunning the vep command in #117 (comment), I reran:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

This time, I'm getting an error that looks different:

[Wed Aug 14 04:55:56 2024]
Error in rule filter_by_exon_distance:
    jobid: 3
    output: output_dir/annotations/chckpts/filter_by_exon_distance.chckpt
    shell:
        deeprvat_annotations filter-annotations-by-exon-distance output_dir/annotations/annotations.parquet reference/gencode.v44.annotation.gtf.gz output_dir/annotations/tmp/protein_coding_genes.parquet output_dir/annotations/annotations.parquet && touch output_dir/annotations/chckpts/filter_by_exon_distance.chckpt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

The traceback was:

Traceback (most recent call last):
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module>
    sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')())
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/pichuan/miniforge3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/pichuan/deeprvat/deeprvat/annotations/annotations.py", line 682, in filter_annotations_by_exon_distance
    f"dropped dublicates in data frame (dropped {len_after_filtering - len(filtered_merge)}rows/ {np.round(100*(len_after_filtering - len(filtered_merge))/len_after_filtering)}%)."
ZeroDivisionError: division by zero

Any advice for that error? Thank you!

@Marcel-Mueck
Copy link
Collaborator

Hello @pichuan, it looks like your annotation dataframe is empty after filtering on exon distance. This would explain the ZeroDivisionError you got. I could however not recreate this error from the example data per se, only after manually removing most variants from it, so it is hard to reconstruct what went wrong in your run. Would it be possible to show the content of the annotations_tmp.parquet file,which is used as input for the filter_by_exon_distance rule? The file should be in example/annotations/output.

@pichuan
Copy link
Author

pichuan commented Aug 29, 2024

Hi @Marcel-Mueck , thanks for the answer. Recently I have suddenly gotten a lot more things on my plate, so I have not been able to followed up.

I have just ssh'ed into the same machine and tried to find the file:

Here is my directory structure:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ ls -lh
total 16K
lrwxrwxrwx 1 pichuan pichuan   29 Aug  9 23:17 annotation_data -> /home/pichuan/annotation_data
drwxrwxr-x 3 pichuan pichuan 4.0K Jul 31 02:58 input_dir
drwxrwxr-x 3 pichuan pichuan 4.0K Jul 31 02:58 output_dir
drwxrwxr-x 4 pichuan pichuan 4.0K Jul 31 02:58 preprocessing_workdir
drwxrwxr-x 2 pichuan pichuan 4.0K Aug  8 17:16 reference
lrwxrwxrwx 1 pichuan pichuan   22 Aug  9 23:30 repo_dir -> /home/pichuan/repo_dir

I'm not sure I have any files of that name, though:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ find $HOME -name annotations_tmp.parquet

This found nothing.

When I looked in a recent error log, I found:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ grep parquet .snakemake/log/2024-08-29T035236.528921.snakemake.log
    input: reference/gencode.v44.annotation.gtf.gz, output_dir/annotations/tmp/protein_coding_genes.parquet, output_dir/annotations/chckpts/add_gene_ids.chckpt
        deeprvat_annotations filter-annotations-by-exon-distance output_dir/annotations/annotations.parquet reference/gencode.v44.annotation.gtf.gz output_dir/annotations/tmp/protein_coding_genes.parquet output_dir/annotations/annotations.parquet && touch output_dir/annotations/chckpts/filter_by_exon_distance.chckpt

So, maybe you're referring to output_dir/annotations/annotations.parquet?

It looks like this:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ python -c "import pandas as pd; print(pd.read_parquet('output_dir/annotations/annotations.parquet'))"
    Consequence_stop_lost  SIFT  Consequence_missense_variant    QKI_k5  ...   af  maf        maf_mb  gene_id
0                     0.0   NaN                           0.0 -0.000015  ...  0.0  0.0  10000.000000      NaN
1                     0.0   NaN                           0.0  0.000196  ...  0.0  0.0  10000.000000      NaN
2                     0.0   NaN                           0.0  0.000003  ...  0.0  0.0  10000.000000      NaN
3                     0.0   NaN                           0.0 -0.000019  ...  0.0  0.0  10000.000000      NaN
4                     0.0   NaN                           0.0 -0.000070  ...  0.1  0.1      3.333333      NaN
5                     0.0   NaN                           0.0 -0.000100  ...  0.0  0.0  10000.000000      NaN
6                     0.0   NaN                           0.0  0.000189  ...  0.0  0.0  10000.000000      NaN
7                     0.0   NaN                           0.0 -0.004565  ...  0.0  0.0  10000.000000      NaN
8                     0.0   NaN                           0.0  0.000002  ...  0.0  0.0  10000.000000      NaN
9                     0.0   NaN                           0.0  0.018481  ...  0.0  0.0  10000.000000      NaN
10                    0.0   NaN                           0.0 -0.000105  ...  0.0  0.0  10000.000000      NaN
11                    0.0   NaN                           0.0  0.000035  ...  0.0  0.0  10000.000000      NaN
12                    0.0   NaN                           0.0  0.000010  ...  0.1  0.1      3.333333      NaN
13                    0.0   NaN                           0.0  0.000085  ...  0.0  0.0  10000.000000      NaN

[14 rows x 41 columns]

which certainly looks wrong to me.

Given that I'm not familiar with what generated that file, I just tried removing it and rerun the snakemake command to see if I can find out how it was generated.

Let me know if have any observation to share with my steps above. Thanks!

(It's also possible that I should get a completely new machine. I might not be able get time to do that until later...)

@Marcel-Mueck
Copy link
Collaborator

Hey @pichuan, thank you for updating on the issue. Yes, sorry. I posted the wrong name for the annotation file, annotations.parquet is correct. It is interesting that no variants are mapped to any gene ids. I could not reproduce this with the example data. What gtf file version are you using for the pipeline? You can see it in the annotation config, the tested version is gtf_file_name : gencode.v44.annotation.gtf.gz

@pichuan
Copy link
Author

pichuan commented Aug 29, 2024

Here is what I have:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ grep gtf_file_name ../config/deeprvat_annotation_config.yaml 
gtf_file_name : gencode.v44.annotation.gtf.gz

Just to be safe, I looked through all the files that I have with that name:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ find $HOME -type f -name gencode.v44.annotation.gtf.gz -exec ls -l {} \;
-rw-rw-r-- 1 pichuan pichuan 49721965 Jul 31 02:58 /home/pichuan/deeprvat/tests/annotations/test_data/create_gene_id_file/create_gene_id_file_small/input/gencode.v44.annotation.gtf.gz
-rw-rw-r-- 1 pichuan pichuan 49721965 Jul 31 02:58 /home/pichuan/deeprvat/tests/annotations/test_data/filter_by_exon_distance/filter_by_exon_distance_small/input/gencode.v44.annotation.gtf.gz
-rw-rw-r-- 1 pichuan pichuan 49721965 Jul 31 02:58 /home/pichuan/deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz

Checking:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ find $HOME -type f -name gencode.v44.annotation.gtf.gz -exec md5sum {} \;
ee330cfe6d0654ba9b9cf434d5c1bfb1  /home/pichuan/deeprvat/tests/annotations/test_data/create_gene_id_file/create_gene_id_file_small/input/gencode.v44.annotation.gtf.gz
ee330cfe6d0654ba9b9cf434d5c1bfb1  /home/pichuan/deeprvat/tests/annotations/test_data/filter_by_exon_distance/filter_by_exon_distance_small/input/gencode.v44.annotation.gtf.gz
ee330cfe6d0654ba9b9cf434d5c1bfb1  /home/pichuan/deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz

@bfclarke
Copy link
Contributor

Hi @pichuan, I just wanted to give you a quick update that we're still working on your issue, but it's taking a bit longer because of team members being on vacation. We'll get back to you with a proposed fix as soon as we can.

@pichuan
Copy link
Author

pichuan commented Sep 10, 2024

Thanks @bfclarke for the update, and thanks the team for looking into this. No problem at all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants