Problems with empty annotation intersection #150

ZabalaAitor · 2024-06-18T11:48:08Z

Description of the bug

Hello,

I am trying to run nf-core/circRNA on sncRNA samples, and I encountered an error during the annotation part for some of the samples. I noticed that the samples with errors have an empty intersect.bed file.

I am wondering what information is supposed to be in the intersect.bed file and what biological reasons could cause it to be empty.

Thank you very much,

Aitor Zabala

Command used and terminal output

nextflow run nf-core/circRNA \
	-r dev \
	-profile apptainer \
	--input /data/azabala/NIM_005/samplesheet.csv \
	--phenotype /data/azabala/NIM_005/phenotype.csv \
	--module circrna_discovery,mirna_prediction \
	--outdir /scratch/azabala/sncRNA/results_circRNA \
	--tool 'circrna_finder' \
	--max_cpus 36 \
	--max_memory 512GB \
	-w /scratch/azabala/work_sncRNA_circRNA \
	--genome GRCh38 \
	--save_reference false \
	-resume

...............................


Caused by:
  Process `NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION (HC19)` terminated with an error exit status (1)

Command executed:

  annotation.py --input HC19.intersect.bed --exon_boundary 200 --output HC19.annotation.bed
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION":
      python: $(python --version | sed 's/Python //g')
      pandas: $(python -c "import pandas; print(pandas.__version__)")
      numpy: $(python -c "import numpy; print(numpy.__version__)")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/home/azabala/.nextflow/assets/nf-core/circRNA/bin/annotation.py", line 55, in <module>
      df = df.groupby(['chr', 'start', 'end', 'strand']).aggregate({
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/groupby/generic.py", line 894, in aggregate
      result = op.agg()
               ^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 169, in agg
      return self.agg_dict_like()
             ^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 478, in agg_dict_like
      arg = self.normalize_dictlike_arg("agg", selected_obj, arg)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 601, in normalize_dictlike_arg
      raise KeyError(f"Column(s) {cols_sorted} do not exist")
  KeyError: "Column(s) ['gene_id', 'transcript_id'] do not exist"

Work dir:
  /scratch/azabala/work_sncRNA_circRNA/83/3b958d1d7194efaa23a82450c6e7f5

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

Nextflow: 23.04.2
Hardware: HPC
Executor: slurm
Conatiner: Apptainer
OS: Linux
nf-core/circrna: dev

nictru · 2024-06-18T13:17:56Z

Hey,
This happens if the GTF file does not meet the expectations. In this case, the gene_id and transcript_id fields in the attributes column are missing. Please make sure to use an appropriate GTF file.
Also the pipeline version seems to be a bit outdated - please update using nextflow pull nf-core/circrna

ZabalaAitor · 2024-06-19T14:14:16Z

Hey,

I used the default GTF file provided by eGenomes, which I believe should have the correct format. Regarding the pipeline, I did update it using nextflow pull nf-core/circrna, but it's possible that the update didn't complete properly due to issues with the HPC environment. I'll look into it to ensure the pipeline is fully updated.

Thanks,

nictru · 2024-06-19T14:24:02Z

I am sure the GTF will have the correct format; otherwise, errors will look different. The problem occurs because the GTF contains regions on sequences not present in the FASTA file.

This problem will also occur on the latest pipeline version, as I have not yet had time to fix it - this was just a side note.

EDIT: This message was a mixup - forget about it

ZabalaAitor · 2024-06-24T10:19:27Z

The FASTA file is also provided by eGenomes...

nictru · 2024-06-24T18:13:27Z

Oh I'm sorry, I got mixed up between two issues. This issue does not have anything to do with the FASTA file. The one with the FASTA file compatibility problems is #151.

Still, the error you encounter is due to missing gene_id and transcrip_id entries in the GTF file. nf-core also discourages the usage of iGenomes as stated here. Maybe look inside the GTF file and see for yourself, but I can also add a check to the pipeline, which will give a user-friendly message if this happens again. To fix this I can recommend reference data from here.

ZabalaAitor · 2024-06-27T09:48:25Z

I tried using another GTF file and encountered an error while running CIRIquant because it is unable to find the GTF file, whereas other tools, such as circRNA_finder, are able to do.

I have written about the issue in #155 . Please feel free to delete or close that entry if you prefer to resolve the issue here.

Thank you very much for your time and assistance.

ZabalaAitor · 2024-07-02T12:27:04Z

This error persists despite using different GTF files. Could it be because there are no circRNAs in those samples?

nictru · 2024-07-02T13:36:29Z

You are absolutely right, this can also occur if no circRNAs are found. I should have thought about this earlier. You can confirm this is the case by switching to /scratch/azabala/work_sncRNA_circRNA/83/3b958d1d7194efaa23a82450c6e7f5 and investigating the GTF file there.

If it is really the case, I will implement a clear error message pointing this out for future users.

ZabalaAitor · 2024-07-03T12:32:25Z

I cannot find the GTF file in that directory, but the intersect.bed file is empty.

nictru · 2024-07-03T12:40:56Z

Yes okay, this is the reason then. Is the data you used confidential? Otherwise I would like to use it as test data for coming up with a clean solution

nictru · 2024-07-12T09:23:39Z

Hey @ZabalaAitor, please re-execute the pipeline with the branch connected to the PR I just opened (#159) and provide me with the updated error message

xfk274280 · 2024-08-01T03:22:47Z

Oh I'm sorry, I got mixed up between two issues. This issue does not have anything to do with the FASTA file. The one with the FASTA file compatibility problems is #151.

Still, the error you encounter is due to missing gene_id and transcrip_id entries in the GTF file. nf-core also discourages the usage of iGenomes as stated here. Maybe look inside the GTF file and see for yourself, but I can also add a check to the pipeline, which will give a user-friendly message if this happens again. To fix this I can recommend reference data from here.

An error occurred due to the absence of transcript_id in the rows where the flag equals gene in the GTF (Gene Transfer Format) file. Furthermore, you are inquiring about which branch, between dev and 150-problems-with-empty-annotation-intersection, should be regarded as the most updated one.

df_incomplete = df_incomplete[df_incomplete != ""]
if len(df_incomplete) > 0:

1 1223243 1223968 1:1223243-1223968:- 11.0 - 1 ensembl_havana gene 1216908 1232067 . -. gene_id "ENSG00000078808"; gene_version "18"; gene_name "SDF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";

ZabalaAitor added the bug Something isn't working label Jun 18, 2024

nictru changed the title ~~ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION'~~ Problems with empty annotation intersection Jul 12, 2024

nictru linked a pull request Jul 12, 2024 that will close this issue

Fix problems with empty annotation intersection #159

Merged

nictru closed this as completed in #159 Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with empty annotation intersection #150

Problems with empty annotation intersection #150

ZabalaAitor commented Jun 18, 2024

nictru commented Jun 18, 2024

ZabalaAitor commented Jun 19, 2024

nictru commented Jun 19, 2024 •

edited

Loading

ZabalaAitor commented Jun 24, 2024

nictru commented Jun 24, 2024

ZabalaAitor commented Jun 27, 2024

ZabalaAitor commented Jul 2, 2024

nictru commented Jul 2, 2024

ZabalaAitor commented Jul 3, 2024

nictru commented Jul 3, 2024

nictru commented Jul 12, 2024

xfk274280 commented Aug 1, 2024

Problems with empty annotation intersection #150

Problems with empty annotation intersection #150

Comments

ZabalaAitor commented Jun 18, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

nictru commented Jun 18, 2024

ZabalaAitor commented Jun 19, 2024

nictru commented Jun 19, 2024 • edited Loading

ZabalaAitor commented Jun 24, 2024

nictru commented Jun 24, 2024

ZabalaAitor commented Jun 27, 2024

ZabalaAitor commented Jul 2, 2024

nictru commented Jul 2, 2024

ZabalaAitor commented Jul 3, 2024

nictru commented Jul 3, 2024

nictru commented Jul 12, 2024

xfk274280 commented Aug 1, 2024

nictru commented Jun 19, 2024 •

edited

Loading