Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation with 0 fusions junctions (STAR alignment) #73

Open
andre-gabriel-42 opened this issue Nov 7, 2023 · 1 comment
Open

Annotation with 0 fusions junctions (STAR alignment) #73

andre-gabriel-42 opened this issue Nov 7, 2023 · 1 comment

Comments

@andre-gabriel-42
Copy link

Hello.. I am currently having an issue with the annotation part. I am using Linux to perform CIRCexplorer2 analysis.

After aligning everything with STAR as described by you,

for instance:

STAR --runThreadN 10 --genomeDir /mnt/work1/genodata/STAR-GRCh38.99/ --readFilesIn /mnt/work2/andre_phd/circRNA_Jan2023/raw/T18Und_circRNA_1.fastq.gz /mnt/work2/andre_phd/circRNA_Jan2023/raw/T18Und_circRNA_2.fastq.gz --readFilesCommand zcat --chimSegmentMin 10

Afterwards, I did the parsing step as follows:

CIRCexplorer2 parse -t STAR Chimeric.out.junction > CIRCexplorer2_Diff0_T18und_parse.log

and I was successuful.

Then, when I tried annotating, I used the following command and got the following mistake:

andregabriel@brutus:~/Desktop/Alignments_André_CIRCexplorer/Diff0_T18und_circRNA$ CIRCexplorer2 annotate -r /home/andregabriel/Desktop/Alignments_André_CIRCexplorer/Homo_sapiens.GRCh38.99.gtf -g /mnt/work1/genodata/STAR-GRCh38.99/Homo_sapiens.GRCh38.dna.primary_assembly.fa -b back_spliced_junction.bed -o circularRNA_Diff0_T18und.txt > CIRCexplorer2_Diff0_T18und_annotate.log

Traceback (most recent call last):
  File "/home/andregabriel/.local/bin/CIRCexplorer2", line 8, in <module>
    sys.exit(main())
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/command_parse.py", line 50, in main
    annotate.annotate(docopt(annotate.__doc__, version=__version__),
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/helper.py", line 38, in wrapper
    fn(*args)
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/annotate.py", line 37, in annotate
    annotate_fusion(options['--ref'], options['--bed'], fusion_tmp,
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/annotate.py", line 50, in annotate_fusion
    genes, novel_genes, gene_info, chrom_info = parse_ref(ref_f, 1)
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/parser.py", line 76, in parse_ref
    gene_id, iso_id, chrom, strand = line.split()[:4]
ValueError: not enough values to unpack (expected 4, got 2)

While exploring the GitHub repository of the initial version of CIRCexplorer, I came across the information that for successful annotation, the GTF file needs to be of the "Gene Predictions and RefSeq Genes with Gene Names" type, essentially a GenePred file (https://genome.ucsc.edu/FAQ/FAQformat.html#format9).

According to the documentation, the expected format for the gene annotation file is as follows:

geneName: Name of the gene
isoformName: Name of the isoform
chrom: Reference sequence
strand: + or - for the strand
txStart: Transcription start position
txEnd: Transcription end position
cdsStart: Coding region start
cdsEnd: Coding region end
exonCount: Number of exons
exonStarts: Exon start positions
exonEnds: Exon end positions (https://circexplorer2.readthedocs.io/en/2.3.0/modules/annotate/).

As such, I used gtfToGenePred to convert the GRCh38/99 GTF file. At this time CIRCexplorer no longer generates errors when using the annotation command. However, it still fails to annotate the obtained junctions, despite not encountering any errors this time.

Annotated 0 fusion junctions!
Start to fix fusion junctions...
Fixed 0 fusion junctions!

I attempted to change the approach and tested it with the hg19 genome instead, just to check if it would work. Although I haven't had much success yet, I've outlined the following steps that I used:

So:

  1. hg19 Genome Preparation:
    I used STAR to index the genome.
    STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /mnt/work1/genodata/hg19 --genomeFastaFiles /mnt/work1/genodata/hg19/bwa_index/hg19.bowtie.fa --sjdbGTFfile /mnt/work1/genodata/hg19/IlluminaiGenomes/UCSC_genes.gtf --sjdbOverhang 100 --limitGenomeGenerateRAM 300 --outFileNamePrefix /home/andregabriel/Desktop/Alignments_André_CIRCexplorer_hg19_Mock/D17Und_Diff0/

  2. Alignment:
    I used STAR to align one of the samples with the newly indexed genome.
    STAR --runThreadN 6 --genomeDir /mnt/work1/genodata/hg19/Indexed_hg19_STAR/ --readFilesIn /mnt/work2/andre_phd/circRNA_Jan2023/raw/T17Und_circRNA_1.fastq.gz /mnt/work2/andre_phd/circRNA_Jan2023/raw/T17Und_circRNA_2.fastq.gz --readFilesCommand zcat --chimSegmentMin 10

  3. Parsing:
    I utilized the CIRCexplorer2 parse command to extract relevant information from the alignment results, focusing specifically on chimeric junctions. Note that for this sample when I used GRCh38, CIRCexplorer2 parse converted 156628 fusion reads, whereas with hg19, it converted 162404 fusion reads. So, parsing continues to identify fusion reads consistently. It's interesting to see the variation, but I assume this is normal.
    CIRCexplorer2 parse -t STAR Chimeric.out.junction > CIRCexplorer2_Diff0_T17und_parse.log

  4. gtfToGenePred:
    I used the gtfToGenePred command to convert the GTF file into a Gene Prediction file, which I just assume that it should facilitate the annotation of fusion junctions in the next step (5).
    gtfToGenePred -genePredExt -geneNameAsName2 /mnt/work1/genodata/hg19/IlluminaiGenomes/UCSC_genes.gtf UCSC_geneGTFtoGENEPRED.txt

  5. Annotation:
    Finally, I attempted to use the CIRCexplorer2 annotate command to annotate fusion junctions based on the fusion reads (.bed file) obtained during parsing and the gene annotation files (specifically, the hg19 .fa and the output from gtfToGenePred).
    Unfortunately, the command didn't produce any annotated fusion junctions again. I wonder if this is due to how the file obtained in the gtfToGenePred step is structured. The file is being read and there are no issues in the CIRCexplorer code, which was a problem previously when we only provided the GTF file. There seems to be an issue with obtaining annotation information.
    CIRCexplorer2 annotate -r /mnt/work1/genodata/hg19/Indexed_hg19_STAR/UCSC_geneGTFtoGENEPRED.txt -g /mnt/work1/genodata/hg19/bwa_index/hg19.bowtie.fa -b back_spliced_junction.bed -o circularRNA_Diff0_T17und.txt > CIRCexplorer2_Diff0_T17und_annotate.log

I appreciate your insights on how to address this matter effectively. Thank you.

Best regards,

Cheers,

@xingma
Copy link
Contributor

xingma commented Nov 8, 2023

To obtain the desired annotation file format, which is slightly different from the direct output of "gtfToGenePred," use the following command: paste <(cut -f 12 UCSC_geneGTFtoGENEPRED.txt) <(cut -f 1-10 UCSC_geneGTFtoGENEPRED.txt) > ref_all.txt. This command combines specific columns from the "UCSC_geneGTFtoGENEPRED.txt" file to create the annotation file with the required format "Gene Predictions and RefSeq Genes with Gene Names".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants