[Question] How can I use precomputed genes via GFF format with GECCO? #17

jolespin · 2024-10-01T21:07:55Z

I have already run Pyrodigal on ~50k genomes and would like to run GECCO on these genomes without rerunning Pyrodigal in the backend. Is there similar usage to antiSMASH where precompute gene models can be provided via GFF? If so, what command can I run? If not, would this be in scope to add in a future update?

The text was updated successfully, but these errors were encountered:

althonos · 2024-10-02T09:26:29Z

Hi @jolespin, you can use pre-computed genes if they are in GenBank, but not in GFF at the moment. I will try to update for GFFs once I get some more time 👍

jolespin · 2024-10-02T15:36:08Z

Awesome thank you. Is there a reliable GFF to Genbank CLI tool you recommend in the meantime? Also, saw you updated Pyrodgial CLI with multithreaded mode! Nice.

althonos · 2024-10-07T11:24:15Z

You can use this simple script with Biopython:

import argparse
import csv
import pathlib
import collections

import Bio.SeqIO
from Bio.SeqFeature import SeqFeature, SimpleLocation


parser = argparse.ArgumentParser() 
parser.add_argument("--gff", required=True, type=pathlib.Path)
parser.add_argument("--fna", required=True, type=pathlib.Path)
parser.add_argument("--gbk")
args = parser.parse_args()


if args.gbk is None:
    args.gbk = args.fna.with_suffix(".gbk")


rows = collections.defaultdict(list)
with args.gff.open() as gff:
    reader = csv.reader(gff, dialect="excel-tab")
    for row in reader:
        rows[row[0]].append(row)

with args.gbk.open("w") as gbk:

    for seq in Bio.SeqIO.parse(args.fna, "fasta"):

        seq_rows = rows[seq.id]
        seq_rows.sort(key=lambda row: int(row[3]))

        for row in seq_rows:
            start, end = map(int, row[3:5])
            id_ = row[-1].split(";")[0][3:]
            strand = int(row[6] + '1')
            location = SimpleLocation(start - 1, end, strand)
            feat = SeqFeature(location=location, type="CDS")
            feat.qualifiers['locus_tag'] = id_
            seq.features.append(feat)

        seq.annotations['molecule_type'] = 'DNA'
        Bio.SeqIO.write(seq, gbk, "genbank")

Usage:

$ python convert.py --gff seq.gff --fna seq.fna --gbk seq.gbk

It will just copy the coordinates and IDs of the genes from the GFF and the sequence from the FASTA file into a GenBank file, and then you can tell GECCO to use the CDS features:

$ gecco -v run -g seq.gbk --cds-feature CDS

althonos added the enhancement New feature or request label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How can I use precomputed genes via GFF format with GECCO? #17

[Question] How can I use precomputed genes via GFF format with GECCO? #17

jolespin commented Oct 1, 2024

althonos commented Oct 2, 2024

jolespin commented Oct 2, 2024 •

edited

Loading

althonos commented Oct 7, 2024

[Question] How can I use precomputed genes via GFF format with GECCO? #17

[Question] How can I use precomputed genes via GFF format with GECCO? #17

Comments

jolespin commented Oct 1, 2024

althonos commented Oct 2, 2024

jolespin commented Oct 2, 2024 • edited Loading

althonos commented Oct 7, 2024

jolespin commented Oct 2, 2024 •

edited

Loading