Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How can I use precomputed genes via GFF format with GECCO? #17

Open
jolespin opened this issue Oct 1, 2024 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@jolespin
Copy link

jolespin commented Oct 1, 2024

I have already run Pyrodigal on ~50k genomes and would like to run GECCO on these genomes without rerunning Pyrodigal in the backend. Is there similar usage to antiSMASH where precompute gene models can be provided via GFF? If so, what command can I run? If not, would this be in scope to add in a future update?

@althonos
Copy link
Member

althonos commented Oct 2, 2024

Hi @jolespin, you can use pre-computed genes if they are in GenBank, but not in GFF at the moment. I will try to update for GFFs once I get some more time 👍

@althonos althonos added the enhancement New feature or request label Oct 2, 2024
@jolespin
Copy link
Author

jolespin commented Oct 2, 2024

Awesome thank you. Is there a reliable GFF to Genbank CLI tool you recommend in the meantime? Also, saw you updated Pyrodgial CLI with multithreaded mode! Nice.

@althonos
Copy link
Member

althonos commented Oct 7, 2024

You can use this simple script with Biopython:

import argparse
import csv
import pathlib
import collections

import Bio.SeqIO
from Bio.SeqFeature import SeqFeature, SimpleLocation


parser = argparse.ArgumentParser() 
parser.add_argument("--gff", required=True, type=pathlib.Path)
parser.add_argument("--fna", required=True, type=pathlib.Path)
parser.add_argument("--gbk")
args = parser.parse_args()


if args.gbk is None:
    args.gbk = args.fna.with_suffix(".gbk")


rows = collections.defaultdict(list)
with args.gff.open() as gff:
    reader = csv.reader(gff, dialect="excel-tab")
    for row in reader:
        rows[row[0]].append(row)

with args.gbk.open("w") as gbk:

    for seq in Bio.SeqIO.parse(args.fna, "fasta"):

        seq_rows = rows[seq.id]
        seq_rows.sort(key=lambda row: int(row[3]))

        for row in seq_rows:
            start, end = map(int, row[3:5])
            id_ = row[-1].split(";")[0][3:]
            strand = int(row[6] + '1')
            location = SimpleLocation(start - 1, end, strand)
            feat = SeqFeature(location=location, type="CDS")
            feat.qualifiers['locus_tag'] = id_
            seq.features.append(feat)

        seq.annotations['molecule_type'] = 'DNA'
        Bio.SeqIO.write(seq, gbk, "genbank")

Usage:

$ python convert.py --gff seq.gff --fna seq.fna --gbk seq.gbk

It will just copy the coordinates and IDs of the genes from the GFF and the sequence from the FASTA file into a GenBank file, and then you can tell GECCO to use the CDS features:

$ gecco -v run -g seq.gbk --cds-feature CDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants