The gtfInsert
toolkit is an innovative collection of Python scripts designed for the comprehensive processing and enhancement of transcript annotations in GTF files. This toolkit facilitates genomic analysis by parsing, manipulating, and reassembling GTF files with improved accuracy and efficiency.
- find_overlap_transcript: Identifies overlapping transcripts in
.combined.gtf
files. - parse_gtf_to_dict_by_cmp_ref: Converts GTF data into a dictionary format, indexed by
cmp_ref
values. - extract_key_value_pairs: Extracts key-value pairs from the tracking file generated by
gffcompare
. - check_key_unique (Optional): Verifies the uniqueness of extracted keys.
- update_keys_with_tracking_key_value_pair: Replaces incorrect transcript IDs with correct gene IDs from the tracking file.
- parse_gtf_to_dict_by_geneID: Parses the reference GTF file into a JSON format for transcript insertion.
- separate_transcripts_with_same_cmp_ref_geneID: Segregates transcripts with the same
cmp_ref
andgeneID
based on start positions.
- insert_by_start_position: Utilizes an insertion sort algorithm to integrate new transcripts into the reference GTF file.
- parse_json_gtf: Converts JSON data back to the standard GTF format.
- find_novel_transcripts: Appends novel transcripts (identified by specific class codes,like u) to the above final GTF file.
- Preparation: Begins with overlapping transcript identification and structuring for insertion into the reference GTF.
- Reference Processing: Converts the reference GTF to a JSON format for efficient transcript integration.
- Organization and Insertion: Organizes new transcripts by start positions for accurate placement in the reference GTF.
- Conversion to GTF: Transforms the JSON data back into the GTF format.
- Novel Transcript Append: Novel transcripts are added to the end of the GTF file.
- Streamlined approach for efficient and accurate transcript annotation.
- The insertion sort algorithm enables precise transcript integration.
- Modular design for workflow flexibility.
- Performance: Completing the pipeline on a 16 GB RAM, 2.6 GHz 6-Core Intel Core i7 system takes about 40 seconds.
The gtfInsert
toolkit, developed as part of a genomic analysis project, enhances transcript annotations in GTF files. This toolkit is particularly adept at processing GTF files generated by gffcompare
, which involves comparing GTF/GFF files from sources like StringTie and reference GTF inputs. Below is an overview of the toolkit's functionalities, showcasing its efficiency in handling novel and overlapping transcripts:
-
Appending Novel Transcripts: The toolkit includes a Python script specifically written to append novel transcripts, identified by class codes
{'r', 'u', 'i', 'y', 'p'}
, to the reference GTF file. These class codes are designated in theXXX.combined/annotated.gtf
files generated by thegffcompare
process. -
Efficient Insertion of Overlapping Transcripts: To handle overlapping transcripts, classified under
{'k', 'm', 'n', 'j', 'e'}
, a function calledparse_gtf_to_dict
has been developed. This function significantly reduces search complexity, optimizing the insertion process by mapping unique gene IDs to their corresponding transcripts, exons, and CDS in a JSON format. -
Conversion from JSON to GTF Format: The toolkit is equipped with a function to convert data back to the GTF format from JSON, ensuring compatibility and ease of use.
-
Finding and Sorting Overlapping Transcripts: A script,
find_overlap_transcript
, gathers all transcripts and exons with the specified class codes from theXXX.combined.gtf
file. This script ensures that each transcript, along with its subsequent exons, is inserted into the reference GTF file at the correct location, determined by attributes likecmp_ref
in the GTF file. -
Handling Multiple Transcripts Under the Same Gene ID: The
separate_transcripts_with_same_cmp_ref_geneID
function effectively organizes transcripts and exons sharing the samecmp_ref
andgeneID
. This organization is based on their start positions, which is crucial for the accurate insertion and sorting of these transcripts into the reference GTF file. -
Key-Value Pair Extraction: A script to extract key-value pairs from a GFFCompare tracking file is included. This script facilitates the mapping of transcript IDs and GeneID, essential for the subsequent update process.
-
Updating Keys in JSON Files: Another critical script updates keys in a JSON file based on mappings provided in another JSON file. This process is vital for maintaining the accuracy of gene and transcript IDs in the modified GTF files.
-
Insertion Sort Algorithm: The toolkit employs an insertion sort algorithm, which is pivotal for integrating transcripts from
XXX.annotation/combined.gtf
into the reference GTF file. This algorithm is designed for efficiency and accuracy, ensuring that each transcript is placed correctly based on its start position.
The gtfInsert
toolkit represents a significant advancement in genomic data processing, offering researchers a powerful and efficient tool for transcript annotation enhancement in GTF files. With its suite of scripts and organized workflow, it streamlines the complex process of transcript annotation, ensuring accurate and comprehensive genomic analysis.
For more detailed insights and examples, the toolkit's GitHub repository provides extensive information and resources: GitHub Repository.