gtfInsert Toolkit - README

The gtfInsert toolkit is an innovative collection of Python scripts designed for the comprehensive processing and enhancement of transcript annotations in GTF files. This toolkit facilitates genomic analysis by parsing, manipulating, and reassembling GTF files with improved accuracy and efficiency.

Overview of Scripts

Parsing and Preparation

find_overlap_transcript: Identifies overlapping transcripts in .combined.gtf files.
parse_gtf_to_dict_by_cmp_ref: Converts GTF data into a dictionary format, indexed by cmp_ref values.
extract_key_value_pairs: Extracts key-value pairs from the tracking file generated by gffcompare.
check_key_unique (Optional): Verifies the uniqueness of extracted keys.
update_keys_with_tracking_key_value_pair: Replaces incorrect transcript IDs with correct gene IDs from the tracking file.

Reference GTF Processing

parse_gtf_to_dict_by_geneID: Parses the reference GTF file into a JSON format for transcript insertion.

Transcript Organization

separate_transcripts_with_same_cmp_ref_geneID: Segregates transcripts with the same cmp_ref and geneID based on start positions.

Insertion and Sorting

insert_by_start_position: Utilizes an insertion sort algorithm to integrate new transcripts into the reference GTF file.

Finalization

parse_json_gtf: Converts JSON data back to the standard GTF format.

Novel Transcripts Handling

find_novel_transcripts: Appends novel transcripts (identified by specific class codes,like u) to the above final GTF file.

Workflow Process

Preparation: Begins with overlapping transcript identification and structuring for insertion into the reference GTF.
Reference Processing: Converts the reference GTF to a JSON format for efficient transcript integration.
Organization and Insertion: Organizes new transcripts by start positions for accurate placement in the reference GTF.
Conversion to GTF: Transforms the JSON data back into the GTF format.
Novel Transcript Append: Novel transcripts are added to the end of the GTF file.

Technical Features

Streamlined approach for efficient and accurate transcript annotation.
The insertion sort algorithm enables precise transcript integration.
Modular design for workflow flexibility.
Performance: Completing the pipeline on a 16 GB RAM, 2.6 GHz 6-Core Intel Core i7 system takes about 40 seconds.

Usage and Examples

The gtfInsert toolkit, developed as part of a genomic analysis project, enhances transcript annotations in GTF files. This toolkit is particularly adept at processing GTF files generated by gffcompare, which involves comparing GTF/GFF files from sources like StringTie and reference GTF inputs. Below is an overview of the toolkit's functionalities, showcasing its efficiency in handling novel and overlapping transcripts:

Appending Novel Transcripts: The toolkit includes a Python script specifically written to append novel transcripts, identified by class codes {'r', 'u', 'i', 'y', 'p'}, to the reference GTF file. These class codes are designated in the XXX.combined/annotated.gtf files generated by the gffcompare process.
Efficient Insertion of Overlapping Transcripts: To handle overlapping transcripts, classified under {'k', 'm', 'n', 'j', 'e'}, a function called parse_gtf_to_dict has been developed. This function significantly reduces search complexity, optimizing the insertion process by mapping unique gene IDs to their corresponding transcripts, exons, and CDS in a JSON format.
Conversion from JSON to GTF Format: The toolkit is equipped with a function to convert data back to the GTF format from JSON, ensuring compatibility and ease of use.
Finding and Sorting Overlapping Transcripts: A script, find_overlap_transcript, gathers all transcripts and exons with the specified class codes from the XXX.combined.gtf file. This script ensures that each transcript, along with its subsequent exons, is inserted into the reference GTF file at the correct location, determined by attributes like cmp_ref in the GTF file.
Handling Multiple Transcripts Under the Same Gene ID: The separate_transcripts_with_same_cmp_ref_geneID function effectively organizes transcripts and exons sharing the same cmp_ref and geneID. This organization is based on their start positions, which is crucial for the accurate insertion and sorting of these transcripts into the reference GTF file.
Key-Value Pair Extraction: A script to extract key-value pairs from a GFFCompare tracking file is included. This script facilitates the mapping of transcript IDs and GeneID, essential for the subsequent update process.
Updating Keys in JSON Files: Another critical script updates keys in a JSON file based on mappings provided in another JSON file. This process is vital for maintaining the accuracy of gene and transcript IDs in the modified GTF files.
Insertion Sort Algorithm: The toolkit employs an insertion sort algorithm, which is pivotal for integrating transcripts from XXX.annotation/combined.gtf into the reference GTF file. This algorithm is designed for efficiency and accuracy, ensuring that each transcript is placed correctly based on its start position.

Conclusion

The gtfInsert toolkit represents a significant advancement in genomic data processing, offering researchers a powerful and efficient tool for transcript annotation enhancement in GTF files. With its suite of scripts and organized workflow, it streamlines the complex process of transcript annotation, ensuring accurate and comprehensive genomic analysis.

For more detailed insights and examples, the toolkit's GitHub repository provides extensive information and resources: GitHub Repository.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Python		Python
checkKit		checkKit
nextflow		nextflow
Input&OutputDemo		Input&OutputDemo
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gtfInsert Toolkit - README

Overview of Scripts

Parsing and Preparation

Reference GTF Processing

Transcript Organization

Insertion and Sorting

Finalization

Novel Transcripts Handling

Workflow Process

Technical Features

Usage and Examples

Conclusion

About

Releases

Packages

Languages

dxu104/gtfInsert

Folders and files

Latest commit

History

Repository files navigation

gtfInsert Toolkit - README

Overview of Scripts

Parsing and Preparation

Reference GTF Processing

Transcript Organization

Insertion and Sorting

Finalization

Novel Transcripts Handling

Workflow Process

Technical Features

Usage and Examples

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages