Skip to content

dxu104/gtfInsert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gtfInsert Toolkit - README

The gtfInsert toolkit is an innovative collection of Python scripts designed for the comprehensive processing and enhancement of transcript annotations in GTF files. This toolkit facilitates genomic analysis by parsing, manipulating, and reassembling GTF files with improved accuracy and efficiency.

Overview of Scripts

Parsing and Preparation

  1. find_overlap_transcript: Identifies overlapping transcripts in .combined.gtf files.
  2. parse_gtf_to_dict_by_cmp_ref: Converts GTF data into a dictionary format, indexed by cmp_ref values.
  3. extract_key_value_pairs: Extracts key-value pairs from the tracking file generated by gffcompare.
  4. check_key_unique (Optional): Verifies the uniqueness of extracted keys.
  5. update_keys_with_tracking_key_value_pair: Replaces incorrect transcript IDs with correct gene IDs from the tracking file.

Reference GTF Processing

  1. parse_gtf_to_dict_by_geneID: Parses the reference GTF file into a JSON format for transcript insertion.

Transcript Organization

  1. separate_transcripts_with_same_cmp_ref_geneID: Segregates transcripts with the same cmp_ref and geneID based on start positions.

Insertion and Sorting

  1. insert_by_start_position: Utilizes an insertion sort algorithm to integrate new transcripts into the reference GTF file.

Finalization

  1. parse_json_gtf: Converts JSON data back to the standard GTF format.

Novel Transcripts Handling

  1. find_novel_transcripts: Appends novel transcripts (identified by specific class codes,like u) to the above final GTF file.

Workflow Process

  1. Preparation: Begins with overlapping transcript identification and structuring for insertion into the reference GTF.
  2. Reference Processing: Converts the reference GTF to a JSON format for efficient transcript integration.
  3. Organization and Insertion: Organizes new transcripts by start positions for accurate placement in the reference GTF.
  4. Conversion to GTF: Transforms the JSON data back into the GTF format.
  5. Novel Transcript Append: Novel transcripts are added to the end of the GTF file.

Technical Features

  • Streamlined approach for efficient and accurate transcript annotation.
  • The insertion sort algorithm enables precise transcript integration.
  • Modular design for workflow flexibility.
  • Performance: Completing the pipeline on a 16 GB RAM, 2.6 GHz 6-Core Intel Core i7 system takes about 40 seconds.

Usage and Examples

The gtfInsert toolkit, developed as part of a genomic analysis project, enhances transcript annotations in GTF files. This toolkit is particularly adept at processing GTF files generated by gffcompare, which involves comparing GTF/GFF files from sources like StringTie and reference GTF inputs. Below is an overview of the toolkit's functionalities, showcasing its efficiency in handling novel and overlapping transcripts:

  1. Appending Novel Transcripts: The toolkit includes a Python script specifically written to append novel transcripts, identified by class codes {'r', 'u', 'i', 'y', 'p'}, to the reference GTF file. These class codes are designated in the XXX.combined/annotated.gtf files generated by the gffcompare process.

    Appending Novel Transcripts

  2. Efficient Insertion of Overlapping Transcripts: To handle overlapping transcripts, classified under {'k', 'm', 'n', 'j', 'e'}, a function called parse_gtf_to_dict has been developed. This function significantly reduces search complexity, optimizing the insertion process by mapping unique gene IDs to their corresponding transcripts, exons, and CDS in a JSON format.

    Efficient Insertion Process

  3. Conversion from JSON to GTF Format: The toolkit is equipped with a function to convert data back to the GTF format from JSON, ensuring compatibility and ease of use.

  4. Finding and Sorting Overlapping Transcripts: A script, find_overlap_transcript, gathers all transcripts and exons with the specified class codes from the XXX.combined.gtf file. This script ensures that each transcript, along with its subsequent exons, is inserted into the reference GTF file at the correct location, determined by attributes like cmp_ref in the GTF file.

    Sorting Overlapping Transcripts

  5. Handling Multiple Transcripts Under the Same Gene ID: The separate_transcripts_with_same_cmp_ref_geneID function effectively organizes transcripts and exons sharing the same cmp_ref and geneID. This organization is based on their start positions, which is crucial for the accurate insertion and sorting of these transcripts into the reference GTF file.

    Handling Multiple Transcripts

  6. Key-Value Pair Extraction: A script to extract key-value pairs from a GFFCompare tracking file is included. This script facilitates the mapping of transcript IDs and GeneID, essential for the subsequent update process.

    Key-Value Pair Extraction

  7. Updating Keys in JSON Files: Another critical script updates keys in a JSON file based on mappings provided in another JSON file. This process is vital for maintaining the accuracy of gene and transcript IDs in the modified GTF files.

    Updating Keys in JSON

  8. Insertion Sort Algorithm: The toolkit employs an insertion sort algorithm, which is pivotal for integrating transcripts from XXX.annotation/combined.gtf into the reference GTF file. This algorithm is designed for efficiency and accuracy, ensuring that each transcript is placed correctly based on its start position.

Conclusion

The gtfInsert toolkit represents a significant advancement in genomic data processing, offering researchers a powerful and efficient tool for transcript annotation enhancement in GTF files. With its suite of scripts and organized workflow, it streamlines the complex process of transcript annotation, ensuring accurate and comprehensive genomic analysis.

For more detailed insights and examples, the toolkit's GitHub repository provides extensive information and resources: GitHub Repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published