Bug fixes and final changes #10

RSWilson1 · 2024-03-04T08:59:42Z

Fixed
#9 - Changed to add flanking to end coordinates.
#8 - Fixed errors occurring with GRCh38. Fixed tabix cmd. Using a short version of the long argument in the cmd line still doesn't break but this seems to be argparse functionality rather than an error.
#7 - Removed parameter documentation as no longer needed.

Updated to make bed file zero-based.
Updated Readme with more information and for clarity.
Removed assembly file as no longer needed with new mapping function.
Updated tests for new changes.

This change is

… present in bed. Fixed error with vcf file containing floats.

…d new option names. - Updated readme to make -gff and --assembly_summary/--assembly_report. - Renamed --assembly_summary to --assembly_report to make clear the file required, matches file name in Refseq FTP. - Updated args cmds to match name change.

- Updated assembly return function to has default mapping - Updated subtract_and_replace to be zero for zero-based bed - Updated to add addition_and_replace to add flanking to end position - Added line to convert 1-based gff coordinates to 0-based bed coordinates - added line to generate pkl upon running gff.

updated tests for new arg name

PEP8 formatting for gene_annotation2bed

Removed assembly file as an input as no longer required. Updated tests for removal of argument.

droped NaNs from gff_df to remove GRCh38 error. Updated docstring Changed df import to use str instead of string

Updated imports and added print statement

Removed redundant to do's Changed error to runtime error

pep8speaks · 2024-03-04T08:59:49Z

Hello @RSWilson1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file gene_annotation2bed.py:

Line 430:80: E501 line too long (83 > 79 characters)
Line 445:80: E501 line too long (82 > 79 characters)
Line 451:80: E501 line too long (83 > 79 characters)
Line 452:80: E501 line too long (132 > 79 characters)
Line 561:80: E501 line too long (93 > 79 characters)

In the file scripts/igv_report.py:

Line 63:80: E501 line too long (104 > 79 characters)
Line 70:80: E501 line too long (82 > 79 characters)
Line 87:31: E261 at least two spaces before inline comment

Comment last updated at 2024-03-05 16:58:49 UTC

jethror1

Reviewed 5 of 6 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions (waiting on @RSWilson1)

gene_annotation2bed.py line 203 at r2 (raw file):

    # print(gff_df[gff_df["source"] == "RefSeq"])
    # print(gff_df[gff_df["source"] == "BestRefSeq"])
    # print(gff_df[gff_df["source"] != "BestRefSeq"])

needed?

Code quote:

    # GFF what to filter on. Select mRNA? or starts with NM_?
    # print(gff_df.head())
    # print(gff_df[gff_df["transcript_id"].str.startswith("NR_")])
    # print(gff_df[gff_df["source"] == "RefSeq"])
    # print(gff_df[gff_df["source"] == "BestRefSeq"])
    # print(gff_df[gff_df["source"] != "BestRefSeq"])

gene_annotation2bed.py line 427 at r2 (raw file):

    gff_df["transcript_id"] = gff_df["transcript_id"].str.split(".").str[0]
    merged_hgnc_df = gff_df.merge(hgnc_df, on="hgnc_id", how="inner")
    # check for loss of hgnc ids

what is the requirement for this? Should you only be printing or raising an exception?

Code quote:

# check for loss of hgnc ids

gene_annotation2bed.py line 714 at r2 (raw file):

    -------
    result : list
        list of values with flanking added.

this returns an int and not a list

Code quote:

    result : list
        list of values with flanking added.

gene_annotation2bed.py line 733 at r2 (raw file):

    -------
    result : list
        list of values with flanking subtracted, minimum value = 0.

this returns an int and not a list

Code quote:

    Returns
    -------
    result : list
        list of values with flanking subtracted, minimum value = 0.

gene_annotation2bed.py line 813 at r2 (raw file):

    # Merge the coordinates DataFrame with the BED DataFrame
    # Set dtypes for the first DataFrame
    bed_df = bed_df.astype({

consider where you're setting types to use more strict int types to reduce memory usage, I wrote up a bit in Slack here: https://binfx.slack.com/archives/G01HV61U4BX/p1682010142425769

hgnc_id could probably be uint16 and coordinates safely can be uint32, and genes/transcripts/annotation are likely to be repeated quite a lot using categorical type would probably have a decent saving

Raises error for lost ids and requires check and removal

…ts to work with new dtypes.

RSWilson1

Reviewable status: 5 of 9 files reviewed, 5 unresolved discussions (waiting on @jethror1)

gene_annotation2bed.py line 203 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

needed?

ahh I knew I forgot something! Will remove as definitely not needed

gene_annotation2bed.py line 714 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

this returns an int and not a list

corrected.

gene_annotation2bed.py line 733 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

this returns an int and not a list

corrected.

gene_annotation2bed.py line 813 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

consider where you're setting types to use more strict int types to reduce memory usage, I wrote up a bit in Slack here: https://binfx.slack.com/archives/G01HV61U4BX/p1682010142425769

hgnc_id could probably be uint16 and coordinates safely can be uint32, and genes/transcripts/annotation are likely to be repeated quite a lot using categorical type would probably have a decent saving

Updated dtypes, some use Int32 as it handles NAs, but as this is downstream of the gff (the point which uses maximum memory) it shouldn't impact the running much.

jethror1

Reviewed 2 of 4 files at r3, 1 of 1 files at r4.
Reviewable status: 8 of 10 files reviewed, 4 unresolved discussions (waiting on @RSWilson1)

gene_annotation2bed.py line 280 at r3 (raw file):

    coordinates_df['start'] = pd.Series(dtype=np.uint32)
    coordinates_df['end'] = pd.Series(dtype=np.uint32)
    coordinates_df['gene'] = ""

do you not set a type for this column? I assume by default it will either be string or object, would be good to force it to category

Code quote:

coordinates_df['gene'] = ""

gene_annotation2bed.py line 827 at r3 (raw file):

        "start_flank": np.uint32,
        "end_flank": np.uint32,
        "hgnc_id": "Int32",

could this not be uint16? this would allow for an ID <~60,000 and should half memory usage for the column

Code quote:

"Int32"

Changed dtype for gene column

RSWilson1

done

Reviewable status: 7 of 10 files reviewed, 4 unresolved discussions (waiting on @jethror1 and @RSWilson1)

gene_annotation2bed.py line 427 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

what is the requirement for this? Should you only be printing or raising an exception?

So I have changed this to raise an error with the ID needing removing to run again, as otherwise it's kinda a silent fail. I could add a force argument to allow them to continue without removing the ids which might be convenient.

gene_annotation2bed.py line 280 at r3 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

do you not set a type for this column? I assume by default it will either be string or object, would be good to force it to category

done.

gene_annotation2bed.py line 827 at r3 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

could this not be uint16? this would allow for an ID <~60,000 and should half memory usage for the column

Yep fixed it to work now.

jethror1

Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @RSWilson1)

RSWilson1 added 13 commits February 19, 2024 15:47

Added printing out HGNC_ids which weren't found in merge and won't be…

321f59c

… present in bed. Fixed error with vcf file containing floats.

Added handling for no IGV fasta provided.

f7f7f43

fixed if else logic for lost ids

293404b

updated tests for new arg name

Removed old arg for handling gene symbols

0a46a68

PEP8 formatting for gene_annotation2bed

Removed handling of assembly file.

9a56623

Removed assembly file as an input as no longer required. Updated tests for removal of argument.

changed creating empty cols

a3951b2

droped NaNs from gff_df to remove GRCh38 error. Updated docstring Changed df import to use str instead of string

Updated requirements docs and readme

539eef9

Updated imports and added print statement

Fixing small changes to get tests passing.

bec0b56

Fixed genome_build supplied for tests

8c66631

Removed redundant to do's Changed error to runtime error

Updated readme with latest changes and formatting

edd3d1a

Fixed tabix for igv reports and removed redundant arguments.

22d3941

RSWilson1 requested a review from jethror1 March 4, 2024 08:59

RSWilson1 self-assigned this Mar 4, 2024

RSWilson1 added 3 commits March 4, 2024 09:15

updated py version in yaml

449f9f9

updated comment

6151380

Updated doc string for script

963f28b

jethror1 requested changes Mar 4, 2024

View reviewed changes

RSWilson1 added 5 commits March 4, 2024 16:34

Updated dtypes

07caaf3

Raises error for lost ids and requires check and removal

Update dtypes to reduce memory usage by ~2Gb for GRCh37.

865943c

Updated dtypes to improve memory usage and changed gff2pandas and tes…

98c7c46

…ts to work with new dtypes.

Remove redundant prints

72b4982

Removed print

1827ba5

RSWilson1 commented Mar 5, 2024

View reviewed changes

Example of a mixed dataset

9791357

jethror1 requested changes Mar 5, 2024

View reviewed changes

RSWilson1 added 2 commits March 5, 2024 16:53

Changed dtype for hgnc_id now works

69751da

Changed dtype for gene column

PEP8

f412ddb

PEP8

1b8599c

RSWilson1 commented Mar 5, 2024

View reviewed changes

jethror1 approved these changes Mar 5, 2024

View reviewed changes

jethror1 merged commit 8d77f1f into main Mar 5, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes and final changes #10

Bug fixes and final changes #10

RSWilson1 commented Mar 4, 2024 •

edited by woook

Loading

pep8speaks commented Mar 4, 2024 •

edited

Loading

jethror1 left a comment

RSWilson1 left a comment

jethror1 left a comment

RSWilson1 left a comment

jethror1 left a comment

Bug fixes and final changes #10

Bug fixes and final changes #10

Conversation

RSWilson1 commented Mar 4, 2024 • edited by woook Loading

pep8speaks commented Mar 4, 2024 • edited Loading

Comment last updated at 2024-03-05 16:58:49 UTC

jethror1 left a comment

Choose a reason for hiding this comment

RSWilson1 left a comment

Choose a reason for hiding this comment

jethror1 left a comment

Choose a reason for hiding this comment

RSWilson1 left a comment

Choose a reason for hiding this comment

jethror1 left a comment

Choose a reason for hiding this comment

RSWilson1 commented Mar 4, 2024 •

edited by woook

Loading

pep8speaks commented Mar 4, 2024 •

edited

Loading