Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixes and final changes #10

Merged
merged 25 commits into from
Mar 5, 2024
Merged

Bug fixes and final changes #10

merged 25 commits into from
Mar 5, 2024

Conversation

RSWilson1
Copy link
Collaborator

@RSWilson1 RSWilson1 commented Mar 4, 2024

Fixed
#9 - Changed to add flanking to end coordinates.
#8 - Fixed errors occurring with GRCh38. Fixed tabix cmd. Using a short version of the long argument in the cmd line still doesn't break but this seems to be argparse functionality rather than an error.
#7 - Removed parameter documentation as no longer needed.

Updated to make bed file zero-based.
Updated Readme with more information and for clarity.
Removed assembly file as no longer needed with new mapping function.
Updated tests for new changes.


This change is Reviewable

… present in bed.

Fixed error with vcf file containing floats.
…d new option names.

- Updated readme to make -gff and --assembly_summary/--assembly_report.
- Renamed --assembly_summary to --assembly_report to make clear the file required, matches file name in Refseq FTP.
- Updated args cmds to match name change.
- Updated assembly return function to has default mapping
- Updated subtract_and_replace to be zero for zero-based bed
- Updated to add addition_and_replace to add flanking to end position
- Added line to convert 1-based gff coordinates to 0-based bed coordinates
- added line to generate pkl upon running gff.
updated tests for new arg name
PEP8 formatting for gene_annotation2bed
Removed assembly file as an input as no longer required.
Updated tests for removal of argument.
droped NaNs from gff_df to remove GRCh38 error.
Updated docstring
Changed df import to use str instead of string
Updated imports and added print statement
Removed redundant to do's
Changed error to runtime error
@RSWilson1 RSWilson1 requested a review from jethror1 March 4, 2024 08:59
@RSWilson1 RSWilson1 self-assigned this Mar 4, 2024
@pep8speaks
Copy link

pep8speaks commented Mar 4, 2024

Hello @RSWilson1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 430:80: E501 line too long (83 > 79 characters)
Line 445:80: E501 line too long (82 > 79 characters)
Line 451:80: E501 line too long (83 > 79 characters)
Line 452:80: E501 line too long (132 > 79 characters)
Line 561:80: E501 line too long (93 > 79 characters)

Line 63:80: E501 line too long (104 > 79 characters)
Line 70:80: E501 line too long (82 > 79 characters)
Line 87:31: E261 at least two spaces before inline comment

Comment last updated at 2024-03-05 16:58:49 UTC

Copy link
Contributor

@jethror1 jethror1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions (waiting on @RSWilson1)


gene_annotation2bed.py line 203 at r2 (raw file):

    # print(gff_df[gff_df["source"] == "RefSeq"])
    # print(gff_df[gff_df["source"] == "BestRefSeq"])
    # print(gff_df[gff_df["source"] != "BestRefSeq"])

needed?

Code quote:

    # GFF what to filter on. Select mRNA? or starts with NM_?
    # print(gff_df.head())
    # print(gff_df[gff_df["transcript_id"].str.startswith("NR_")])
    # print(gff_df[gff_df["source"] == "RefSeq"])
    # print(gff_df[gff_df["source"] == "BestRefSeq"])
    # print(gff_df[gff_df["source"] != "BestRefSeq"])

gene_annotation2bed.py line 427 at r2 (raw file):

    gff_df["transcript_id"] = gff_df["transcript_id"].str.split(".").str[0]
    merged_hgnc_df = gff_df.merge(hgnc_df, on="hgnc_id", how="inner")
    # check for loss of hgnc ids

what is the requirement for this? Should you only be printing or raising an exception?

Code quote:

# check for loss of hgnc ids

gene_annotation2bed.py line 714 at r2 (raw file):

    -------
    result : list
        list of values with flanking added.

this returns an int and not a list

Code quote:

    result : list
        list of values with flanking added.

gene_annotation2bed.py line 733 at r2 (raw file):

    -------
    result : list
        list of values with flanking subtracted, minimum value = 0.

this returns an int and not a list

Code quote:

    Returns
    -------
    result : list
        list of values with flanking subtracted, minimum value = 0.

gene_annotation2bed.py line 813 at r2 (raw file):

    # Merge the coordinates DataFrame with the BED DataFrame
    # Set dtypes for the first DataFrame
    bed_df = bed_df.astype({

consider where you're setting types to use more strict int types to reduce memory usage, I wrote up a bit in Slack here: https://binfx.slack.com/archives/G01HV61U4BX/p1682010142425769

hgnc_id could probably be uint16 and coordinates safely can be uint32, and genes/transcripts/annotation are likely to be repeated quite a lot using categorical type would probably have a decent saving

Copy link
Collaborator Author

@RSWilson1 RSWilson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 9 files reviewed, 5 unresolved discussions (waiting on @jethror1)


gene_annotation2bed.py line 203 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

needed?

ahh I knew I forgot something! Will remove as definitely not needed


gene_annotation2bed.py line 714 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

this returns an int and not a list

corrected.


gene_annotation2bed.py line 733 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

this returns an int and not a list

corrected.


gene_annotation2bed.py line 813 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

consider where you're setting types to use more strict int types to reduce memory usage, I wrote up a bit in Slack here: https://binfx.slack.com/archives/G01HV61U4BX/p1682010142425769

hgnc_id could probably be uint16 and coordinates safely can be uint32, and genes/transcripts/annotation are likely to be repeated quite a lot using categorical type would probably have a decent saving

Updated dtypes, some use Int32 as it handles NAs, but as this is downstream of the gff (the point which uses maximum memory) it shouldn't impact the running much.

Copy link
Contributor

@jethror1 jethror1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r3, 1 of 1 files at r4.
Reviewable status: 8 of 10 files reviewed, 4 unresolved discussions (waiting on @RSWilson1)


gene_annotation2bed.py line 280 at r3 (raw file):

    coordinates_df['start'] = pd.Series(dtype=np.uint32)
    coordinates_df['end'] = pd.Series(dtype=np.uint32)
    coordinates_df['gene'] = ""

do you not set a type for this column? I assume by default it will either be string or object, would be good to force it to category

Code quote:

coordinates_df['gene'] = ""

gene_annotation2bed.py line 827 at r3 (raw file):

        "start_flank": np.uint32,
        "end_flank": np.uint32,
        "hgnc_id": "Int32",

could this not be uint16? this would allow for an ID <~60,000 and should half memory usage for the column

Code quote:

"Int32"

Copy link
Collaborator Author

@RSWilson1 RSWilson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Reviewable status: 7 of 10 files reviewed, 4 unresolved discussions (waiting on @jethror1 and @RSWilson1)


gene_annotation2bed.py line 427 at r2 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

what is the requirement for this? Should you only be printing or raising an exception?

So I have changed this to raise an error with the ID needing removing to run again, as otherwise it's kinda a silent fail. I could add a force argument to allow them to continue without removing the ids which might be convenient.


gene_annotation2bed.py line 280 at r3 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

do you not set a type for this column? I assume by default it will either be string or object, would be good to force it to category

done.


gene_annotation2bed.py line 827 at r3 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

could this not be uint16? this would allow for an ID <~60,000 and should half memory usage for the column

Yep fixed it to work now.

Copy link
Contributor

@jethror1 jethror1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @RSWilson1)

@jethror1 jethror1 merged commit 8d77f1f into main Mar 5, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants