-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fixes and final changes #10
Conversation
… present in bed. Fixed error with vcf file containing floats.
…d new option names. - Updated readme to make -gff and --assembly_summary/--assembly_report. - Renamed --assembly_summary to --assembly_report to make clear the file required, matches file name in Refseq FTP. - Updated args cmds to match name change.
- Updated assembly return function to has default mapping - Updated subtract_and_replace to be zero for zero-based bed - Updated to add addition_and_replace to add flanking to end position - Added line to convert 1-based gff coordinates to 0-based bed coordinates - added line to generate pkl upon running gff.
updated tests for new arg name
PEP8 formatting for gene_annotation2bed
Removed assembly file as an input as no longer required. Updated tests for removal of argument.
droped NaNs from gff_df to remove GRCh38 error. Updated docstring Changed df import to use str instead of string
Updated imports and added print statement
Removed redundant to do's Changed error to runtime error
Hello @RSWilson1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-03-05 16:58:49 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 6 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions (waiting on @RSWilson1)
gene_annotation2bed.py
line 203 at r2 (raw file):
# print(gff_df[gff_df["source"] == "RefSeq"]) # print(gff_df[gff_df["source"] == "BestRefSeq"]) # print(gff_df[gff_df["source"] != "BestRefSeq"])
needed?
Code quote:
# GFF what to filter on. Select mRNA? or starts with NM_?
# print(gff_df.head())
# print(gff_df[gff_df["transcript_id"].str.startswith("NR_")])
# print(gff_df[gff_df["source"] == "RefSeq"])
# print(gff_df[gff_df["source"] == "BestRefSeq"])
# print(gff_df[gff_df["source"] != "BestRefSeq"])
gene_annotation2bed.py
line 427 at r2 (raw file):
gff_df["transcript_id"] = gff_df["transcript_id"].str.split(".").str[0] merged_hgnc_df = gff_df.merge(hgnc_df, on="hgnc_id", how="inner") # check for loss of hgnc ids
what is the requirement for this? Should you only be printing or raising an exception?
Code quote:
# check for loss of hgnc ids
gene_annotation2bed.py
line 714 at r2 (raw file):
------- result : list list of values with flanking added.
this returns an int and not a list
Code quote:
result : list
list of values with flanking added.
gene_annotation2bed.py
line 733 at r2 (raw file):
------- result : list list of values with flanking subtracted, minimum value = 0.
this returns an int and not a list
Code quote:
Returns
-------
result : list
list of values with flanking subtracted, minimum value = 0.
gene_annotation2bed.py
line 813 at r2 (raw file):
# Merge the coordinates DataFrame with the BED DataFrame # Set dtypes for the first DataFrame bed_df = bed_df.astype({
consider where you're setting types to use more strict int types to reduce memory usage, I wrote up a bit in Slack here: https://binfx.slack.com/archives/G01HV61U4BX/p1682010142425769
hgnc_id could probably be uint16
and coordinates safely can be uint32
, and genes/transcripts/annotation are likely to be repeated quite a lot using categorical
type would probably have a decent saving
Raises error for lost ids and requires check and removal
…ts to work with new dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 9 files reviewed, 5 unresolved discussions (waiting on @jethror1)
gene_annotation2bed.py
line 203 at r2 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
needed?
ahh I knew I forgot something! Will remove as definitely not needed
gene_annotation2bed.py
line 714 at r2 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
this returns an int and not a list
corrected.
gene_annotation2bed.py
line 733 at r2 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
this returns an int and not a list
corrected.
gene_annotation2bed.py
line 813 at r2 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
consider where you're setting types to use more strict int types to reduce memory usage, I wrote up a bit in Slack here: https://binfx.slack.com/archives/G01HV61U4BX/p1682010142425769
hgnc_id could probably be
uint16
and coordinates safely can beuint32
, and genes/transcripts/annotation are likely to be repeated quite a lot usingcategorical
type would probably have a decent saving
Updated dtypes, some use Int32 as it handles NAs, but as this is downstream of the gff (the point which uses maximum memory) it shouldn't impact the running much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r3, 1 of 1 files at r4.
Reviewable status: 8 of 10 files reviewed, 4 unresolved discussions (waiting on @RSWilson1)
gene_annotation2bed.py
line 280 at r3 (raw file):
coordinates_df['start'] = pd.Series(dtype=np.uint32) coordinates_df['end'] = pd.Series(dtype=np.uint32) coordinates_df['gene'] = ""
do you not set a type for this column? I assume by default it will either be string or object, would be good to force it to category
Code quote:
coordinates_df['gene'] = ""
gene_annotation2bed.py
line 827 at r3 (raw file):
"start_flank": np.uint32, "end_flank": np.uint32, "hgnc_id": "Int32",
could this not be uint16
? this would allow for an ID <~60,000 and should half memory usage for the column
Code quote:
"Int32"
Changed dtype for gene column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Reviewable status: 7 of 10 files reviewed, 4 unresolved discussions (waiting on @jethror1 and @RSWilson1)
gene_annotation2bed.py
line 427 at r2 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
what is the requirement for this? Should you only be printing or raising an exception?
So I have changed this to raise an error with the ID needing removing to run again, as otherwise it's kinda a silent fail. I could add a force argument to allow them to continue without removing the ids which might be convenient.
gene_annotation2bed.py
line 280 at r3 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
do you not set a type for this column? I assume by default it will either be string or object, would be good to force it to category
done.
gene_annotation2bed.py
line 827 at r3 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
could this not be
uint16
? this would allow for an ID <~60,000 and should half memory usage for the column
Yep fixed it to work now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @RSWilson1)
Fixed
#9 - Changed to add flanking to end coordinates.
#8 - Fixed errors occurring with GRCh38. Fixed tabix cmd. Using a short version of the long argument in the cmd line still doesn't break but this seems to be argparse functionality rather than an error.
#7 - Removed parameter documentation as no longer needed.
Updated to make bed file zero-based.
Updated Readme with more information and for clarity.
Removed assembly file as no longer needed with new mapping function.
Updated tests for new changes.
This change is