Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutational load function (SHM) #536

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

MKanetscheider
Copy link
Collaborator

Added mutational_load function to calculate differences between sequence and germline alignment. This is especially useful/insightful for BCR due to SHM and help to understand how much mutational actually occurred. However, this is a rather simple approach!

Closes #...

  • CHANGELOG.md updated
  • Tests added (For bug fixes or new features)
  • Tutorial updated (if necessary)

src/scirpy/tl/_mutational_load.py Outdated Show resolved Hide resolved
src/scirpy/tl/__init__.py Show resolved Hide resolved
src/scirpy/tl/_mutational_load.py Outdated Show resolved Hide resolved
mutation_dict = {"fwr1": [], "fwr2": [], "fwr3": [], "fwr4": [], "cdr1": [], "cdr2": [], "cdr3": []}

for row in range(len(airr_df)):
fwr1_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][:78]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do the numbers of the indices come from? Can we be sure they will remain stable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These indices come from the IMGT unique numbering scheme (https://pubmed.ncbi.nlm.nih.gov/12477501/). This scheme is a standard approach to ensure that we can compare different V-regions of different cells. The neat thing is that sequences are aligned in a way that fwr 1-3 and cdr1-2 are always on the same spot in the germline and sequence alignment that's why these fixed indices work. cdr3 and fwr4 can be inferred by knowing the junction length and total sequence length as it is used in my code.

src/scirpy/tl/_mutational_load.py Outdated Show resolved Hide resolved
),
}

for v, coordinates in regions.items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for v, coordinates in regions.items():
for region, coordinates in regions.items():

One letter loop variables should only be used if they follow certain conventions, e.g. i/j/k for counters in for loops,
or k, v for key, value pairs from dict.items().

Since you use v for the dict key, this can be confusing and I suggest to use a "proper" variable name like region here.

@grst
Copy link
Collaborator

grst commented Aug 19, 2024

In terms of implementation, I think we're getting there :)
Still need to try it out myself to check if I like the overall workflow/interface when using this in a juptyer notebook.

@MKanetscheider
Copy link
Collaborator Author

Hi Gregor,
I worked on the test case for the mutational_load function and finished some kind of "beta" version. I would be very grateful if you could have a look if I'm going in the right direction here :)
Additionally, I discovered some bugs while testing, which I quickly resolved, but maybe not elegantly so please have also a look on that.

For some reason pushing these changes seem to have broken something with MuData, but I have no idea why and what I could possibly have done to cause this 😢 The error massage seems to be everywhere the same:
ImportError: cannot import name 'AlignedViewMixin' from 'anndata._core.aligned_mapping'
Could you help me solve this?

@grst
Copy link
Collaborator

grst commented Aug 29, 2024

Breaking mudata is not your fault. It was caused by an anndata release and should be fixed by now. Just rerun the tests :)

Copy link

codecov bot commented Aug 30, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.75%. Comparing base (08e0cc3) to head (4ad7d59).
Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
src/scirpy/tl/_mutational_load.py 85.88% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #536      +/-   ##
==========================================
+ Coverage   81.43%   81.75%   +0.31%     
==========================================
  Files          49       50       +1     
  Lines        4213     4451     +238     
==========================================
+ Hits         3431     3639     +208     
- Misses        782      812      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@grst grst mentioned this pull request Oct 17, 2024
6 tasks
Comment on lines +63 to +66
sequence_alignment
Awkward array key to access sequence alignment information
germline_alignment
Awkward array key to access germline alignment information -> best practice mask D-gene segment (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0243-2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these information get there? Does dandelion / immcantation preprocessing populate these fields with "best practice" values? Does cellranger also provide them without any further processing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these columns are populated by such tools like Dandelion and Immcantation. In fact, it's the result of re-annotation with igBLAST or imgt/highv-quest that is run "under the hood" of these tools (see also https://immcantation.readthedocs.io/en/stable/getting_started/10x_tutorial.html#assign-v-d-and-j-genes-using-igblast). Both dandelion and Immcantation follow the AIRR Community Standard, meaning that both "sequence_alignment" and "germline_alignment" should always result in the IMGT-gapped sequence (see also https://immcantation.readthedocs.io/en/stable/datastandards.html)
As far as I am aware does cellranger not provide it in this format, which is the main reason why we have to re-annotate cellranger output in the first place 😢

@grst
Copy link
Collaborator

grst commented Nov 21, 2024

@MKanetscheider, two more considerations

  1. Can we somehow verify that the alignments are really using the IMGT reference? Could we at least check something like length? If they were using a different reference, then the results would be pretty nonsensical, wouldn't they?

  2. Especially in the case of region = "subregion", we end up with a ton on columns in adata.obs. To me it seems that the mutational load is actually a chain-level attribute, so how about adding it to the awkward array instead? In that case we could get rid of the following arguments

    • chain_idx_key (just calculate it for all chains)
    • chains (just calculate it for all chains)
    • region (just calculate it for all regions)
    • frequency (just calculate both)

    The function would then add the chain-level attributes mutation_count, mutation_freq, {cdr,fwd}{1,2,3,4}_mutation_{count,load} v_segment_mutation_{count,freq}.

    Afterwards, they could be retrieved like any other chain attribute:

    df = ir.get.airr_df(mdata, ["VDJ_1"], ["mutation_count", "mutation_freq"])

LMK what you think!

@MKanetscheider
Copy link
Collaborator Author

@MKanetscheider, two more considerations

  1. Can we somehow verify that the alignments are really using the IMGT reference? Could we at least check something like length? If they were using a different reference, then the results would be pretty nonsensical, wouldn't they?

Unfortunately yes, if sequences are not IMGT aligned this whole function would be either non-functional any more or just returning some random nonsense 😢 I think I actually do already check for length, because I count differences via hamming-distance, which raises a ValueError if germline and sequence alignments have different lengths. However, I would also like to have a better safety net, but I couldn't come up with anything else so far...

  1. Especially in the case of region = "subregion", we end up with a ton on columns in adata.obs. To me it seems that the mutational load is actually a chain-level attribute, so how about adding it to the awkward array instead? In that case we could get rid of the following arguments

    • chain_idx_key (just calculate it for all chains)
    • chains (just calculate it for all chains)
    • region (just calculate it for all regions)
    • frequency (just calculate both)

    The function would then add the chain-level attributes mutation_count, mutation_freq, {cdr,fwd}{1,2,3,4}_mutation_{count,load} v_segment_mutation_{count,freq}.
    Afterwards, they could be retrieved like any other chain attribute:

    df = ir.get.airr_df(mdata, ["VDJ_1"], ["mutation_count", "mutation_freq"])

LMK what you think!

I think this sounds rather amazing...as you are already well aware, this whole function is rather ugly and not that user-friendly at the moment...I would really love to see it in a more compact format
But do you think that we might run into any performance problems if we always calculate everything with one function call? The function would have to calculate every hamming distance of each sequence/germline alignment pair, which could be troublesome for this big datasets that Felix and you are trying to reach, right?

@grst
Copy link
Collaborator

grst commented Nov 23, 2024

I think I actually do already check for length, because I count differences via hamming-distance, which raises a ValueError if germline and sequence alignments have different lengths

I think that's good enough then. After all it's also clearly stated in the documentation.

But do you think that we might run into any performance problems if we always calculate everything with one function call?

I don't think it would be an issue, because it will still be linear over the number of chains. The part Felix has been working on compares all-vs-all sequences, which is quadratic over the number of sequences and therefore a much harder problem.

Comment on lines +26 to +27
if num_chars == 0:
return np.nan # can be used as a flag for filtering
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this basically returns None when no (non-ignored) characters were compared. Could you please elaborate what's the reasoning behind this instead of returning 0?

@grst
Copy link
Collaborator

grst commented Nov 24, 2024

I think this sounds rather amazing...as you are already well aware, this whole function is rather ugly and not that user-friendly at the moment...I would really love to see it in a more compact format
But do you think that we might run into any performance problems if we always calculate everything with one function call? The function would have to calculate every hamming distance of each sequence/germline alignment pair, which could be troublesome for this big datasets that Felix and you are trying to reach, right?

I gave it a try in #573.
Speed-wise, it is about as fast as running all regions on VJ_1/VDJ_1 chains in your implementation, but including all chains. It could also be further optimized, but it's likely not a bottleneck... it still completes within a few minutes on 1M cells.

I'd still need to deal with a bunch of edge cases, but I think the approach is viable.

Comment on lines +177 to +181
"cdr3": (312, 312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6),
"fwr4": (
312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6,
len(airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"]),
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference for this calculation?

From Lefranc 2023:

rearranged CDR3-IMGT of 13 amino acids (and toa JUNCTION of 15 amino acids, 2nd-CYS 104 and J-TRP or J-PHE 118 being included in the JUNCTION definition). This numbering is convenient to use since 80% of the IMGT/LIGM-DBIG and TR rearranged sequences have a CDR3 IMGT length less than or equal to 13 amino acids (IMGT statistics, October 2001). For rearranged CDR3-IMGT less than 13 amino acids, gaps are created from the top of the loop, in the following order 111, 112, 110, 113, 109, 114, etc. (Table 4A). For rearranged CDR3-IMGT more than 13 amino acids, additional positions are created, between positions 111 and 112 at the top of the CDR3-IMGT loop, in the following order 112.1, 111.1, 112.2, 111.2, 112.3, 111.3, etc. (Table 4B).

From that text, I'd rather deduce something like

312 + max(junction_length, 13 * 3) # * 3 because 13 amino acids into nucleotides

But I haven't looked at the actual data myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants