Mutational load function (SHM) #536

MKanetscheider · 2024-08-09T14:57:16Z

Added mutational_load function to calculate differences between sequence and germline alignment. This is especially useful/insightful for BCR due to SHM and help to understand how much mutational actually occurred. However, this is a rather simple approach!

Closes #...

CHANGELOG.md updated
Tests added (For bug fixes or new features)
Tutorial updated (if necessary)

…nce and germline alignment

for more information, see https://pre-commit.ci

src/scirpy/tl/_mutational_load.py

src/scirpy/tl/__init__.py

src/scirpy/tl/_mutational_load.py

… function to api.rst

for more information, see https://pre-commit.ci

…cirpy into mutational_load

for more information, see https://pre-commit.ci

grst · 2024-08-15T12:05:22Z

src/scirpy/tl/_mutational_load.py

+            mutation_dict = {"fwr1": [], "fwr2": [], "fwr3": [], "fwr4": [], "cdr1": [], "cdr2": [], "cdr3": []}
+
+            for row in range(len(airr_df)):
+                fwr1_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][:78]


Where do the numbers of the indices come from? Can we be sure they will remain stable?

These indices come from the IMGT unique numbering scheme (https://pubmed.ncbi.nlm.nih.gov/12477501/). This scheme is a standard approach to ensure that we can compare different V-regions of different cells. The neat thing is that sequences are aligned in a way that fwr 1-3 and cdr1-2 are always on the same spot in the germline and sequence alignment that's why these fixed indices work. cdr3 and fwr4 can be inferred by knowing the junction length and total sequence length as it is used in my code.

src/scirpy/tl/_mutational_load.py

… documentation

for more information, see https://pre-commit.ci

grst · 2024-08-19T15:27:39Z

src/scirpy/tl/_mutational_load.py

+                    ),
+                }
+
+                for v, coordinates in regions.items():


Suggested change

for v, coordinates in regions.items():

for region, coordinates in regions.items():

One letter loop variables should only be used if they follow certain conventions, e.g. i/j/k for counters in for loops,
or k, v for key, value pairs from dict.items().

Since you use v for the dict key, this can be confusing and I suggest to use a "proper" variable name like region here.

src/scirpy/tl/_mutational_load.py

grst · 2024-08-19T15:30:43Z

In terms of implementation, I think we're getting there :)
Still need to try it out myself to check if I like the overall workflow/interface when using this in a juptyer notebook.

…cirpy into mutational_load

… fixed

for more information, see https://pre-commit.ci

…cirpy into mutational_load

MKanetscheider · 2024-08-29T10:02:52Z

Hi Gregor,
I worked on the test case for the mutational_load function and finished some kind of "beta" version. I would be very grateful if you could have a look if I'm going in the right direction here :)
Additionally, I discovered some bugs while testing, which I quickly resolved, but maybe not elegantly so please have also a look on that.

For some reason pushing these changes seem to have broken something with MuData, but I have no idea why and what I could possibly have done to cause this 😢 The error massage seems to be everywhere the same:
ImportError: cannot import name 'AlignedViewMixin' from 'anndata._core.aligned_mapping'
Could you help me solve this?

grst · 2024-08-29T20:31:02Z

Breaking mudata is not your fault. It was caused by an anndata release and should be fixed by now. Just rerun the tests :)

for more information, see https://pre-commit.ci

codecov · 2024-08-30T18:28:31Z

Codecov Report

Attention: Patch coverage is 87.50000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.75%. Comparing base (08e0cc3) to head (4ad7d59).
Report is 20 commits behind head on main.

Files with missing lines	Patch %	Lines
src/scirpy/tl/_mutational_load.py	85.88%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #536      +/-   ##
==========================================
+ Coverage   81.43%   81.75%   +0.31%     
==========================================
  Files          49       50       +1     
  Lines        4213     4451     +238     
==========================================
+ Hits         3431     3639     +208     
- Misses        782      812      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

review-notebook-app · 2024-10-17T19:49:59Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

grst · 2024-11-20T21:00:28Z

src/scirpy/tl/_mutational_load.py

+    sequence_alignment
+        Awkward array key to access sequence alignment information
+    germline_alignment
+        Awkward array key to access germline alignment information -> best practice mask D-gene segment (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0243-2)


How do these information get there? Does dandelion / immcantation preprocessing populate these fields with "best practice" values? Does cellranger also provide them without any further processing?

Yes, these columns are populated by such tools like Dandelion and Immcantation. In fact, it's the result of re-annotation with igBLAST or imgt/highv-quest that is run "under the hood" of these tools (see also https://immcantation.readthedocs.io/en/stable/getting_started/10x_tutorial.html#assign-v-d-and-j-genes-using-igblast). Both dandelion and Immcantation follow the AIRR Community Standard, meaning that both "sequence_alignment" and "germline_alignment" should always result in the IMGT-gapped sequence (see also https://immcantation.readthedocs.io/en/stable/datastandards.html)
As far as I am aware does cellranger not provide it in this format, which is the main reason why we have to re-annotate cellranger output in the first place 😢

grst · 2024-11-21T19:37:23Z

@MKanetscheider, two more considerations

Can we somehow verify that the alignments are really using the IMGT reference? Could we at least check something like length? If they were using a different reference, then the results would be pretty nonsensical, wouldn't they?
Especially in the case of region = "subregion", we end up with a ton on columns in adata.obs. To me it seems that the mutational load is actually a chain-level attribute, so how about adding it to the awkward array instead? In that case we could get rid of the following arguments
- chain_idx_key (just calculate it for all chains)
- chains (just calculate it for all chains)
- region (just calculate it for all regions)
- frequency (just calculate both)
The function would then add the chain-level attributes mutation_count, mutation_freq, {cdr,fwd}{1,2,3,4}_mutation_{count,load} v_segment_mutation_{count,freq}.

Afterwards, they could be retrieved like any other chain attribute:
```
df = ir.get.airr_df(mdata, ["VDJ_1"], ["mutation_count", "mutation_freq"])
```

LMK what you think!

MKanetscheider · 2024-11-23T10:48:29Z

@MKanetscheider, two more considerations

Can we somehow verify that the alignments are really using the IMGT reference? Could we at least check something like length? If they were using a different reference, then the results would be pretty nonsensical, wouldn't they?

Unfortunately yes, if sequences are not IMGT aligned this whole function would be either non-functional any more or just returning some random nonsense 😢 I think I actually do already check for length, because I count differences via hamming-distance, which raises a ValueError if germline and sequence alignments have different lengths. However, I would also like to have a better safety net, but I couldn't come up with anything else so far...

Especially in the case of region = "subregion", we end up with a ton on columns in adata.obs. To me it seems that the mutational load is actually a chain-level attribute, so how about adding it to the awkward array instead? In that case we could get rid of the following arguments

chain_idx_key (just calculate it for all chains)

chains (just calculate it for all chains)

region (just calculate it for all regions)

frequency (just calculate both)

The function would then add the chain-level attributes mutation_count, mutation_freq, {cdr,fwd}{1,2,3,4}_mutation_{count,load} v_segment_mutation_{count,freq}.
Afterwards, they could be retrieved like any other chain attribute:
df = ir.get.airr_df(mdata, ["VDJ_1"], ["mutation_count", "mutation_freq"])
LMK what you think!

I think this sounds rather amazing...as you are already well aware, this whole function is rather ugly and not that user-friendly at the moment...I would really love to see it in a more compact format
But do you think that we might run into any performance problems if we always calculate everything with one function call? The function would have to calculate every hamming distance of each sequence/germline alignment pair, which could be troublesome for this big datasets that Felix and you are trying to reach, right?

grst · 2024-11-23T18:21:39Z

I think I actually do already check for length, because I count differences via hamming-distance, which raises a ValueError if germline and sequence alignments have different lengths

I think that's good enough then. After all it's also clearly stated in the documentation.

But do you think that we might run into any performance problems if we always calculate everything with one function call?

I don't think it would be an issue, because it will still be linear over the number of chains. The part Felix has been working on compares all-vs-all sequences, which is quadratic over the number of sequences and therefore a much harder problem.

grst · 2024-11-24T13:21:17Z

src/scirpy/tl/_mutational_load.py

+    if num_chars == 0:
+        return np.nan  # can be used as a flag for filtering


So this basically returns None when no (non-ignored) characters were compared. Could you please elaborate what's the reasoning behind this instead of returning 0?

grst · 2024-11-24T15:05:49Z

I think this sounds rather amazing...as you are already well aware, this whole function is rather ugly and not that user-friendly at the moment...I would really love to see it in a more compact format
But do you think that we might run into any performance problems if we always calculate everything with one function call? The function would have to calculate every hamming distance of each sequence/germline alignment pair, which could be troublesome for this big datasets that Felix and you are trying to reach, right?

I gave it a try in #573.
Speed-wise, it is about as fast as running all regions on VJ_1/VDJ_1 chains in your implementation, but including all chains. It could also be further optimized, but it's likely not a bottleneck... it still completes within a few minutes on 1M cells.

I'd still need to deal with a bunch of edge cases, but I think the approach is viable.

grst · 2024-11-24T15:18:26Z

src/scirpy/tl/_mutational_load.py

+                        "cdr3": (312, 312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6),
+                        "fwr4": (
+                            312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6,
+                            len(airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"]),
+                        ),


Do you have a reference for this calculation?

From Lefranc 2023:

rearranged CDR3-IMGT of 13 amino acids (and toa JUNCTION of 15 amino acids, 2nd-CYS 104 and J-TRP or J-PHE 118 being included in the JUNCTION definition). This numbering is convenient to use since 80% of the IMGT/LIGM-DBIG and TR rearranged sequences have a CDR3 IMGT length less than or equal to 13 amino acids (IMGT statistics, October 2001). For rearranged CDR3-IMGT less than 13 amino acids, gaps are created from the top of the loop, in the following order 111, 112, 110, 113, 109, 114, etc. (Table 4A). For rearranged CDR3-IMGT more than 13 amino acids, additional positions are created, between positions 111 and 112 at the top of the CDR3-IMGT loop, in the following order 112.1, 111.1, 112.2, 111.2, 112.3, 111.3, etc. (Table 4B).

From that text, I'd rather deduce something like

312 + max(junction_length, 13 * 3) # * 3 because 13 amino acids into nucleotides

But I haven't looked at the actual data myself.

MKanetscheider and others added 3 commits August 9, 2024 16:54

Added mutational_load function to calculate differences between seque…

9f8558f

…nce and germline alignment

[pre-commit.ci] auto fixes from pre-commit.com hooks

4755736

for more information, see https://pre-commit.ci

Merge branch 'scverse:main' into mutational_load

4b020b6

grst reviewed Aug 13, 2024

View reviewed changes

src/scirpy/tl/_mutational_load.py Outdated Show resolved Hide resolved

src/scirpy/tl/__init__.py Show resolved Hide resolved

src/scirpy/tl/_mutational_load.py Outdated Show resolved Hide resolved

MKanetscheider and others added 5 commits August 15, 2024 12:22

Rewrote mutational_load function based on previous feedback and added…

56a8594

… function to api.rst

[pre-commit.ci] auto fixes from pre-commit.com hooks

c599a39

for more information, see https://pre-commit.ci

Fixed an issue with pre-commit

5c7c92c

Merge branch 'mutational_load' of https://github.com/MKanetscheider/s…

c84708c

…cirpy into mutational_load

[pre-commit.ci] auto fixes from pre-commit.com hooks

12ada2f

for more information, see https://pre-commit.ci

grst reviewed Aug 15, 2024

View reviewed changes

MKanetscheider and others added 2 commits August 18, 2024 12:18

Further optimized mutational_load function and formating of docstring…

a416e06

… documentation

[pre-commit.ci] auto fixes from pre-commit.com hooks

c0e795c

for more information, see https://pre-commit.ci

grst reviewed Aug 19, 2024

View reviewed changes

src/scirpy/tl/_mutational_load.py Outdated Show resolved Hide resolved

MKanetscheider and others added 7 commits August 20, 2024 09:18

Fixed small issues with the code layout as suggested by grst

9793062

Merge branch 'scverse:main' into mutational_load

d12f7b5

Merge branch 'mutational_load' of https://github.com/MKanetscheider/s…

906df48

…cirpy into mutational_load

Added a first beta-test case, which revealed some bugs that were also…

196177e

… fixed

[pre-commit.ci] auto fixes from pre-commit.com hooks

11101a5

for more information, see https://pre-commit.ci

Specified 'except' condition

cf53b72

Merge branch 'mutational_load' of https://github.com/MKanetscheider/s…

9c6a56c

…cirpy into mutational_load

MKanetscheider and others added 2 commits August 30, 2024 20:17

Merge branch 'main' into mutational_load

e5c4d76

[pre-commit.ci] auto fixes from pre-commit.com hooks

ea80d69

for more information, see https://pre-commit.ci

MKanetscheider and others added 2 commits October 15, 2024 10:28

Merge branch 'main' into mutational_load

e662e36

Add notebook section about somatic hypermutation

ae9563b

grst mentioned this pull request Oct 17, 2024

BCR tutorial #542

Merged

6 tasks

grst added 2 commits November 20, 2024 21:06

Merge branch 'main' into mutational_load

4ad7d59

Update SHM description text in tutorial

8abcd01

grst reviewed Nov 20, 2024

View reviewed changes

Merge branch 'main' into mutational_load

42f79da

grst reviewed Nov 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutational load function (SHM) #536

Mutational load function (SHM) #536

MKanetscheider commented Aug 9, 2024

grst Aug 15, 2024

MKanetscheider Aug 15, 2024

grst Aug 19, 2024

grst commented Aug 19, 2024

MKanetscheider commented Aug 29, 2024

grst commented Aug 29, 2024

codecov bot commented Aug 30, 2024 •

edited

Loading

review-notebook-app bot commented Oct 17, 2024

grst Nov 20, 2024

MKanetscheider Nov 23, 2024

grst commented Nov 21, 2024

MKanetscheider commented Nov 23, 2024

grst commented Nov 23, 2024

grst Nov 24, 2024

grst commented Nov 24, 2024

grst Nov 24, 2024

	for v, coordinates in regions.items():
	for region, coordinates in regions.items():

		if num_chars == 0:
		return np.nan # can be used as a flag for filtering

Mutational load function (SHM) #536

Are you sure you want to change the base?

Mutational load function (SHM) #536

Conversation

MKanetscheider commented Aug 9, 2024

grst Aug 15, 2024

Choose a reason for hiding this comment

MKanetscheider Aug 15, 2024

Choose a reason for hiding this comment

grst Aug 19, 2024

Choose a reason for hiding this comment

grst commented Aug 19, 2024

MKanetscheider commented Aug 29, 2024

grst commented Aug 29, 2024

codecov bot commented Aug 30, 2024 • edited Loading

Codecov Report

review-notebook-app bot commented Oct 17, 2024

grst Nov 20, 2024

Choose a reason for hiding this comment

MKanetscheider Nov 23, 2024

Choose a reason for hiding this comment

grst commented Nov 21, 2024

MKanetscheider commented Nov 23, 2024

grst commented Nov 23, 2024

grst Nov 24, 2024

Choose a reason for hiding this comment

grst commented Nov 24, 2024

grst Nov 24, 2024

Choose a reason for hiding this comment

codecov bot commented Aug 30, 2024 •

edited

Loading