Dealing with alleles and multiple gene calls #561

grst · 2024-10-08T19:51:47Z

I want to report some kind of bug/issue that I recognized during my work in Scirpy and has to do with scirpy.tl.define_clonotype cluster (but likely also with scirpy.tl.define_clonotype, although this was not tested). Scirpy considers any string inside v_call (same problem with j_call) as a unique V-gene assignment, and this is perfectly fine for working with Cell Ranger annotation. However, if we are working with re-annotated data, which is done with IgBlastn or IMGT/Highv-quest this is not true any more. First, the annotation contains alleles, which are depicted like this IGHV3-33*001. The problem is that if another cell would have IGHV3-33*002 these two cells would always be separated by Scirpy (if v_gene = True), because Scirpy thinks that those are totally different genes, although they only differ in their alleles, even if everything else even the junction sequence can be quite similar.
Another issue with using IgBlastn or IMGT/Highv-quest re-annotation is that they often leave multiple possible gene assignments into the v_call column if they are all similar likely.
What I did so far with my datasets was that I manually manipulated the v_call and j_call column prior to loading the datatset into an AnnData object so that they only contain the first v/j_call and without the allele information.

My idea here would be to adapt the clonotype cluster function so that it automatically ignores multiple v_call's/j_call's i.e. only considers the first one and also ignores the allele information for clustering, but doesn't manipulate the call itself. Immcantation has a own parameter on how to work with multiple calls for a gene (see "parameter first= FALSE": https://scoper.readthedocs.io/en/stable/topics/hierarchicalClones/).
Actually I encountered this problem already some time ago and discussed it with @felixpetschko but eventually we both forgot about it until now. Either way, I think it's good if @grst can also have a look on this problem and help with a solution, because if I remeber correctly it's not that trivial to "fix" this. Maybe there is some elegant workaround available?

Originally posted by @MKanetscheider in #542 (comment)

grst · 2024-10-08T19:52:24Z

@MKanetscheider, I made this a separate issue.

grst · 2024-10-08T20:01:31Z

@zktuong, would be curious if and how you have dealt with this before?

From a technical perspective I could see two solutions:

Simple workaround: Provide a "preprocessing" function that maniuplates the v/j_call string to strip away the allele information and only keep the first gene (what @MKanetscheider has been doing manually so far)
The define_clonotype_cluster function already uses a distance matrix to deterime distance between genes. For now, that's a boring identity matrix. This matrix could be replaced with something like

	IGHV3-33*002	IGHV3-33*001	IGHV27	...
IGHV3-33*002	1	1	0	0
IGHV3-33*001	1	1	0	0
IGHV27	0	0	1	0
...	0	0	0	1

zktuong · 2024-10-08T22:44:24Z

yea i just only do:

https://github.com/zktuong/dandelion/blob/7a86c211da0a3eab6863921d2dfbbd562feeeaa9/dandelion/tools/_tools.py#L1660C13-L1660C79
V = [re.sub(STRIPALLELENUM, "", str(v)) for v in input_vdj[VCALL]]

but i don't select the first one:

https://github.com/zktuong/dandelion/blob/7a86c211da0a3eab6863921d2dfbbd562feeeaa9/dandelion/tools/_tools.py#L1669C4-L1669C55
V = [",".join(list(set(v.split(",")))) for v in V]

can be easily changed to just v.split(",")[0] in my case

grst · 2024-10-11T11:10:54Z

I think the matrix solution would be quite elegant, as it would also allow us to deal with the case of multiple genes, e.g. to ensure IGHx,IGHy matches both IGHx and IGHy we could have

	IGHx	IGHy	IGHx,IGHy
IGHx	1	0	1
IGHy	0	1	1
IGHx,IGHy	1	1	1

Basically all we'd need is a function that builds such a matrix from the gene names in plug it into the clonotype definition module.

@felixpetschko, do you think you could look into that?

CC @FFinotello

felixpetschko · 2024-11-05T10:17:47Z

Hi @grst, I just took a look at my improved implementation of the v/j-gene matching and I think the matrix solution you suggested should work fine. If this implementation is not that urgent I would actually tackle this problem in a few weeks when I completed my thesis if that's fine.

grst · 2024-11-05T10:21:54Z

sounds good, thanks!

grst added this to scirpy-dev Oct 8, 2024

github-project-automation bot moved this to ToDo in scirpy-dev Oct 8, 2024

grst assigned felixpetschko Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with alleles and multiple gene calls #561

Dealing with alleles and multiple gene calls #561

grst commented Oct 8, 2024 •

edited

Loading

grst commented Oct 8, 2024

grst commented Oct 8, 2024

zktuong commented Oct 8, 2024 •

edited

Loading

grst commented Oct 11, 2024 •

edited

Loading

felixpetschko commented Nov 5, 2024 •

edited

Loading

grst commented Nov 5, 2024

Dealing with alleles and multiple gene calls #561

Dealing with alleles and multiple gene calls #561

Comments

grst commented Oct 8, 2024 • edited Loading

grst commented Oct 8, 2024

grst commented Oct 8, 2024

zktuong commented Oct 8, 2024 • edited Loading

grst commented Oct 11, 2024 • edited Loading

felixpetschko commented Nov 5, 2024 • edited Loading

grst commented Nov 5, 2024

grst commented Oct 8, 2024 •

edited

Loading

zktuong commented Oct 8, 2024 •

edited

Loading

grst commented Oct 11, 2024 •

edited

Loading

felixpetschko commented Nov 5, 2024 •

edited

Loading