Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with alleles and multiple gene calls #561

Open
grst opened this issue Oct 8, 2024 · 6 comments
Open

Dealing with alleles and multiple gene calls #561

grst opened this issue Oct 8, 2024 · 6 comments
Assignees

Comments

@grst
Copy link
Collaborator

grst commented Oct 8, 2024

I want to report some kind of bug/issue that I recognized during my work in Scirpy and has to do with scirpy.tl.define_clonotype cluster (but likely also with scirpy.tl.define_clonotype, although this was not tested). Scirpy considers any string inside v_call (same problem with j_call) as a unique V-gene assignment, and this is perfectly fine for working with Cell Ranger annotation. However, if we are working with re-annotated data, which is done with IgBlastn or IMGT/Highv-quest this is not true any more. First, the annotation contains alleles, which are depicted like this IGHV3-33*001. The problem is that if another cell would have IGHV3-33*002 these two cells would always be separated by Scirpy (if v_gene = True), because Scirpy thinks that those are totally different genes, although they only differ in their alleles, even if everything else even the junction sequence can be quite similar.
Another issue with using IgBlastn or IMGT/Highv-quest re-annotation is that they often leave multiple possible gene assignments into the v_call column if they are all similar likely.
What I did so far with my datasets was that I manually manipulated the v_call and j_call column prior to loading the datatset into an AnnData object so that they only contain the first v/j_call and without the allele information.

My idea here would be to adapt the clonotype cluster function so that it automatically ignores multiple v_call's/j_call's i.e. only considers the first one and also ignores the allele information for clustering, but doesn't manipulate the call itself. Immcantation has a own parameter on how to work with multiple calls for a gene (see "parameter first= FALSE": https://scoper.readthedocs.io/en/stable/topics/hierarchicalClones/).
Actually I encountered this problem already some time ago and discussed it with @felixpetschko but eventually we both forgot about it until now. Either way, I think it's good if @grst can also have a look on this problem and help with a solution, because if I remeber correctly it's not that trivial to "fix" this. Maybe there is some elegant workaround available?

Originally posted by @MKanetscheider in #542 (comment)

@grst
Copy link
Collaborator Author

grst commented Oct 8, 2024

@MKanetscheider, I made this a separate issue.

@grst
Copy link
Collaborator Author

grst commented Oct 8, 2024

@zktuong, would be curious if and how you have dealt with this before?

From a technical perspective I could see two solutions:

  1. Simple workaround: Provide a "preprocessing" function that maniuplates the v/j_call string to strip away the allele information and only keep the first gene (what @MKanetscheider has been doing manually so far)
  2. The define_clonotype_cluster function already uses a distance matrix to deterime distance between genes. For now, that's a boring identity matrix. This matrix could be replaced with something like
IGHV3-33*002 IGHV3-33*001 IGHV27 ...
IGHV3-33*002 1 1 0 0
IGHV3-33*001 1 1 0 0
IGHV27 0 0 1 0
... 0 0 0 1

@zktuong
Copy link
Contributor

zktuong commented Oct 8, 2024

yea i just only do:

https://github.com/zktuong/dandelion/blob/7a86c211da0a3eab6863921d2dfbbd562feeeaa9/dandelion/tools/_tools.py#L1660C13-L1660C79
V = [re.sub(STRIPALLELENUM, "", str(v)) for v in input_vdj[VCALL]]

but i don't select the first one:

https://github.com/zktuong/dandelion/blob/7a86c211da0a3eab6863921d2dfbbd562feeeaa9/dandelion/tools/_tools.py#L1669C4-L1669C55
V = [",".join(list(set(v.split(",")))) for v in V]

can be easily changed to just v.split(",")[0] in my case

@grst
Copy link
Collaborator Author

grst commented Oct 11, 2024

I think the matrix solution would be quite elegant, as it would also allow us to deal with the case of multiple genes, e.g. to ensure IGHx,IGHy matches both IGHx and IGHy we could have

IGHx IGHy IGHx,IGHy
IGHx 1 0 1
IGHy 0 1 1
IGHx,IGHy 1 1 1

Basically all we'd need is a function that builds such a matrix from the gene names in plug it into the clonotype definition module.

@felixpetschko, do you think you could look into that?

CC @FFinotello

@felixpetschko
Copy link
Collaborator

felixpetschko commented Nov 5, 2024

Hi @grst, I just took a look at my improved implementation of the v/j-gene matching and I think the matrix solution you suggested should work fine. If this implementation is not that urgent I would actually tackle this problem in a few weeks when I completed my thesis if that's fine.

@grst
Copy link
Collaborator Author

grst commented Nov 5, 2024

sounds good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ToDo
Development

No branches or pull requests

3 participants