-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions in alignment results #32
Comments
Dear Adam, About (1):
About (2): Tell me if some points remain unclear and I will try to edit and clarify my answer! Best, |
Dear Quentin, Thank you very much. The response is clear. And I have started running IGoR -align on my data, and did some tests on demo for -infer and -evaluate -output, as well as pygor. Here, I encountered four other questions about IGoR model inference and evaluation. 1 (infer). According to the document webpage, the -chain option can be set to alpha, beta, light, heavy_naive, heavy_memory; however, under share/igor/models/human, there are only three folders: tcr_alpha, tcr_beta, bcr_heavy. So the questions are: (1) Where is the model for light chain, or is it unfinished? (2) Is there any difference between heavy_naive and heavy_memory? do they both use bcr_heavy folder, but have differences hard-coded in igor? BTW, (3) what are the default values set for --L_thresh and --P_ratio_thresh? 2 (evaluate). (1) Why -evaluate and -output need another round of iteration? (2) Do you have any detailed suggestions about the approaches to evaluate the model fitting after -infer is done, eg. look at the distribution of what measure? using pygor? (3) How to know the iteration is converged, and how accurate (eg. standard error) for each model parameters? (4) When running -evaluate after -infer, should I set -set_custom_model to the final_parms.txt and final_marginals.txt under _inference folder? 3 (pygor). I tested pygor, which could read best_scenarios output file as well as model files, and store in some container classes. (1) Does pygor also provide functions for probabilistic calculation based on the input IGoR model? for example, calculate the generative probability for a given scenario. BTW, (2) In inference_logs.txt file (with little documentation?), does seq_likelihood column mean (similar to) Pgen and seq_best_scenario column means the generative probability for the best scenario? 4 (coverage). I tested -output --coverage (also with little documentation?). (1) Do VD_genes means V_gene and D_gene; VDJ_genes means V_gene, D_gene and J_gene; etc.? (2) When I tried to run --coverage on the demo foo, igor could run through for V_gene and J_gene, but would report error for D_gene (or VD_genes, VDJ_genes, etc.); below are the commands and output. Commands:
Command:
Output:
Command:
Output:
Command:
Or
Output:
Thank you very much for your kind attention. Looking forward to your reply. Best wishes, |
Dear Adam, About the
|
Dear Quentin, Thank you very much for the response. I have successfully run through IGoR on mouse IgH in last month, with some tricks to bypass some issues. And the Pgen results looked quite good and reasonable. One issue (on computational performance) is the memory usage when running IGoR -align or -infer. I run IGoR on my MacBook with 32GB physical memory limit and 12 CPU cores, and the total amount of input reads is about 3M (3 million). When running IGoR -align, I found out that V alignment could use up to 12 CPU cores and more memory than the physical limit, since Mac OS probably can automatically move the memory use exceeding the limit to disk or somewhere else and labeled as "compressed memory". However, D alignment could only use 1~3 CPU cores, and asked for more and more memory when processing more reads. When the memory usage exceeds the physical limit, the progress would be slowed down dramatically. I believed that the alignment should be independent for each input read, so I split the input into several parts, to lower the memory usage and thus fasten running through IGoR -align step. Similarly, when running IGoR -infer on the whole 3M input reads, IGoR would ask for much more memory than the physical limit and stuck before showing the progress bar. I am not sure why EM algorithm, as an algorithm traversing each input sequence iteratively, requests more memory for more input reads, which looks to have space complexity linear to input sample size. To run through IGoR -infer, I have to split the 3M input reads into 15 parts, and manually make IGoR -infer run on each part (0.2M reads) iteratively (use the final model of the previous part as the inital model of the next part, and iterate for several times on all 15 parts). And I finally chose four models after several iterations, and found the Pgen predicted by IGoR -evaluate -output looked stable in magnitude and also followed our expectation. Do you think this splitting trick is applicable and acceptable? Another issue is about using only CDR3 region as input. I tested it with --ntCDR3 option in IGoR -align step. However, the alignment results may often miss the true answer, which may be recovered by using --thresh 0 instead of --ntCDR3, although it would be slower. On the other hand, for a part of input sequences, even if the alignment files contained the true V,D,J genes, IGoR -evaluate -output might not find any (best) scenarios and leave Pgen=NaN. I have not figured out the reason and solution for this problem. By the way, is there any way to output P_read as well as Pgen for each input sequence? And output P_recomb and P_error for each (best) scenario? Thank you very much for your kind attention. Best wishes, |
Dear Adam,
I'm not sure why the V alignement blew up your memory this is rather weird, I'll have a check and keep you posted. About D alignments: I think this is simply due to the fact that D alignment is very fast and the different threads must compete in order to get a chance to write alignments on the hard-drive for each sequence. If your hard drive is not very fast then this is most likely the limiting step and the threads spend a lot of time waiting for disk writing. Again I'm not sure why the memory usage is blowing up, I will have a check.
Yes this is indeed an issue. IGoR requests more memory for more input reads simply because it loads (and filter) all alignments before starting EM and store them in memory. In order to decrease this memory usage you can either alter your alignment filter thresholds (this will also increase your computation speed, however possibly to the expense of a loss of accuracy in the algorithm).
Ah yes, this is something that had already been brought up to my attention in issue #7. I thought I had fixed this problem, maybe I overlooked something. I will have a check.
P_read is the sequence likelihood (in the Best wishes, |
Hi Quentin @qmarcou,
I am a new bioinformatics postdoc in Prof. Frederick Alt's lab. I have read your 2018 Nature Communication paper introducing the fantastic tool IGoR, and I am now trying to apply IGoR to our data in mouse. However, I encounter some questions about IGoR, firstly the align step.
(1) I have run through IGoR demo, written a script to display the alignment, and compared the alignment with IgBLAST results. Then, I had two strange observations: 1) The alignment of D and J often has one-bp mismatch at 5' end, which is dropped in IgBLAST. 2) Although I saw no mismatch in the alignment, IGoR -align reported many mismatches indexes.
For example, the third read sequence in the demo (seq_index = 2) is TCCCCAACCAGACAGCTCTTTACTTCTGTGCCACCAGTGACCCGGGTACAACGACGAGCA, and IGoR top alignments include:
With my scripts, I displayed the alignment, as shown below
On the other hand, IgBLAST result is
I looked at some other examples, and see the one-bp mismatch at 5'-end of D or J in IGoR align results, but not in IgBLAST results, and I do not know why. Also, I am not sure why IGoR reported so many indexes of mismatches than actually displayed (maybe outside the alignment region?).
(2) In order to run IGoR -align, I am preparing input files. Along with three VDJ sequence fasta files, I saw V_gene_CDR3_anchors.csv and J_gene_CDR3_anchors.csv containing anchor_index for most V,J segments. I examined IMGT annotation (eg. http://www.imgt.org/ligmdb/view?id=U66059) and guessed the anchor_index for Vs is the start position of 2nd-CYS, and the anchor_index for Js is the start position of J-MOTIF (both 0-based). Is this guess correct?
I read IGoR document webpage, which says "The index should correspond to the first letter of the cysteine (for V) or tryptophan/phenylalanin (for J) for the nucleotide sequence of the gene.", and "If the considered sequences are nucleotide CDR3 sequences (delimited by its anchors on 3' and 5' sides) using the command --ntCDR3 alignments will be performed using gene anchors information as offset bounds.". So if I do not use --ntCDR3, is it necessary to provide the anchor_index for V and J?
Thank you very much for your kind attention. Looking forward to your reply.
Best regards,
Adam Yongxin Ye
The text was updated successfully, but these errors were encountered: