Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash during alignment stage after 3.1 update #57

Open
anearman opened this issue Nov 20, 2024 · 9 comments
Open

Crash during alignment stage after 3.1 update #57

anearman opened this issue Nov 20, 2024 · 9 comments

Comments

@anearman
Copy link

Hello! I was running 3.0 almost successfully yesterday but it crashed during gnomon training so I updated to 3.1. Today it appears to not make it through the alignment stages with the following error:

ERROR ~ Error executing process > 'egapx:rnaseq_short_plane:star:run_star (4)'

Caused by:
Process egapx:rnaseq_short_plane:star:run_star (4) terminated with an error exit status (3)

I'm attempting to annotate a small trypanosomid genome (~34MB) with a proteome and ample RNAseq data. I was able to execute the example files with no problems for 3.0, though I haven't checked for 3.1. Attached are the various log files.

Thank you!

issue.zip

@victzh
Copy link

victzh commented Nov 20, 2024

I tried to replicate the issue, and found that we don't have proteins for the taxonomy branch of your sequence. I see that you supplied the proteins yourself, but as far as I can tell it does not work as reliably as if you have the taxonomy branch covered by us. If you can provide me with a link to download proteins I can try to replicate it again.

Thanks,

Victor.

@anearman
Copy link
Author

Hi Victor,

Thanks for getting back! Attached is the proteome I used. Originally I had a collection of UniProt formatted proteins for all Trypanosomatids (~500MB) but that proved to be too much, so I restricted it to a functional annotation I previously performed for this species.

LpasUniProt.fasta.gz

Side note, these organisms don't have introns, is there a way to account for this in the annotation process?

@victzh
Copy link

victzh commented Nov 21, 2024

Thanks, I will try it again with your protein data. About introns - I don't know, but I will ask my colleagues. It should have a way - we annotate many different kinds of organisms. I will ask around.

@murphyte
Copy link

protists including Trypanosomes are currently out-of-scope for EGAPx, as stated on the home page. It's not just a matter of the protein sets -- we need to do additional development to adequately support protists and fungi. It's on our roadmap, but it'll likely be a while before we are ready to support Trypanosomes.

That doesn't explain the run_star failure. We do have logic to automate selection of max_intron size, and that logic is not set up for Trypanosomes, so it may be picking an unusual value that might cause failures.

@victzh
Copy link

victzh commented Nov 21, 2024

I did have a run with your proteins and NCBI's version of sequence and SRA reads. It ran through STAR successfully, and even memory requirements for it were not extreme. It failed later for me suggesting that the sequence have too many similarities to proks, so it is probably contaminated.

But anyways as already mentioned, we don't support this taxonomy branch yet, so even if it runs successfully after using our another product, FCS (Foreign Contamination Screening) the results are not going to be valid.

@anearman
Copy link
Author

Yeah, I did see the lack of support Trypanos but wanted to see if I could sneak it through anyway. Perhaps this is why I was seeing a slightly different failure in v3.0 for gnomon training. I have a full annotation for this species already but having a not fun time getting it to table2asn standards.

Mostly this was an easy (hopefully) test run before trying to push through several much larger genomes with our consortium project. We'll likely have to run most of those on the HPC, but it would be nice to be able to do some of the smaller ones locally.

I can rerun the example files to see if the run_star failure persists for v3.1 and report back if you think that will be helpful.

@anearman
Copy link
Author

I did have a run with your proteins and NCBI's version of sequence and SRA reads. It ran through STAR successfully, and even memory requirements for it were not extreme. It failed later for me suggesting that the sequence have too many similarities to proks, so it is probably contaminated.

But anyways as already mentioned, we don't support this taxonomy branch yet, so even if it runs successfully after using our another product, FCS (Foreign Contamination Screening) the results are not going to be valid.

Thanks, Victor! Any thoughts as to why I'm having the run_star failure with 3.1 and did not have it with 3.0?

@victzh
Copy link

victzh commented Nov 21, 2024

It maybe an accidental fault in STAR, the error of this kind can happen if STAR failed and samtools can't read a full data chunk. On the other hand, it should be retried and if it is just a fluke it should complete. I don't see theese retries in your run.trace.txt file. Can you send me the config file you have for Singularity, please? And what are the parameters of machine you're running it on, CPUs, RAM?

@anearman
Copy link
Author

If I remember correctly, I tried to continue after the first time it failed, then it failed again, so I deleted everything in the working directory, deleted the project directory, and started everything fresh after a restart and general update check.

I set the docker config to 31 CPUs and 120 GB RAM and then set a 20GB swap. When running on 3.0, there seemed to be no problems until the end after completing ~480 tasks. The only thing that might be strange is that I have Nextflow installed as a mamba environment stacked on the python environment for egapx, but it seemed to work fine until 3.1 was installed. The only other thing I found was my samtools version was slightly out of date, so I just updated.

I didn't change anything in the Singularity config file so it just says:
singularity.enabled = true

I did edit the docker config file to:
docker.enabled = true
process {
memory = 120.GB
cpus = 31
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants