Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gtdb_ver95_alllca_taxid.csv.tar.gz #1

Open
bheimbu opened this issue Aug 30, 2023 · 16 comments
Open

gtdb_ver95_alllca_taxid.csv.tar.gz #1

bheimbu opened this issue Aug 30, 2023 · 16 comments

Comments

@bheimbu
Copy link

bheimbu commented Aug 30, 2023

Hi,

I'm wondering where gtdb_ver95_alllca_taxid.csv.tar.gz is coming from? Have you written it by yourself or downloaded it somewhere? I'm using your pipeline to analyze the microbiome of Australian termites. But I want to use GTDB ver202 or later.

Additionally, I'd like to know where the gtf file ("/bucket/BourguignonU/Jigs_backup/working_files/AIMS/AIM2/tpm_functional_annotation/functional_annotation/all_functions_all_taxonomy/gtf_files_Dec2019/named-gtffiles/filename-230-13-prokka.map.gtf) is coming from, please. Found in hpc_tpmcal.md. Have you used this code snippet to do it?

Also, it is not clear to me, where this file (/bucket/BourguignonU/Jigs_backup/working_files/AIMS/paper1/markergenes/markers-rpkm/individualanalysis_feb2021/all-samples-prokTPM.txt") from here is coming from.

Cheers Bastian

@Jigyasa3
Copy link
Collaborator

Jigyasa3 commented Sep 4, 2023

Hi @bheimbu ,

Thank you for your interest in the scripts!

  1. The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written. But you can find that GTDB team has generated a code to create a similar (and even better) version of the same file. You can find it here-https://github.com/shenwei356/gtdb-taxdump
  2. The gtf files were generated from PROKKA. You found the right code snippet to convert.
  3. The file all-samples-prokTPM.txt was generated from the SALMON software
    I ran the SALMON software on all files and concatenated them together to generate this file.

Good luck with your analysis!
Let me know if you have any other doubts!

@bheimbu
Copy link
Author

bheimbu commented Sep 4, 2023

Hi @Jigyasa3 ,

thanks for getting back to me. I'll see how far I can go. Actually I'm trying to implement your pipeline in a Snakenake workflow to make it more reproducible. So I may have some upcoming questions in the future -- just to let you know.

Cheers Bastian

@bheimbu
Copy link
Author

bheimbu commented Sep 5, 2023

Hi @Jigyasa3,

could you please clarify on:

The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written. But you can find that GTDB team has generated a code to create a similar (and even better) version of the same file. You can find it here-https://github.com/shenwei356/gtdb-taxdump

I cannot find the mentioned code on the webpage?

Cheers Bastian

@Jigyasa3
Copy link
Collaborator

Jigyasa3 commented Sep 5, 2023

Hi @bheimbu ,

The file gtdb_ver95_alllca_taxid.csv.tar.gz essentially creates a taxdump file for a specific version of GTDB. The GitHub page I linked allows you to create a taxdump file for any version of GTDB database. I haven't used it yet. I found it recently and was interested that the GTDB team has streamlined the process of using the database for DIAMOND/BLAST analysis.

They give details of the method in their README file. I recommend asking them directly as I haven't used it myself.

@bheimbu
Copy link
Author

bheimbu commented Sep 6, 2023

Hi @Jigyasa3,

I'm really sorry to bother you, but when I use the code on https://github.com/shenwei356/gtdb-taxdump, I get following files: delnodes.dmp, merged.dmp, names.dmp, nodes.dmp, and taxid.map. None of these files comes close to your gtdb_ver95_alllca_taxid.csv.tar.gz.

Is there a script or some line of code that you could share with me?

Cheers Bastian

@Jigyasa3
Copy link
Collaborator

Jigyasa3 commented Sep 6, 2023

Hi @bheimbu ,

I did a Google search for you. Here are some suggestions.

  1. To incorporate the taxdump files from GTDB into DIAMOND- check this link https://www.biostars.org/p/412823/
    and DIAMOND manual https://gensoft.pasteur.fr/docs/diamond/2.0.4/3_Command_line_options.html
  2. The GTDB equivalent of the files required to input into DIAMOND-https://github.com/shenwei356/gtdb-taxdump/issues/6

@bheimbu
Copy link
Author

bheimbu commented Sep 7, 2023

Hi,

thanks for all your help again. Besides that the links do not work, I'm wondering why u cannot tell me how you have created gtdb_ver95_alllca_taxid.csv.tar.gz, since you say

The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written.

Anyway, if you don't want to share this information with me, I have to respect that.

I would have some more questions related to your pipeline:

  1. Prokka outputs *fna and *faa files, but they don't have the same fasta headers, right? So once you use fetchMGs to extract COGs using the Prokka files (*faa and *fna) as input, you only get COG protein sequences, but no protein-coding nucleotide sequences (see this related post). So I did you do it?

  2. Anyway, I'm a bit confused because these files

while read line;do while read cogs;do cp ${line}/${cogs}*fna allfetchm_nucoutput/${line}-${cogs}.fna;done < allcogs.txt ;done <filesnames.txt

don't appear again somewhere in your pipeline, so are they really important anyway?

I'm really sorry to bother you with all these questions, but I just want to get things right.

Cheers Bastian

@Jigyasa3
Copy link
Collaborator

Jigyasa3 commented Sep 7, 2023

Hey @bheimbu ,

  1. Sorry, the only reason I am redirecting you to other resources for creating the gtdb_ver95_alllca_taxid.csv.tar.gz file is that I have already left the university and I no longer have access to my university's cluster to check old scripts. From what I remember I joined the metadata file from GTDB and taxdump files to create gtdb_ver95_alllca_taxid.csv.tar.gz. It essentially adds a LCA taxonomy to each taxid. BTW, the links do work, you will have to copy and paste them. Somehow clicking on the link redirects you to the issues page of this repository.

  2. Prokka adds _1 to each protein fasta header at the end. So the first part of the header is common between the protein and nucleotide headers. I just matched the first part. Just to verify that I was matching to the correct nucleotide header- a) manually compared the annotation of some of the nucleotide and their corresponding protein sequences, b) used emboss online tool on some nucleotide sequences, and translated them to proteins which should be 100% identical to original protein sequences.

  3. Yes, you are right the filenames.txt file created from this while loop is not used again. This was just to keep track of how many files I was working with.

Let me know if you need anything!

@bheimbu
Copy link
Author

bheimbu commented Sep 8, 2023

Hi,

thanks for the clarification, I didn't know you left OIST. I'll have a second look on your provided links.

I'll see what I can do about gtdb_ver95_alllca_taxid.csv.tar.gz.

There are certainly more questions coming, but so far so good ;)

Have a nice weekend,

Bastian

@bheimbu
Copy link
Author

bheimbu commented Oct 2, 2023

Hi @Jigyasa3,

to be honest, I'm stuck. Right now, I'm trying to combine all my files as in combiningallfiles.md, but I'm failing on the first line.

My salmon quant files look like this:

Name	Length	EffectiveLength	TPM	NumReads
BEC328_contig1:440-1066	627	388.472	75.633197	9.000
BEC328_contig2:250-951	702	463.470	147.919989	21.000
BEC328_contig3:214-567	354	131.388	198.776314	8.000
BEC328_contig6:26-460	435	200.515	227.934551	14.000
BEC328_contig7:281-601	321	106.969	152.595815	5.000
BEC328_contig9:578-1465	888	649.470	60.318628	12.000
BEC328_contig10:7-867	861	622.470	73.424146	14.000

So tpm$fullproteinnames<-paste(tpm$file_name,tpm$gene_name,sep="_") is not even possible, because there are no columns file_name and gene_name. Sometimes I'm really thinking where not using the same software versions?

How do your fullproteinnames actually look like -- I'm just curious?!

Cheers Bastian

ps: This file also makes me wonder cogs<-read.csv("/bucket/BourguignonU/Jigs_backup/working_files/AIMS/paper1/markergenes/markers-rpkm/individualanalysis_feb2021/allcogs-allsamples-finalkrakenoutput.csv") as you mentioned before that diamond not kraken2 was used, actually.

@Jigyasa3
Copy link
Collaborator

Jigyasa3 commented Oct 5, 2023

Hi @bheimbu ,

Yes, I think the software versions are different. But you can still run this code coz the data is similar although the names are different.
To run tpm$fullproteinnames<-paste(tpm$file_name,tpm$gene_name,sep="_"), you can add the filename to your file using -

  1. awk command-
    awk ' { print FILENAME","$0} ' your_tpm_file_name > your_new_tpm_file_name
  2. in R- The first column will be the filename (i.e. column "file_name" of the Rscript ) and your "NAME" column is already the "gene_name" column. So you can combine them together now.

Sorry, as I said before I don't have access to the intermediate files as I am not at OIST anymore. But the final files generated from these scripts are publicly available if it helps- https://figshare.com/articles/dataset/Tables_for_main_figures/19173407

@bheimbu
Copy link
Author

bheimbu commented Oct 5, 2023

Thx for letting me know,

will try your suggestions tmrrw. Have a good one,

Bastian

@bheimbu
Copy link
Author

bheimbu commented Dec 19, 2023

Hi,

it's been a while. I hope you're fine and preparing for the holidays. I have a question:

BLASTp analysis against ANNOTREE database

#The "all-wood-gtdb.fasta.dmnd" created by adding proteinsequences from ANNOTREE database corresponding to gene(s) of interest-

diamond blastp --db ${DB_DIR}/all-wood-gtdb.fasta.dmnd --query ${IN_DIR}/${file1} --outfmt 6 --out ${OUT_DIR}/wood-gtdb-matches-${file1}.txt --threads 15

Where is the all-wood-gtdb.fasta.dmnd coming from? I'll try with this database, and it works but I cannot relate the results to my kofam results as this outputs no KeggIDs only "gene_id" and "gtdb_id".

Cheers Bastian

@bheimbu
Copy link
Author

bheimbu commented Jan 25, 2024

Hi,

a different thing. I'd like to publish a snakemake workflow using some of your scripts (adjusted to my needs). That's why, I'd like to ask you if you want to be a co-author? Let me know your decision.

If not, I'll clearly state that your code was used extensively.

Cheers Bastian

@Jigyasa3
Copy link
Collaborator

Hi @bheimbu ,

Thanks for the message! Sorry, I was very busy during and after the holidays! The Annotree KEGGids and sequences are coming from herehttp://annotree.uwaterloo.ca/annotree/app/.
If you search for a KEGGID of interest, Annotree has the option to download a CSV file that contains the KEGGID, protein sequence, bacterialID etc.
You can then extract the protein sequence and KEGGID in fasta format use that as a database for diamond blastp

I am super excited that you are ready to publish the Snakemake workflow! I would request to add my name to the codes that were used verbatim from my GitHub repository. But otherwise, you are welcome to acknowledge my name. Could you please also add the paper name associated with my codes for reference?

Good luck!

@bheimbu
Copy link
Author

bheimbu commented Jan 26, 2024

Hi,

I'm happy that you are on board. Can you give me your address details and ORCID ID (via email at [email protected])?

Of course, I will reference you. I'm preparing the manuscript right now and would be happy if you would provide some comments and feedback once it is finished.

Cheers Bastian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants