gtdb_ver95_alllca_taxid.csv.tar.gz #1

bheimbu · 2023-08-30T07:16:05Z

Hi,

I'm wondering where gtdb_ver95_alllca_taxid.csv.tar.gz is coming from? Have you written it by yourself or downloaded it somewhere? I'm using your pipeline to analyze the microbiome of Australian termites. But I want to use GTDB ver202 or later.

Additionally, I'd like to know where the gtf file ("/bucket/BourguignonU/Jigs_backup/working_files/AIMS/AIM2/tpm_functional_annotation/functional_annotation/all_functions_all_taxonomy/gtf_files_Dec2019/named-gtffiles/filename-230-13-prokka.map.gtf) is coming from, please. Found in hpc_tpmcal.md. Have you used this code snippet to do it?

Also, it is not clear to me, where this file (/bucket/BourguignonU/Jigs_backup/working_files/AIMS/paper1/markergenes/markers-rpkm/individualanalysis_feb2021/all-samples-prokTPM.txt") from here is coming from.

Cheers Bastian

The text was updated successfully, but these errors were encountered:

Jigyasa3 · 2023-09-04T16:35:00Z

Hi @bheimbu ,

Thank you for your interest in the scripts!

The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written. But you can find that GTDB team has generated a code to create a similar (and even better) version of the same file. You can find it here-https://github.com/shenwei356/gtdb-taxdump
The gtf files were generated from PROKKA. You found the right code snippet to convert.
The file all-samples-prokTPM.txt was generated from the SALMON software
I ran the SALMON software on all files and concatenated them together to generate this file.

Good luck with your analysis!
Let me know if you have any other doubts!

bheimbu · 2023-09-04T17:24:37Z

Hi @Jigyasa3 ,

thanks for getting back to me. I'll see how far I can go. Actually I'm trying to implement your pipeline in a Snakenake workflow to make it more reproducible. So I may have some upcoming questions in the future -- just to let you know.

Cheers Bastian

bheimbu · 2023-09-05T06:44:47Z

Hi @Jigyasa3,

could you please clarify on:

The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written. But you can find that GTDB team has generated a code to create a similar (and even better) version of the same file. You can find it here-https://github.com/shenwei356/gtdb-taxdump

I cannot find the mentioned code on the webpage?

Cheers Bastian

Jigyasa3 · 2023-09-05T20:11:40Z

Hi @bheimbu ,

The file gtdb_ver95_alllca_taxid.csv.tar.gz essentially creates a taxdump file for a specific version of GTDB. The GitHub page I linked allows you to create a taxdump file for any version of GTDB database. I haven't used it yet. I found it recently and was interested that the GTDB team has streamlined the process of using the database for DIAMOND/BLAST analysis.

They give details of the method in their README file. I recommend asking them directly as I haven't used it myself.

bheimbu · 2023-09-06T07:15:34Z

Hi @Jigyasa3,

I'm really sorry to bother you, but when I use the code on https://github.com/shenwei356/gtdb-taxdump, I get following files: delnodes.dmp, merged.dmp, names.dmp, nodes.dmp, and taxid.map. None of these files comes close to your gtdb_ver95_alllca_taxid.csv.tar.gz.

Is there a script or some line of code that you could share with me?

Cheers Bastian

Jigyasa3 · 2023-09-06T15:53:53Z

Hi @bheimbu ,

I did a Google search for you. Here are some suggestions.

To incorporate the taxdump files from GTDB into DIAMOND- check this link https://www.biostars.org/p/412823/
and DIAMOND manual https://gensoft.pasteur.fr/docs/diamond/2.0.4/3_Command_line_options.html
The GTDB equivalent of the files required to input into DIAMOND-https://github.com/shenwei356/gtdb-taxdump/issues/6

bheimbu · 2023-09-07T07:00:19Z

Hi,

thanks for all your help again. Besides that the links do not work, I'm wondering why u cannot tell me how you have created gtdb_ver95_alllca_taxid.csv.tar.gz, since you say

The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written.

Anyway, if you don't want to share this information with me, I have to respect that.

I would have some more questions related to your pipeline:

Prokka outputs *fna and *faa files, but they don't have the same fasta headers, right? So once you use fetchMGs to extract COGs using the Prokka files (*faa and *fna) as input, you only get COG protein sequences, but no protein-coding nucleotide sequences (see this related post). So I did you do it?
Anyway, I'm a bit confused because these files

while read line;do while read cogs;do cp ${line}/${cogs}*fna allfetchm_nucoutput/${line}-${cogs}.fna;done < allcogs.txt ;done <filesnames.txt

don't appear again somewhere in your pipeline, so are they really important anyway?

I'm really sorry to bother you with all these questions, but I just want to get things right.

Cheers Bastian

Jigyasa3 · 2023-09-07T15:47:40Z

Hey @bheimbu ,

Sorry, the only reason I am redirecting you to other resources for creating the gtdb_ver95_alllca_taxid.csv.tar.gz file is that I have already left the university and I no longer have access to my university's cluster to check old scripts. From what I remember I joined the metadata file from GTDB and taxdump files to create gtdb_ver95_alllca_taxid.csv.tar.gz. It essentially adds a LCA taxonomy to each taxid. BTW, the links do work, you will have to copy and paste them. Somehow clicking on the link redirects you to the issues page of this repository.
Prokka adds _1 to each protein fasta header at the end. So the first part of the header is common between the protein and nucleotide headers. I just matched the first part. Just to verify that I was matching to the correct nucleotide header- a) manually compared the annotation of some of the nucleotide and their corresponding protein sequences, b) used emboss online tool on some nucleotide sequences, and translated them to proteins which should be 100% identical to original protein sequences.
Yes, you are right the filenames.txt file created from this while loop is not used again. This was just to keep track of how many files I was working with.

Let me know if you need anything!

bheimbu · 2023-09-08T06:38:42Z

Hi,

thanks for the clarification, I didn't know you left OIST. I'll have a second look on your provided links.

I'll see what I can do about gtdb_ver95_alllca_taxid.csv.tar.gz.

There are certainly more questions coming, but so far so good ;)

Have a nice weekend,

Bastian

bheimbu · 2023-10-02T08:49:03Z

Hi @Jigyasa3,

to be honest, I'm stuck. Right now, I'm trying to combine all my files as in combiningallfiles.md, but I'm failing on the first line.

My salmon quant files look like this:

Name	Length	EffectiveLength	TPM	NumReads
BEC328_contig1:440-1066	627	388.472	75.633197	9.000
BEC328_contig2:250-951	702	463.470	147.919989	21.000
BEC328_contig3:214-567	354	131.388	198.776314	8.000
BEC328_contig6:26-460	435	200.515	227.934551	14.000
BEC328_contig7:281-601	321	106.969	152.595815	5.000
BEC328_contig9:578-1465	888	649.470	60.318628	12.000
BEC328_contig10:7-867	861	622.470	73.424146	14.000

So tpm$fullproteinnames<-paste(tpm$file_name,tpm$gene_name,sep="_") is not even possible, because there are no columns file_name and gene_name. Sometimes I'm really thinking where not using the same software versions?

How do your fullproteinnames actually look like -- I'm just curious?!

Cheers Bastian

ps: This file also makes me wonder cogs<-read.csv("/bucket/BourguignonU/Jigs_backup/working_files/AIMS/paper1/markergenes/markers-rpkm/individualanalysis_feb2021/allcogs-allsamples-finalkrakenoutput.csv") as you mentioned before that diamond not kraken2 was used, actually.

Jigyasa3 · 2023-10-05T15:43:38Z

Hi @bheimbu ,

Yes, I think the software versions are different. But you can still run this code coz the data is similar although the names are different.
To run tpm$fullproteinnames<-paste(tpm$file_name,tpm$gene_name,sep="_"), you can add the filename to your file using -

awk command-
awk ' { print FILENAME","$0} ' your_tpm_file_name > your_new_tpm_file_name
in R- The first column will be the filename (i.e. column "file_name" of the Rscript ) and your "NAME" column is already the "gene_name" column. So you can combine them together now.

Sorry, as I said before I don't have access to the intermediate files as I am not at OIST anymore. But the final files generated from these scripts are publicly available if it helps- https://figshare.com/articles/dataset/Tables_for_main_figures/19173407

bheimbu · 2023-10-05T18:29:22Z

Thx for letting me know,

will try your suggestions tmrrw. Have a good one,

Bastian

bheimbu · 2023-12-19T13:37:12Z

Hi,

it's been a while. I hope you're fine and preparing for the holidays. I have a question:

BLASTp analysis against ANNOTREE database

#The "all-wood-gtdb.fasta.dmnd" created by adding proteinsequences from ANNOTREE database corresponding to gene(s) of interest-

diamond blastp --db ${DB_DIR}/all-wood-gtdb.fasta.dmnd --query ${IN_DIR}/${file1} --outfmt 6 --out ${OUT_DIR}/wood-gtdb-matches-${file1}.txt --threads 15

Where is the all-wood-gtdb.fasta.dmnd coming from? I'll try with this database, and it works but I cannot relate the results to my kofam results as this outputs no KeggIDs only "gene_id" and "gtdb_id".

Cheers Bastian

bheimbu · 2024-01-25T13:15:48Z

Hi,

a different thing. I'd like to publish a snakemake workflow using some of your scripts (adjusted to my needs). That's why, I'd like to ask you if you want to be a co-author? Let me know your decision.

If not, I'll clearly state that your code was used extensively.

Cheers Bastian

Jigyasa3 · 2024-01-25T16:24:21Z

Hi @bheimbu ,

Thanks for the message! Sorry, I was very busy during and after the holidays! The Annotree KEGGids and sequences are coming from herehttp://annotree.uwaterloo.ca/annotree/app/.
If you search for a KEGGID of interest, Annotree has the option to download a CSV file that contains the KEGGID, protein sequence, bacterialID etc.
You can then extract the protein sequence and KEGGID in fasta format use that as a database for diamond blastp

I am super excited that you are ready to publish the Snakemake workflow! I would request to add my name to the codes that were used verbatim from my GitHub repository. But otherwise, you are welcome to acknowledge my name. Could you please also add the paper name associated with my codes for reference?

Good luck!

bheimbu · 2024-01-26T12:56:43Z

Hi,

I'm happy that you are on board. Can you give me your address details and ORCID ID (via email at [email protected])?

Of course, I will reference you. I'm preparing the manuscript right now and would be happy if you would provide some comments and feedback once it is finished.

Cheers Bastian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gtdb_ver95_alllca_taxid.csv.tar.gz #1

gtdb_ver95_alllca_taxid.csv.tar.gz #1

bheimbu commented Aug 30, 2023 •

edited

Loading

Jigyasa3 commented Sep 4, 2023

bheimbu commented Sep 4, 2023

bheimbu commented Sep 5, 2023 •

edited

Loading

Jigyasa3 commented Sep 5, 2023

bheimbu commented Sep 6, 2023

Jigyasa3 commented Sep 6, 2023

bheimbu commented Sep 7, 2023 •

edited

Loading

Jigyasa3 commented Sep 7, 2023

bheimbu commented Sep 8, 2023 •

edited

Loading

bheimbu commented Oct 2, 2023 •

edited

Loading

Jigyasa3 commented Oct 5, 2023

bheimbu commented Oct 5, 2023

bheimbu commented Dec 19, 2023

bheimbu commented Jan 25, 2024

Jigyasa3 commented Jan 25, 2024

bheimbu commented Jan 26, 2024

gtdb_ver95_alllca_taxid.csv.tar.gz #1

gtdb_ver95_alllca_taxid.csv.tar.gz #1

Comments

bheimbu commented Aug 30, 2023 • edited Loading

Jigyasa3 commented Sep 4, 2023

bheimbu commented Sep 4, 2023

bheimbu commented Sep 5, 2023 • edited Loading

Jigyasa3 commented Sep 5, 2023

bheimbu commented Sep 6, 2023

Jigyasa3 commented Sep 6, 2023

bheimbu commented Sep 7, 2023 • edited Loading

Jigyasa3 commented Sep 7, 2023

bheimbu commented Sep 8, 2023 • edited Loading

bheimbu commented Oct 2, 2023 • edited Loading

Jigyasa3 commented Oct 5, 2023

bheimbu commented Oct 5, 2023

bheimbu commented Dec 19, 2023

bheimbu commented Jan 25, 2024

Jigyasa3 commented Jan 25, 2024

bheimbu commented Jan 26, 2024

bheimbu commented Aug 30, 2023 •

edited

Loading

bheimbu commented Sep 5, 2023 •

edited

Loading

bheimbu commented Sep 7, 2023 •

edited

Loading

bheimbu commented Sep 8, 2023 •

edited

Loading

bheimbu commented Oct 2, 2023 •

edited

Loading