Problem in downloading database #126

Subhajeet1997 · 2023-09-14T17:07:34Z

I have used the command "hgtector database -o db_dir --default" to download the database. After downloading the protein files successfully. when downloading the genome files. it is showing following error
Using local file GCF_963082495.1_Q8283_protein.faa.gz.
Using local file GCF_963378075.1_MU0083_Flye_MinION_protein.faa.gz.
Using local file GCF_963378095.1_MU0053_Flye_MinION.2_protein.faa.gz.
Using local file GCF_963378105.1_MU0102_Flye_MinION_protein.faa.gz.
Using local file GCF_963394915.1_CCUG_26878_T_protein.faa.gz.
Done.
Extracting downloaded genomic data...Killed
what is the reason behind it??

qiyunzhu · 2023-09-14T17:41:26Z

Hi @Subhajeet1997 Thanks for reporting. I have not seen this problem before. It seems to be a problem outside HGTector's Python code. Perhaps it is because your gzip library isn't correctly installed in the computer. To debug, you may grab a downloaded .gz file (say, filename.gz), and attempt to open it using the Python code:

import gzip
f = gzip.open('filename.gz', 'rb')
print(f.read().decode().splitlines()[0])
f.close()

If you get the same error, then my guess is correct.

Subhajeet1997 · 2023-09-14T17:47:14Z

yes, i have tried to gzip a file using your script. it is showing following error
Traceback (most recent call last):
File "/home/sutripa/test_1/python.py", line 3, in
print(f.read().decode().splitlines()[0])
^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 2066: invalid start byte

Subhajeet1997 · 2023-09-14T17:49:02Z

But gzip is properly installed in my system. when i tried to unzip the same file with gzip -d "filename". it is easily unzipped

qiyunzhu · 2023-09-14T18:37:04Z

I see. The gzip program and Python may use different libraries. Perhaps the Python part is not right. It could also be that the gzipped file you tested is not a text file, causing the decoding error. Can you please try a text file? Alternatively, you can modify the line of code from print(f.read().decode().splitlines()[0]) into _ = f.read(). This will test whether it is the gzip library issue or the file content issue.

Subhajeet1997 · 2023-09-15T06:28:24Z

test.txt.gz
import gzip
f = gzip.open('test.txt.gz', 'rb')
print(f.read().decode().splitlines()[0])
f.close()
i have run this script to unzip the gzipped text file and it is running successfully.
I have tried again "hgtector database -o hgtector_database --default --threads 50"
still same error
Using local file GCF_963394915.1_CCUG_26878_T_protein.faa.gz.
Done.
Extracting downloaded genomic data...Killed
if I can't download the database by this way i will use the prebuilt recent database but can you give me the proper link from where i can download using wget command. Because the links provided in the github page, i cant understand properly. Please help me to run the tool. It is very essential for my analysis.

Subhajeet1997 · 2023-09-20T12:56:24Z

Hello, I can't download the database by default method. So, I have downloaded the pre-built database named "hgtdb_20230102" and unzip it. It contains "db.faa, genome.map.gz, genomes.tsv, lineages.txt, taxdump, taxon.map.gz" files. I have then tried to do manual database compilation using following command.
echo $'accession.version\ttaxid' | cat - <(zcat taxon.map.gz) > prot.accession2taxid.FULL
diamond makedb --threads 50 --in db.faa --taxonmap prot.accession2taxid.FULL --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db db
it is showing following error:
"Error: Invalid taxonomy mapping file format."
Please help please

qiyunzhu · 2023-09-21T19:18:56Z

Hello @Subhajeet1997 Thanks for the follow-up. I just tried to compile the "hgtdb_20230102" database using DIAMOND v2.1.8 (the latest version), and it worked. I also tried to do it on the demo database "ref107" and it worked too. Therefore, I am afraid that I cannot reproduce the error you encountered. Which DIAMOND version did you use? If it's too old (like 0.7.x) there could be a problem. Otherwise, you perhaps can check the integrity of the downloaded database file. There is an MD5 checksum attached in the repository for you to do this check.

qiyunzhu · 2023-09-21T19:43:20Z

Also, I just built a small custom database using the hgtector database command, and didn't get the Killed error. I did some search and found that this error might be related to memory leak. I don't know how to handle this...

Subhajeet1997 · 2023-09-22T06:25:51Z

Yes, you are right, my diamond tool is of older version diamond v0.9.25.126. I will update the diamond and try to compile the database. But for now, I have compiled the database using makeblastdb, it is successfully compiled and I have run one search using blast. It is obviously slow compared to diamond, taking 2-2.5 days to run. So, I am waiting for the output. Hope I will get some results.

Subhajeet1997 · 2023-09-27T15:16:31Z

Hey, the blast run has successfully and got results. But I have another query what are default parameters for "--maxhits --evalue --identity --coverage ". As I run in default, is running in default mode acceptable?

qiyunzhu · 2023-10-01T18:36:37Z

Hi @Subhajeet1997 The default parameters are stored in config.yml:

  # search cutoffs
  maxseqs: 500        # maximum number of sequences to return
  evalue: 1.0e-5      # maximum E-value cutoff (note: keep decimal point)
  identity: 0         # minimum percent identity cutoff
  coverage: 0         # minimum percent query coverage cutoff

  # hits filtering
  maxhits: 0          # maximum number of hits to preserve (0 for unlimited)

kirtivel · 2023-10-13T02:09:21Z

Hello Prof. Zhu (@qiyunzhu ),
I have a doubt regarding the creation of a custom database using specific taxa. I wanted to know the HGT genes in a Metabacillus strain and I presumed that I cannot download the entire repository of bacterial faa files. Hence, after installing hgtector 2.0, I ran the following code to exclude all taxa except Bacillota -

hgtector database -c bacteria -o db1 -t 1117,766,57723,201174,200783,67819,67818,976,1936987,3018035,67814,29547,1930617,204428,1090,200795,200938,2138240,200930,1297,68297,74152,65842,32066,142182,1134404,256845,544448,2818505,1293497,40117,203682,1224,1853220,203691,508458,200940,3027942,200918,74201 -e

The code for Bacillota is 1239 which is what I want to download. But even this is taking an awfully long time (approx. 13h). The download is happening without any error but it's too slow. Following are my system and Wifi details :

Lenovo Ideapad, 16GB memory
12th Gen i5 - 1235U x 12 Processor
1 TB Disk capacity
OS = Ubuntu 22.04.3 LTS
Wifi speed = 32Mb/s

Do I require more disk space for this download? Or is there anything wrong with the code? If you think that my disk space is not enough could you suggest any other way to do this? Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem in downloading database #126

Problem in downloading database #126

Subhajeet1997 commented Sep 14, 2023

qiyunzhu commented Sep 14, 2023

Subhajeet1997 commented Sep 14, 2023

Subhajeet1997 commented Sep 14, 2023

qiyunzhu commented Sep 14, 2023

Subhajeet1997 commented Sep 15, 2023 •

edited

Loading

Subhajeet1997 commented Sep 20, 2023

qiyunzhu commented Sep 21, 2023

qiyunzhu commented Sep 21, 2023

Subhajeet1997 commented Sep 22, 2023

Subhajeet1997 commented Sep 27, 2023

qiyunzhu commented Oct 1, 2023

kirtivel commented Oct 13, 2023

Problem in downloading database #126

Problem in downloading database #126

Comments

Subhajeet1997 commented Sep 14, 2023

qiyunzhu commented Sep 14, 2023

Subhajeet1997 commented Sep 14, 2023

Subhajeet1997 commented Sep 14, 2023

qiyunzhu commented Sep 14, 2023

Subhajeet1997 commented Sep 15, 2023 • edited Loading

Subhajeet1997 commented Sep 20, 2023

qiyunzhu commented Sep 21, 2023

qiyunzhu commented Sep 21, 2023

Subhajeet1997 commented Sep 22, 2023

Subhajeet1997 commented Sep 27, 2023

qiyunzhu commented Oct 1, 2023

kirtivel commented Oct 13, 2023

Subhajeet1997 commented Sep 15, 2023 •

edited

Loading