Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in downloading database #126

Open
Subhajeet1997 opened this issue Sep 14, 2023 · 12 comments
Open

Problem in downloading database #126

Subhajeet1997 opened this issue Sep 14, 2023 · 12 comments

Comments

@Subhajeet1997
Copy link

I have used the command "hgtector database -o db_dir --default" to download the database. After downloading the protein files successfully. when downloading the genome files. it is showing following error
Using local file GCF_963082495.1_Q8283_protein.faa.gz.
Using local file GCF_963378075.1_MU0083_Flye_MinION_protein.faa.gz.
Using local file GCF_963378095.1_MU0053_Flye_MinION.2_protein.faa.gz.
Using local file GCF_963378105.1_MU0102_Flye_MinION_protein.faa.gz.
Using local file GCF_963394915.1_CCUG_26878_T_protein.faa.gz.
Done.
Extracting downloaded genomic data...Killed
what is the reason behind it??

@qiyunzhu
Copy link
Contributor

Hi @Subhajeet1997 Thanks for reporting. I have not seen this problem before. It seems to be a problem outside HGTector's Python code. Perhaps it is because your gzip library isn't correctly installed in the computer. To debug, you may grab a downloaded .gz file (say, filename.gz), and attempt to open it using the Python code:

import gzip
f = gzip.open('filename.gz', 'rb')
print(f.read().decode().splitlines()[0])
f.close()

If you get the same error, then my guess is correct.

@Subhajeet1997
Copy link
Author

yes, i have tried to gzip a file using your script. it is showing following error
Traceback (most recent call last):
File "/home/sutripa/test_1/python.py", line 3, in
print(f.read().decode().splitlines()[0])
^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 2066: invalid start byte

@Subhajeet1997
Copy link
Author

But gzip is properly installed in my system. when i tried to unzip the same file with gzip -d "filename". it is easily unzipped

@qiyunzhu
Copy link
Contributor

I see. The gzip program and Python may use different libraries. Perhaps the Python part is not right. It could also be that the gzipped file you tested is not a text file, causing the decoding error. Can you please try a text file? Alternatively, you can modify the line of code from print(f.read().decode().splitlines()[0]) into _ = f.read(). This will test whether it is the gzip library issue or the file content issue.

@Subhajeet1997
Copy link
Author

Subhajeet1997 commented Sep 15, 2023

test.txt.gz
import gzip
f = gzip.open('test.txt.gz', 'rb')
print(f.read().decode().splitlines()[0])
f.close()
i have run this script to unzip the gzipped text file and it is running successfully.
I have tried again "hgtector database -o hgtector_database --default --threads 50"
still same error
Using local file GCF_963394915.1_CCUG_26878_T_protein.faa.gz.
Done.
Extracting downloaded genomic data...Killed
if I can't download the database by this way i will use the prebuilt recent database but can you give me the proper link from where i can download using wget command. Because the links provided in the github page, i cant understand properly. Please help me to run the tool. It is very essential for my analysis.

@Subhajeet1997
Copy link
Author

Hello, I can't download the database by default method. So, I have downloaded the pre-built database named "hgtdb_20230102" and unzip it. It contains "db.faa, genome.map.gz, genomes.tsv, lineages.txt, taxdump, taxon.map.gz" files. I have then tried to do manual database compilation using following command.
echo $'accession.version\ttaxid' | cat - <(zcat taxon.map.gz) > prot.accession2taxid.FULL
diamond makedb --threads 50 --in db.faa --taxonmap prot.accession2taxid.FULL --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db db
it is showing following error:
"Error: Invalid taxonomy mapping file format."
Please help please

@qiyunzhu
Copy link
Contributor

Hello @Subhajeet1997 Thanks for the follow-up. I just tried to compile the "hgtdb_20230102" database using DIAMOND v2.1.8 (the latest version), and it worked. I also tried to do it on the demo database "ref107" and it worked too. Therefore, I am afraid that I cannot reproduce the error you encountered. Which DIAMOND version did you use? If it's too old (like 0.7.x) there could be a problem. Otherwise, you perhaps can check the integrity of the downloaded database file. There is an MD5 checksum attached in the repository for you to do this check.

@qiyunzhu
Copy link
Contributor

Also, I just built a small custom database using the hgtector database command, and didn't get the Killed error. I did some search and found that this error might be related to memory leak. I don't know how to handle this...

@Subhajeet1997
Copy link
Author

Yes, you are right, my diamond tool is of older version diamond v0.9.25.126. I will update the diamond and try to compile the database. But for now, I have compiled the database using makeblastdb, it is successfully compiled and I have run one search using blast. It is obviously slow compared to diamond, taking 2-2.5 days to run. So, I am waiting for the output. Hope I will get some results.

@Subhajeet1997
Copy link
Author

Hey, the blast run has successfully and got results. But I have another query what are default parameters for "--maxhits --evalue --identity --coverage ". As I run in default, is running in default mode acceptable?

@qiyunzhu
Copy link
Contributor

qiyunzhu commented Oct 1, 2023

Hi @Subhajeet1997 The default parameters are stored in config.yml:

  # search cutoffs
  maxseqs: 500        # maximum number of sequences to return
  evalue: 1.0e-5      # maximum E-value cutoff (note: keep decimal point)
  identity: 0         # minimum percent identity cutoff
  coverage: 0         # minimum percent query coverage cutoff

  # hits filtering
  maxhits: 0          # maximum number of hits to preserve (0 for unlimited)

@kirtivel
Copy link

Hello Prof. Zhu (@qiyunzhu ),
I have a doubt regarding the creation of a custom database using specific taxa. I wanted to know the HGT genes in a Metabacillus strain and I presumed that I cannot download the entire repository of bacterial faa files. Hence, after installing hgtector 2.0, I ran the following code to exclude all taxa except Bacillota -

hgtector database -c bacteria -o db1 -t 1117,766,57723,201174,200783,67819,67818,976,1936987,3018035,67814,29547,1930617,204428,1090,200795,200938,2138240,200930,1297,68297,74152,65842,32066,142182,1134404,256845,544448,2818505,1293497,40117,203682,1224,1853220,203691,508458,200940,3027942,200918,74201 -e

The code for Bacillota is 1239 which is what I want to download. But even this is taking an awfully long time (approx. 13h). The download is happening without any error but it's too slow. Following are my system and Wifi details :

  1. Lenovo Ideapad, 16GB memory
  2. 12th Gen i5 - 1235U x 12 Processor
  3. 1 TB Disk capacity
  4. OS = Ubuntu 22.04.3 LTS
  5. Wifi speed = 32Mb/s

Do I require more disk space for this download? Or is there anything wrong with the code? If you think that my disk space is not enough could you suggest any other way to do this? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants