-
Notifications
You must be signed in to change notification settings - Fork 58
Using QLever for UniProt
Instructions for building a QLever index for the complete UniProt data, written by Hannah Bast on 27.04.2022, last updated on 16.03.2023.
I downloaded all RDF and OWL files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf as follows (at the time of the download, the files were from 12.10.2022).
DATE=2022-10-12
curl -s https://ftp.expasy.org/databases/uniprot/current_release/rdf/RELEASE.meta4 \
| sed 's/<metalink.*/<metalink>/' \
| xmllint --xpath '/metalink/files/file/url[@location="ch"]/text()' - \
> uniprot.download-urls.${DATE}
mkdir -p rdf.${DATE}
> uniprot.${DATE}.download-log
cat uniprot.download-urls.${DATE} \
| while read URL; do wget --no-verbose -P rdf.${DATE} ${URL} 2>&1 | tee -a uniprot.${DATE}.download-log; done
The total number of files with RDF data was 723, with a total size of 788 GB. I converted these files to compressed Turtle using Apache Jena
and GNU parallel
as follows (this takes over a day). The total file size of the resulting ttl.xz
files was 702 GB.
XML2TTL="apache-jena-3.17.0/bin/rdfxml --output=ttl 2> /dev/null"
mkdir -p ttl.${DATE}
> rdf2ttl.commands.txt
for RDF in rdf.${DATE}/*.{owl,owl.xz,rdf,rdf.xz}; do \
echo "xzcat -f ${RDF} | ${XML2TTL} | xz -c > ttl.${DATE}/$(basename ${RDF} | sed 's/\(rdf\|rdf.xz\|owl\|owl.xz\)$/ttl.xz/') && echo 'DONE converting ${RDF}'" >> rdf2ttl.commands.txt; done
cat rdf2ttl.commands | parallel
Note that earlier versions of the UniProt RDF/XML files used inconsistent definitions for the base prefix [1]. This is no longer a problem, thanks to the UniProt team for the fix!
Clone the current QLever master and merge a small PR specifically written for UniProt, which changes two settings in the code that cannot yet be changed via the command line or settings file. Namely, lower the maximum size for a literal kept in RAM (from 1024 to 128), and don't store any literals of the predicates rdf:value
and up:md5Checksum
in RAM. These settings are crucial, otherwise your index build will run out of RAM.
git clone --recursive [email protected]:ad-freiburg/qlever
cd qlever
git merge origin/uniprot-settings
docker build -t qlever.uniprot .
To compile natively (without docker), follow the instructions provided by the qlever script when typing qlever install-binaries
.
Use the qlever script with the preconfigured Qleverfile for UniProt as follows. The first command downloads the Qleverfile
, the second command builds the index.
. qlever uniprot
qlever index
If you want to use your natively compiled code, set USE_DOCKER = false
in the Qleverfile
. If you want to build all six permutations instead of just PSO and POS (which are enough for almost all queries), set PSO_AND_POS_ONLY = false
in the Qleverfile
(or remove the whole line with that variable, since the default is to build all six permutations).
Here are the stats (produced by qlever index-stats
) of the index building on an AMD Ryzen 9 5900X PC (12 cores) with 128 GB of RAM:
Parse input : 22.6 h
Build vocabularies : 28.2 h
Convert to global IDs : 4.6 h
PSO & POS permutations : 29.9 h
TOTAL index build time : 85.3 h
183 GB uniprot.full.index.pos
209 GB uniprot.full.index.pso
1.5 TB uniprot.full.vocabulary.external
332 GB uniprot.full.vocabulary.external.idsAndOffsets.mmap
40 GB uniprot.full.vocabulary.internal
2.3 TB total
In the same directory, just type the following. The server is then up in 1.5 minutes.
qlever start
[1] In the 2021-11-17 version of the UniProt data, the base prefix was defined inconsistently in different files. Namely, it was defined as :
@prefix : http://purl.uniprot.org/core/` in most files, but had a different definition in others. This is not forbidden, but confusing. For the indexing with QLever, we identified four files with an inconsistent definition of that prefix and fixed it as follows (this is very fast, since the respective files are small).
xzcat -f rdf.${DATE}/taxonomy-hierarchy.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/taxonomy-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/uniparc-patents.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix schema:/; s/:mentions/schema:mentions/' | xz -c > ttl.${DATE}/uniparc-patents.ttl.xz &
xzcat -f rdf.${DATE}/go-hierarchy.owl.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/go-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/void.rdf | ${XML2TTL} | sed '/@prefix :/d' | xz -c > ttl.${DATE}/void.ttl.xz &