Skip to content

Building the suffix array

Tibo Vande Moortele edited this page Aug 7, 2024 · 18 revisions

Build using the CLI

Build using Docker

Build using the HPC

Index storage setup

Important

This section needs to be executed on a Unipept server!

Set the correct UniProt version

export UNIPROT_VERSION="2024.04"

Navigate to the data share

cd /mnt/data

Create the folder structure for the new index version

sudo mkdir -p "uniprot-$UNIPROT_VERSION"/{index,suffix-array,tables}

Set the right permissions

sudo chmod -R 777 "uniprot-$UNIPROT_VERSION"

Save the version number

echo "$UNIPROT_VERSION" | tr '-' '.' > "uniprot-$UNIPROT_VERSION/suffix-array/.version"

Creating the input files

Important

This section needs to be executed on a Unipept server!

Clone the unipept-database repository

git clone https://github.com/unipept/unipept-database

Build all the binaries:

./unipept-database/scripts/build_binaries.sh

Start a new screen session:

screen

Run the build_database script:

sudo ./unipept-database/scripts/build_database.sh -i "/mnt/data/uniprot-$UNIPROT_VERSION/index" -d /mnt/data/tmp -m 2g suffix-array swissprot,trembl https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz,https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz "/mnt/data/uniprot-$UNIPROT_VERSION/tables"

Create the proteins.tsv and taxons.tsv file required to build/deploy the suffix array

lz4cat /mnt/data/uniprot-$UNIPROT_VERSION/tables/uniprot_entries.tsv.lz4 | cut -f2,4,7,8 > /mnt/data/uniprot-$UNIPROT_VERSION/suffix-array/proteins.tsv;
lz4cat /mnt/data/uniprot-$UNIPROT_VERSION/tables/taxons.tsv.lz4 > /mnt/data/uniprot-$UNIPROT_VERSION/suffix-array/taxons.tsv

Moving the input files to the HPC VO

Important

This section needs to be executed on a Unipept server!

Set the HPC Virtual Organisation

export HPC_VO_LOCATION="/kyukon/data/gent/vo/000/gvo00038"

Move the files

scp "/mnt/data/uniprot-$UNIPROT_VERSION/suffix-array/proteins.tsv" "hpc-tibo:$HPC_VO_LOCATION/suffix-array"
scp "/mnt/data/uniprot-$UNIPROT_VERSION/suffix-array/taxons.tsv" "hpc-tibo:$HPC_VO_LOCATION/suffix-array"

Running the PBS job

Clone the unipept-index repository

git clone https://github.com/unipept/unipept-index

Go to the root of the repository and update the submodules

cd unipept-index
git submodule update --init --recursive

Swap to the high-memory gallade cluster

module swap cluster/gallade

Submit the PBS script to start the process

VSC_DATA_VO=/kyukon/data/gent/vo/000/gvo00038 qsub sa-builder/build.pbs

VSC_DATA_VO has to contain the path to the virtual organisation.

Troubleshooting

Error: attribute name space is experimental

error[E0658]: `#[diagnostic]` attribute name space is experimental
   --> /user/gent/437/vsc43736/.cargo/registry/src/index.crates.io-6f17d22bba15001f/axum-0.7.5/src/handler/mod.rs:130:5
    |
130 |     diagnostic::on_unimplemented(
    |     ^^^^^^^^^^
    |
    = note: see issue #111996 <https://github.com/rust-lang/rust/issues/111996> for more information
    = help: add `#![feature(diagnostic_namespace)]` to the crate attributes to enable

For more information about this error, try `rustc --explain E0658`.
error: could not compile `axum` (lib) due to previous error

Solution: Downgrade the version of the package to a working version