This is the Sensitive Data Archive (SDA) Command Line Interface (sda-cli). This tool was created to unify and simplify the tools needed to perform the most common user actions in the SDA.
This tool can be used to encrypt and upload data when submitting to the archive, and to download and decrypt with retrieving data from the archive.
It is recommended to use precompiled executables for sda-cli
which can be found at https://github.com/NBISweden/sda-cli/releases
To get help on the usage of the tool, please use the following command
./sda-cli help
The main functionalities implemented in this tool are explained in the following sections.
If any command seems a bit puzzling, you can find guidance on how to interpret it right here and here.
The files stored in the SDA/BP archive are encrypted using the crypt4gh standard. The following sections explain how to encrypt and upload files to the archive.
The files that are uploaded to the SDA/BP services, need to be encrypted with the correct public key. Depending on the service you want to use, this key can be downloaded using this command if you are uploading to the SDA:
wget https://raw.githubusercontent.com/NBISweden/EGA-SE-user-docs/main/crypt4gh_key.pub
or this command if you are uploading to Big Picture:
wget https://raw.githubusercontent.com/NBISweden/EGA-SE-user-docs/main/crypt4gh_bp_key.pub
Now that the public key is downloaded, the file(s) can be encrypted using the binary file created in the first step of this guide. To encrypt a specific file, use the following command:
./sda-cli encrypt [-key <public_key>] <file_to_encrypt>
where <public_key>
the key downloaded in the previous step. The tool also allows for encrypting multiple files at once, by listing them separated with space like:
./sda-cli encrypt -key <public_key> <file_1_to_encrypt> <file_2_to_encrypt> <file_3_to_encrypt>
This command comes with the -continue
option, which will continue encrypting files, even if one of them fails. To enable this feature, the command should be executed with the -continue=true
option.
If no public key is provided, the tool will look for it from a previous login session.
To encrypt files with more than one public keys, repeatedly use the -key
flag, e.g.
./sda-cli encrypt -key <public_key1> -key <public_key2> <file_to_encrypt>
will encrypt a file using two keys so that it can be decrypted with either of the corresponding private keys. Encryption with more than two keys is possible, as well. Another option is to provide as argument to -key
a file with concatenated public keys generated e.g. from a command like
cat <pub_key1> <pub_key2> > <concatenated_pub_keys>
Passing a combination of the above arguments is allowed, as well:
./sda-cli encrypt -key <concatenated_public_keys> -key <public_key3> <file_to_encrypt>
Note: The encrypt
command will create four files containing hashes (both md5 and sha256) for the encrypted and unencrypted files, respectively.
Developers' Notes: The tool is creating a key pair when encrypting the files. This key pair is temporary for security reasons.
Once your files are encrypted, they are ready to be submitted to the SDA/BP archive. The s3 storage requires users to be authenticated, therefore a configuration file needs to be downloaded before starting the uploading of the files.
The configuration file can be downloaded by logging in with a Life Science RI account:
- For BigPicture use https://login.bp.nbis.se/
Follow the dialogue to get authenticated and then click on Download inbox s3cmd credentials
to download the configuration file named s3cmd.conf. The configuration file should be placed in the root folder of the repository.
Alternatively, you can download the configuration file using the login command.
The access token used to authenticate to the inbox can be supplied in 3 different ways; in the config file, as an ENV ACCESSTOKEN
or on the command line -accessToken <access-token>
as part of the upload command.
The Priority order for the access token is:
- Using the
-accessToken
flag - From the ENV
ACCESSTOKEN
- In the config file
Now that the configuration file is downloaded, the file(s) can be uploaded to the archive using the binary file created in the first step of this guide. To upload a specific file, use the following command:
./sda-cli upload -config <configuration_file> <encrypted_file_to_upload>
where <configuration_file>
the file downloaded in the previous step and <encrypted_file_to_upload>
a file encrypted in the earlier steps. The tool also allows for uploading multiple files at once, by listing them separated with space like:
./sda-cli upload -config <configuration_file> <encrypted_file_1_to_upload> <encrypted_file_2_to_upload>
Note that the files will be uploaded in the base folder of the user.
One can also upload entire directories recursively, i.e. including all contained files and folders while keeping the local folder structure. This can be achieved with the -r
flag, e.g. running:
./sda-cli upload -config <configuration_file> -r <folder_to_upload>
will upload <folder_to_upload>
as is, i.e. with the same inner folder and file structure as the local one, to the archive.
It is also possible to specify multiple directories and files for upload with the same command. For example,
./sda-cli upload -config <configuration_file> -r <encrypted_file_1_to_upload> <encrypted_file_2_to_upload> <folder_1_to_upload> <folder_2_to_upload>
However, if -r
is omitted in the above, any folders will be skipped during upload.
The user can specify a different path for uploading files/folders with the -targetDir
flag followed by the name of the folder. For example, the command:
./sda-cli upload -config <configuration_file> -r <encrypted_file_1_to_upload> <folder_1_to_upload> -targetDir <upload_folder>
will create <upload_folder>
under the user's base folder with contents <upload_folder>/<encrypted_file_1_to_upload>
and <upload_folder>/<folder_1_to_upload>
. Note that the given <upload_folder>
may well be a folder path, e.g. <folder1/folder2>
, and in this case <encrypted_file_1_to_upload>
will be uploaded to folder1/folder2/<encrypted_file_1_to_upload>
.
As a side note it is possible to include all the contents of a directory with /.
, for example,
./sda-cli upload -config <configuration_file> -r <folder_to_upload>/. -targetDir <new_folder_name>
will upload all contents of <folder_to_upload>
to <new_folder_name>
recursively, effectively renaming <folder_to_upload>
upon upload to the archive.
It is possible to combine the encryption and upload steps with the use of the flag --encrypt-with-key
followed by the path of the crypt4gh public key to be used for encryption. In this case, the input list of file arguments can only contain unencrypted source files. For example the following,
./sda-cli upload -config <configuration_file> --encrypt-with-key <public_key> <unencrypted_file_to_upload>
will encrypt <unencrypted_file_to_upload>
using <public_key>
as public key and upload the created <file_to_upload.c4gh>
in the base folder of the user.
Encrypt on upload can be combined with any of the flags above. For example,
./sda-cli upload -config <configuration_file> --encrypt-with-key <public_key> -r <folder_to_upload_with_unencrypted_data> -targetDir <new_folder_name>
will first encrypt all files in <folder_to_upload_with_unencrypted_data>
and then upload the folder recursively (selecting only the created c4gh
files) under <new_folder_name>
.
Notes: The tool calls the encrypt module internally, therefore similar behavior to that command is expected, including the creation of hash files. In addition,
- For encryption with multiple public keys, concatenate all public keys into one file and pass it as the argument to
encrypt-with-key
. - If the input includes encrypted files, the tool will exit without performing further tasks.
- The encrypted files will be created next to their unencrypted counterparts.
- The tool will not overwrite existing encrypted files. It will exit early if encrypted counterparts of the source files already exist with the same source path.
- If the flag
--force-overwrite
is used, the tool will overwrite any already existing file. - The cli will exit if the input has any un-encrypred files. To override that, use the flag
--force-unencrypted
.
All the following cases require a configuration file to be downloaded before starting the operation, and the list
parameter should be used.
This feature returns all the files in the user's inbox recursively and can be executed using:
./sda-cli list [-config <configuration_file>]
It also allows for requesting files/filepaths with a specified prefix using:
./sda-cli list [-config <configuration_file>] <prefix>
This command will return any file/path starting with the defined <prefix>
.
If no config is given by the user, the tool will look for a previous login from the user.
To list datasets or the files within a dataset that the user has access to, the --datasets
flag should be used and the login URL must be provided:
./sda-cli list -config <configuration_file> --datasets --url <login_url> (--bytes)
where <login_url>
is the URL where the user goes for authentication. This command returns a list of accessible datasets, including the number
of files in each dataset and the total size of it. The --bytes
flag is optional and it will display the size of the dataset in bytes.
To list the files within a particular dataset, the user must provide the dataset ID (which can be obtained by executing the previous command):
./sda-cli list -config <configuration_file> -dataset <datasetID> -url <login_url> (--bytes)
This command returns a list of files within the dataset, including their file IDs, file sizes and file paths. The --bytes
flag is optional and it will display the size of the files in bytes.
Files and datasets can be downloaded using the download
parameter.
This utilizes the Download API which enables secure downloads from the SDA/BP archive.
The download API allows for downloading files from the archive and it requires the user to have access to the dataset, therefore a configuration file needs to be downloaded before starting the downloading of the files. For downloading files the user also needs to know the download service URL and the dataset ID. The user has several options for downloading:
- specific files by using their paths
- specific files by using their file ids
- multiple files recursively
- multiple files by providing a text file with the paths of the files to download
- all the files of the dataset
- specific encrypted files
For downloading one specific file the user needs to provide the path or the id (the id should NOT have "/") of this file by running the command below:
./sda-cli download -config <configuration_file> -dataset-id <datasetID> -url <download-service-URL> [<filepath> or <fileid>]
where <configuration_file>
the file downloaded in the previous step, <dataset_id>
the ID of the dataset and <filepath>
the path of the file (or <fileid>
the id of the file) in the dataset.
The tool also allows for downloading multiple files at once, by listing their filepaths (or file ids) separated with space and it also allows for selecting a folder where the files will be downloaded, using the outdir
argument:
./sda-cli download -config <configuration_file> -dataset-id <datasetID> -url <download-service-url> -outdir <outdir> <path/to/file1> <other/path/to/file2> ... (or <fileID_1> <fileID_2> ...)
For downloading the content of a folder (including subfolders) the user need to add the --recursive
flag followed by the path(s) of the folder(s):
./sda-cli download -config <configuration_file> -dataset-id <datasetID> -url <download-service-url> -outdir <outdir> --recursive path/to/folder1 path/to/folder2 ...
For downloading multiple files the user can provide a text file with the paths of the files to download. The file should contain one path per line.
In this case user needs to use the --from-file
flag and at the end user needs to provide the path of the text file with the paths of the files to download:
./sda-cli download -config <configuration_file> -dataset-id <datasetID> -url <download-service-url> -outdir <outdir> --from-file <path/to/text_file>
For downloading the whole dataset the user needs add the --dataset
flag and NOT providing any filepaths:
./sda-cli download -config <configuration_file> -dataset-id <datasetID> -url <download-service-url> -outdir <outdir> --dataset
where the dataset will be downloaded in the <outdir>
directory be keeping the original folder structure of the dataset.
When a public key is provided, you can download files that are encrypted on the server-side with that public key. The command is similar to downloading the unencrypted files except that a public key is provided through the -pubkey
flag. For example:
./sda-cli download -pubkey <public-key-file> -config <configuration_file> -dataset-id <datasetID> -url <download-service-url> -outdir <outdir> <filepath_1_to_download> <filepath_2_to_download> ...
After a successful download, the encrypted files can be decrypted using the private key corresponding to the provided public key.
Given that the instructions in the download section have been followed, the key pair and the data files should be stored in some location. The last step is to decrypt the files in order to access their content. That can be achieved using the following command:
./sda-cli decrypt -key <keypair_name>.sec.pem <file_to_decrypt>
where <keypair_name>.sec.pem
is the private key created in the relevant section and <file_to_decrypt>
is one of the files downloaded following the instructions of the download section. The --force-overwrite
flag can be used to over-write the decrypted file if it exists and the --clean
flag to delete the encrypted file after successful decryption.
You can login to download the configuration file needed for some of the the tool's operation using the login command:
./sda-cli login <login_target>
where login_target
is the URL to the sda-auth
service from the sensitive-data-archive project.
This will open a link for the user where they can go and log in.
After the login is complete, a configuration file will be created in the tool's directory with the name of .sda-cli-session
You can get the current version of the sda-cli by running:
./sda-cli version
You can download a (partial) file using the htsget server. You can define the name of the file you want to save the result to by using the output
variable. If output
is not defined, the file will keep it's original name. If the file already exists, it will not be over-written unless you use the --force-overwrite
flag. filename
, host
, and key
are required. reference
refers to the part of the file you wish to download.
htsget -dataset <datasetID> -filename <filename> -reference <reference-number> -host <htsget-hostname> -pubkey <public-key-file> -output <file>
for example:
./sda-cli htsget -config testing/s3cmd.conf -dataset DATASET0001 -filename htsnexus_test_NA12878 -reference 11 -host http://localhost:8088 -pubkey testing/c4gh.pub.pem -output ~/tmp/result.c4gh --force-overwrite
This section contains the information required to install, modify and run the sda-cli
tool.
The sda-cli
is written in golang. In order to be able to modify, build and run the tool, golang (>= 1.22) needs to be installed. The instructions for installing go
can be found here.
To build the sda-cli
tool run the following command from the root folder of the repository
go build
This command will create an executable file in the root folder, named sda-cli
.
The github actions include a release workflow that builds binaries for different operating systems. In order to create a new release, create a tag either using the graphical interface or through the command line. That should trigger the creation of a release with the latest code of the specified branch.
In order for the automatic release to get triggered, the releases should be of the format vX.X.X
, e.g. v1.0.0
.
Before pushing a change to the releaser, make sure to run the check for the configuration file, running:
goreleaser check -f .goreleaser.yaml