Skip to content

Commit

Permalink
Merge pull request #59 from phac-nml/v1.1.3
Browse files Browse the repository at this point in the history
SISTR v1.1.3 release
  • Loading branch information
kbessonov1984 authored Nov 26, 2024
2 parents 5aad179 + 1a02167 commit 089dc4a
Show file tree
Hide file tree
Showing 19 changed files with 85,554 additions and 81 deletions.
37 changes: 37 additions & 0 deletions .github/workflows/github-actions.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Python application

on:
push:
branches: [ "master", "v1.1.3" ]
pull_request:
branches: [ "master", "v1.1.3" ]

permissions:
contents: read

jobs:
build:

runs-on: ubuntu-22.04

steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install mash ncbi-blast+ libssl-dev libcurl4-openssl-dev mafft libssl-dev ca-certificates -y
sudo apt-get install python3-pip python3-dev python3-biopython -y
python3 -m pip install --upgrade pip setuptools
pip3 install pytest fastcluster openpyxl pycurl pandas scipy "numpy<2"
python setup.py install
sistr_init
- name: Test with pytest
run: |
pytest -o log_cli=true --basetemp=tmp-pytest
124 changes: 124 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,127 @@
# 1.1.3

Serovar nomenclature update after USA Cantaloupe Outbreaks in November 2023. The O24 and O25 antigens would not be wet-lab typed reliably causing the collapse of certain serovar pairs detailed below (Table 1). The selected serovar in the pair is the one that will be reported by SISTR and the other serovar in the pair will be dropped. No O24 or O25 will be reported in the antigenic formula (Table 2).

<h3>Table 1 - serovar pairs that were collapsed</h3>

|Serovar pair | Serovar selected in v1.1.3 |
|------------------------|----------------------------|
|Soahanina - Sundsvall | Sundsvall |
|Martonos - Finkenwerder | Finkenwerder |
|Midway - Florida | Florida |
|Lindern - Charity | Charity |
|Bahrenfeld - Onderstepoort | Onderstepoort |
|Schalkwijk - Moussoro | Schalkwijk |
|Amberg - Boecker | Boecker |
| Carrau - Madelia | Carrau |
| Chichiri - Uzaramo | Uzaramo |
| Poano - Stafford | Poano |

### Changes of serovar assignments in `sistr/data/genomes-to-serovar.txt` file

|genome accession | serovar previous | serovar current |
|-----------------|------------------|-----------------|
| SAL_DA9822AA | Soahanina |Sundsvall
| SRR1815423 | Soahanina | Sundsvall
| SRR2889947 | Soahanina | Sundsvall
| SRR2889992 | Soahanina | Sundsvall
| SRR3996854 |Soahanina | Sundsvall
| SRR3669910 |Soahanina | Sundsvall
| SRR3732330 |Soahanina | Sundsvall
|SRR3713652 | Soahanina | Sundsvall
| SRR3713653 |Soahanina | Sundsvall
|SRR3978444 |Soahanina | Sundsvall
| SRR2011392 |Soahanina | Sundsvall
| SRR1068363 |Soahanina | Sundsvall
|ERR161888 |Soahanina | Sundsvall
|SAL_BA5034AA |Soahanina | Sundsvall
|SAL_EA3233AA |Soahanina | Sundsvall
|SAL_GA9094AA |Soahanina | Sundsvall
|SRR1158155 |Soahanina | Sundsvall
|SRR2751907 |Soahanina | Sundsvall
|SRR4237685 |Soahanina | Sundsvall
|SRR5010548 |Soahanina | Sundsvall
|09_6055 |Madelia | Carrau
|11_0879 |Madelia | Carrau
|SAL_BA1830AA |Madelia | Carrau
|SAL_CA7979AA |Madelia | Carrau
|SAL_DA4289AA |Madelia | Carrau
|SAL_DA7475AA |Madelia | Carrau
|SAL_EA4948AA |Madelia | Carrau
|SAL_FA5821AA |Madelia | Carrau
|SAL_HA4780AA |Madelia | Carrau
|SAL_HA4886AA |Madelia | Carrau
|SRR1269415 |Madelia | Carrau
|SRR1548430 |Madelia | Carrau
|SRR1805645| Madelia | Carrau
|SRR2104612 |Madelia | Carrau
|SRR2911800 |Madelia | Carrau
|SRR3933147 |Madelia | Carrau
|SRR4098716 |Madelia | Carrau
|SRR1258654 |Madelia | Carrau
|SRR1582141 |Madelia | Carrau
|SRR4019409 |Madelia | Carrau
|SRR4244476 |Madelia | Carrau
|SRR2075023 |Madelia | Carrau
|SRR5132365 |Madelia | Carrau
|SRR5051381 |Madelia | Carrau
|SRR3743984 |Madelia | Carrau
|SRR5054238 |Madelia | Carrau
|SRR3928735 |Madelia | Carrau
|SRR1586586 |Madelia | Carrau
|SRR2976043 |Madelia | Carrau
|SRR2962333 |Madelia | Carrau
|SRR3928732 |Madelia | Carrau
|SRR3928736 |Madelia | Carrau
|SRR2962332 |Madelia | Carrau
|SAL_EA2874AA |Bahrenfeld | Onderstepoort
|SAL_FA0525AA |Bahrenfeld | Onderstepoort
|SRR3173783 |Bahrenfeld | Onderstepoort
|SAL_DA7014AA |Martonos | Finkenwerder
|SRR1300569 |Martonos | Finkenwerder
|SRR1973814 |Martonos | Finkenwerder

### Changes to `Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv` antigen to serovar lookup database
Removed the following entries
1. Martonos,"6,14,24",d,"1,5",,H,FALSE,enterica
2. Midway,"6,14,24",d,"1,7",,H,FALSE,enterica
3. Lindern,"6,14,[24]",d,"e,n,x",,H,FALSE,enterica
4. Bahrenfeld,"6,14,[24]","e,h","1,5",,H,FALSE,enterica
5. Moussoro,"1,6,14,25",i,"e,n,z15",,H,FALSE,enterica
6. Amberg,"6,14,24","l,v","1,7",,H,FALSE,enterica
7. Madelia,"1,6,14,25",y,"1,7",,H,FALSE,enterica
8. Soahanina,"6,14,24",z,"e,n,x",,H,FALSE,enterica
9. Chichiri,"6,14,24","z4,z24",-,,H,TRUE,enterica
10. II 4:a:z39,"1,4,12,[27]",a,z39,,B,FALSE,salamae

The following entries were modified in the in the `O_antigen` field as such

<h3>Table 2 - updated antigenic formulas for the O24-25 serovars</h3>

| Before | After (SISTR v1.1.3)|
|--------|-------|
|Sundsvall,"[1],6,14,[<b>25</b>]",z,"e,n,x",,H,FALSE,enterica| Sundsvall,"6,14",z,"e,n,x",,H,FALSE,enterica |
|Finkenwerder,"[1],6,14,[<b>25</b>]",d,"1,5",,H,FALSE,enterica | Finkenwerder,"6,14",d,"1,5",,H,FALSE,enterica |
|Florida,"[1],6,14,[<b>25</b>]",d,"1,7",,H,FALSE,enterica | Florida,"6,14",d,"1,7",,H,FALSE,enterica |
| Charity,"[1],6,14,[<b>25</b>]",d,"e,n,x",,H,FALSE,enterica | Charity,"6,14",d,"e,n,x",,H,FALSE,enterica |
| Onderstepoort,"1,6,14,[<b>25</b>]","e,h","1,5",,H,FALSE,enterica | Onderstepoort,"6,14","e,h","1,5",,H,FALSE,enterica |
| Schalkwijk,"6,14,[<b>24</b>]",i,"e,n,z15",,H,FALSE,enterica | Schalkwijk,"6,14",i,"e,n,z15",,H,FALSE,enterica |
| Boecker,"[1],6,14,[<b>25</b>]","l,v","1,7",,H,FALSE,enterica |Boecker,"6,14","l,v","1,7",,H,FALSE,enterica |
| Carrau,"6,14,[<b>24</b>]",y,"1,7",,H,FALSE,enterica | Carrau,"6,14",y,"1,7",,H,FALSE,enterica |
| Uzaramo,"1,6,14,<b>25</b>","z4,z24",-,,H,TRUE,enterica | Uzaramo,"6,14","z4,z24",-,,H,TRUE,enterica |
| Poano,"[1],6,14,[<b>25</b>]",z,"l,z13,z28",,H,FALSE,enterica | Poano,"6,14",z,"l,z13,z28",,H,FALSE,enterica |

### New output field `antigenic_formula`
- Added `antigenic_formula` field that aggregates the O, H1 and H2 antigen values in a single location for convenience

### New argument `--list-of-serovars`
- Added `--list-of-serovars` option allowing user to provide a single column text file listing all serovars of interest to match against the SISTR prediction. The result will be reported in `predicted_serovar_in_list` field as `Y` or `N` if there is match or otherwise. This could be useful for cases when only a certain list of serovars could be reported

### New d-tartrate message for `Paratyphi B`, `Paratyphi B var. Java` and`I 1,4,[5],12:i:-` serovars
- If Paratyphi B and Paratyphi B var. Java serovar is predicted and the `--qc` is selected, the following message will appear in `qc_messages` field `Perform d-tartrate test (dT) to differentiate between Paratyphi B and Paratyphi B var. Java. The dT+ result is indicative of variant Java.`
- If monophasic `I 1,4,[5],12:i:-` predicted, then the `qc_messages` field will suggest d-tartrate test via this message
`Perform d-tartrate test (dT) as both dT+ and dT- I 1,4,[5],12:i:- subtypes exist.`

# 1.1.1

* Fixed issue with sorting of BLAST results (causing cgMLST types to be different between BLAST versions). Pull request #43.
Expand Down
103 changes: 61 additions & 42 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@
:target: https://sistr-app.onrender.com/
:alt: web app deployed on Render.com Cloud Hosting

Serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles using BLAST.
*Salmonella* serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles using BLAST.
`Mash MinHash <https://mash.readthedocs.io/en/latest/>`_ can also be used for serovar prediction.

.. epigraph::

`Latest stable version <https://github.com/phac-nml/sistr_cmd/releases/latest>`_


*Don't want to use a command-line app?* Try the `SISTR web app <https://github.com/phac-nml/sistr_cmd#web-application>`_ deployed on Galaxy and Render.com platforms
*Don't want to use a command-line app?* Try SISTR with interface deployed on Galaxy and Render.com online platforms (see the `Web application`_ section)


Citation
Expand Down Expand Up @@ -64,7 +64,7 @@ Installation
============

Using Conda [Recommended]
-----------
---------------------------

You can install ``sistr_cmd`` using `Conda <https://conda.io/miniconda.html>`_ from the `BioConda channel <https://bioconda.github.io/>`_:

Expand Down Expand Up @@ -115,10 +115,19 @@ SISTR can be publically accessed as a web application via:

- Galaxy EU instance at https://usegalaxy.eu/root?tool_id=sistr_cmd |galaxy|
- Render.com Cloud Hosting Platform-as-a-Service (PaaS) hosts a **DEMO** SISTR web application https://sistr-app.onrender.com/ |render|
**NOTE:** The SISTR web application hosted on Render.com might take up to 20 seconds to load on the first run and will shutdown after 15 min of inactivity

SISTR web application source code is available at https://github.com/phac-nml/sistr-web-app allowing easy web interface deployment on any infrastructure types (on-premises, cloud/remote).
**NOTE 1:** The SISTR web application hosted on Render.com might take up to 20 seconds to load on the first run and will shutdown after 15 min of inactivity

**NOTE 2:** SISTR web application source code is available at https://github.com/phac-nml/sistr-web-app allowing easy web interface deployment on any infrastructure types (on-premises, cloud/remote).


Database
=========
SISTR will automatically initialize database of *Salmonella* serovar determination antigens, cgMLST profiles and MASH sketch of reference genomes by downloading it from a remote location.
The SISTR database v1.3 got minor updates by collapsing some of the serovars with O24/O25 antigens detailed in `CHANGELOG.md <CHANGELOG.md>`_ file

- SISTR v1.1 database is available at https://zenodo.org/records/13618515 or via a direct url https://zenodo.org/records/13618515/files/SISTR_V_1.1_db.tar.gz?download=1 (used with SISTR < 1.1.3 )
- SISTR v1.3 database is available at https://zenodo.org/records/13693495 or va a direct url https://zenodo.org/records/13693495/files/SISTR_V_1.1.3_db.tar.gz?download=1 (used with SISTR >= 1.1.3)


Dependencies
Expand All @@ -129,7 +138,7 @@ These are the external dependencies required for ``sistr_cmd``:
- Python (>= v2.7 OR >= v3.4)
- BLAST+ (>= v2.2.30)
- MAFFT (>=v7.271 (2016/1/6))
- `Mash v1.0+ <https://github.com/marbl/Mash/releases>`_ [optional]
- `Mash v2.0+ <https://github.com/marbl/Mash/releases>`_ [optional]

Python Dependencies
-------------------
Expand Down Expand Up @@ -167,7 +176,7 @@ If you run ``sistr -h``, you should see the following usage info:
Serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles using BLAST.
Note about using the "--use-full-cgmlst-db" flag:
The "centroid" allele database is ~10% the size of the full set so analysis is much quicker with the "centroid" vs "full" set of alleles. Results between 2 cgMLST allele sets should not differ.
The "centroid" allele database is ~10% the size of the full set so analysis is much quicker with the "centroid" vs "full" set of alleles. Results between 2 cgMLST allele sets should not differ.
If you find this program useful in your research, please cite as:
Expand Down Expand Up @@ -210,12 +219,16 @@ If you run ``sistr -h``, you should see the following usage info:
serovar prediction results.
-t THREADS, --threads THREADS
Number of parallel threads to run sistr_cmd analysis.
-l LIST_OF_SEROVARS, --list-of-serovars LIST_OF_SEROVARS
A path to a single column text file containing list of
serovar(s) to check serovar prediction against. Result
reported in the "predicted_serovar_in_list"
field as Y (present) or N (absent) value.
-v, --verbose Logging verbosity level (-v == show warnings; -vvv ==
show debug info)
-V, --version show program's version number and exit
Example Usage
-------------

Expand Down Expand Up @@ -279,32 +292,11 @@ Summary of output options:


Primary results output (``-o sistr-results``)
------------------------------------------

Tab-delimited results output (``-f tab``):

.. code-block:: tab
cgmlst_ST cgmlst_distance cgmlst_genome_match cgmlst_matching_alleles cgmlst_subspecies fasta_filepath genome h1 h2 o_antigen qc_messages qc_status serogroup serovar serovar_antigen serovar_cgmlst
660408169 0.00909090909091 LT2 327 enterica /home/peter/Downloads/sistr-LT2-example/LT2.fasta LT2 i 1,2 1,4,[5],12 PASS B Typhimurium Typhimurium Typhimurium
CSV results output (``-f csv``):

.. code-block:: csv
cgmlst_ST,cgmlst_distance,cgmlst_genome_match,cgmlst_matching_alleles,cgmlst_subspecies,fasta_filepath,genome,h1,h2,o_antigen,qc_messages,qc_status,serogroup,serovar,serovar_antigen,serovar_cgmlst
660408169,0.00909090909091,LT2,327,enterica,/home/peter/Downloads/sistr-LT2-example/LT2.fasta,LT2,i,"1,2","1,4,[5],12",,PASS,B,Typhimurium,Typhimurium,Typhimurium
How the results should look in a table:

.. csv-table::

cgmlst_ST,cgmlst_distance,cgmlst_genome_match,cgmlst_matching_alleles,cgmlst_subspecies,fasta_filepath,genome,h1,h2,o_antigen,qc_messages,qc_status,serogroup,serovar,serovar_antigen,serovar_cgmlst
660408169,0.00909090909091,LT2,327,enterica,/home/peter/Downloads/sistr-LT2-example/LT2.fasta,LT2,i,"1,2","1,4,[5],12",,PASS,B,Typhimurium,Typhimurium,Typhimurium


JSON results output:
---------------------------------------------
SISTR supports various text output formats specified by the ``-f`` option with ``json`` being the default.

JSON results output (``-f json``):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: json
[
Expand All @@ -328,6 +320,32 @@ JSON results output:
}
]
Tab-delimited results output (``-f tab``):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: text
cgmlst_ST cgmlst_distance cgmlst_genome_match cgmlst_matching_alleles cgmlst_subspecies fasta_filepath genome h1 h2 o_antigen qc_messages qc_status serogroup serovar serovar_antigen serovar_cgmlst
660408169 0.00909090909091 LT2 327 enterica /home/peter/Downloads/sistr-LT2-example/LT2.fasta LT2 i 1,2 1,4,[5],12 PASS B Typhimurium Typhimurium Typhimurium
CSV results output (``-f csv``):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Raw ``csv`` output results opened in a text editor

.. code-block:: csv
cgmlst_ST,cgmlst_distance,cgmlst_genome_match,cgmlst_matching_alleles,cgmlst_subspecies,fasta_filepath,genome,h1,h2,o_antigen,qc_messages,qc_status,serogroup,serovar,serovar_antigen,serovar_cgmlst
660408169,0.00909090909091,LT2,327,enterica,/home/peter/Downloads/sistr-LT2-example/LT2.fasta,LT2,i,"1,2","1,4,[5],12",,PASS,B,Typhimurium,Typhimurium,Typhimurium
The same ``csv`` results rendered as a table

.. csv-table::

cgmlst_ST,cgmlst_distance,cgmlst_genome_match,cgmlst_matching_alleles,cgmlst_subspecies,fasta_filepath,genome,h1,h2,o_antigen,qc_messages,qc_status,serogroup,serovar,serovar_antigen,serovar_cgmlst
660408169,0.00909090909091,LT2,327,enterica,/home/peter/Downloads/sistr-LT2-example/LT2.fasta,LT2,i,"1,2","1,4,[5],12",,PASS,B,Typhimurium,Typhimurium,Typhimurium


cgMLST allele search results
-------------------------------------

Expand All @@ -337,7 +355,7 @@ These results may be useful for understanding unexpected or low confidence serov
Schema:
~~~~~~~

.. code-block:: json
.. code-block:: text
{
<genome name>: {
Expand Down Expand Up @@ -414,14 +432,15 @@ Schema:
"seq": string
}
}}
}
}
Example:
~~~~~~~~

Here's some truncated example allele search results output:
Here's some truncated example allele search results output in JSON format for ``LT2`` sample:

.. code-block:: json
.. code-block:: text
{
"LT2": {
Expand Down Expand Up @@ -472,7 +491,7 @@ cgMLST allelic profiles output (``--cgmlst-profiles cgmlst-profiles.csv``)
--------------------------------------------------------------------------

With the ``-p``/``--cgmlst-profiles`` commandline argument, you can output the 330 loci cgMLST allelic profiles for your input genomes (i.e. the allele designation for each cgMLST locus for each input genome).
You can use this information to construct phylogenetic trees from this data using a tool such as `Phyloviz Online <https://online.phyloviz.net/index>`_.
You can use this information to construct phylogenetic trees from this data using a tool such as `Phyloviz Online <https://online.phyloviz.net/index>`_ by uploading cgMLST profiles data.
This type of analysis may be useful to explore why unexpected serovar prediction results were generated (e.g. your genomes are genetically very different from each other).

Example truncated cgMLST profiles output:
Expand All @@ -485,13 +504,13 @@ Example truncated cgMLST profiles output:


QC by ``sistr_cmd`` (``--qc``)
-------------------
------------------------------

If you are running ``sistr_cmd`` with the ``--qc`` commandline argument, ``sistr_cmd`` will run some basic QC to determine the level of confidence in the serovar prediction.

The ``qc_status`` field should contain a value of ``PASS`` if your genome passes all QC checks, otherwise, it will be ``WARNING`` or ``FAIL`` if there are issues with your results and/or input genome sequence.

The ``qc_messages`` field will contain useful information about why you may have a low confidence serovar prediction result. The QC messages will be delimited by `` | ``.
The ``qc_messages`` field will contain useful information about why you may have a low confidence serovar prediction result. The QC messages will be delimited by `` | `` symbol.

For example, here are the QC messages for an unusually small *Salmonella* assembly where the predicted serovar was "-:-:-":

Expand All @@ -507,10 +526,10 @@ The QC messages produced by ``sistr_cmd`` should help you understand your serova

Galaxy workflows
================
The `galaxy <https://github.com/phac-nml/sistr_cmd/tree/master/galaxy>`_ folder contains Galaxy Project SISTR workflows that allow to process samples in large batches.
The `galaxy <./galaxy/>`_ folder contains Galaxy SISTR workflows that can be readily imported into existing Galaxy server instance and allow to process WGS samples in large batches starting from raw reads and finishing with serovar results.


- `Galaxy-Workflow-Assembly-Serotyping-withReport-for-SISTR_v1.1.1+galaxy1-recipe.ga <https://github.com/phac-nml/sistr_cmd/tree/master/galaxy/Galaxy-Workflow-Assembly-Serotyping-withReport-for-SISTR_v1.1.1+galaxy1-recipe.ga>`_
- `Galaxy-Workflow-Assembly-Serotyping-withReport-for-SISTR_v1.1.1+galaxy1-recipe.ga <./galaxy/Galaxy-Workflow-Assembly-Serotyping-withReport-for-SISTR_v1.1.1+galaxy1-recipe.ga>`_
+ Summary: Assembles genomes from raw reads, performs serotyping and generates overall report
+ Uses tool dependencies: ``sistr 1.1.1+galaxy1``, ``shovill 1.0.4+galaxy1`` and ``tp_cat 0.1.0``

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def run(self):
'install': CustomInstallCommand
},
install_requires=[
'numpy>=1.11.1,<1.23.5',
'numpy>=1.11.1,<2',
'tables>=3.3.0,<4',
'pandas>=0.22.0,<3',
'pycurl>=7.43.0,<8',
Expand Down
Loading

0 comments on commit 089dc4a

Please sign in to comment.