Skip to content

Commit

Permalink
move most of README.md to WIKI
Browse files Browse the repository at this point in the history
  • Loading branch information
MrTomRod committed May 25, 2022
1 parent 6adad95 commit 2ca3fad
Show file tree
Hide file tree
Showing 7 changed files with 50 additions and 253 deletions.
259 changes: 13 additions & 246 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,253 +1,16 @@
# Scoary 2

Associate genes with traits!
Scoary 2 associates orthogenes (e.g. generated using [OrthoFinder][orthofinder]
or [Roary][roary] to traits. It reports a list of genes sorted by strength of
association per trait. The results can be explored interactively with a simple, static HTML/JS app.

# Installation
# Wiki

1) [Python](https://pypi.org/project/scoary-2/) 3.10+: `pip install scoary-2`
2) [Docker](https://hub.docker.com/r/troder/scoary-2): `docker pull troder/scoary-2`

# Usage

Examples:

1) Dataset from Scoary 1

```shell
# Dataset from Scoary 1: genes in Roary gene count format
scoary \
--genes Gene_presence_absence.csv \
--traits Tetracycline_resistance.csv \
--outdir out \
--n-permut 1000
```

See output [here](https://gfv-oberburg.ch/FILES/SCOARY2_TETR/overview.html).

2) OrthoFinder-based data

```shell
scoary \
--genes N0.tsv \
--gene-data-type 'gene-list:\t' \
--gene-info N0_best_names.tsv \
--traits traits.tsv \
--trait-data-type 'gaussian:\t' \
--trait-info trait_info.tsv \
--isolate-info isolate_info.tsv \
--n-permut 200 \
--n-cpus 7 \
--outdir out
# the following are optional: gene-info, trait-info, isolate-info
```

See output of a huge, unfiltered dataset [here](https://gfv-oberburg.ch/FILES/SCOARY2.2/overview.html).
(Takes a few seconds to load.)

## Docker cookbook

Run interactive shell:

```shell
docker run \
-u $(id -u ${USER}):$(id -g ${USER}) \
-it --rm \
-v /path/to/data:/data:Z \
troder/scoary-2 \
/bin/bash
```

## Options

Below is the output of `scoary --help`:

```text
POSITIONAL ARGUMENTS
GENES
Type: str
Path to gene presence/absence table: columns=isolates, rows=genes
TRAITS
Type: str
Path to trait presence/absence table: columns=traits, rows=isolates
OUTDIR
Type: str
Directory to place output files
FLAGS
--multiple_testing=MULTIPLE_TESTING
Type: str
Default: 'bonferroni:0.999'
"method:cutoff" for filtering genes after Fisher's test, where cutoff is a number and method is one of [bonferroni, sidak, holm-sidak, holm, simes-hochberg, hommel, fdr_bh, fdr_by, fdr_tsbh, fdr_tsbky]
--gene_info=GENE_INFO
Type: Optional[str]
Default: None
Path to file that describes genes: columns=arbitrary properties, rows=genes
--trait_info=TRAIT_INFO
Type: Optional[str]
Default: None
Path to file that describes traits: columns=arbitrary properties, rows=traits
--isolate_info=ISOLATE_INFO
Type: Optional[str]
Default: None
Path to file that describes isolates: columns=arbitrary properties, rows=isolates
--newicktree=NEWICKTREE
Type: Optional[str]
Default: None
Path to a custom tree in Newick format
--pairwise=PAIRWISE
Type: bool
Default: True
If False, only perform Fisher's test. If True, also perform pairwise comparisons algorithm.
--n_permut=N_PERMUT
Type: int
Default: 500
Post-hoc label-switching test: perform N permutations of the phenotype by random label switching. Low p-values suggest that the effect is not merely lineage-specific.
--restrict_to=RESTRICT_TO
Type: Optional[str]
Default: None
Comma-separated list of isolates to which to restrict this analysis
--ignore=IGNORE
Type: Optional[str]
Default: None
Comma-separated list of isolates to be ignored for this analysis
--n_cpus=N_CPUS
Type: int
Default: 1
--trait_data_type=TRAIT_DATA_TYPE
Type: str
Default: 'binary:,'
"<method>:<?cutoff>:<?covariance_type>:<?alternative>:<?delimiter>" How to read the traits table. Example: "gene-list:\t" for OrthoFinder N0.tsv table
--gene_data_type=GENE_DATA_TYPE
Type: str
Default: 'gene-count:,'
"<data_type>:<?delimiter>" How to read the genes table. Example: "gene-list:\t" for OrthoFinder N0.tsv table
--random_state=RANDOM_STATE
Type: Optional[int]
Default: None
Set a fixed seed for the random number generator
--limit_traits=LIMIT_TRAITS
Type: Optional[int, int]
Default: None
Limit the analysis to traits n to m. Useful for debugging. Example: "(0, 10)"
```

# Overview of the algorithm

![algorithm flowchart](media/ScoaryWorkflow.drawio.svg)

# Usage of the picking algorithm in Python

Python bindings to the _pairwise comparisons algorithm_, as described in
[Read, 1995](https://doi.org/10.1006/jtbi.1995.0047),
[Maddison, 2000](https://doi.org/10.1006/jtbi.1999.1050) and
[Brynildsrud, 2016](https://doi.org/10.1186/s13059-016-1108-8).

<details>

<summary>Simple pair picking</summary>

```python
from pprint import pprint
from scoary import ScoaryTree, pick_single, print_tree

tree = [['isolate1', 'isolate2'], [['isolate3', 'isolate4'], ['isolate5', 'isolate6']]]

label_to_trait_a = {
'isolate1': True,
'isolate2': False,
'isolate3': True,
'isolate4': False,
'isolate5': True,
'isolate6': False,
}

label_to_trait_b = {
'isolate1': True,
'isolate2': False,
'isolate3': True,
'isolate4': False,
'isolate5': True,
'isolate6': False,
}

print_tree(
ScoaryTree.from_list(tree),
label_to_trait_a, label_to_trait_b
)
# /-11_isolate1
# /-|
# | \-00_isolate2
# |
# --| /-11_isolate3
# | /-|
# | | \-00_isolate4
# \-|
# | /-11_isolate5
# \-|
# \-00_isolate6

result = pick_single(tree, label_to_trait_a, label_to_trait_b, calc_pvals=True)
pprint(result)
# {'best_fisher_p': 0.25,
# 'max_contrasting_pairs': 3,
# 'max_opposing_pairs': 0,
# 'max_supporting_pairs': 3,
# 'worst_pval': 0.25}
```

</details>

<details>

<summary>Parallel pair picking</summary>

This takes advantage of [Numba](https://numba.pydata.org/) optimizations.

```python
import pandas as pd
from scoary import pick

tree = [['isolate1', 'isolate2'], ['isolate3', 'isolate4']]

# e.g. phenotype
label_to_trait_a = {
'isolate1': True,
'isolate2': False,
'isolate3': False,
'isolate4': True,
}

# e.g. presence/absence of genes
trait_b_df = pd.DataFrame(
columns=['isolate1', 'isolate2', 'isolate3', 'isolate4'],
data=[
[True, True, False, False], # gene 1
[True, False, True, False], # gene 2
[True, False, False, True], # ...
[False, True, True, False],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
]
)

max_contr, max_suppo, max_oppos, best, worst = pick(
tree=tree,
label_to_trait_a=label_to_trait_a,
trait_b_df=trait_b_df,
calc_pvals=True
)

print(f'{max_contr=}\n{max_suppo=}\n{max_oppos=}\n{best=}\n{worst=}')
# max_contr=array([1, 2, 2, 2, 2, 2, 2, 2])
# max_suppo=array([1, 1, 2, 0, 1, 1, 1, 1])
# max_oppos=array([1, 1, 0, 2, 1, 1, 1, 1])
# best=array([1. , 1. , 0.5, 0.5, 1. , 1. , 1. , 1. ])
# worst=array([1. , 1. , 0.5, 0.5, 1. , 1. , 1. , 1. ])
```

</details>
- [Installation](wiki/Installation)
- [Usage](wiki/Usage)
- [Input](wiki/Input)
- [Output](wiki/Output)
- [Tutorial](wiki/Tutorial)

# Todo:

Expand All @@ -263,3 +26,7 @@ print(f'{max_contr=}\n{max_suppo=}\n{max_oppos=}\n{best=}\n{worst=}')
- [ ] `overview.html` & `trait.html`: ~~set proper title~~, add logo/favicon
- [ ] `overview.html` & `trait.html`: link to table download
- [ ] Improve `README.md`

[orthofinder]: https://github.com/davidemms/OrthoFinder/

[roary]: https://sanger-pathogens.github.io/Roary/
4 changes: 0 additions & 4 deletions media/ScoaryWorkflow.drawio.svg

This file was deleted.

6 changes: 5 additions & 1 deletion scoary/load_traits.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ def filter_df(df: pd.DataFrame, restrict_to: str = None, ignore: str = None) ->

def load_binary(traits: str, delimiter: str, restrict_to: str = None, ignore: str = None,
limit_traits: (int, int) = None):
logger.debug(f'Loading binary traits: {traits=} {delimiter=}')
dtypes = defaultdict(lambda: int)
dtypes["index_column"] = str
traits_df = pd.read_csv(traits, delimiter=delimiter, index_col=0, dtype=dtypes, na_values=STR_NA_VALUES)
Expand Down Expand Up @@ -71,6 +72,7 @@ def load_binary(traits: str, delimiter: str, restrict_to: str = None, ignore: st

def load_numeric(traits: str, delimiter: str, restrict_to: str = None, ignore: str = None,
limit_traits: (int, int) = None):
logger.debug(f'Loading numeric traits: {traits=} {delimiter=}')
dtypes = defaultdict(lambda: float)
dtypes["index_column"] = str
numeric_df = pd.read_csv(traits, delimiter=delimiter, index_col=0, dtype=dtypes, na_values=STR_NA_VALUES)
Expand Down Expand Up @@ -236,7 +238,7 @@ def fn_km(kmeans, gm, ns, trait, proc_id, container_binarized_traits, container_

def fn_gm(kmeans, gm, ns, trait, proc_id, container_binarized_traits, container_binarization_info, extract_covariances):
print_status(ns, trait, proc_id)
logger.debug(f'Binarizing {trait=} using gaussian mixture')
logger.debug(f'Binarizing {trait=} using gaussian mixture.')
try:
binarized_data, metadata = apply_gm(gm, ns.numeric_df[trait], certainty_cutoff=ns.cutoff,
extract_covariances=extract_covariances)
Expand Down Expand Up @@ -333,6 +335,8 @@ def binarize(
from .init_multiprocessing import init, mp
mgr, ns, counter, lock = init()

logger.info(f'Binarizing traits: {method=} {alternative=} {cutoff=} {covariance_type=} {n_cpus=} {random_state=}')

ns = BinarizeTraitNamespace.create_namespace(ns, {
'start_time': datetime.now(),
'lock': lock,
Expand Down
3 changes: 2 additions & 1 deletion scoary/scoary.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ def scoary(
Low p-values suggest that the effect is not merely lineage-specific.
:param restrict_to: Comma-separated list of isolates to which to restrict this analysis
:param ignore: Comma-separated list of isolates to be ignored for this analysis
:param n_cpus: Number of CPUs that should be used
:param n_cpus: Number of CPUs that should be used. There is overhead in multiprocessing, so if the dataset is
small, use n_cpus=1
:param trait_data_type: "<method>:<?cutoff>:<?covariance_type>:<?alternative>:<?delimiter>" How to read the traits
table. Example: "gene-list:\\t" for OrthoFinder N0.tsv table
:param gene_data_type: "<data_type>:<?delimiter>" How to read the genes table. Example: "gene-list:\\t" for
Expand Down
2 changes: 1 addition & 1 deletion scoary/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ def load_info_file(
info_df = pd.read_csv(info_file, index_col=0, delimiter='\t')

assert info_df.index.name == merge_col, \
f'The file {info_file} is improperly formatted: The first column must be named "{merge_col}".' \
f'The file {info_file} is improperly formatted: The first column must be named "{merge_col}". ' \
f'Current name: {info_df.index.name}. Remaining columns: {info_df.columns.tolist()}'

if expected_overlap_set is not None:
Expand Down
1 change: 1 addition & 0 deletions tests/init_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
},
'tetracycline': {
'genes': 'Gene_presence_absence.csv',
'gene-info': 'gene-info.tsv',
'traits': 'Tetracycline_resistance.csv',
'traits-numeric': 'Tetracycline_resistance_numeric.csv',
'restrict_to': 'Restrict_to.csv',
Expand Down
28 changes: 28 additions & 0 deletions tests/test_scoary.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,16 @@ def test_scoary_multi_threaded(self):
outdir=self.tempdir
)

def test_scoary_gene_info(self):
scoary(
genes=get_path('tetracycline', 'genes'),
gene_info=get_path('tetracycline', 'gene-info'),
traits=get_path('tetracycline', 'traits'),
n_permut=1000,
n_cpus=1,
outdir=self.tempdir
)

def test_scoary_long(self):
scoary(
genes=get_path('new_ds', 'genes-hog'),
Expand Down Expand Up @@ -72,6 +82,24 @@ def test_scoary_long_numeric(self):
pairwise=False
)

def test_scoary_gauss_kmeans(self):
scoary(
genes=get_path('new_ds', 'genes-hog'),
gene_info=get_path('new_ds', 'genes-hog-info'),
gene_data_type='gene-list:2',
traits=get_path('new_ds', 'traits-lc'),
trait_data_type=f'gaussian:kmeans:\t',
trait_info=get_path('new_ds', 'traits-lc-meta'),
isolate_info=get_path('new_ds', 'isolate-meta'),
n_permut=200,
restrict_to=RESTRICT_TO,
random_state=42,
n_cpus=7,
outdir=self.tempdir,
limit_traits=(0, 100),
pairwise=False
)

def test_scoary_real(self):
scoary(
multiple_testing='bonferroni:0.1',
Expand Down

0 comments on commit 2ca3fad

Please sign in to comment.