From d9353027285b0317a4401bc675bfefddb1f7cd14 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Fri, 29 Sep 2023 11:23:33 -0700 Subject: [PATCH 01/43] add arXiv badge --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index c825d22..5c296e1 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,5 @@ # MAVE Minimum Information Model + +[![arXiv](https://img.shields.io/badge/arXiv-2306.15113-b31b1b.svg?style=flat-square)](https://arxiv.org/abs/2306.15113) + JSON Schema for validating MAVE experiment metadata From f0c3f30d2add90a87f370be9a4a9e8df4b29b1b6 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Fri, 29 Sep 2023 11:38:02 -0700 Subject: [PATCH 02/43] basic README updates --- README.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/README.md b/README.md index 5c296e1..516130a 100644 --- a/README.md +++ b/README.md @@ -3,3 +3,15 @@ [![arXiv](https://img.shields.io/badge/arXiv-2306.15113-b31b1b.svg?style=flat-square)](https://arxiv.org/abs/2306.15113) JSON Schema for validating MAVE experiment metadata + +## How to use this repository + +This repository contains an implementation of the schema described in the [Atlas of Variant Effects Alliance](https://www.varianteffect.org) minimum information model for describing a multiplexed assay experiment. + +The schema defines a set of required and optional fields and possible values that can be used to validate a minimum information document. +The implementation is found in the `schema` directory. + +The `examples` directory contains examples of this type of document describing real experiments, as well as a simple Python script that will run the schema validation using [jsonschema](https://pypi.org/project/jsonschema/). +Many other implementations of the JSON Schema standard are available in other languages (see [here](https://json-schema.org/implementations.html)). + +Please note that although we are using the JSON Schema standard, the files here are in YAML format because it is more human-readable. From ee71fb121a44d24c197855a47d2e4304791c969a Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Fri, 29 Sep 2023 11:53:47 -0700 Subject: [PATCH 03/43] add instructions for reading the schema --- README.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/README.md b/README.md index 516130a..0fcce98 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,8 @@ JSON Schema for validating MAVE experiment metadata +*Purpose:* To provide an overarching organization and definitions for terms relevant to tech development and data repositories associated with the [Atlas of Variant Effects Alliance](https://www.varianteffect.org). + ## How to use this repository This repository contains an implementation of the schema described in the [Atlas of Variant Effects Alliance](https://www.varianteffect.org) minimum information model for describing a multiplexed assay experiment. @@ -11,7 +13,22 @@ This repository contains an implementation of the schema described in the [Atlas The schema defines a set of required and optional fields and possible values that can be used to validate a minimum information document. The implementation is found in the `schema` directory. +In addition to the structure of the minimum information model, the schema also defines controlled vocabulary terms for describing one of these experiments. + The `examples` directory contains examples of this type of document describing real experiments, as well as a simple Python script that will run the schema validation using [jsonschema](https://pypi.org/project/jsonschema/). Many other implementations of the JSON Schema standard are available in other languages (see [here](https://json-schema.org/implementations.html)). Please note that although we are using the JSON Schema standard, the files here are in YAML format because it is more human-readable. + +## Reading the schema + +The `schema` directory contains a YAML representation of the minimum information standard and controlled vocabulary. +There are multiple levels of required information that can be browsed hierarchically. +Most fields include a description that details the intention of that field and the type of information that is to be provided. + +For many fields, there is an enumerated list of valid values corresponding to the controlled vocabulary terms that can be used to describe the experiment. + +The schema structure and terms are also described below, although the YAML documents in the `schema` directory should be considered the authoritative source of information if there are discrepancies. + +## Controlled vocabulary terms + From 2258d7446258887f34952071baaf8e586ae8e813 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Fri, 29 Sep 2023 13:50:10 -0700 Subject: [PATCH 04/43] added gdoc content to README --- README.md | 182 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 181 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 0fcce98..7508507 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,187 @@ Most fields include a description that details the intention of that field and t For many fields, there is an enumerated list of valid values corresponding to the controlled vocabulary terms that can be used to describe the experiment. -The schema structure and terms are also described below, although the YAML documents in the `schema` directory should be considered the authoritative source of information if there are discrepancies. +The general schema structure and terms are also described below. +The YAML documents in the `schema` directory should be considered the authoritative structure and source of information where there are discrepancies. ## Controlled vocabulary terms +### Important acronyms + +#### Multiplexed Assays of Variant Effects (MAVEs) + +Experimental assays involving scaled, pooled genetic perturbation of a naturally occurring or synthetic DNA element followed by multiplexed high-throughput phenotyping (potentially multiple phenotypic modalities). + +#### Variant Effect Map (VEM) + +A dataset that reports the effects of variation in a DNA element (a gene, transcript, set of regulatory regions, etc.) on a single or multiplexed set of phenotypes. + +#### Atlas of Variant Effects (AVE) ### + +A combined resource for variant effects measured across model systems and contexts applicable to the study of the structure and function of the genome and its products, as well as the consequences of its perturbation in health and disease. + +### Experimental vocabulary (genetic perturbation, phenotype and context) + +#### Genetic perturbation + +This section describes the scope and characteristics of variant introduction. + +**Library scope** – the collection of DNA elements introduced into the library. +DNA elements can have known (e.g. a gene, an exon or set of exons included in a transcript, a set of enhancers, repressors, etc), or unknown functions. +For a given DNA element we distinguish the mode of variant programming/engineering (e.g. all SNV, indels, ClinVar variants etc). + +Controlled vocabulary terms (one or many): +- Coding +- Intronic +- Non-coding regulatory +- Non-coding other (eg tRNA) + +**Variant Library characteristics** – methods used to generate the library + +*Variant generation method* – how was the variant library created (e.g. doped oligo, mutagenic PCR, primer-based, base editor) + +Controlled vocabulary categorical term (can pick both category options): +- Editing at endogenous locus +- In vitro variant construct generation + +*In vitro construct generation method* (if applicable) +- Oligo-directed mutagenic PCR (e.g. NNK PCR) +- Error-prone PCR +- Nicking mutagenesis +- Microarray synthesis +- Site-directed mutagenesis +- Doped oligo synthesis +- Oligo pool synthesis +- Proprietary method +- Other (please describe) + +*Integration/expression of exogenous construct* (if applicable) +- Entire element replacement at the native locus (e.g. with integrases, not base editing) + +*Integration of extra-local construct* (e.g. with landing pad; if applicable) +- Viral Integration +- Episomal delivery +- Transfection of RNA + +*Endogenous genome editing* (if applicable) +- CRISPR/Cas system +- SpCas9 +- SaCas9 +- AsCas12a +- RfxCas13d +- CRISPR/Cas system functionality + - Wildtype nuclease + - Base Editor + - Prime Editor + +**Delivery method** – how the variant induction machinery and/or construct was delivered to the cell/organism (e.g. viral transduction, electroporation, transfection and MOI) + +Controlled vocabulary terms (one or many): +- Electroporation +- Lipofection +- Nucleofection +- Microinjection +- Chemical-based transfection +- Transduction: AAV +- Transduction: lentivirus +- Transformation: chemical or heat shock +- Other (please specify) + +#### Phenotypic assay + +A physical adjudication of model system that allows for systematic interrogation of a functional read-out for a large amount of genetic variants (e.g. cell size and mode of adjudication, action potential characteristic(s) and mode of measurement, expression of a particular factor and mode of measurement (FACS, sc-RNA-seq), or transcript expression (bulk RNA-seq)). + +**Dimensionality of phenotyping assays** – how many phenotypes and of what complexity are included in the map + +Controlled vocabulary terms (select one): +- Single functional read-out +- Single dimension (e.g. FACS fluorescence from a single protein was used) +- High-dimensional data (e.g. ML/AI enabled cell imaging/classification) +- The outcomes of multiple phenotypic assays were combined to make this map + +**Phenotypic assay examines** – terms selected from OBI subtree with root [OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) + +- DNA + - OBI_0000913 Promoter activity reporter gene assay RNA + - “Other”, e.g. structure, methylation + +- RNA + - OBI_0001177 Bulk RNA-sequencing + - OBI_0002631 Single cell RNA-sequencing and single cell combinatorial index RNA-sequencing assay + - OBI_0003094 Fluorescence in-situ hybridization (FISH) assay + - “Other” + +- Protein + - OBI_0000916 Flow cytometry assay + - OBI_0003096 Imaging Mass Cytometry assay + - OBI_0002161 Evolution of ligands by exponential enrichment assay + - “Other” + +- Morphology & Function + - OBI_0002119 Single cell imaging + - OBI_0003091 Multiplexed fluorescent antibody imaging + - OBI_0001146 Binding assays + - OBI_0000891 Cell Proliferation Assay, including fluorescence image-based cell proliferation assay + - OBI_0000699 Survival assessment assay + - “Other” + +**Disease/biological process relevance** – choose terms from [OMIM](https://www.omim.org/) or [https://mondo.monarchinitiative.org/](https://mondo.monarchinitiative.org/) + +#### Context - Characteristics of the model system that influence expression of phenotype + +**Cellular model system and genetic background** – genetically encoded characteristics of the model system that potentially affect the outcome of the assay (e.g. species, animal strain, genetic ancestry, biological sex) + +Controlled vocabulary terms (one or many): +- Immortalized human cells (e.g. HEK293, HeLa cells; please specify below) +- Murine primary cells +- Induced pluripotent stem cells from male +- Induced pluripotent stem cells from female +- Patient derived primary cells (e.g. T-cells, adipocytes) +- Yeast +- E. coli +- Other bacteria +- Bacteriophage +- Molecular display (e.g. ribosome display) +- Other (please specify - includes all other OBI ontology terms) + +Commonly used cell lines and model systems + +| Cell | CLO Term | NCBI Taxonomy ID | +|------|----------|------------------| +| Yeast | n/a | 4932 | +| HEK293T | 37372 or 37373 | 9606 | +| HAP1 | missing | 9606 | +| HeLa | 3684 | 9606 | +| *E. coli* | n/a | 562 | +| iPSC-derived | 37308 | 9606 | +| *C. elegans* | n/a | 6239 | +| *C. savignyi* | n/a | 51511 | +| *D. melanogaster* | n/a | 7227 | +| HepG2 | 3704 | 9606 | +| Human hepatocytes | 182 | 9606 | +| K562 | 7050 | 9606 | +| Mouse embryonic stem cells | 37317 | 10090 | +| NIH3T3 | missing | 10090 | +| Bacteriophage | n/a | 38018 | +| Cell-free | n/a | n/a | + +**Environmental variables** – variance of environmental factors included in the experiment (e.g. addition of specific compounds to cell media, temperature controls, time course, CRISPR interference by KRAB, KRAB-MeCP2, CRISPR activation by VPR, SAM, or SunTag, etc.) + +Controlled vocabulary terms (select one): +- Yes - If yes, please describe this in detail in the free text methods describing your assay. +- No + +#### Variant sequencing characteristics +This section details the method for accurately capturing variant frequency associated with outcome of phenotypic assay. + +**Library profiling strategy** – approach used to quantify variants in the population + +Controlled vocabulary terms (select one): +- Direct sequencing +- Shotgun sequencing +- Barcode sequencing + +Controlled vocabulary terms (select one): +- Single segment (short read) +- Single segment (long read) +- Multi-segment From edf60778fd07c4998bc28a72778b54b3916b8a40 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 4 Oct 2023 10:56:46 -0700 Subject: [PATCH 05/43] support validating multiple examples --- examples/source_validation.py | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/examples/source_validation.py b/examples/source_validation.py index 1a68980..bdcaa5a 100644 --- a/examples/source_validation.py +++ b/examples/source_validation.py @@ -1,4 +1,4 @@ -from jsonschema import validate +from jsonschema import validate, ValidationError import pathlib import yaml @@ -10,6 +10,15 @@ if __name__ == "__main__": with open(SCHEMA_DIR / "experiment.yml") as ysf: experiment_schema = yaml.safe_load(ysf) - with open(EXAMPLES_DIR / "experiment1.yml") as ye1f: - experiment_record = yaml.safe_load(ye1f) - validate(experiment_record, experiment_schema) + for example_file in EXAMPLES_DIR.glob("*.yml"): + with open(example_file) as ye1f: + experiment_record = yaml.safe_load(ye1f) + print("validating", example_file) + try: + validate(experiment_record, experiment_schema) + except ValidationError as e: + print("failed to validate:", e.message) + else: + print("validation successful") + + From a683ff2113c4f7eeabd626145f63a3874bee0b13 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 4 Oct 2023 10:57:13 -0700 Subject: [PATCH 06/43] rename example1 to match publication --- examples/{experiment1.yml => Seuma_2018.yml} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename examples/{experiment1.yml => Seuma_2018.yml} (100%) diff --git a/examples/experiment1.yml b/examples/Seuma_2018.yml similarity index 100% rename from examples/experiment1.yml rename to examples/Seuma_2018.yml From 703f1b605b560071b1d3ce762bbac3bb8f470cc4 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 4 Oct 2023 11:26:45 -0700 Subject: [PATCH 07/43] add findlay BRCA1 SGE example --- examples/Findlay_2018.yml | 57 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 examples/Findlay_2018.yml diff --git a/examples/Findlay_2018.yml b/examples/Findlay_2018.yml new file mode 100644 index 0000000..ecdc980 --- /dev/null +++ b/examples/Findlay_2018.yml @@ -0,0 +1,57 @@ +title: BRCA1 Saturation Genome Editing +abstract: >- + Variants of uncertain significance fundamentally limit the clinical utility of genetic information. The challenge + they pose is epitomized by BRCA1, a tumour suppressor gene in which germline loss-of-function variants predispose + women to breast and ovarian cancer. Although BRCA1 has been sequenced in millions of women, the risk associated with + most newly observed variants cannot be definitively assigned. Here we use saturation genome editing to assay 96.5% of + all possible single-nucleotide variants (SNVs) in 13 exons that encode functionally critical domains of BRCA1. + Functional effects for nearly 4,000 SNVs are bimodally distributed and almost perfectly concordant with established + assessments of pathogenicity. Over 400 non-functional missense SNVs are identified, as well as around 300 SNVs that + disrupt expression. We predict that these results will be immediately useful for the clinical interpretation of BRCA1 + variants, and that this approach can be extended to overcome the challenge of variants of uncertain significance in + additional clinically actionable genes. +document: + title: Accurate classification of BRCA1 variants with saturation genome editing + system: + Nature + date: "2018-09-12" + ref: https://doi.org/10.1038/s41586-018-0461-z +variantLibrary: + scope: + type: coding + targetSequences: + - id: NM_007294.3 + sequenceAlphabet: DNA + generationMethod: + type: endogenous locus library + system: SpCas9 + mechanism: nuclease + description: Array-synthesized oligo pools (Agilent) + deliveryMethod: + type: other + description: Lipofection - TurboFectin +phenotypicAssay: + dimensionality: + type: multiple functional readouts + method: + type: survival assessment assay + method: + type: bulk RNA-sequencing + relevance: + - system: https://www.omim.org/ + code: "604370" + label: BREAST-OVARIAN CANCER, FAMILIAL, SUSCEPTIBILITY TO, 1; BROVCA1 + - system: https://www.omim.org/ + code: "113705" + label: BRCA1 DNA REPAIR-ASSOCIATED PROTEIN; BRCA1 + - system: https://mondo.monarchinitiative.org/ + code: MONDO:0004984 + label: basal-like breast carcinoma + - system: https://mondo.monarchinitiative.org/ + code: MONDO:0011450 + label: breast-ovarian cancer, familial, susceptibility to, 1 + modelSystem: + type: immortalized human cells + description: HAP1 + profilingStrategy: direct sequencing + sequencingMethod: multi-segment From abebeafb17bc5fc44d3717c25a2fd28a0929394c Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 4 Oct 2023 12:44:37 -0700 Subject: [PATCH 08/43] whitespace changes --- examples/Seuma_2018.yml | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/examples/Seuma_2018.yml b/examples/Seuma_2018.yml index ad3efe4..86caf04 100644 --- a/examples/Seuma_2018.yml +++ b/examples/Seuma_2018.yml @@ -1,9 +1,19 @@ title: Amyloid-Beta Deep Mutational Scan -abstract: Multiplexed assays of variant effects (MAVEs) guide clinical variant interpretation and reveal disease mechanisms. To date, MAVEs have focussed on a single mutation type–amino acid (AA) substitutions–despite the diversity of coding variants that cause disease. Here we use Deep Indel Mutagenesis (DIM) to generate a comprehensive atlas of diverse variant effects for a disease protein, the amyloid beta (Aβ) peptide that aggregates in Alzheimer's disease (AD) and is mutated in familial AD (fAD). The atlas identifies known fAD mutations and reveals that many variants beyond substitutions accelerate Aβ aggregation and are likely to be pathogenic. Truncations, substitutions, insertions, single- and internal multi-AA deletions differ in their propensity to enhance or impair aggregation, but likely pathogenic variants from all classes are highly enriched in the polar N-terminal region of Aβ. This comparative atlas highlights the importance of including diverse mutation types in MAVEs and provides important mechanistic insights into amyloid nucleation. +abstract: >- + Multiplexed assays of variant effects (MAVEs) guide clinical variant interpretation and reveal disease mechanisms. To + date, MAVEs have focussed on a single mutation type–amino acid (AA) substitutions–despite the diversity of coding + variants that cause disease. Here we use Deep Indel Mutagenesis (DIM) to generate a comprehensive atlas of diverse + variant effects for a disease protein, the amyloid beta (Aβ) peptide that aggregates in Alzheimer's disease (AD) and + is mutated in familial AD (fAD). The atlas identifies known fAD mutations and reveals that many variants beyond + substitutions accelerate Aβ aggregation and are likely to be pathogenic. Truncations, substitutions, insertions, + single- and internal multi-AA deletions differ in their propensity to enhance or impair aggregation, but likely + pathogenic variants from all classes are highly enriched in the polar N-terminal region of Aβ. This comparative atlas + highlights the importance of including diverse mutation types in MAVEs and provides important mechanistic insights + into amyloid nucleation. document: title: >- - An atlas of amyloid aggregation: the impact of substitutions, insertions, deletions and truncations - on amyloid beta fibril nucleation. + An atlas of amyloid aggregation: the impact of substitutions, insertions, deletions and truncations on amyloid beta + fibril nucleation. system: Nature Communications date: "2022-11-18" From d3a6446a3ac5a378b01041554dea94b0d041a74b Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 4 Oct 2023 12:51:40 -0700 Subject: [PATCH 09/43] explain ga4gh hashes --- README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/README.md b/README.md index 7508507..d5a91d0 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,13 @@ For many fields, there is an enumerated list of valid values corresponding to th The general schema structure and terms are also described below. The YAML documents in the `schema` directory should be considered the authoritative structure and source of information where there are discrepancies. +### Generating sequence identifiers + +Some examples (e.g. `examples/Seuma_2018.yml`) include target sequence identifiers and hashes. +These values were generated according to the [GA4GH VRS](https://vrs.ga4gh.org/) standard (see [here](en/stable/impl-guide/computed_identifiers.html)) for details. + +Generating these stable identifiers is not required but is recommended, particularly for in-vitro construct libraries. + ## Controlled vocabulary terms ### Important acronyms From af279192495ed0a9c3bc920d673fc419bc02935e Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 4 Oct 2023 16:01:39 -0700 Subject: [PATCH 10/43] add PTEN VAMP-seq example --- examples/Matreyek_2018.yml | 77 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) create mode 100644 examples/Matreyek_2018.yml diff --git a/examples/Matreyek_2018.yml b/examples/Matreyek_2018.yml new file mode 100644 index 0000000..e66eb32 --- /dev/null +++ b/examples/Matreyek_2018.yml @@ -0,0 +1,77 @@ +title: PTEN VAMP-seq +abstract: >- + Determining the pathogenicity of genetic variants is a critical challenge, and functional assessment is often the + only option. Experimentally characterizing millions of possible missense variants in thousands of clinically + important genes requires generalizable, scalable assays. We describe variant abundance by massively parallel + sequencing (VAMP-seq), which measures the effects of thousands of missense variants of a protein on intracellular + abundance simultaneously. We apply VAMP-seq to quantify the abundance of 7,801 single-amino-acid variants of PTEN and + TPMT, proteins in which functional variants are clinically actionable. We identify 1,138 PTEN and 777 TPMT variants + that result in low protein abundance, and may be pathogenic or alter drug metabolism, respectively. We observe + selection for low-abundance PTEN variants in cancer, and show that p.Pro38Ser, which accounts for ~10% of PTEN + missense variants in melanoma, functions via a dominant-negative mechanism. Finally, we demonstrate that VAMP-seq is + applicable to other genes, highlighting its generalizability. +document: + title: >- + Multiplex assessment of protein variant abundance by massively parallel + sequencing + system: + Nature Genetics + date: "2018-05-21" + ref: https://doi.org/10.1038/s41588-018-0122-z +variantLibrary: + scope: + type: coding + targetSequences: + - sequence: "ATGACAGCCATCATCAAAGAGATCGTTAGCAGAAACAAAAGGAGATATCAAGAGGATGGA\ + TTCGACTTAGACTTGACCTATATTTATCCAAACATTATTGCTATGGGATTTCCTGCAGAA\ + AGACTTGAAGGCGTATACAGGAACAATATTGATGATGTAGTAAGGTTTTTGGATTCAAAG\ + CATAAAAACCATTACAAGATATACAATCTTTGTGCTGAAAGACATTATGACACCGCCAAA\ + TTTAATTGCAGAGTTGCACAATATCCTTTTGAAGACCATAACCCACCACAGCTAGAACTT\ + ATCAAACCCTTTTGTGAAGATCTTGACCAATGGCTAAGTGAAGATGACAATCATGTTGCA\ + GCAATTCACTGTAAAGCTGGAAAGGGACGAACTGGTGTAATGATATGTGCATATTTATTA\ + CATCGGGGCAAATTTTTAAAGGCACAAGAGGCCCTAGATTTCTATGGGGAAGTAAGGACC\ + AGAGACAAAAAGGGAGTAACTATTCCCAGTCAGAGGCGCTATGTGTATTATTATAGCTAC\ + CTGTTAAAGAATCATCTGGATTATAGACCAGTGGCACTGTTGTTTCACAAGATGATGTTT\ + GAAACTATTCCAATGTTCAGTGGCGGAACTTGCAATCCTCAGTTTGTGGTCTGCCAGCTA\ + AAGGTGAAGATATATTCCTCCAATTCAGGACCCACACGACGGGAAGACAAGTTCATGTAC\ + TTTGAGTTCCCTCAGCCGTTACCTGTGTGTGGTGATATCAAAGTAGAGTTCTTCCACAAA\ + CAGAACAAGATGCTAAAAAAGGACAAAATGTTTCACTTTTGGGTAAATACATTCTTCATA\ + CCAGGACCAGAGGAAACCTCAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATC\ + GATAGCATTTGCAGTATAGAGCGTGCAGATAATGACAAGGAATATCTAGTACTTACTTTA\ + ACAAAAAATGATCTTGACAAAGCAAATAAAGACAAAGCCAACCGATACTTTTCTCCAAAT\ + TTTAAGGTGAAGCTGTACTTCACAAAAACAGTAGAGGAGCCGTCAAATCCAGAGGCTAGC\ + AGTTCAACTTCTGTAACACCAGATGTTAGTGACAATGAACCTGATCATTATAGATATTCT\ + GACACCACTGACTCTGATCCAGAGAATGAACCTTTTGATGAAGATCAGCATACACAAATT\ + ACAAAAGTCTGA" + sequenceAlphabet: DNA + generationMethod: + type: in-vitro construct library + system: oligo-directed mutagenic PCR + integration: extra-local construct insertion + description: Integration using Tet-on landing pad system + deliveryMethod: + type: chemical or heat shock transformation +phenotypicAssay: + dimensionality: + type: single dimension + method: + type: flow cytometry assay + description: VAMP-seq + relevance: + - system: https://www.omim.org/ + code: "601728" + label: PHOSPHATASE AND TENSIN HOMOLOG; PTEN + - system: https://www.omim.org/ + code: "158350" + label: COWDEN SYNDROME 1; CWS1 + - system: https://mondo.monarchinitiative.org/ + code: MONDO:0017623 + label: PTEN hamartoma tumor syndrome + - system: https://mondo.monarchinitiative.org/ + code: MONDO:0017623 + label: Cowden syndrome 1 + modelSystem: + type: immortalized human cells + description: HEK 293T TetBxb1BFP + profilingStrategy: barcode sequencing + sequencingMethod: single-segment (short read) From 588e0e16fa2924e0f21d72b7b851ea23718a862d Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Sat, 7 Oct 2023 15:16:58 -0700 Subject: [PATCH 11/43] add zenodo badge --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index d5a91d0..eebed65 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ # MAVE Minimum Information Model -[![arXiv](https://img.shields.io/badge/arXiv-2306.15113-b31b1b.svg?style=flat-square)](https://arxiv.org/abs/2306.15113) +[![arXiv](https://img.shields.io/badge/arXiv-2306.15113-b31b1b.svg)](https://arxiv.org/abs/2306.15113) +[![DOI](https://zenodo.org/badge/634403007.svg)](https://zenodo.org/badge/latestdoi/634403007) JSON Schema for validating MAVE experiment metadata From a6315f3cd0d9fb1c953fec885c28662c8ea1ef41 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Sat, 7 Oct 2023 17:09:58 -0700 Subject: [PATCH 12/43] remove unused score set fields --- schema/experiment.yml | 42 ------------------------------------------ 1 file changed, 42 deletions(-) diff --git a/schema/experiment.yml b/schema/experiment.yml index fd537d4..ac3f823 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -1,4 +1,3 @@ -# https://docs.google.com/document/d/1HTy5oLd0FWwcqirCh6WXFcBAiwHl5rr5hnjg_kEYKsE/edit $schema: https://json-schema.org/draft/2020-12/schema title: MAVE experiment definition $defs: @@ -300,41 +299,6 @@ $defs: - modelSystem - profilingStrategy - sequencingMethod -# scoreSet: -# description: a set of assayed variants and their phenotypic scores. -# type: object -# properties: -# variantScores: -# description: a set of variants and their phenotypic scores. -# type: array -# items: -# - type: object -# properties: -# variant: -# type: object -# description: a variant associated with a phenotypic score. -# properties: -# vrs: -# # $ref: "https://w3id.org/ga4gh/vrs/1.3/vrs.json#/definitions/Variation" -# type: object -# description: a computable description of the variant using GA4GH VRS. -# mave-hgvs: -# type: string -# description: a human-readable description of the variant using MAVE-HGVS. -# anyOf: -# - required: -# - vrs -# - required: -# - mave-hgvs -# score: -# type: number -# description: a phenotypic score associated with a variant. -# required: -# - variant -# - score -# minItems: 1 -# required: -# - variant_scores type: object additionalProperties: false properties: @@ -352,14 +316,8 @@ properties: $ref: "#/$defs/VariantLibrary" phenotypicAssay: $ref: "#/$defs/phenotypicAssay" -# scoreSets: -# type: array -# items: -# - $ref: "#/$defs/scoreSet" -# minItems: 1 required: - title - abstract - variantLibrary - phenotypicAssay -# - scoreSet \ No newline at end of file From 8cb7ba85a206d7433ba7224004477d2a395fed0a Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Mon, 9 Oct 2023 14:26:56 -0700 Subject: [PATCH 13/43] add hard wrap at 120 characters --- README.md | 74 ++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 51 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index eebed65..fd604a0 100644 --- a/README.md +++ b/README.md @@ -5,37 +5,49 @@ JSON Schema for validating MAVE experiment metadata -*Purpose:* To provide an overarching organization and definitions for terms relevant to tech development and data repositories associated with the [Atlas of Variant Effects Alliance](https://www.varianteffect.org). +*Purpose:* To provide an overarching organization and definitions for terms relevant to tech development and data +repositories associated with the [Atlas of Variant Effects Alliance](https://www.varianteffect.org). ## How to use this repository -This repository contains an implementation of the schema described in the [Atlas of Variant Effects Alliance](https://www.varianteffect.org) minimum information model for describing a multiplexed assay experiment. +This repository contains an implementation of the schema described in the +[Atlas of Variant Effects Alliance](https://www.varianteffect.org) minimum information model for describing a +multiplexed assay experiment. -The schema defines a set of required and optional fields and possible values that can be used to validate a minimum information document. +The schema defines a set of required and optional fields and possible values that can be used to validate a minimum +information document. The implementation is found in the `schema` directory. -In addition to the structure of the minimum information model, the schema also defines controlled vocabulary terms for describing one of these experiments. +In addition to the structure of the minimum information model, the schema also defines controlled vocabulary terms for +describing one of these experiments. -The `examples` directory contains examples of this type of document describing real experiments, as well as a simple Python script that will run the schema validation using [jsonschema](https://pypi.org/project/jsonschema/). -Many other implementations of the JSON Schema standard are available in other languages (see [here](https://json-schema.org/implementations.html)). +The `examples` directory contains examples of this type of document describing real experiments, as well as a simple +Python script that will run the schema validation using [jsonschema](https://pypi.org/project/jsonschema/). +Many other implementations of the JSON Schema standard are available in other languages (see +[here](https://json-schema.org/implementations.html)). -Please note that although we are using the JSON Schema standard, the files here are in YAML format because it is more human-readable. +Please note that although we are using the JSON Schema standard, the files here are in YAML format because it is more +human-readable. ## Reading the schema The `schema` directory contains a YAML representation of the minimum information standard and controlled vocabulary. There are multiple levels of required information that can be browsed hierarchically. -Most fields include a description that details the intention of that field and the type of information that is to be provided. +Most fields include a description that details the intention of that field and the type of information that is to be +provided. -For many fields, there is an enumerated list of valid values corresponding to the controlled vocabulary terms that can be used to describe the experiment. +For many fields, there is an enumerated list of valid values corresponding to the controlled vocabulary terms that can +be used to describe the experiment. The general schema structure and terms are also described below. -The YAML documents in the `schema` directory should be considered the authoritative structure and source of information where there are discrepancies. +The YAML documents in the `schema` directory should be considered the authoritative structure and source of information +where there are discrepancies. ### Generating sequence identifiers Some examples (e.g. `examples/Seuma_2018.yml`) include target sequence identifiers and hashes. -These values were generated according to the [GA4GH VRS](https://vrs.ga4gh.org/) standard (see [here](en/stable/impl-guide/computed_identifiers.html)) for details. +These values were generated according to the [GA4GH VRS](https://vrs.ga4gh.org/) standard (see +[here](en/stable/impl-guide/computed_identifiers.html)) for details. Generating these stable identifiers is not required but is recommended, particularly for in-vitro construct libraries. @@ -45,15 +57,19 @@ Generating these stable identifiers is not required but is recommended, particul #### Multiplexed Assays of Variant Effects (MAVEs) -Experimental assays involving scaled, pooled genetic perturbation of a naturally occurring or synthetic DNA element followed by multiplexed high-throughput phenotyping (potentially multiple phenotypic modalities). +Experimental assays involving scaled, pooled genetic perturbation of a naturally occurring or synthetic DNA element +followed by multiplexed high-throughput phenotyping (potentially multiple phenotypic modalities). #### Variant Effect Map (VEM) -A dataset that reports the effects of variation in a DNA element (a gene, transcript, set of regulatory regions, etc.) on a single or multiplexed set of phenotypes. +A dataset that reports the effects of variation in a DNA element (a gene, transcript, set of regulatory regions, etc.) +on a single or multiplexed set of phenotypes. #### Atlas of Variant Effects (AVE) ### -A combined resource for variant effects measured across model systems and contexts applicable to the study of the structure and function of the genome and its products, as well as the consequences of its perturbation in health and disease. +A combined resource for variant effects measured across model systems and contexts applicable to the study of the +structure and function of the genome and its products, as well as the consequences of its perturbation in health and +disease. ### Experimental vocabulary (genetic perturbation, phenotype and context) @@ -62,8 +78,10 @@ A combined resource for variant effects measured across model systems and contex This section describes the scope and characteristics of variant introduction. **Library scope** – the collection of DNA elements introduced into the library. -DNA elements can have known (e.g. a gene, an exon or set of exons included in a transcript, a set of enhancers, repressors, etc), or unknown functions. -For a given DNA element we distinguish the mode of variant programming/engineering (e.g. all SNV, indels, ClinVar variants etc). +DNA elements can have known (e.g. a gene, an exon or set of exons included in a transcript, a set of enhancers, +repressors, etc), or unknown functions. +For a given DNA element we distinguish the mode of variant programming/engineering +(e.g. all SNV, indels, ClinVar variants etc). Controlled vocabulary terms (one or many): - Coding @@ -73,7 +91,8 @@ Controlled vocabulary terms (one or many): **Variant Library characteristics** – methods used to generate the library -*Variant generation method* – how was the variant library created (e.g. doped oligo, mutagenic PCR, primer-based, base editor) +*Variant generation method* – how was the variant library created +(e.g. doped oligo, mutagenic PCR, primer-based, base editor) Controlled vocabulary categorical term (can pick both category options): - Editing at endogenous locus @@ -109,7 +128,8 @@ Controlled vocabulary categorical term (can pick both category options): - Base Editor - Prime Editor -**Delivery method** – how the variant induction machinery and/or construct was delivered to the cell/organism (e.g. viral transduction, electroporation, transfection and MOI) +**Delivery method** – how the variant induction machinery and/or construct was delivered to the cell/organism +(e.g. viral transduction, electroporation, transfection and MOI) Controlled vocabulary terms (one or many): - Electroporation @@ -124,7 +144,10 @@ Controlled vocabulary terms (one or many): #### Phenotypic assay -A physical adjudication of model system that allows for systematic interrogation of a functional read-out for a large amount of genetic variants (e.g. cell size and mode of adjudication, action potential characteristic(s) and mode of measurement, expression of a particular factor and mode of measurement (FACS, sc-RNA-seq), or transcript expression (bulk RNA-seq)). +A physical adjudication of model system that allows for systematic interrogation of a functional read-out for a large +amount of genetic variants (e.g. cell size and mode of adjudication, action potential characteristic(s) and mode of +measurement, expression of a particular factor and mode of measurement (FACS, sc-RNA-seq), or transcript expression +(bulk RNA-seq)). **Dimensionality of phenotyping assays** – how many phenotypes and of what complexity are included in the map @@ -134,7 +157,8 @@ Controlled vocabulary terms (select one): - High-dimensional data (e.g. ML/AI enabled cell imaging/classification) - The outcomes of multiple phenotypic assays were combined to make this map -**Phenotypic assay examines** – terms selected from OBI subtree with root [OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) +**Phenotypic assay examines** – terms selected from OBI subtree with root +[OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) - DNA - OBI_0000913 Promoter activity reporter gene assay RNA @@ -160,11 +184,13 @@ Controlled vocabulary terms (select one): - OBI_0000699 Survival assessment assay - “Other” -**Disease/biological process relevance** – choose terms from [OMIM](https://www.omim.org/) or [https://mondo.monarchinitiative.org/](https://mondo.monarchinitiative.org/) +**Disease/biological process relevance** – choose terms from [OMIM](https://www.omim.org/) or +[https://mondo.monarchinitiative.org/](https://mondo.monarchinitiative.org/) #### Context - Characteristics of the model system that influence expression of phenotype -**Cellular model system and genetic background** – genetically encoded characteristics of the model system that potentially affect the outcome of the assay (e.g. species, animal strain, genetic ancestry, biological sex) +**Cellular model system and genetic background** – genetically encoded characteristics of the model system that +potentially affect the outcome of the assay (e.g. species, animal strain, genetic ancestry, biological sex) Controlled vocabulary terms (one or many): - Immortalized human cells (e.g. HEK293, HeLa cells; please specify below) @@ -200,7 +226,9 @@ Commonly used cell lines and model systems | Bacteriophage | n/a | 38018 | | Cell-free | n/a | n/a | -**Environmental variables** – variance of environmental factors included in the experiment (e.g. addition of specific compounds to cell media, temperature controls, time course, CRISPR interference by KRAB, KRAB-MeCP2, CRISPR activation by VPR, SAM, or SunTag, etc.) +**Environmental variables** – variance of environmental factors included in the experiment +(e.g. addition of specific compounds to cell media, temperature controls, time course, CRISPR interference by KRAB, +KRAB-MeCP2, CRISPR activation by VPR, SAM, or SunTag, etc.) Controlled vocabulary terms (select one): - Yes - If yes, please describe this in detail in the free text methods describing your assay. From 29c53770d1dc7ff3ca47e5a7e4061383f21b49ec Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Mon, 9 Oct 2023 14:41:55 -0700 Subject: [PATCH 14/43] fix incorrect date for Seuma --- README.md | 5 ++++- examples/{Seuma_2018.yml => Seuma_2022.yml} | 0 2 files changed, 4 insertions(+), 1 deletion(-) rename examples/{Seuma_2018.yml => Seuma_2022.yml} (100%) diff --git a/README.md b/README.md index fd604a0..b398312 100644 --- a/README.md +++ b/README.md @@ -45,7 +45,10 @@ where there are discrepancies. ### Generating sequence identifiers -Some examples (e.g. `examples/Seuma_2018.yml`) include target sequence identifiers and hashes. +S* `examples/Findlay_2018.yml` describes a saturation genome editing experiment on BRCA1, involving CRISPR-based editing +of the endogenous locus and measuring cell survival in HAP1 cells + +ome examples (e.g. `examples/Seuma_2022.yml`) include target sequence identifiers and hashes. These values were generated according to the [GA4GH VRS](https://vrs.ga4gh.org/) standard (see [here](en/stable/impl-guide/computed_identifiers.html)) for details. diff --git a/examples/Seuma_2018.yml b/examples/Seuma_2022.yml similarity index 100% rename from examples/Seuma_2018.yml rename to examples/Seuma_2022.yml From 23e27e2c4c1df6326dc7af16ff2a3a402f4ba63b Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Mon, 9 Oct 2023 15:16:47 -0700 Subject: [PATCH 15/43] improve instructions --- README.md | 43 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index b398312..5562cfd 100644 --- a/README.md +++ b/README.md @@ -43,12 +43,45 @@ The general schema structure and terms are also described below. The YAML documents in the `schema` directory should be considered the authoritative structure and source of information where there are discrepancies. -### Generating sequence identifiers +### Applying the schema to your datasets + +Unless you are an experienced YAML user who is able to read the `schema/experiment.yml` file yourself, we recommend +choosing the most closely-related example file as a starting point and modifying it as needed. + +The repository currently contains three examples: + +* `examples/Findlay_2018.yml` describes a saturation genome editing (SGE) experiment on BRCA1, involving CRISPR-based +editing of the endogenous locus and measuring cell survival in HAP1 cells +([PubMed reference](https://pubmed.ncbi.nlm.nih.gov/30209399/)) +* `examples/Matreyek_2018.ml` describes a deep mutational scan of PTEN, expressed using a designed construct integrated +into the genome using a landing pad system and measuring cell fluorescence, also known as VAMP-seq +([PubMed reference](https://pubmed.ncbi.nlm.nih.gov/29785012/)) +* `examples/Seuma_2022.yml` describes a deep mutatational scan of amyloid beta, expressed episomally and measuring the +effect on yeast growth ([PubMed reference](https://pubmed.ncbi.nlm.nih.gov/36400770/)) + +The schema starts with some descriptive metadata, such as the title and abstract. +We recommend that the title in particular focus on describing the dataset specific to the document rather than the +overall study. +The `title` and `abstract` are required properties. + +The next section (`document`) describes the publication (if any). +This part of the schema is optional, but if it is included, the `ref` property that provides an accession number +(such as a [DOI](https://www.doi.org/)) is required. + +The following sections `variantLibrary` and `phenotypicAssay` describe the experiment that was performed and both are +required. +Each has several subsections that provide structure for detailing the important experimental design decisions captured +by the schema. +We refer users to the examples and the list of [controlled vocabulary terms](#controlled-vocabulary-terms) below to help +complete this section, as it will be different for each experiment. + +*Note:* We anticipate that the standard will be adopted by established resources such as +[MaveDB](https://www.mavedb.org) that will provide users with the ability to download a minimum information file after +data deposition. -S* `examples/Findlay_2018.yml` describes a saturation genome editing experiment on BRCA1, involving CRISPR-based editing -of the endogenous locus and measuring cell survival in HAP1 cells +### Generating sequence identifiers -ome examples (e.g. `examples/Seuma_2022.yml`) include target sequence identifiers and hashes. +Some examples (e.g. `examples/Seuma_2022.yml`) include target sequence identifiers and hashes. These values were generated according to the [GA4GH VRS](https://vrs.ga4gh.org/) standard (see [here](en/stable/impl-guide/computed_identifiers.html)) for details. @@ -92,7 +125,7 @@ Controlled vocabulary terms (one or many): - Non-coding regulatory - Non-coding other (eg tRNA) -**Variant Library characteristics** – methods used to generate the library +**Variant library characteristics** – methods used to generate the library *Variant generation method* – how was the variant library created (e.g. doped oligo, mutagenic PCR, primer-based, base editor) From 820ba7489c33557d9e8f06313f8bf801d831b70c Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Tue, 10 Oct 2023 12:16:53 -0700 Subject: [PATCH 16/43] add ontologies section --- README.md | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 5562cfd..29f4983 100644 --- a/README.md +++ b/README.md @@ -107,6 +107,19 @@ A combined resource for variant effects measured across model systems and contex structure and function of the genome and its products, as well as the consequences of its perturbation in health and disease. +### Ontologies and identifiers + +For describing assay readouts, we make use of terms from the +[Ontology for Biomedical Investigations](https://obi-ontology.org/). + +For describing human phenotypes relevant to the assay, we suggest using terms from [OMIM](https://www.omim.org/) or the +[Mondo Disease Ontology](https://mondo.monarchinitiative.org/). + +For describing human cell lines, we use terms from the [Cell Line Ontology](http://obofoundry.org/ontology/clo.html), +where available. +We encourage users to provide an [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) that specifically denotes +the organism (including strain, where applicable). + ### Experimental vocabulary (genetic perturbation, phenotype and context) #### Genetic perturbation @@ -220,8 +233,8 @@ Controlled vocabulary terms (select one): - OBI_0000699 Survival assessment assay - “Other” -**Disease/biological process relevance** – choose terms from [OMIM](https://www.omim.org/) or -[https://mondo.monarchinitiative.org/](https://mondo.monarchinitiative.org/) +**Disease/biological process relevance** – choose terms from [OMIM](https://www.omim.org/) or the +[Mondo Disease Ontology](https://mondo.monarchinitiative.org/) #### Context - Characteristics of the model system that influence expression of phenotype From e01c5b3d19cc46e60fa1a5a454026b1624e8717b Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 10:16:59 -0400 Subject: [PATCH 17/43] json build and formatting --- .requirements.txt | 3 +- README.md | 31 +-- schema/Makefile | 7 + schema/experiment.json | 421 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 446 insertions(+), 16 deletions(-) create mode 100644 schema/Makefile create mode 100644 schema/experiment.json diff --git a/.requirements.txt b/.requirements.txt index 0fe2a14..09d8f24 100644 --- a/.requirements.txt +++ b/.requirements.txt @@ -1,3 +1,4 @@ pytest pyyaml -jsonschema \ No newline at end of file +jsonschema +ga4gh.gks.metaschema diff --git a/README.md b/README.md index 29f4983..183ae81 100644 --- a/README.md +++ b/README.md @@ -5,25 +5,25 @@ JSON Schema for validating MAVE experiment metadata -*Purpose:* To provide an overarching organization and definitions for terms relevant to tech development and data +*Purpose:* To provide an overarching organization and definitions for terms relevant to tech development and data repositories associated with the [Atlas of Variant Effects Alliance](https://www.varianteffect.org). ## How to use this repository -This repository contains an implementation of the schema described in the -[Atlas of Variant Effects Alliance](https://www.varianteffect.org) minimum information model for describing a +This repository contains an implementation of the schema described in the +[Atlas of Variant Effects Alliance](https://www.varianteffect.org) minimum information model for describing a multiplexed assay experiment. -The schema defines a set of required and optional fields and possible values that can be used to validate a minimum +The schema defines a set of required and optional fields and possible values that can be used to validate a minimum information document. The implementation is found in the `schema` directory. -In addition to the structure of the minimum information model, the schema also defines controlled vocabulary terms for +In addition to the structure of the minimum information model, the schema also defines controlled vocabulary terms for describing one of these experiments. The `examples` directory contains examples of this type of document describing real experiments, as well as a simple Python script that will run the schema validation using [jsonschema](https://pypi.org/project/jsonschema/). -Many other implementations of the JSON Schema standard are available in other languages (see +Many other implementations of the JSON Schema standard are available in other languages (see [here](https://json-schema.org/implementations.html)). Please note that although we are using the JSON Schema standard, the files here are in YAML format because it is more @@ -93,12 +93,12 @@ Generating these stable identifiers is not required but is recommended, particul #### Multiplexed Assays of Variant Effects (MAVEs) -Experimental assays involving scaled, pooled genetic perturbation of a naturally occurring or synthetic DNA element +Experimental assays involving scaled, pooled genetic perturbation of a naturally occurring or synthetic DNA element followed by multiplexed high-throughput phenotyping (potentially multiple phenotypic modalities). #### Variant Effect Map (VEM) -A dataset that reports the effects of variation in a DNA element (a gene, transcript, set of regulatory regions, etc.) +A dataset that reports the effects of variation in a DNA element (a gene, transcript, set of regulatory regions, etc.) on a single or multiplexed set of phenotypes. #### Atlas of Variant Effects (AVE) ### @@ -112,12 +112,12 @@ disease. For describing assay readouts, we make use of terms from the [Ontology for Biomedical Investigations](https://obi-ontology.org/). -For describing human phenotypes relevant to the assay, we suggest using terms from [OMIM](https://www.omim.org/) or the -[Mondo Disease Ontology](https://mondo.monarchinitiative.org/). +For describing human diseases relevant to the assay, we recommend using terms from [OMIM](https://www.omim.org/) or +the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/). -For describing human cell lines, we use terms from the [Cell Line Ontology](http://obofoundry.org/ontology/clo.html), +For describing human cell lines, we use terms from the [Cell Line Ontology](http://obofoundry.org/ontology/clo.html), where available. -We encourage users to provide an [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) that specifically denotes +We encourage users to provide an [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) that specifically denotes the organism (including strain, where applicable). ### Experimental vocabulary (genetic perturbation, phenotype and context) @@ -275,8 +275,8 @@ Commonly used cell lines and model systems | Bacteriophage | n/a | 38018 | | Cell-free | n/a | n/a | -**Environmental variables** – variance of environmental factors included in the experiment -(e.g. addition of specific compounds to cell media, temperature controls, time course, CRISPR interference by KRAB, +**Environmental variables** – variance of environmental factors included in the experiment +(e.g. addition of specific compounds to cell media, temperature controls, time course, CRISPR interference by KRAB, KRAB-MeCP2, CRISPR activation by VPR, SAM, or SunTag, etc.) Controlled vocabulary terms (select one): @@ -284,9 +284,10 @@ Controlled vocabulary terms (select one): - No #### Variant sequencing characteristics + This section details the method for accurately capturing variant frequency associated with outcome of phenotypic assay. -**Library profiling strategy** – approach used to quantify variants in the population +**Library profiling strategy** – approach used to quantify variants in the population Controlled vocabulary terms (select one): - Direct sequencing diff --git a/schema/Makefile b/schema/Makefile new file mode 100644 index 0000000..3c83ecc --- /dev/null +++ b/schema/Makefile @@ -0,0 +1,7 @@ +JSYAMLS:=experiment.yml +JSONS:=${JSYAMLS:.yml=.json} + +all: ${JSONS} + +%.json: %.yml + jsy2js.py <$< >$@ diff --git a/schema/experiment.json b/schema/experiment.json new file mode 100644 index 0000000..d437ea2 --- /dev/null +++ b/schema/experiment.json @@ -0,0 +1,421 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "title": "MAVE experiment definition", + "$defs": { + "Document": { + "type": "object", + "additionalProperties": false, + "description": "a written document describing a work", + "properties": { + "title": { + "type": "string", + "description": "the title of the document" + }, + "system": { + "type": "string", + "description": "the name of the system the document is registered in, such as a journal or preprint server" + }, + "date": { + "type": "string", + "format": "date", + "description": "the date the document was registered" + }, + "ref": { + "type": "string", + "format": "uri", + "description": "a Universal Record Identifier for the document" + } + }, + "required": [ + "ref" + ] + }, + "Coding": { + "type": "object", + "additionalProperties": false, + "properties": { + "system": { + "description": "identity of the terminology system", + "type": "string", + "format": "uri" + }, + "version": { + "description": "version of the terminology system", + "type": "string" + }, + "code": { + "description": "a code within the terminology system", + "type": "string", + "pattern": "\\S+( \\S+)*" + }, + "label": { + "description": "a human-readable description of the concept associated with the code", + "type": "string" + } + } + }, + "ReferenceSequence": { + "type": "object", + "additionalProperties": false, + "properties": { + "id": { + "type": "string" + }, + "sha512t24u": { + "type": "string", + "pattern": "[0-9A-Za-z_\\-]{32}" + }, + "sequence": { + "type": "string", + "pattern": "^[A-Z*\\-]*$" + }, + "sequenceAlphabet": { + "type": "string", + "enum": [ + "DNA", + "RNA", + "protein" + ] + } + } + }, + "EndogenousLocusLibraryMethod": { + "type": "object", + "additionalProperties": false, + "description": "a methodology for generating a variant library at an endogenous locus", + "properties": { + "type": { + "const": "endogenous locus library", + "default": "endogenous locus library" + }, + "system": { + "type": "string", + "description": "the system used to generate the library", + "enum": [ + "SpCas9", + "SaCas9", + "AsCas12a", + "RfsCas13d" + ] + }, + "mechanism": { + "description": "the functional mechanism of the library generation method", + "type": "string", + "enum": [ + "nuclease", + "base editor", + "prime editor" + ] + }, + "description": { + "description": "additional details about the variant library generation method", + "type": "string" + } + }, + "required": [ + "type", + "system", + "mechanism" + ] + }, + "InVitroConstructLibraryMethod": { + "type": "object", + "additionalProperties": false, + "description": "a methodology for generating and integrating an exogenous variant library", + "properties": { + "type": { + "const": "in-vitro construct library", + "default": "in-vitro construct library" + }, + "system": { + "type": "string", + "description": "the type of method used to generate the library", + "enum": [ + "oligo-directed mutagenic PCR", + "error-prone PCR", + "nicking mutagenesis", + "microarray synthesis", + "site-directed mutagenesis", + "doped oligo synthesis", + "oligo pool synthesis", + "proprietary method", + "other" + ] + }, + "integration": { + "description": "the mechanism for integration or expression of an exogenous construct", + "type": "string", + "enum": [ + "native locus replacement", + "extra-local construct insertion", + "random locus viral integration", + "episomal delivery", + "plasmid (not integrated)", + "transfection of RNA" + ] + }, + "description": { + "description": "additional details about the variant library generation method", + "type": "string" + } + }, + "if": { + "properties": { + "system": { + "const": "other" + } + } + }, + "then": { + "required": [ + "type", + "system", + "description" + ] + }, + "else": { + "required": [ + "type", + "system" + ] + } + }, + "VariantLibrary": { + "type": "object", + "additionalProperties": false, + "description": "a collection of sequences that are derived from a common target sequence", + "properties": { + "targetSequences": { + "description": "the collection of sequences used as references from which all variants in the library are defined", + "type": "array", + "items": { + "$ref": "#/$defs/ReferenceSequence" + }, + "minItems": 1 + }, + "scope": { + "description": "the functional scope of DNA elements introduced into the library. DNA elements can have known or unknown functions. Example functions include a gene, an exon or set of exons included in a transcript, a set of enhancers, a set of repressors, etc.", + "type": "object", + "properties": { + "type": { + "description": "the scope type for elements introduced into the library", + "type": "string", + "enum": [ + "coding", + "intronic", + "non-coding, regulatory", + "non-coding, other" + ] + }, + "description": { + "type": "string", + "description": "additional details about the DNA element scope. For example, distinguishing the mode of variant programming/engineering (e.g. all SNV, indels, ClinVar variants etc)." + } + }, + "if": { + "properties": { + "type": { + "const": "non-coding, other" + } + } + }, + "then": { + "required": [ + "description" + ] + } + }, + "generationMethod": { + "description": "the method used for generating the library", + "oneOf": [ + { + "$ref": "#/$defs/EndogenousLocusLibraryMethod" + }, + { + "$ref": "#/$defs/InVitroConstructLibraryMethod" + } + ] + }, + "deliveryMethod": { + "description": "how the variant library was delivered to the model system for phenotype evaluation.", + "type": "object", + "properties": { + "type": { + "type": "string", + "enum": [ + "electroporation", + "nucleofection", + "chemical-based transfection", + "adeno-associated virus transduction", + "lentivirus transduction", + "chemical or heat shock transformation", + "other" + ] + }, + "description": { + "type": "string", + "description": "additional details about the delivery method" + } + }, + "required": [ + "type" + ] + } + }, + "required": [ + "scope", + "targetSequences", + "generationMethod", + "deliveryMethod" + ] + }, + "phenotypicAssay": { + "description": "a physical adjudication of a model system that allows for systematic interrogation of a functional read-out for a large amount of genetic variants.", + "type": "object", + "properties": { + "dimensionality": { + "type": "object", + "description": "dimensionality of phenotyping assay. Describes how many phenotypes and of what complexity are included in the map.", + "properties": { + "type": { + "description": "a coding defining the dimensionality of the assay as single or multiple functional readouts.", + "type": "string", + "enum": [ + "single dimension", + "high-dimensional data", + "multiple functional readouts" + ] + }, + "description": { + "type": "string", + "description": "additional details about the dimensionality of the assay" + } + }, + "required": [ + "type" + ] + }, + "method": { + "description": "the assay method, defining the molecular properties interrogated.", + "type": "object", + "properties": { + "type": { + "type": "string", + "enum": [ + "promoter activity reporter gene assay", + "bulk RNA-sequencing", + "single-cell RNA sequencing assay", + "fluorescence in-situ hybridization (FISH) assay", + "flow cytometry assay", + "imaging mass cytometry assay", + "evolution of ligands by exponential enrichment assay", + "single cell imaging", + "multiplexed fluorescent antibody imaging", + "binding assays", + "cell proliferation assay", + "survival assessment assay", + "other" + ] + }, + "description": { + "type": "string", + "description": "additional details about the assay method." + } + }, + "required": [ + "type" + ] + }, + "relevance": { + "description": "the disease or biological processes the assay is relevant to.", + "type": "array", + "items": { + "$ref": "#/$defs/Coding" + }, + "minItems": 1 + }, + "modelSystem": { + "description": "the model system context that influences expression of the phenotype.", + "type": "object", + "properties": { + "type": { + "description": "the model system.", + "type": "string", + "enum": [ + "immortalized human cells", + "murine primary cells", + "induced pluripotent stem cells from human male", + "induced pluripotent stem cells from human female", + "patient derived primary cells (e.g. T-cells, adipocytes)", + "yeast", + "bacteria", + "bacteriophage", + "molecular display", + "other" + ] + }, + "description": { + "type": "string", + "description": "additional details about the model system." + } + } + }, + "profilingStrategy": { + "description": "the strategy used to profile the variant library", + "type": "string", + "enum": [ + "direct sequencing", + "shotgun sequencing", + "barcode sequencing" + ] + }, + "sequencingMethod": { + "description": "the sequencing method used", + "type": "string", + "enum": [ + "single-segment (short read)", + "single-segment (long read)", + "multi-segment" + ] + } + }, + "required": [ + "dimensionality", + "method", + "relevance", + "modelSystem", + "profilingStrategy", + "sequencingMethod" + ] + } + }, + "type": "object", + "additionalProperties": false, + "properties": { + "title": { + "description": "the title of the MAVE experiment", + "type": "string" + }, + "abstract": { + "description": "an abstract describing the MAVE experiment", + "type": "string" + }, + "document": { + "description": "the primary document describing this experiment", + "$ref": "#/$defs/Document" + }, + "variantLibrary": { + "description": "characteristics of the variant library generation process", + "$ref": "#/$defs/VariantLibrary" + }, + "phenotypicAssay": { + "$ref": "#/$defs/phenotypicAssay" + } + }, + "required": [ + "title", + "abstract", + "variantLibrary", + "phenotypicAssay" + ] +} \ No newline at end of file From c341e2cd8c2b13b4781858cb67f8d2e7ccf42510 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 11 Oct 2023 07:31:12 -0700 Subject: [PATCH 18/43] bring over updates from VNP --- README.md | 29 ++++++++++------------------- 1 file changed, 10 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 29f4983..e724507 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,9 @@ JSON Schema for validating MAVE experiment metadata *Purpose:* To provide an overarching organization and definitions for terms relevant to tech development and data repositories associated with the [Atlas of Variant Effects Alliance](https://www.varianteffect.org). +This "controlled vocabulary" and standard is intended to give structure to minimum required information for data and +meta-data sharing for scientists using variant effect mapping technology. + ## How to use this repository @@ -14,6 +17,12 @@ This repository contains an implementation of the schema described in the [Atlas of Variant Effects Alliance](https://www.varianteffect.org) minimum information model for describing a multiplexed assay experiment. +Overall, it is felt that minimum standard reporting should include information on +(1) means and characteristics of genetic perturbation, +(2) details of the phenotypic assay employed to identify variant effects, +(3) information on the cellular and environmental context(s) in which the assays were carried out, and +(4) details of sequencing strategy for variant-effect associations. + The schema defines a set of required and optional fields and possible values that can be used to validate a minimum information document. The implementation is found in the `schema` directory. @@ -89,25 +98,7 @@ Generating these stable identifiers is not required but is recommended, particul ## Controlled vocabulary terms -### Important acronyms - -#### Multiplexed Assays of Variant Effects (MAVEs) - -Experimental assays involving scaled, pooled genetic perturbation of a naturally occurring or synthetic DNA element -followed by multiplexed high-throughput phenotyping (potentially multiple phenotypic modalities). - -#### Variant Effect Map (VEM) - -A dataset that reports the effects of variation in a DNA element (a gene, transcript, set of regulatory regions, etc.) -on a single or multiplexed set of phenotypes. - -#### Atlas of Variant Effects (AVE) ### - -A combined resource for variant effects measured across model systems and contexts applicable to the study of the -structure and function of the genome and its products, as well as the consequences of its perturbation in health and -disease. - -### Ontologies and identifiers +### Overview of ontologies and identifiers For describing assay readouts, we make use of terms from the [Ontology for Biomedical Investigations](https://obi-ontology.org/). From 1dfb4cad18b9953e1f2029eb6e4988724ecca23d Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 10:35:28 -0400 Subject: [PATCH 19/43] Markdown lint --- README.md | 64 +++++++++++++++++++++++++++++++++---------------------- 1 file changed, 38 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 183ae81..a01d849 100644 --- a/README.md +++ b/README.md @@ -133,22 +133,25 @@ For a given DNA element we distinguish the mode of variant programming/engineeri (e.g. all SNV, indels, ClinVar variants etc). Controlled vocabulary terms (one or many): + - Coding - Intronic - Non-coding regulatory - Non-coding other (eg tRNA) - + **Variant library characteristics** – methods used to generate the library -*Variant generation method* – how was the variant library created +*Variant generation method* – how was the variant library created (e.g. doped oligo, mutagenic PCR, primer-based, base editor) Controlled vocabulary categorical term (can pick both category options): + - Editing at endogenous locus - In vitro variant construct generation *In vitro construct generation method* (if applicable) -- Oligo-directed mutagenic PCR (e.g. NNK PCR) + +- Oligo-directed mutagenic PCR (e.g. NNK PCR) - Error-prone PCR - Nicking mutagenesis - Microarray synthesis @@ -159,28 +162,32 @@ Controlled vocabulary categorical term (can pick both category options): - Other (please describe) *Integration/expression of exogenous construct* (if applicable) + - Entire element replacement at the native locus (e.g. with integrases, not base editing) *Integration of extra-local construct* (e.g. with landing pad; if applicable) + - Viral Integration - Episomal delivery - Transfection of RNA *Endogenous genome editing* (if applicable) + - CRISPR/Cas system - SpCas9 - SaCas9 - AsCas12a - RfxCas13d - CRISPR/Cas system functionality - - Wildtype nuclease - - Base Editor - - Prime Editor + - Wildtype nuclease + - Base Editor + - Prime Editor -**Delivery method** – how the variant induction machinery and/or construct was delivered to the cell/organism +**Delivery method** – how the variant induction machinery and/or construct was delivered to the cell/organism (e.g. viral transduction, electroporation, transfection and MOI) Controlled vocabulary terms (one or many): + - Electroporation - Lipofection - Nucleofection @@ -201,37 +208,38 @@ measurement, expression of a particular factor and mode of measurement (FACS, sc **Dimensionality of phenotyping assays** – how many phenotypes and of what complexity are included in the map Controlled vocabulary terms (select one): + - Single functional read-out - Single dimension (e.g. FACS fluorescence from a single protein was used) - High-dimensional data (e.g. ML/AI enabled cell imaging/classification) - The outcomes of multiple phenotypic assays were combined to make this map -**Phenotypic assay examines** – terms selected from OBI subtree with root +**Phenotypic assay examines** – terms selected from OBI subtree with root [OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) -- DNA - - OBI_0000913 Promoter activity reporter gene assay RNA - - “Other”, e.g. structure, methylation +- DNA + - OBI_0000913 Promoter activity reporter gene assay RNA + - “Other”, e.g. structure, methylation - RNA - - OBI_0001177 Bulk RNA-sequencing - - OBI_0002631 Single cell RNA-sequencing and single cell combinatorial index RNA-sequencing assay - - OBI_0003094 Fluorescence in-situ hybridization (FISH) assay - - “Other” + - OBI_0001177 Bulk RNA-sequencing + - OBI_0002631 Single cell RNA-sequencing and single cell combinatorial index RNA-sequencing assay + - OBI_0003094 Fluorescence in-situ hybridization (FISH) assay + - “Other” -- Protein - - OBI_0000916 Flow cytometry assay - - OBI_0003096 Imaging Mass Cytometry assay - - OBI_0002161 Evolution of ligands by exponential enrichment assay - - “Other” +- Protein + - OBI_0000916 Flow cytometry assay + - OBI_0003096 Imaging Mass Cytometry assay + - OBI_0002161 Evolution of ligands by exponential enrichment assay + - “Other” - Morphology & Function - - OBI_0002119 Single cell imaging - - OBI_0003091 Multiplexed fluorescent antibody imaging - - OBI_0001146 Binding assays - - OBI_0000891 Cell Proliferation Assay, including fluorescence image-based cell proliferation assay - - OBI_0000699 Survival assessment assay - - “Other” + - OBI_0002119 Single cell imaging + - OBI_0003091 Multiplexed fluorescent antibody imaging + - OBI_0001146 Binding assays + - OBI_0000891 Cell Proliferation Assay, including fluorescence image-based cell proliferation assay + - OBI_0000699 Survival assessment assay + - “Other” **Disease/biological process relevance** – choose terms from [OMIM](https://www.omim.org/) or the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/) @@ -242,6 +250,7 @@ Controlled vocabulary terms (select one): potentially affect the outcome of the assay (e.g. species, animal strain, genetic ancestry, biological sex) Controlled vocabulary terms (one or many): + - Immortalized human cells (e.g. HEK293, HeLa cells; please specify below) - Murine primary cells - Induced pluripotent stem cells from male @@ -280,6 +289,7 @@ Commonly used cell lines and model systems KRAB-MeCP2, CRISPR activation by VPR, SAM, or SunTag, etc.) Controlled vocabulary terms (select one): + - Yes - If yes, please describe this in detail in the free text methods describing your assay. - No @@ -290,11 +300,13 @@ This section details the method for accurately capturing variant frequency assoc **Library profiling strategy** – approach used to quantify variants in the population Controlled vocabulary terms (select one): + - Direct sequencing - Shotgun sequencing - Barcode sequencing Controlled vocabulary terms (select one): + - Single segment (short read) - Single segment (long read) - Multi-segment From 4cf20f436e6a74cfbc5ed43f76220c44fe695518 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 10:36:34 -0400 Subject: [PATCH 20/43] Links and text updates --- README.md | 90 +++++++++++++++++++++++++++---------------------------- 1 file changed, 44 insertions(+), 46 deletions(-) diff --git a/README.md b/README.md index a01d849..afdcb60 100644 --- a/README.md +++ b/README.md @@ -16,74 +16,69 @@ multiplexed assay experiment. The schema defines a set of required and optional fields and possible values that can be used to validate a minimum information document. -The implementation is found in the `schema` directory. +The implementation is found in the [schema](schema/) directory. In addition to the structure of the minimum information model, the schema also defines controlled vocabulary terms for describing one of these experiments. -The `examples` directory contains examples of this type of document describing real experiments, as well as a simple +The [examples](examples/) directory contains examples of this type of document describing real experiments, as well as a simple Python script that will run the schema validation using [jsonschema](https://pypi.org/project/jsonschema/). Many other implementations of the JSON Schema standard are available in other languages (see [here](https://json-schema.org/implementations.html)). -Please note that although we are using the JSON Schema standard, the files here are in YAML format because it is more -human-readable. +Please note that although we are using the JSON Schema standard, the schema source file is written in YAML format for ease in human +reading/writing, and processed to JSON using the provided [Makefile](schema/Makefile). ## Reading the schema -The `schema` directory contains a YAML representation of the minimum information standard and controlled vocabulary. -There are multiple levels of required information that can be browsed hierarchically. -Most fields include a description that details the intention of that field and the type of information that is to be -provided. +The [schema](schema/) directory contains JSON and YAML representations of the minimum information standard and controlled vocabulary +expressed as JSON Schema. There are multiple levels of required information that can be browsed hierarchically. +Most fields include a description that details the intention of that field and the type of information that is to be provided. -For many fields, there is an enumerated list of valid values corresponding to the controlled vocabulary terms that can +For many fields, there is an enumerated list of valid values corresponding to the controlled vocabulary terms that must be used to describe the experiment. -The general schema structure and terms are also described below. -The YAML documents in the `schema` directory should be considered the authoritative structure and source of information -where there are discrepancies. +The general schema structure and terms are also described below. The YAML documents in the [schema](schema/) directory should be +considered the authoritative structure and source of information where discrepancies exist. ### Applying the schema to your datasets -Unless you are an experienced YAML user who is able to read the `schema/experiment.yml` file yourself, we recommend +Unless you are an experienced YAML user who is able to read the [schema/experiment.yml](schema/experiment.yml) file yourself, we recommend choosing the most closely-related example file as a starting point and modifying it as needed. The repository currently contains three examples: -* `examples/Findlay_2018.yml` describes a saturation genome editing (SGE) experiment on BRCA1, involving CRISPR-based -editing of the endogenous locus and measuring cell survival in HAP1 cells +- [examples/Findlay_2018.yml](examples/Findlay_2018.yml) describes a saturation genome editing (SGE) experiment on BRCA1, involving CRISPR-based +editing of the endogenous locus and measuring cell survival in HAP1 cells ([PubMed reference](https://pubmed.ncbi.nlm.nih.gov/30209399/)) -* `examples/Matreyek_2018.ml` describes a deep mutational scan of PTEN, expressed using a designed construct integrated -into the genome using a landing pad system and measuring cell fluorescence, also known as VAMP-seq +- [examples/Matreyek_2018.yml](examples/Matreyek_2018.yml) describes a deep mutational scan of PTEN, expressed using a designed construct integrated +into the genome using a landing pad system and measuring cell fluorescence, also known as VAMP-seq ([PubMed reference](https://pubmed.ncbi.nlm.nih.gov/29785012/)) -* `examples/Seuma_2022.yml` describes a deep mutatational scan of amyloid beta, expressed episomally and measuring the +- [examples/Seuma_2022.yml](examples/Seuma_2022.yml) describes a deep mutatational scan of amyloid beta, expressed episomally and measuring the effect on yeast growth ([PubMed reference](https://pubmed.ncbi.nlm.nih.gov/36400770/)) The schema starts with some descriptive metadata, such as the title and abstract. -We recommend that the title in particular focus on describing the dataset specific to the document rather than the -overall study. +The title and abstract should reflect the experimental dataset reflected in a study (which may optionally reference a published document that may +have a differing title). The `title` and `abstract` are required properties. -The next section (`document`) describes the publication (if any). -This part of the schema is optional, but if it is included, the `ref` property that provides an accession number -(such as a [DOI](https://www.doi.org/)) is required. +The next section (`document`) describes a publication associated with the experiment (if any). +This part of the schema is optional, but if used, must minimally include a `ref` property with a URI (such as a [DOI](https://www.doi.org/)) +linking to the publication. -The following sections `variantLibrary` and `phenotypicAssay` describe the experiment that was performed and both are -required. -Each has several subsections that provide structure for detailing the important experimental design decisions captured -by the schema. -We refer users to the examples and the list of [controlled vocabulary terms](#controlled-vocabulary-terms) below to help -complete this section, as it will be different for each experiment. +The following sections `variantLibrary` and `phenotypicAssay` describe the experiment that was performed and both are required. +Each has several subsections that provide structure for detailing the important experimental design decisions captured by the schema. +We refer users to the examples and the list of [controlled vocabulary terms](#controlled-vocabulary-terms) below to help complete this section, +as it will be different for each experiment. -*Note:* We anticipate that the standard will be adopted by established resources such as -[MaveDB](https://www.mavedb.org) that will provide users with the ability to download a minimum information file after -data deposition. +*Note:* We anticipate that the standard will be adopted by established resources such as [MaveDB](https://www.mavedb.org) that will provide +users with the ability to download a minimum information file after data deposition. ### Generating sequence identifiers Some examples (e.g. `examples/Seuma_2022.yml`) include target sequence identifiers and hashes. -These values were generated according to the [GA4GH VRS](https://vrs.ga4gh.org/) standard (see -[here](en/stable/impl-guide/computed_identifiers.html)) for details. +These values were generated according to the [GA4GH VRS v1.3](https://vrs.ga4gh.org/) standard (see +[here](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html)) for details. Generating these stable identifiers is not required but is recommended, particularly for in-vitro construct libraries. @@ -103,13 +98,16 @@ on a single or multiplexed set of phenotypes. #### Atlas of Variant Effects (AVE) ### -A combined resource for variant effects measured across model systems and contexts applicable to the study of the -structure and function of the genome and its products, as well as the consequences of its perturbation in health and -disease. +A combined resource for variant effects measured across model systems and contexts applicable to the study of the +structure and function of the genome and its products, as well as the consequences of its perturbation in health and +disease. Read more at . ### Ontologies and identifiers -For describing assay readouts, we make use of terms from the +Concept codes follow the `Coding` model, which describes concepts as objects with a `code` and `label` used by a +`system` (or `version` of a `system`). + +For describing assay readouts, we recommend the use of terms from the [Ontology for Biomedical Investigations](https://obi-ontology.org/). For describing human diseases relevant to the assay, we recommend using terms from [OMIM](https://www.omim.org/) or @@ -126,10 +124,10 @@ the organism (including strain, where applicable). This section describes the scope and characteristics of variant introduction. -**Library scope** – the collection of DNA elements introduced into the library. -DNA elements can have known (e.g. a gene, an exon or set of exons included in a transcript, a set of enhancers, -repressors, etc), or unknown functions. -For a given DNA element we distinguish the mode of variant programming/engineering +**Library scope** – the collection of DNA elements introduced into the library. +DNA elements can have known (e.g. a gene, an exon or set of exons included in a transcript, a set of enhancers, +repressors, etc), or unknown functions. +For a given DNA element we distinguish the mode of variant programming/engineering (e.g. all SNV, indels, ClinVar variants etc). Controlled vocabulary terms (one or many): @@ -200,9 +198,9 @@ Controlled vocabulary terms (one or many): #### Phenotypic assay -A physical adjudication of model system that allows for systematic interrogation of a functional read-out for a large -amount of genetic variants (e.g. cell size and mode of adjudication, action potential characteristic(s) and mode of -measurement, expression of a particular factor and mode of measurement (FACS, sc-RNA-seq), or transcript expression +A physical adjudication of model system that allows for systematic interrogation of a functional read-out for a large +amount of genetic variants (e.g. cell size and mode of adjudication, action potential characteristic(s) and mode of +measurement, expression of a particular factor and mode of measurement (FACS, sc-RNA-seq), or transcript expression (bulk RNA-seq)). **Dimensionality of phenotyping assays** – how many phenotypes and of what complexity are included in the map @@ -246,7 +244,7 @@ Controlled vocabulary terms (select one): #### Context - Characteristics of the model system that influence expression of phenotype -**Cellular model system and genetic background** – genetically encoded characteristics of the model system that +**Cellular model system and genetic background** – genetically encoded characteristics of the model system that potentially affect the outcome of the assay (e.g. species, animal strain, genetic ancestry, biological sex) Controlled vocabulary terms (one or many): From 668bd658da23129cfa80b392e6cf0f020ba46d04 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 11 Oct 2023 07:58:31 -0700 Subject: [PATCH 21/43] add replicate field --- examples/Matreyek_2018.yml | 7 +++++++ schema/experiment.yml | 17 +++++++++++++++++ 2 files changed, 24 insertions(+) diff --git a/examples/Matreyek_2018.yml b/examples/Matreyek_2018.yml index e66eb32..f466f83 100644 --- a/examples/Matreyek_2018.yml +++ b/examples/Matreyek_2018.yml @@ -54,6 +54,13 @@ variantLibrary: phenotypicAssay: dimensionality: type: single dimension + replication: + type: biological and technical + description: 8 biological replicate experiments were performed from three + different transfections (4, 3, and 1 experimental replicate for these + transfections). Technical replicates were performed as part of QC, but + the technical replicates were collapsed and analyzed as one experiment + after passing. method: type: flow cytometry assay description: VAMP-seq diff --git a/schema/experiment.yml b/schema/experiment.yml index ac3f823..a960de7 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -226,6 +226,23 @@ $defs: description: additional details about the dimensionality of the assay required: - type + replication: + type: object + description: replication of phenotyping assay. Describes what kind of replication was performed. + properties: + type: + description: a coding defining the kind of replication performed. + type: string + enum: + - biological + - technical + - biological and technical + - no replication + description: + type: string + description: additional details about the replicate structure of the assay, including number of replicates + required: + - type method: description: the assay method, defining the molecular properties interrogated. type: object From a224311c11db261b6445d8f3ef872727c55c489c Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 11 Oct 2023 08:21:46 -0700 Subject: [PATCH 22/43] add identifiers for modelSystem --- examples/Findlay_2018.yml | 4 ++++ examples/Matreyek_2018.yml | 7 +++++++ examples/Seuma_2022.yml | 7 ++++++- schema/experiment.yml | 6 ++++++ 4 files changed, 23 insertions(+), 1 deletion(-) diff --git a/examples/Findlay_2018.yml b/examples/Findlay_2018.yml index ecdc980..94ba09f 100644 --- a/examples/Findlay_2018.yml +++ b/examples/Findlay_2018.yml @@ -53,5 +53,9 @@ phenotypicAssay: modelSystem: type: immortalized human cells description: HAP1 + identifiers: + - system: https://www.ncbi.nlm.nih.gov/taxonomy + code: NCBI:txid9606 + label: Homo sapiens profilingStrategy: direct sequencing sequencingMethod: multi-segment diff --git a/examples/Matreyek_2018.yml b/examples/Matreyek_2018.yml index f466f83..7543044 100644 --- a/examples/Matreyek_2018.yml +++ b/examples/Matreyek_2018.yml @@ -80,5 +80,12 @@ phenotypicAssay: modelSystem: type: immortalized human cells description: HEK 293T TetBxb1BFP + identifiers: + - system: https://www.ebi.ac.uk/ols/ontologies/clo + code: CLO:0037372 + label: HEK293T cell + - system: https://www.ncbi.nlm.nih.gov/taxonomy + code: NCBI:txid9606 + label: Homo sapiens profilingStrategy: barcode sequencing sequencingMethod: single-segment (short read) diff --git a/examples/Seuma_2022.yml b/examples/Seuma_2022.yml index 86caf04..d16af93 100644 --- a/examples/Seuma_2022.yml +++ b/examples/Seuma_2022.yml @@ -51,6 +51,11 @@ phenotypicAssay: label: Alzheimer’s Disease modelSystem: type: yeast - description: Yeast (S.cerevisiae) + description: Saccharomyces cerevisiae [psi-pin-] + (MATa ade1-14 his3 leu2-3,112 lys2 trp1 ura3-52) + identifiers: + - system: https://www.ncbi.nlm.nih.gov/taxonomy + code: NCBI:txid4932 + label: Saccharomyces cerevisiae profilingStrategy: direct sequencing sequencingMethod: single-segment (short read) \ No newline at end of file diff --git a/schema/experiment.yml b/schema/experiment.yml index a960de7..20e1f60 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -295,6 +295,12 @@ $defs: description: type: string description: additional details about the model system. + identifiers: + description: relevant ontology terms or identifiers for the model system. + type: array + items: + $ref: "#/$defs/Coding" + minItems: 1 profilingStrategy: description: the strategy used to profile the variant library type: string From 37227ec30a5bddcd153b8833b88d49317d295218 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 11 Oct 2023 09:27:22 -0700 Subject: [PATCH 23/43] added data source to schema --- examples/Findlay_2018.yml | 12 ++++++++++++ examples/Matreyek_2018.yml | 9 +++++++++ examples/Seuma_2022.yml | 5 +++++ schema/experiment.yml | 27 +++++++++++++++++++++++++++ 4 files changed, 53 insertions(+) diff --git a/examples/Findlay_2018.yml b/examples/Findlay_2018.yml index 94ba09f..15ddf7e 100644 --- a/examples/Findlay_2018.yml +++ b/examples/Findlay_2018.yml @@ -16,6 +16,18 @@ document: Nature date: "2018-09-12" ref: https://doi.org/10.1038/s41586-018-0461-z +datasets: + - system: MaveDB + accession: urn:mavedb:00000097 + ref: https://mavedb.org/#/experiment-sets/urn:mavedb:00000097 + description: processed scores, including scores for each replicate of each exon + - system: GEO + accession: GSE117159 + ref: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117159 + description: raw sequencing data + - system: website + accession: https://sge.gs.washington.edu/BRCA1/ + ref: https://sge.gs.washington.edu/BRCA1/ variantLibrary: scope: type: coding diff --git a/examples/Matreyek_2018.yml b/examples/Matreyek_2018.yml index 7543044..8f03f71 100644 --- a/examples/Matreyek_2018.yml +++ b/examples/Matreyek_2018.yml @@ -18,6 +18,15 @@ document: Nature Genetics date: "2018-05-21" ref: https://doi.org/10.1038/s41588-018-0122-z +datasets: + - system: MaveDB + accession: urn:mavedb:00000013-a + ref: https://mavedb.org/#/experiments/urn:mavedb:00000013-a + description: processed scores, including scores for each replicate experiment + - system: BioProject + accession: PRJNA428380 + ref: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA428380 + description: raw sequencing data variantLibrary: scope: type: coding diff --git a/examples/Seuma_2022.yml b/examples/Seuma_2022.yml index d16af93..bdda850 100644 --- a/examples/Seuma_2022.yml +++ b/examples/Seuma_2022.yml @@ -18,6 +18,11 @@ document: Nature Communications date: "2022-11-18" ref: https://doi.org/10.1038/s41467-022-34742-3 +datasets: + - system: MaveDB + accession: urn:mavedb:00000113-a + ref: https://mavedb.org/#/experiments/urn:mavedb:00000113-a + description: processed scores variantLibrary: scope: type: coding diff --git a/schema/experiment.yml b/schema/experiment.yml index 20e1f60..c778597 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -42,6 +42,27 @@ $defs: label: description: a human-readable description of the concept associated with the code type: string + Dataset: + type: object + additionalProperties: false + description: a dataset available from an external source + properties: + system: + type: string + description: the name of the system the dataset is available from, such as a database + accession: + type: string + description: accession number for the dataset in the system + ref: + type: string + format: uri + description: a Universal Record Identifier for the dataset + description: + description: additional details about the dataset, such as whether it contains raw or processed data + type: string + required: + - system + - accession ReferenceSequence: type: object additionalProperties: false @@ -334,6 +355,12 @@ properties: document: description: the primary document describing this experiment $ref: "#/$defs/Document" + datasets: + description: datasets associated with this experiment + type: array + items: + $ref: "#/$defs/Dataset" + minItems: 1 variantLibrary: description: characteristics of the variant library generation process $ref: "#/$defs/VariantLibrary" From bfc63d75b6ec67a828d872df87530ccd364be5e2 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 11 Oct 2023 09:28:12 -0700 Subject: [PATCH 24/43] add missing description --- examples/Findlay_2018.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/Findlay_2018.yml b/examples/Findlay_2018.yml index 15ddf7e..f71dce6 100644 --- a/examples/Findlay_2018.yml +++ b/examples/Findlay_2018.yml @@ -28,6 +28,7 @@ datasets: - system: website accession: https://sge.gs.washington.edu/BRCA1/ ref: https://sge.gs.washington.edu/BRCA1/ + description: processed scores and visualizations hosted by the investigators variantLibrary: scope: type: coding From 96d3dcc0f59330ddd1d2ddcbf0b5ef9ff54a5cc4 Mon Sep 17 00:00:00 2001 From: Alan Rubin Date: Wed, 11 Oct 2023 09:36:41 -0700 Subject: [PATCH 25/43] add replication section --- examples/Findlay_2018.yml | 3 +++ examples/Seuma_2022.yml | 4 ++++ 2 files changed, 7 insertions(+) diff --git a/examples/Findlay_2018.yml b/examples/Findlay_2018.yml index f71dce6..a95dc6d 100644 --- a/examples/Findlay_2018.yml +++ b/examples/Findlay_2018.yml @@ -46,6 +46,9 @@ variantLibrary: phenotypicAssay: dimensionality: type: multiple functional readouts + replication: + type: biological + description: two biological replicates were performed method: type: survival assessment assay method: diff --git a/examples/Seuma_2022.yml b/examples/Seuma_2022.yml index bdda850..4fc7008 100644 --- a/examples/Seuma_2022.yml +++ b/examples/Seuma_2022.yml @@ -41,6 +41,10 @@ variantLibrary: phenotypicAssay: dimensionality: type: single dimension + replication: + type: biological and technical + description: Three biological replicates (transformations) were performed and five technical replicate selections + were done for each. Sequencing was performed by combining six equimolar samples of each technical replicate. method: type: survival assessment assay description: Survival assessment assay (growth in -adenine) From e87f53b87ec169de44d3eee32653a6055cfbad55 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 15:26:49 -0400 Subject: [PATCH 26/43] library scope edits --- README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index afdcb60..7d8e276 100644 --- a/README.md +++ b/README.md @@ -96,7 +96,7 @@ followed by multiplexed high-throughput phenotyping (potentially multiple phenot A dataset that reports the effects of variation in a DNA element (a gene, transcript, set of regulatory regions, etc.) on a single or multiplexed set of phenotypes. -#### Atlas of Variant Effects (AVE) ### +#### Atlas of Variant Effects (AVE) A combined resource for variant effects measured across model systems and contexts applicable to the study of the structure and function of the genome and its products, as well as the consequences of its perturbation in health and @@ -130,12 +130,15 @@ repressors, etc), or unknown functions. For a given DNA element we distinguish the mode of variant programming/engineering (e.g. all SNV, indels, ClinVar variants etc). -Controlled vocabulary terms (one or many): +Controlled vocabulary terms for `scope.type` (one or many): + +- coding +- intronic +- non-coding, regulatory +- non-coding, other -- Coding -- Intronic -- Non-coding regulatory -- Non-coding other (eg tRNA) +Libraries may be further described with `scope.description`. The `description` field should be populated for any +library of type `non-coding, other` (e.g. tRNA libraries). **Variant library characteristics** – methods used to generate the library From cad3f148e17a41ec5fcad3b257e0cde14a2d6cce Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 17:19:39 -0400 Subject: [PATCH 27/43] add refGet ref --- README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 50f3fdc..a4dd76a 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,6 @@ repositories associated with the [Atlas of Variant Effects Alliance](https://www This "controlled vocabulary" and standard is intended to give structure to minimum required information for data and meta-data sharing for scientists using variant effect mapping technology. - ## How to use this repository This repository contains an implementation of the schema described in the @@ -86,8 +85,8 @@ users with the ability to download a minimum information file after data deposit ### Generating sequence identifiers Some examples (e.g. `examples/Seuma_2022.yml`) include target sequence identifiers and hashes. -These values were generated according to the [GA4GH VRS v1.3](https://vrs.ga4gh.org/) standard (see -[here](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html)) for details. +These values were generated according to the [GA4GH VRS v1.3](https://vrs.ga4gh.org/) and [refGet](http://samtools.github.io/hts-specs/refget.html) +standards (see [here](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html)) for details. Generating these stable identifiers is not required but is recommended, particularly for in-vitro construct libraries. From 9fb312790c4cfb53dc5cdf6d11b3a9177c8219cf Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 17:24:07 -0400 Subject: [PATCH 28/43] minor edits --- README.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index a4dd76a..42f0de2 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ JSON Schema for validating MAVE experiment metadata *Purpose:* To provide an overarching organization and definitions for terms relevant to tech development and data repositories associated with the [Atlas of Variant Effects Alliance](https://www.varianteffect.org). -This "controlled vocabulary" and standard is intended to give structure to minimum required information for data and +This "controlled vocabulary" and standard is intended to give structure to minimum required information for data and meta-data sharing for scientists using variant effect mapping technology. ## How to use this repository @@ -16,10 +16,10 @@ This repository contains an implementation of the schema described in the [Atlas of Variant Effects Alliance](https://www.varianteffect.org) minimum information model for describing a multiplexed assay experiment. -Overall, it is felt that minimum standard reporting should include information on -(1) means and characteristics of genetic perturbation, -(2) details of the phenotypic assay employed to identify variant effects, -(3) information on the cellular and environmental context(s) in which the assays were carried out, and +Overall, it is felt that minimum standard reporting should include information on +(1) means and characteristics of genetic perturbation, +(2) details of the phenotypic assay employed to identify variant effects, +(3) information on the cellular and environmental context(s) in which the assays were carried out, and (4) details of sequencing strategy for variant-effect associations. The schema defines a set of required and optional fields and possible values that can be used to validate a minimum @@ -94,7 +94,7 @@ Generating these stable identifiers is not required but is recommended, particul ### Overview of ontologies and identifiers -Concept codes follow the `Coding` model, which describes concepts as objects with a `code` and `label` used by a +Concept codes used by the schema follow the `Coding` model, which describes concepts as objects with a `code` and `label` used by a `system` (or `version` of a `system`). For describing assay readouts, we recommend the use of terms from the @@ -105,6 +105,7 @@ the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/). For describing human cell lines, we use terms from the [Cell Line Ontology](http://obofoundry.org/ontology/clo.html), where available. + We encourage users to provide an [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) that specifically denotes the organism (including strain, where applicable). @@ -118,7 +119,7 @@ This section describes the scope and characteristics of variant introduction. DNA elements can have known (e.g. a gene, an exon or set of exons included in a transcript, a set of enhancers, repressors, etc), or unknown functions. For a given DNA element we distinguish the mode of variant programming/engineering -(e.g. all SNV, indels, ClinVar variants etc). +(e.g. all SNV, indels, ClinVar variants, etc). Controlled vocabulary terms for `scope.type` (one or many): From 1750970bee33d40f6cfdb3093f1344e3521eb091 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 18:43:22 -0400 Subject: [PATCH 29/43] closes #2 --- README.md | 86 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 46 insertions(+), 40 deletions(-) diff --git a/README.md b/README.md index 42f0de2..e283e5c 100644 --- a/README.md +++ b/README.md @@ -74,7 +74,7 @@ The next section (`document`) describes a publication associated with the experi This part of the schema is optional, but if used, must minimally include a `ref` property with a URI (such as a [DOI](https://www.doi.org/)) linking to the publication. -The following sections `variantLibrary` and `phenotypicAssay` describe the experiment that was performed and both are required. +The [variantLibrary](#variant-library) and [phenotypicAssay](#phenotypic-assay) describe the experiment that was performed and both are required. Each has several subsections that provide structure for detailing the important experimental design decisions captured by the schema. We refer users to the examples and the list of [controlled vocabulary terms](#controlled-vocabulary-terms) below to help complete this section, as it will be different for each experiment. @@ -109,15 +109,17 @@ where available. We encourage users to provide an [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) that specifically denotes the organism (including strain, where applicable). -### Experimental vocabulary (genetic perturbation, phenotype and context) +### Variant Library -#### Genetic perturbation +This section describes the scope and characteristics of a variant library: a collection of sequence variants for a MAVE experiment +that are derived from a common target sequence. -This section describes the scope and characteristics of variant introduction. +#### Library scope + +the functional scope of DNA elements introduced into the library. +DNA elements can have known or unknown functions. Example functions include a gene, an exon or set of exons included in a +transcript, a set of enhancers, a set of repressors, etc. -**Library scope** – the collection of DNA elements introduced into the library. -DNA elements can have known (e.g. a gene, an exon or set of exons included in a transcript, a set of enhancers, -repressors, etc), or unknown functions. For a given DNA element we distinguish the mode of variant programming/engineering (e.g. all SNV, indels, ClinVar variants, etc). @@ -128,55 +130,59 @@ Controlled vocabulary terms for `scope.type` (one or many): - non-coding, regulatory - non-coding, other -Libraries may be further described with `scope.description`. The `description` field should be populated for any +Libraries may be further described with `description`. The `description` field must be populated for any library of type `non-coding, other` (e.g. tRNA libraries). -**Variant library characteristics** – methods used to generate the library +#### Library generation method + +The methods used to generate the library. A library may create and integrate an *in vitro* construct or directly edit an +endogenous locus. The library generation method is defined by its `type`, which may be one of: -*Variant generation method* – how was the variant library created -(e.g. doped oligo, mutagenic PCR, primer-based, base editor) +- in-vitro construct library +- endogenous locus library -Controlled vocabulary categorical term (can pick both category options): +##### In-vitro construct library method -- Editing at endogenous locus -- In vitro variant construct generation +A methodology for generating and integrating an exogenous variant library. -*In vitro construct generation method* (if applicable) +For *in-vitro* constructs, `system` is one of the following controlled vocabulary terms: -- Oligo-directed mutagenic PCR (e.g. NNK PCR) -- Error-prone PCR -- Nicking mutagenesis -- Microarray synthesis -- Site-directed mutagenesis -- Doped oligo synthesis -- Oligo pool synthesis -- Proprietary method -- Other (please describe) +- oligo-directed mutagenic PCR +- error-prone PCR +- nicking mutagenesis +- microarray synthesis +- site-directed mutagenesis +- doped oligo synthesis +- oligo pool synthesis +- proprietary method +- other -*Integration/expression of exogenous construct* (if applicable) +In addition, `integration` refers to the mechanism for integration or expression of an exogenous construct and is one of +the following controlled vocabulary terms: -- Entire element replacement at the native locus (e.g. with integrases, not base editing) +- native locus replacement +- extra-local construct insertion +- random locus viral integration +- episomal delivery +- plasmid (not integrated) +- transfection of RNA -*Integration of extra-local construct* (e.g. with landing pad; if applicable) +`system` and `integration` are required properties. `description` may be used to further describe the generation method +`system` and `integration` parameters, and is required if the `system` is set to `other`. -- Viral Integration -- Episomal delivery -- Transfection of RNA +##### Endogenous locus library method -*Endogenous genome editing* (if applicable) +A methodology for generating a variant library at an endogenous locus. + +For endogenous editing, `system` refers to the CRISPR/Cas system used, and is one of the following controlled vocabulary terms: -- CRISPR/Cas system - SpCas9 - SaCas9 - AsCas12a -- RfxCas13d -- CRISPR/Cas system functionality - - Wildtype nuclease - - Base Editor - - Prime Editor - -**Delivery method** – how the variant induction machinery and/or construct was delivered to the cell/organism -(e.g. viral transduction, electroporation, transfection and MOI) +- RfsCas13d + +In addition, `mechanism` is used to define the functional mechanism of the method, and is one of the following controlled vocabulary +terms: Controlled vocabulary terms (one or many): From 3e3ef7974fdb42564dad10dc6e0a4f017191652c Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 18:56:11 -0400 Subject: [PATCH 30/43] variant library delivery method --- README.md | 41 +++++++++++++++++++++++++---------------- 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index e283e5c..47b74b6 100644 --- a/README.md +++ b/README.md @@ -116,14 +116,11 @@ that are derived from a common target sequence. #### Library scope -the functional scope of DNA elements introduced into the library. -DNA elements can have known or unknown functions. Example functions include a gene, an exon or set of exons included in a +The variant library should be defined by the functional scope of DNA elements introduced into the library. +DNA elements can have known or unknown functions. Example functions include a gene, an exon or set of exons included in a transcript, a set of enhancers, a set of repressors, etc. -For a given DNA element we distinguish the mode of variant programming/engineering -(e.g. all SNV, indels, ClinVar variants, etc). - -Controlled vocabulary terms for `scope.type` (one or many): +We define the scope type using the following controlled vocabulary terms: - coding - intronic @@ -184,17 +181,29 @@ For endogenous editing, `system` refers to the CRISPR/Cas system used, and is on In addition, `mechanism` is used to define the functional mechanism of the method, and is one of the following controlled vocabulary terms: -Controlled vocabulary terms (one or many): +- nuclease +- base editor +- prime editor + +`system` and `mechanism` are required properties. `description` may be used to further describe the generation method +`system` and `mechanism` parameters. + +#### Delivery method + +The delivery method specifies how the variant induction machinery and/or construct was delivered to the cell/organism +(e.g. viral transduction, electroporation, transfection and MOI). + +The delivery method is specified by the `type` property and must be one of the following controlled vocabulary terms: + +- electroporation +- nucleofection +- chemical-based transfection +- adeno-associated virus transduction +- lentivirus transduction +- chemical or heat shock transformation +- other -- Electroporation -- Lipofection -- Nucleofection -- Microinjection -- Chemical-based transfection -- Transduction: AAV -- Transduction: lentivirus -- Transformation: chemical or heat shock -- Other (please specify) +The `type` field is required. Additional detail about the delivery method may be provided with the `description` property. #### Phenotypic assay From 6351f6dbd93853fe49f7ffa3ad54f8c00bea366c Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 19:04:35 -0400 Subject: [PATCH 31/43] target sequences --- README.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 47b74b6..586e4db 100644 --- a/README.md +++ b/README.md @@ -86,7 +86,7 @@ users with the ability to download a minimum information file after data deposit Some examples (e.g. `examples/Seuma_2022.yml`) include target sequence identifiers and hashes. These values were generated according to the [GA4GH VRS v1.3](https://vrs.ga4gh.org/) and [refGet](http://samtools.github.io/hts-specs/refget.html) -standards (see [here](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html)) for details. +standards (see [here](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html) for details). Generating these stable identifiers is not required but is recommended, particularly for in-vitro construct libraries. @@ -114,6 +114,17 @@ the organism (including strain, where applicable). This section describes the scope and characteristics of a variant library: a collection of sequence variants for a MAVE experiment that are derived from a common target sequence. +#### Target sequences + +A collection of sequences used as references from which all variants in the library are defined. This collection is defined as a +set of `ReferenceSequence` objects, each defined by the following properties: + +`id`: an identifier for the sequence. +`sha512t24u`: the GA4GH `SQ.` identifier (see [here](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html) for +details). +`sequence`: the literal sequence as a string of [IUPAC single character codes](https://www.bioinformatics.org/sms/iupac.html). +`sequenceAlphabet`: one of `na` (nucleic acids) or `aa` (amino acids) for interpreting IUPAC character codes in the `sequence`. + #### Library scope The variant library should be defined by the functional scope of DNA elements introduced into the library. @@ -205,7 +216,7 @@ The delivery method is specified by the `type` property and must be one of the f The `type` field is required. Additional detail about the delivery method may be provided with the `description` property. -#### Phenotypic assay +### Phenotypic assay A physical adjudication of model system that allows for systematic interrogation of a functional read-out for a large amount of genetic variants (e.g. cell size and mode of adjudication, action potential characteristic(s) and mode of From 66dbfed7978b8116bf2834fe45bfbee480ed2119 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 19:05:07 -0400 Subject: [PATCH 32/43] minor edits --- schema/experiment.yml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/schema/experiment.yml b/schema/experiment.yml index c778597..7893973 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -161,7 +161,7 @@ $defs: VariantLibrary: type: object additionalProperties: false - description: a collection of sequences that are derived from a common target sequence + description: a collection of sequence variants that are derived from a common target sequence properties: targetSequences: description: the collection of sequences used as references from which all variants in the library are defined @@ -228,6 +228,7 @@ $defs: description: a physical adjudication of a model system that allows for systematic interrogation of a functional read-out for a large amount of genetic variants. type: object + additionalProperties: false properties: dimensionality: type: object From 0433ca42ad17df0136c1d1524b5f6128c69ccac7 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 19:16:16 -0400 Subject: [PATCH 33/43] phenotypic assay dimensionality --- README.md | 20 ++++++++++++++------ examples/Findlay_2018.yml | 2 +- examples/Matreyek_2018.yml | 2 +- examples/Seuma_2022.yml | 2 +- 4 files changed, 17 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 586e4db..363c1ff 100644 --- a/README.md +++ b/README.md @@ -223,14 +223,22 @@ amount of genetic variants (e.g. cell size and mode of adjudication, action pote measurement, expression of a particular factor and mode of measurement (FACS, sc-RNA-seq), or transcript expression (bulk RNA-seq)). -**Dimensionality of phenotyping assays** – how many phenotypes and of what complexity are included in the map +#### Dimensionality -Controlled vocabulary terms (select one): +Dimensionality defines how many phenotypes and of what complexity are included in the map. + +Dimensionality is primarily defined by its `type`, which must be one of the following controlled vocabulary terms: + +- single-dimensional data +- high-dimensional data +- combined functional data + +where `single-dimensional data` refers to experiments with a single dimension (e.g. FACS fluorescence from a single protein was used), +`high-dimensional data` refers to experiments with multiple dimensions (e.g. ML/AI enabled cell imaging/classification), and +`combined functional data` refers to experiments where multiple phenotypic assays were combined to make a map. -- Single functional read-out -- Single dimension (e.g. FACS fluorescence from a single protein was used) -- High-dimensional data (e.g. ML/AI enabled cell imaging/classification) -- The outcomes of multiple phenotypic assays were combined to make this map +The `type` field is required. Additional information about the `dimensionality` of an experiment may be provided using the +`description` field. **Phenotypic assay examines** – terms selected from OBI subtree with root [OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) diff --git a/examples/Findlay_2018.yml b/examples/Findlay_2018.yml index a95dc6d..5ce298f 100644 --- a/examples/Findlay_2018.yml +++ b/examples/Findlay_2018.yml @@ -45,7 +45,7 @@ variantLibrary: description: Lipofection - TurboFectin phenotypicAssay: dimensionality: - type: multiple functional readouts + type: combined functional data replication: type: biological description: two biological replicates were performed diff --git a/examples/Matreyek_2018.yml b/examples/Matreyek_2018.yml index 8f03f71..89ebe90 100644 --- a/examples/Matreyek_2018.yml +++ b/examples/Matreyek_2018.yml @@ -62,7 +62,7 @@ variantLibrary: type: chemical or heat shock transformation phenotypicAssay: dimensionality: - type: single dimension + type: single-dimensional data replication: type: biological and technical description: 8 biological replicate experiments were performed from three diff --git a/examples/Seuma_2022.yml b/examples/Seuma_2022.yml index 4fc7008..1e7bc0b 100644 --- a/examples/Seuma_2022.yml +++ b/examples/Seuma_2022.yml @@ -40,7 +40,7 @@ variantLibrary: type: chemical or heat shock transformation phenotypicAssay: dimensionality: - type: single dimension + type: single-dimensional data replication: type: biological and technical description: Three biological replicates (transformations) were performed and five technical replicate selections From 783f0cc244096813ad078003067ed9934ce404ab Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 19:20:43 -0400 Subject: [PATCH 34/43] notebook fails when failing --- examples/source_validation.py | 2 +- schema/experiment.json | 74 +++++++++++++++++++++++++++++++++-- schema/experiment.yml | 4 +- 3 files changed, 74 insertions(+), 6 deletions(-) diff --git a/examples/source_validation.py b/examples/source_validation.py index bdcaa5a..7fe02f5 100644 --- a/examples/source_validation.py +++ b/examples/source_validation.py @@ -17,7 +17,7 @@ try: validate(experiment_record, experiment_schema) except ValidationError as e: - print("failed to validate:", e.message) + raise("failed to validate:", e.message) else: print("validation successful") diff --git a/schema/experiment.json b/schema/experiment.json index d437ea2..cdf2938 100644 --- a/schema/experiment.json +++ b/schema/experiment.json @@ -54,6 +54,34 @@ } } }, + "Dataset": { + "type": "object", + "additionalProperties": false, + "description": "a dataset available from an external source", + "properties": { + "system": { + "type": "string", + "description": "the name of the system the dataset is available from, such as a database" + }, + "accession": { + "type": "string", + "description": "accession number for the dataset in the system" + }, + "ref": { + "type": "string", + "format": "uri", + "description": "a Universal Record Identifier for the dataset" + }, + "description": { + "description": "additional details about the dataset, such as whether it contains raw or processed data", + "type": "string" + } + }, + "required": [ + "system", + "accession" + ] + }, "ReferenceSequence": { "type": "object", "additionalProperties": false, @@ -183,7 +211,7 @@ "VariantLibrary": { "type": "object", "additionalProperties": false, - "description": "a collection of sequences that are derived from a common target sequence", + "description": "a collection of sequence variants that are derived from a common target sequence", "properties": { "targetSequences": { "description": "the collection of sequences used as references from which all variants in the library are defined", @@ -272,6 +300,7 @@ "phenotypicAssay": { "description": "a physical adjudication of a model system that allows for systematic interrogation of a functional read-out for a large amount of genetic variants.", "type": "object", + "additionalProperties": false, "properties": { "dimensionality": { "type": "object", @@ -281,9 +310,9 @@ "description": "a coding defining the dimensionality of the assay as single or multiple functional readouts.", "type": "string", "enum": [ - "single dimension", + "single-dimensional data", "high-dimensional data", - "multiple functional readouts" + "combined functional data" ] }, "description": { @@ -295,6 +324,29 @@ "type" ] }, + "replication": { + "type": "object", + "description": "replication of phenotyping assay. Describes what kind of replication was performed.", + "properties": { + "type": { + "description": "a coding defining the kind of replication performed.", + "type": "string", + "enum": [ + "biological", + "technical", + "biological and technical", + "no replication" + ] + }, + "description": { + "type": "string", + "description": "additional details about the replicate structure of the assay, including number of replicates" + } + }, + "required": [ + "type" + ] + }, "method": { "description": "the assay method, defining the molecular properties interrogated.", "type": "object", @@ -357,6 +409,14 @@ "description": { "type": "string", "description": "additional details about the model system." + }, + "identifiers": { + "description": "relevant ontology terms or identifiers for the model system.", + "type": "array", + "items": { + "$ref": "#/$defs/Coding" + }, + "minItems": 1 } } }, @@ -404,6 +464,14 @@ "description": "the primary document describing this experiment", "$ref": "#/$defs/Document" }, + "datasets": { + "description": "datasets associated with this experiment", + "type": "array", + "items": { + "$ref": "#/$defs/Dataset" + }, + "minItems": 1 + }, "variantLibrary": { "description": "characteristics of the variant library generation process", "$ref": "#/$defs/VariantLibrary" diff --git a/schema/experiment.yml b/schema/experiment.yml index 7893973..b94565d 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -240,9 +240,9 @@ $defs: readouts. type: string enum: - - single dimension + - single-dimensional data - high-dimensional data - - multiple functional readouts + - combined functional data description: type: string description: additional details about the dimensionality of the assay From eb2babdcb35deebf2eb9633ea0e549c371658d12 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 19:28:03 -0400 Subject: [PATCH 35/43] add replication --- README.md | 11 +++++++++++ schema/experiment.yml | 1 + 2 files changed, 12 insertions(+) diff --git a/README.md b/README.md index 363c1ff..c3d2a19 100644 --- a/README.md +++ b/README.md @@ -240,6 +240,17 @@ where `single-dimensional data` refers to experiments with a single dimension (e The `type` field is required. Additional information about the `dimensionality` of an experiment may be provided using the `description` field. +#### Replication + +Assay replication work performed is defined by its `type`, which must be one of the following controlled vocabulary terms: + +- biological +- technical +- biological and technical +- no replication + +The `type` field is required. Additional detail about the replication method may be provided with the `description` property. + **Phenotypic assay examines** – terms selected from OBI subtree with root [OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) diff --git a/schema/experiment.yml b/schema/experiment.yml index b94565d..48f2837 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -344,6 +344,7 @@ $defs: - modelSystem - profilingStrategy - sequencingMethod + - replication type: object additionalProperties: false properties: From f17c1f0c610dc06fa65689a9ef6c79f41480b2aa Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 20:57:17 -0400 Subject: [PATCH 36/43] method, relevance, and model system --- README.md | 90 +++++++++++++++++++++--------------------- schema/experiment.json | 9 +++-- schema/experiment.yml | 11 ++++-- 3 files changed, 56 insertions(+), 54 deletions(-) diff --git a/README.md b/README.md index c3d2a19..aa4a7d7 100644 --- a/README.md +++ b/README.md @@ -214,7 +214,7 @@ The delivery method is specified by the `type` property and must be one of the f - chemical or heat shock transformation - other -The `type` field is required. Additional detail about the delivery method may be provided with the `description` property. +The `type` property is required. Additional detail about the delivery method may be provided with the `description` property. ### Phenotypic assay @@ -237,8 +237,8 @@ where `single-dimensional data` refers to experiments with a single dimension (e `high-dimensional data` refers to experiments with multiple dimensions (e.g. ML/AI enabled cell imaging/classification), and `combined functional data` refers to experiments where multiple phenotypic assays were combined to make a map. -The `type` field is required. Additional information about the `dimensionality` of an experiment may be provided using the -`description` field. +The `type` property is required. Additional information about the `dimensionality` of an experiment may be provided using the +`description` property. #### Replication @@ -249,58 +249,54 @@ Assay replication work performed is defined by its `type`, which must be one of - biological and technical - no replication -The `type` field is required. Additional detail about the replication method may be provided with the `description` property. +The `type` property is required. Additional detail about the replication method may be provided with the `description` property. -**Phenotypic assay examines** – terms selected from OBI subtree with root -[OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) +#### Method -- DNA - - OBI_0000913 Promoter activity reporter gene assay RNA - - “Other”, e.g. structure, methylation - -- RNA - - OBI_0001177 Bulk RNA-sequencing - - OBI_0002631 Single cell RNA-sequencing and single cell combinatorial index RNA-sequencing assay - - OBI_0003094 Fluorescence in-situ hybridization (FISH) assay - - “Other” - -- Protein - - OBI_0000916 Flow cytometry assay - - OBI_0003096 Imaging Mass Cytometry assay - - OBI_0002161 Evolution of ligands by exponential enrichment assay - - “Other” +The assay method, defining the molecular properties interrogated by the experiment. Terms are derived from OBI subtree with root +[OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) where appropriate. Term mappings to OBI concept identifiers +are available in the controlled vocabulary definitions file. The method is specified by the `type` property, which must be one of +the following controlled vocabulary terms: -- Morphology & Function - - OBI_0002119 Single cell imaging - - OBI_0003091 Multiplexed fluorescent antibody imaging - - OBI_0001146 Binding assays - - OBI_0000891 Cell Proliferation Assay, including fluorescence image-based cell proliferation assay - - OBI_0000699 Survival assessment assay - - “Other” +- promoter activity detection by reporter gene assay +- bulk RNA-sequencing +- single-cell RNA sequencing assay +- fluorescence in-situ hybridization (FISH) assay +- flow cytometry assay +- imaging mass cytometry assay +- systematic evolution of ligands by exponential enrichment assay +- single cell imaging +- multiplexed fluorescent antibody imaging +- binding assay +- cell proliferation assay +- survival assessment assay +- other -**Disease/biological process relevance** – choose terms from [OMIM](https://www.omim.org/) or the -[Mondo Disease Ontology](https://mondo.monarchinitiative.org/) +#### Relevance -#### Context - Characteristics of the model system that influence expression of phenotype +The disease or biological processes the assay is relevant to. Relevance is specified by an array of `Coding` objects (see +[note](#overview-of-ontologies-and-identifiers)). We recommend relevance to be described by terms from [OMIM](https://www.omim.org/) +or the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/). -**Cellular model system and genetic background** – genetically encoded characteristics of the model system that -potentially affect the outcome of the assay (e.g. species, animal strain, genetic ancestry, biological sex) +#### Model system -Controlled vocabulary terms (one or many): +The model system context that influences expression of the phenotype. The model system is specified by the `type` property and must +be one of the following controlled vocabulary terms: -- Immortalized human cells (e.g. HEK293, HeLa cells; please specify below) -- Murine primary cells -- Induced pluripotent stem cells from male -- Induced pluripotent stem cells from female -- Patient derived primary cells (e.g. T-cells, adipocytes) -- Yeast -- E. coli -- Other bacteria -- Bacteriophage -- Molecular display (e.g. ribosome display) -- Other (please specify - includes all other OBI ontology terms) +- immortalized human cells +- murine primary cells +- induced pluripotent stem cells from human male +- induced pluripotent stem cells from human female +- patient derived primary cells (e.g. T-cells, adipocytes) +- yeast +- bacteria +- bacteriophage +- molecular display +- other -Commonly used cell lines and model systems +We recommend that cell lines are further described by relevant concepts using the `codings` array of `Coding` objects (see +[note](#overview-of-ontologies-and-identifiers)). We recommend that cell lines are described using the Cell Line Ontology +where applicable. Some commonly used cell lines and model systems are listed below: | Cell | CLO Term | NCBI Taxonomy ID | |------|----------|------------------| @@ -321,6 +317,8 @@ Commonly used cell lines and model systems | Bacteriophage | n/a | 38018 | | Cell-free | n/a | n/a | +The `type` property is required. Additional detail about the model system may be provided with the `description` property. + **Environmental variables** – variance of environmental factors included in the experiment (e.g. addition of specific compounds to cell media, temperature controls, time course, CRISPR interference by KRAB, KRAB-MeCP2, CRISPR activation by VPR, SAM, or SunTag, etc.) diff --git a/schema/experiment.json b/schema/experiment.json index cdf2938..c53b556 100644 --- a/schema/experiment.json +++ b/schema/experiment.json @@ -354,16 +354,16 @@ "type": { "type": "string", "enum": [ - "promoter activity reporter gene assay", + "promoter activity detection by reporter gene assay", "bulk RNA-sequencing", "single-cell RNA sequencing assay", "fluorescence in-situ hybridization (FISH) assay", "flow cytometry assay", "imaging mass cytometry assay", - "evolution of ligands by exponential enrichment assay", + "systematic evolution of ligands by exponential enrichment assay", "single cell imaging", "multiplexed fluorescent antibody imaging", - "binding assays", + "binding assay", "cell proliferation assay", "survival assessment assay", "other" @@ -445,7 +445,8 @@ "relevance", "modelSystem", "profilingStrategy", - "sequencingMethod" + "sequencingMethod", + "replication" ] } }, diff --git a/schema/experiment.yml b/schema/experiment.yml index 48f2837..5852395 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -42,6 +42,9 @@ $defs: label: description: a human-readable description of the concept associated with the code type: string + required: + - system + - code Dataset: type: object additionalProperties: false @@ -272,16 +275,16 @@ $defs: type: type: string enum: - - promoter activity reporter gene assay + - promoter activity detection by reporter gene assay - bulk RNA-sequencing - single-cell RNA sequencing assay - fluorescence in-situ hybridization (FISH) assay - flow cytometry assay - imaging mass cytometry assay - - evolution of ligands by exponential enrichment assay + - systematic evolution of ligands by exponential enrichment assay - single cell imaging - multiplexed fluorescent antibody imaging - - binding assays + - binding assay - cell proliferation assay - survival assessment assay - other @@ -317,7 +320,7 @@ $defs: description: type: string description: additional details about the model system. - identifiers: + codings: description: relevant ontology terms or identifiers for the model system. type: array items: From cc3e3b9283ae3f131e1d8c439af72b7033ef3d8d Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 20:57:40 -0400 Subject: [PATCH 37/43] update JSON representation --- schema/experiment.json | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/schema/experiment.json b/schema/experiment.json index c53b556..2c141e0 100644 --- a/schema/experiment.json +++ b/schema/experiment.json @@ -51,7 +51,11 @@ "label": { "description": "a human-readable description of the concept associated with the code", "type": "string" - } + }, + "required": [ + "system", + "code" + ] } }, "Dataset": { @@ -410,7 +414,7 @@ "type": "string", "description": "additional details about the model system." }, - "identifiers": { + "codings": { "description": "relevant ontology terms or identifiers for the model system.", "type": "array", "items": { From 5b4578160ffe182042b4dc91fd6590e3e6af2560 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 21:14:36 -0400 Subject: [PATCH 38/43] going green --- examples/Findlay_2018.yml | 2 +- examples/Matreyek_2018.yml | 2 +- examples/Seuma_2022.yml | 2 +- schema/experiment.json | 12 ++++++------ schema/experiment.yml | 6 +++--- 5 files changed, 12 insertions(+), 12 deletions(-) diff --git a/examples/Findlay_2018.yml b/examples/Findlay_2018.yml index 5ce298f..da3eceb 100644 --- a/examples/Findlay_2018.yml +++ b/examples/Findlay_2018.yml @@ -69,7 +69,7 @@ phenotypicAssay: modelSystem: type: immortalized human cells description: HAP1 - identifiers: + codings: - system: https://www.ncbi.nlm.nih.gov/taxonomy code: NCBI:txid9606 label: Homo sapiens diff --git a/examples/Matreyek_2018.yml b/examples/Matreyek_2018.yml index 89ebe90..0f93fe0 100644 --- a/examples/Matreyek_2018.yml +++ b/examples/Matreyek_2018.yml @@ -89,7 +89,7 @@ phenotypicAssay: modelSystem: type: immortalized human cells description: HEK 293T TetBxb1BFP - identifiers: + codings: - system: https://www.ebi.ac.uk/ols/ontologies/clo code: CLO:0037372 label: HEK293T cell diff --git a/examples/Seuma_2022.yml b/examples/Seuma_2022.yml index 1e7bc0b..1db0bd5 100644 --- a/examples/Seuma_2022.yml +++ b/examples/Seuma_2022.yml @@ -62,7 +62,7 @@ phenotypicAssay: type: yeast description: Saccharomyces cerevisiae [psi-pin-] (MATa ade1-14 his3 leu2-3,112 lys2 trp1 ura3-52) - identifiers: + codings: - system: https://www.ncbi.nlm.nih.gov/taxonomy code: NCBI:txid4932 label: Saccharomyces cerevisiae diff --git a/schema/experiment.json b/schema/experiment.json index 2c141e0..388ffba 100644 --- a/schema/experiment.json +++ b/schema/experiment.json @@ -51,12 +51,12 @@ "label": { "description": "a human-readable description of the concept associated with the code", "type": "string" - }, - "required": [ - "system", - "code" - ] - } + } + }, + "required": [ + "system", + "code" + ] }, "Dataset": { "type": "object", diff --git a/schema/experiment.yml b/schema/experiment.yml index 5852395..4db8f2c 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -42,9 +42,9 @@ $defs: label: description: a human-readable description of the concept associated with the code type: string - required: - - system - - code + required: + - system + - code Dataset: type: object additionalProperties: false From 7624a76e2163c539b2ed15fd1d89ded5eaa831ad Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 21:29:22 -0400 Subject: [PATCH 39/43] complete phenotypic assay property descriptions --- README.md | 32 +++++++++++++------------------- examples/Findlay_2018.yml | 2 +- examples/Matreyek_2018.yml | 2 +- examples/Seuma_2022.yml | 2 +- schema/experiment.json | 4 ++-- schema/experiment.yml | 4 ++-- 6 files changed, 20 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index aa4a7d7..3708882 100644 --- a/README.md +++ b/README.md @@ -319,29 +319,23 @@ where applicable. Some commonly used cell lines and model systems are listed bel The `type` property is required. Additional detail about the model system may be provided with the `description` property. -**Environmental variables** – variance of environmental factors included in the experiment -(e.g. addition of specific compounds to cell media, temperature controls, time course, CRISPR interference by KRAB, -KRAB-MeCP2, CRISPR activation by VPR, SAM, or SunTag, etc.) +#### Profiling strategy -Controlled vocabulary terms (select one): +The variant profiling strategy used to capture variant frequency associated with outcome of phenotypic assay. The profiling +strategy must be one of the following following controlled vocabulary terms: -- Yes - If yes, please describe this in detail in the free text methods describing your assay. -- No +- direct sequencing +- shotgun sequencing +- barcode sequencing -#### Variant sequencing characteristics +`profilingStrategy` is a required property. -This section details the method for accurately capturing variant frequency associated with outcome of phenotypic assay. +#### Sequencing read type -**Library profiling strategy** – approach used to quantify variants in the population +The sequencing read type used in the assay. The read type must be one of the following controlled vocabulary terms: -Controlled vocabulary terms (select one): +- single-segment (short read) +- single-segment (long read) +- multi-segment -- Direct sequencing -- Shotgun sequencing -- Barcode sequencing - -Controlled vocabulary terms (select one): - -- Single segment (short read) -- Single segment (long read) -- Multi-segment +`sequencingReadType` is a required property. diff --git a/examples/Findlay_2018.yml b/examples/Findlay_2018.yml index da3eceb..b40d330 100644 --- a/examples/Findlay_2018.yml +++ b/examples/Findlay_2018.yml @@ -74,4 +74,4 @@ phenotypicAssay: code: NCBI:txid9606 label: Homo sapiens profilingStrategy: direct sequencing - sequencingMethod: multi-segment + sequencingReadType: multi-segment diff --git a/examples/Matreyek_2018.yml b/examples/Matreyek_2018.yml index 0f93fe0..2a49ac5 100644 --- a/examples/Matreyek_2018.yml +++ b/examples/Matreyek_2018.yml @@ -97,4 +97,4 @@ phenotypicAssay: code: NCBI:txid9606 label: Homo sapiens profilingStrategy: barcode sequencing - sequencingMethod: single-segment (short read) + sequencingReadType: single-segment (short read) diff --git a/examples/Seuma_2022.yml b/examples/Seuma_2022.yml index 1db0bd5..44be8dc 100644 --- a/examples/Seuma_2022.yml +++ b/examples/Seuma_2022.yml @@ -67,4 +67,4 @@ phenotypicAssay: code: NCBI:txid4932 label: Saccharomyces cerevisiae profilingStrategy: direct sequencing - sequencingMethod: single-segment (short read) \ No newline at end of file + sequencingReadType: single-segment (short read) \ No newline at end of file diff --git a/schema/experiment.json b/schema/experiment.json index 388ffba..64644c2 100644 --- a/schema/experiment.json +++ b/schema/experiment.json @@ -433,7 +433,7 @@ "barcode sequencing" ] }, - "sequencingMethod": { + "sequencingReadType": { "description": "the sequencing method used", "type": "string", "enum": [ @@ -449,7 +449,7 @@ "relevance", "modelSystem", "profilingStrategy", - "sequencingMethod", + "sequencingReadType", "replication" ] } diff --git a/schema/experiment.yml b/schema/experiment.yml index 4db8f2c..af1a8b4 100644 --- a/schema/experiment.yml +++ b/schema/experiment.yml @@ -333,7 +333,7 @@ $defs: - direct sequencing - shotgun sequencing - barcode sequencing - sequencingMethod: + sequencingReadType: description: the sequencing method used type: string enum: @@ -346,7 +346,7 @@ $defs: - relevance - modelSystem - profilingStrategy - - sequencingMethod + - sequencingReadType - replication type: object additionalProperties: false From b7f5e124285fe0f82d8dbd77d7723271cf866016 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 21:32:19 -0400 Subject: [PATCH 40/43] closes #5 --- concept_vocabulary.tsv | 65 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 concept_vocabulary.tsv diff --git a/concept_vocabulary.tsv b/concept_vocabulary.tsv new file mode 100644 index 0000000..f834675 --- /dev/null +++ b/concept_vocabulary.tsv @@ -0,0 +1,65 @@ +Domain Term exactMapping(s) Definition +LibraryDeliveryMethod adeno-associated virus transduction Library delivery using adeno-associated virus transduction +LibraryDeliveryMethod chemical or heat shock transformation Library delivery using chemical or heat shock transformation +LibraryDeliveryMethod chemical-based transfection Library delivery using chemical-based transfection +LibraryDeliveryMethod electroporation Library delivery using electroporation +LibraryDeliveryMethod lentivirus transduction Library delivery using lentivirus transduction +LibraryDeliveryMethod nucleofection Library delivery using nucleofection +LibraryGenerationMechanism base editor A base editor mechanism of CRISPR/Cas mediated variant library generation +LibraryGenerationMechanism prime editor A prime editor mechanism of CRISPR/Cas mediated variant library generation +LibraryGenerationMechanism nuclease A nuclease mechanism of CRISPR/Cas mediated variant library generation +LibraryGenerationSystem AsCas12a CRISPR/Cas mediated variant library generation by the AsCas12a system +LibraryGenerationSystem doped oligo synthesis Doped oligo synthesis mediated variant library generation +LibraryGenerationSystem error-prone PCR Error-prone Polymerase Chain Reaction (PCR) mediated variant library generation +LibraryGenerationSystem microarray synthesis Microarray synthesis mediated variant library generation +LibraryGenerationSystem nicking mutagenesis Nicking mutagenesis mediated variant library generation +LibraryGenerationSystem oligo pool synthesis Oligo pool synthesis mediated variant library generation +LibraryGenerationSystem oligo-directed mutagenic PCR Oligo-directed mutagenic Polymerase Chain Reaction (PCR) mediated variant library generation +LibraryGenerationSystem proprietary method A proprietary method for variant library generation +LibraryGenerationSystem RfsCas13d CRISPR/Cas mediated variant library generation by the RfsCas13d system +LibraryGenerationSystem SaCas9 CRISPR/Cas mediated variant library generation by the SaCas9 system +LibraryGenerationSystem site-directed mutagenesis Site-directed mutagenesis mediated variant library generation +LibraryGenerationSystem SpCas9 CRISPR/Cas mediated variant library generation by the SpCas9 system +LibraryIntegrationMechanism episomal delivery Library expression by episomal delivery +LibraryIntegrationMechanism extra-local construct insertion Library integration at a designated integration site, e.g. with Landing Pad +LibraryIntegrationMechanism native locus replacement Entire element replacement at the native locus (e.g. with integrases) +LibraryIntegrationMechanism plasmid (not integrated) Expression of gene products from a non-integrating plasmid +LibraryIntegrationMechanism random locus viral integration Intergration of a virus into a random locus +LibraryIntegrationMechanism transfection of RNA Direct transfection of RNA +PhenotypicAssayDimensionality high-dimensional data Assay with inherent high-dimensional data +PhenotypicAssayDimensionality multiple functional readouts Assay with multiple functional readouts +PhenotypicAssayDimensionality single dimension Assay with a single dimensional readout +PhenotypicAssayMethod binding assay OBI:0001146 Phenotypic assay measuring binding (e.g. between two proteins) +PhenotypicAssayMethod bulk RNA-sequencing OBI:0003090 Phenotypic assay using bulk RNA-sequencing +PhenotypicAssayMethod cell proliferation assay OBI:0000891 Phenotypic assay measuring cell proliferation +PhenotypicAssayMethod systematic evolution of ligands by exponential enrichment assay OBI:0002161 Phenotypic assay measuring evolution of ligands by exponential enrichment +PhenotypicAssayMethod flow cytometry assay OBI:0000916 Phenotypic assay measuring fluorescence by flow cytometry +PhenotypicAssayMethod fluorescence in-situ hybridization (FISH) assay OBI:0003094 Phenotypic assay using fluorescence in-situ hybridization (FISH) +PhenotypicAssayMethod imaging mass cytometry assay OBI:0003096 Phenotypic assay using imaging mass cytometry +PhenotypicAssayMethod multiplexed fluorescent antibody imaging OBI:0003091 Phenotypic assay using multiplexed fluorescent antibody imaging +PhenotypicAssayMethod promoter activity detection by reporter gene assay OBI:0000913 Phenotypic assay measuring promoter activity using a reporter gene +PhenotypicAssayMethod single-cell imaging Phenotypic assay using single cell imaging +PhenotypicAssayMethod single-cell RNA sequencing assay OBI:0002631 Phenotypic assay using single-cell RNA sequencing +PhenotypicAssayMethod survival assessment assay OBI:0000699 Phenotypic assay using a survival assessment assay +PhenotypicAssayModelSystem bacteria Model system of bacteria (E. coli) +PhenotypicAssayModelSystem bacteriophage Model system of bacteriophage +PhenotypicAssayModelSystem immortalized human cells Model system of immortalized human cells (H. sapiens) +PhenotypicAssayModelSystem induced pluripotent stem cells from human female Model system of induced pluripotent stem cells from human female +PhenotypicAssayModelSystem induced pluripotent stem cells from human male Model system of induced pluripotent stem cells from human male +PhenotypicAssayModelSystem molecular display Model system of molecular display +PhenotypicAssayModelSystem murine primary cells Model system of mouse primary cells (M. musculus) +PhenotypicAssayModelSystem patient derived primary cells (e.g. T-cells, adipocytes) Model system of patient derived primary cells (e.g. T-cells, adipocytes) +PhenotypicAssayModelSystem yeast Model system of yeast (S. cerevisiae) +PhenotypicAssayProfilingStrategy barcode sequencing Library profiling strategy of sequencing of a barcode associated with the variant library +PhenotypicAssayProfilingStrategy direct sequencing Library profiling strategy of direct sequencing of the target variant library +PhenotypicAssayProfilingStrategy shotgun sequencing Library profiling strategy of shotgun sequencing +PhenotypicAssaySequencingMethod multi-segment Library sequencing method of sequencing of multiple segments using short or long reads +PhenotypicAssaySequencingMethod single-segment (long read) Library sequencing method of sequencing of a single segment using long reads (e.g. Oxford Nanopore or PacBio) +PhenotypicAssaySequencingMethod single-segment (short read) Library sequencing method of sequencing of a single segment using short reads (e.g. Illumina) +VariantLibrary base editor functionality A base editor mechanism of a CRISPR/Cas variant library generation method +VariantLibrary prime editor functionality A prime editor mechanism of a CRISPR/Cas variant library generation method +VariantLibrary wildtype nuclease functionality A wildtype nuclease mechanism of a CRISPR/Cas variant library generation method +VariantLibraryScope coding The protein-coding sequence of a gene +VariantLibraryScope intronic Intronic sequence in between exons of a gene +VariantLibraryScope non-coding, other Non-coding sequence corresponding to non-regulatory elements +VariantLibraryScope non-coding, regulatory Non-coding sequence corresponding to regulatory elements (e.g. enhancers or promoters) \ No newline at end of file From 27cb3286c7e9906809c993d5c402cf45f813bf22 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 21:33:39 -0400 Subject: [PATCH 41/43] link to cv.tsv --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3708882..8621f6d 100644 --- a/README.md +++ b/README.md @@ -255,7 +255,7 @@ The `type` property is required. Additional detail about the replication method The assay method, defining the molecular properties interrogated by the experiment. Terms are derived from OBI subtree with root [OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) where appropriate. Term mappings to OBI concept identifiers -are available in the controlled vocabulary definitions file. The method is specified by the `type` property, which must be one of +are available in the [controlled vocabulary tsv](controlled_vocabulary.tsv). The method is specified by the `type` property, which must be one of the following controlled vocabulary terms: - promoter activity detection by reporter gene assay From c5b1643daef7b983fd560a7d399638bcfd3dc5a9 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 21:35:06 -0400 Subject: [PATCH 42/43] fix typo in tsv --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 8621f6d..9d22949 100644 --- a/README.md +++ b/README.md @@ -255,7 +255,7 @@ The `type` property is required. Additional detail about the replication method The assay method, defining the molecular properties interrogated by the experiment. Terms are derived from OBI subtree with root [OBI_0000070: “assay”](http://purl.obolibrary.org/obo/OBI_0000070) where appropriate. Term mappings to OBI concept identifiers -are available in the [controlled vocabulary tsv](controlled_vocabulary.tsv). The method is specified by the `type` property, which must be one of +are available in the [concept vocabulary tsv](concept_vocabulary.tsv). The method is specified by the `type` property, which must be one of the following controlled vocabulary terms: - promoter activity detection by reporter gene assay From c7d3c399e19accce939928b98b5069f0f60ac850 Mon Sep 17 00:00:00 2001 From: "Alex H. Wagner, PhD" Date: Wed, 11 Oct 2023 21:46:01 -0400 Subject: [PATCH 43/43] align vocab with schema --- concept_vocabulary.tsv | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/concept_vocabulary.tsv b/concept_vocabulary.tsv index f834675..b01b6c2 100644 --- a/concept_vocabulary.tsv +++ b/concept_vocabulary.tsv @@ -27,8 +27,8 @@ LibraryIntegrationMechanism plasmid (not integrated) Expression of gene product LibraryIntegrationMechanism random locus viral integration Intergration of a virus into a random locus LibraryIntegrationMechanism transfection of RNA Direct transfection of RNA PhenotypicAssayDimensionality high-dimensional data Assay with inherent high-dimensional data -PhenotypicAssayDimensionality multiple functional readouts Assay with multiple functional readouts -PhenotypicAssayDimensionality single dimension Assay with a single dimensional readout +PhenotypicAssayDimensionality combined functional data Assay with multiple, combined functional readouts +PhenotypicAssayDimensionality single-dimensional data Assay with a single-dimensional readout PhenotypicAssayMethod binding assay OBI:0001146 Phenotypic assay measuring binding (e.g. between two proteins) PhenotypicAssayMethod bulk RNA-sequencing OBI:0003090 Phenotypic assay using bulk RNA-sequencing PhenotypicAssayMethod cell proliferation assay OBI:0000891 Phenotypic assay measuring cell proliferation