-
Notifications
You must be signed in to change notification settings - Fork 15
Data Format
Kris Davie edited this page Jun 26, 2020
·
53 revisions
This wiki is intended to describe the naming convention of the data format. We follow the normal convention for loom files where possible, but with some extension.
We define M
and N
as the number of rows and the number of columns respectively in the expression matrix.
-
title
a string label identifiying the data containing in the loom file -
SCopeTreeL1
top level category in SCope -
SCopeTreeL2
second level category in SCope -
SCopeTreeL3
third level category in SCope -
MetaData
a JSON string containing all the meta data related to annotations, embeddings, clusterings and markers: e.g.:
MetaData = {
"annotations": [{
"name": "",
"values": []
}],
"metrics": [{
"name":""
}],
"embeddings": [{
"id": -1,
"name": "SCENIC 25PC, 60 perplexity",
}, {
"id": 0,
"name": "Seurat 82PC, 30 perplexity",
}, {
"id": 1,
"name": "DDRTree (monocle)",
"trajectory": {
"nodes": ["V_1", "..."],
"edges": [{
"source": "V_1",
"target": "V_5"
}, "..."],
"coordinates": [{
"x": 0,
"y": 1
}, "..."]
}
}],
"clusterings": [{
"id": 0,
"group": "Seurat",
"name": "Seurat resolution 2.0",
"clusters": [{
"id": 0,
"description": "Mushroom Body (1)",
"cell_type_annotation": [{
"data": {
"curator_name": "Kristofer Davie",
"curator_id": "0000-0003-2182-1249",
"timestamp": 1581609997591,
"obo_id": "GO:0043005",
"ols_iri": "http://purl.obolibrary.org/obo/GO_0043005",
"annotation_label": "neuron projection",
"markers": ["ey"],
"publication": "https://doi.org/10.1016/j.cell.2018.05.057",
"comment": "This cluster seems to be enriched in neuronal projection genes"},
"validate_hash": "3070ff3f9913cda19183bc74e310c61e675540b9f52a6d4d1724919e744d363a",
"votes": {
"votes_for": {
"total": 0,
"voters": []},
"votes_against": {
"total": 1,
"voters": [{
"voter_name": "Kristofer Davie",
"voter_id": "0000-0003-2182-1249",
"voter_hash": "de4787fba14bb1f412bd33055f6ca155172bba34b4b88ef0662ad58b823a06de"}
]
}
}
}]
}],
"clusterMarkerMetrics": [{
"accessor": "avg_logFC",
"name": "Avg. logFC",
"description": "Average log fold change from Wilcox test (Seurat)"
}]
}],
"regulonThresholds": [{
"regulon": "Abd-B_(17g)",
"defaultThresholdValue": 0.00909688,
"defaultThresholdName": "tenPercentOfMax",
"allThresholds": {
"tenPercentOfMax": 0.00909688
},
"motifData": "idmmpmm__Abd-B.png"
}]
}
Where:
-
annotations
a unique list of values for each metadata annotation i.e. Age, Genotype -
metrics
a unique list of values for each metadata metric i.e. n_umi, n_genes, ... -
embeddings
names and IDs for each embedding inEmbeddings_X
andEmbeddings_Y
. -
clusterings
Descriptions and sources from each clustering. ID matches Z inClusterMarkers_Z
. -
regulonThresholds
AUC thresholds for each regulon. Make sure regulon is following the following convention[TF-name]_([number-of-genes-in-regulon]g)
.
-
Gene
an M array of type string storing the symbols of the genes -
Regulons
an M-by-X matrix of type string where X is the number of regulons inferred by SCENIC. This is used to store the genes present in each of the regulons. -
ClusterMarkers_Z
an M-by-X matrix of type string where X is the number clusters found across all the analysis and Z is the index number of corresponding Seurat cluster in Clusterings column attribute.
-
Embedding
an N-by-Y matrix of type float. This is used to stored the coordinates of the cells in a Y-dimensional space. Currently only supports N=2. Columns should be named '_X' and '_Y'. -
Embeddings_X
a second N-by-Z matrix of type float. Used for storing the X coordinates of the Z extra embeddings. -
Embeddings_Y
a second N-by-Z matrix of type float. Used for storing the Y coordinates of the Z extra embeddings. -
CellID
an N array of type string storing the cell IDs -
Clusterings
an N-by-Y matrix of type int. This is used to store the clusters from multiple analysis, such as Seurat with a different resolution parameter. -
RegulonsAUC
an N-by-Y matrix of type float. This is used to stored the AUC values computed by scoring Y regulons on the cells with AUCell.
-
nUMI
an N array of type integer storing the number of UMIs per cell -
nGene
an N array of type integer storing the number of genes per cell -
Age
an N array of type integer storing the age of the Fly whose the cell is coming from -
Replicate
an N array of type integer storing the replicate information of the cell -
Sex
an N array of type boolean storing the sex of the Fly whose the cell is coming from -
Genotype
an N array of type string storing the genotype of the Fly whose the cell is coming from
-
Subclusters
a JSON string containing information about subclusters including embeddings - Very inefficient for now
Subclusters = {
"clustering_0" : {
"embedding": {
"cluster_0": {
"x": [],
"y": []
},
"cluster_1": {
"x": [],
"y": []
}
},
"markers": {
"cluster_0": {
"subcluster_0": [],
"subcluster_1": [],
"subcluster_2": [],
}
}
}
}
-
GeneSets
an M-by-X matrix of type string where X is the number of external gene sets.