Skip to content

Data Format

Kris Davie edited this page Jun 26, 2020 · 53 revisions

This wiki is intended to describe the naming convention of the data format. We follow the normal convention for loom files where possible, but with some extension.

We define M and N as the number of rows and the number of columns respectively in the expression matrix.

SCope Loom Standards

Global attributes

  • title a string label identifiying the data containing in the loom file
  • SCopeTreeL1 top level category in SCope
  • SCopeTreeL2 second level category in SCope
  • SCopeTreeL3 third level category in SCope
  • MetaData a JSON string containing all the meta data related to annotations, embeddings, clusterings and markers: e.g.:
MetaData = {
    "annotations": [{
        "name": "",
        "values": []
    }],
    "metrics": [{
        "name":""
    }],
    "embeddings": [{
        "id": -1,
        "name": "SCENIC 25PC, 60 perplexity",
    }, {
        "id": 0,
        "name": "Seurat 82PC, 30 perplexity",
    }, {
        "id": 1,
        "name": "DDRTree (monocle)",
        "trajectory": {
           "nodes": ["V_1", "..."],
           "edges": [{
              "source": "V_1", 
              "target": "V_5"
           }, "..."],
           "coordinates": [{
              "x": 0, 
              "y": 1
           }, "..."]
        }
    }],
    "clusterings": [{
        "id": 0,
        "group": "Seurat",
        "name": "Seurat resolution 2.0",
        "clusters": [{
            "id": 0,
            "description": "Mushroom Body (1)",
            "cell_type_annotation": [{
                "data": {
                    "curator_name": "Kristofer Davie",
                    "curator_id": "0000-0003-2182-1249",
                    "timestamp": 1581609997591,
                    "obo_id": "GO:0043005",
                    "ols_iri": "http://purl.obolibrary.org/obo/GO_0043005",
                    "annotation_label": "neuron projection",
                    "markers": ["ey"],
                    "publication": "https://doi.org/10.1016/j.cell.2018.05.057",
                    "comment": "This cluster seems to be enriched in neuronal projection genes"},
               "validate_hash": "3070ff3f9913cda19183bc74e310c61e675540b9f52a6d4d1724919e744d363a",
               "votes": {
                   "votes_for": {
                       "total": 0, 
                       "voters": []},
                    "votes_against": {
                        "total": 1,
                    "voters": [{
                        "voter_name": "Kristofer Davie",
                        "voter_id": "0000-0003-2182-1249",
                        "voter_hash": "de4787fba14bb1f412bd33055f6ca155172bba34b4b88ef0662ad58b823a06de"}
                        ]
                    }
                }
            }]
        }],
        "clusterMarkerMetrics": [{
           "accessor": "avg_logFC", 
           "name": "Avg. logFC",
           "description": "Average log fold change from Wilcox test (Seurat)"  
        }]
    }],
    "regulonThresholds": [{
        "regulon": "Abd-B_(17g)",
        "defaultThresholdValue": 0.00909688,
        "defaultThresholdName": "tenPercentOfMax",
        "allThresholds": {
              "tenPercentOfMax": 0.00909688
        },
        "motifData": "idmmpmm__Abd-B.png"
      }]
}

Where:

  • annotations a unique list of values for each metadata annotation i.e. Age, Genotype
  • metrics a unique list of values for each metadata metric i.e. n_umi, n_genes, ...
  • embeddings names and IDs for each embedding in Embeddings_X and Embeddings_Y.
  • clusterings Descriptions and sources from each clustering. ID matches Z in ClusterMarkers_Z.
  • regulonThresholds AUC thresholds for each regulon. Make sure regulon is following the following convention [TF-name]_([number-of-genes-in-regulon]g).

Row attributes

  • Gene an M array of type string storing the symbols of the genes
  • Regulons an M-by-X matrix of type string where X is the number of regulons inferred by SCENIC. This is used to store the genes present in each of the regulons.
  • ClusterMarkers_Z an M-by-X matrix of type string where X is the number clusters found across all the analysis and Z is the index number of corresponding Seurat cluster in Clusterings column attribute.

Column attributes

  • Embedding an N-by-Y matrix of type float. This is used to stored the coordinates of the cells in a Y-dimensional space. Currently only supports N=2. Columns should be named '_X' and '_Y'.
  • Embeddings_X a second N-by-Z matrix of type float. Used for storing the X coordinates of the Z extra embeddings.
  • Embeddings_Y a second N-by-Z matrix of type float. Used for storing the Y coordinates of the Z extra embeddings.
  • CellID an N array of type string storing the cell IDs
  • Clusterings an N-by-Y matrix of type int. This is used to store the clusters from multiple analysis, such as Seurat with a different resolution parameter.
  • RegulonsAUC an N-by-Y matrix of type float. This is used to stored the AUC values computed by scoring Y regulons on the cells with AUCell.

Currently Not Standardized

  • nUMI an N array of type integer storing the number of UMIs per cell
  • nGene an N array of type integer storing the number of genes per cell
  • Age an N array of type integer storing the age of the Fly whose the cell is coming from
  • Replicate an N array of type integer storing the replicate information of the cell
  • Sex an N array of type boolean storing the sex of the Fly whose the cell is coming from
  • Genotype an N array of type string storing the genotype of the Fly whose the cell is coming from

Deprecated

Global attribute

  • Subclusters a JSON string containing information about subclusters including embeddings - Very inefficient for now
Subclusters = {
  "clustering_0" : {
    "embedding": {
      "cluster_0": {
        "x": [],
        "y": []
      },
      "cluster_1": {
        "x": [],
        "y": []
      }  
    },
    "markers": {
      "cluster_0": {
        "subcluster_0": [],
        "subcluster_1": [],
        "subcluster_2": [],
        
      }
    }
  }
}

Row Attribute

  • GeneSets an M-by-X matrix of type string where X is the number of external gene sets.