Skip to content

Latest commit

 

History

History
184 lines (139 loc) · 8.4 KB

README.md

File metadata and controls

184 lines (139 loc) · 8.4 KB

Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE

TRACE1 supports you in analyzing global and local quality 🕵🏽‍♀️ of two-dimensional embeddings, based on Regl-scatterplot2 .

screenshot

Installation

OPTION 1: Using Docker 🐋

Make sure you have Docker Compose installed. Then build the container that includes the backend and frontend.

docker-compose build
docker-compose up

This will mount the /frontend, /backend, and /data directories into the repective containers.

Open http://localhost:3000 with your browser to see the result.

OPTION 2: Without Docker

Required packages

Backend: Install the required python packages for the backend, tested with Python 3.11 from backend/pip_requirements.txt or backend/conda_requirements.yaml.

Frontend: Install the packages in frontend/package.json using e.g. npm install.

First, start the backend within the right python evironment:

conda activate backend_env/
python main.py
# or
python -m uvicorn main:app --reload

Then start the frontend development server:

npm run dev

Open http://localhost:3000 with your browser to see the result.

Data Preparation

The easiest way to load your data into TRACE is using the Dataset class to add embeddings and compute quality measures. This will create the necessary Anndata structure under the hood. Examples can be found in the notebooks of each dataset folder.

trace_data = Dataset(
    hd_data=data,
    name="Gaussian Line",
    verbose=True,
    hd_metric="euclidean",
)
How is the the Anndata object structured?

The TRACE backend can load data structured in the Anndata format. It includes the following fields:

  • adata.X high-dimensional data

  • [optional] adata.obs: dataframe with metadata e.g. cluster labels

  • adata.obsm low-dimensional embeddings, one entry for each embedding, e.g. adata.obsm["t-SNE (exag. 5)"] for a t-SNE embedding.

  • adata.uns unstructured data:

    • adata.uns["methods"]: a dictionary that structures all available embeddings into groups (exactly one level with keys and a list as values such as in the example). This defines which embeddings can be selected in the interface. For example one could group according to DR methods and and list all corresponding two-dimensional embedding keys in adata.obsm:
      {
          "t-SNE": ["t-SNE (exag. 5)", "t-SNE (exag. 1)"],
          "UMAP": ["UMAP 20", "UMAP 100"]
      }
    • [optional] adata.uns["neighbors"]: an nxk array of the k-nearest high-dimensional neighbors of each point
    • [optional] adata.uns["t-SNE (exag. 5)"]: dictionary with additional data for each embedding, such as quality scores or parameters used to obtain the embedding. For example:
      {
          "quality": {"qnx@50": [...], "qnx@200": [...]},
          "parameters": {"perplexity": 100, "exaggeration": 5, "epochs": 750}
      }
    • [optional] 🌈 You can add custom colors for metadata features by adding a list of HEX values to trace_data.adata.uns["featureName_colors"]. For categorical features, the number of colors should match the number of categories. The colors for continuous features will be mapped to the [min, max] range of the feature values.

1. Adding 2-dimensional embeddings

After preprocessing your data and computing a range of 2-dimensional embeddings using your favorite DR method, add the data and the embeddings to the data object:

# Repeat for each embedding
trace_data.add_embedding(
    name= "tSNE (perplexity 30)",
    embedding = tsne_emb,
    category="tSNE",
)

2. Computing High-Dimensional Neighbors and Quality Measures

To provide snappy interactions in TRACE, the HD neighbors and a range of quality measures need to be precomputed. We use ANNOY to obtain the approximate neighbors and provide implementations of the following quality measures to be visualized via point colors in TRACE:

  • neighborhood preservation measures the fraction of k high-dimensional neighbors that are preserved in the low-dimensional embedding.
  • landmark distance correlation: Sampling landmark points using a random or kmeanss++ (supports only Euclidean distance) from the high-dimensional data. We then compute the pairwise distances between all landmarks in high-dimensional space and each embedding and the rank correlation of their distance vectors. Points that are not landmark points are colored according to their nearest landmark point in the embedding.
  • random triplet accuracy quantifies the ratio of random triplets (i,j,k), where relative order of j and k with respect to i in the high-dimensional space is preserved in the embedding.
  • point stability measures how much the distances between each point and a random sample of other points vary across all embeddings. If a point has a very different global or local position in the embeddings, the stability will be low.

To compute all available quality measures:

trace_data.compute_quality(filename="./gauss_line.h5ad", hd_metric="euclidean")
trace_data.print_quality()
How can I chose the parameters of the quality measures? Instead of calling the ```compute_quality``` function, you can also call each function separately.
trace_data.precompute_HD_neighbors(maxK=200)
trace_data.compute_neighborhood_preservation(
    neighborhood_sizes=[200, 100, 50]
)
trace_data.compute_global_distance_correlation(
    max_landmarks=1000, LD_landmark_neighbors=True,
    hd_metric="euclidean", sampling_method="random",
)
trace_data.compute_random_triplet_accuracy(
    num_triplets=10
)
trace_data.compute_point_stability(num_samples=50)

# align the embeddings such that point movement is minimized
trace_data.align_embeddings(reference_embedding="PCA")
trace_data.save_adata(filename="./gauss_line.h5ad")

3. Add Dataset Configuration

To include a dataset in the dashboard you need to add the filepath in the configuration in data_configs.yaml. For the Gaussian Line dataset this would be:

"GaussLine": {
    "filepath": "../data/gauss_line/gauss_line.h5ad",
    "description": "Gaussian clusters shifted along a line from Böhm et al. (2022)",
}

Example Datasets

Gaussian Line 🟢 🟠 🟣

A small example dataset that is included in the repository.

Mammoth 🦣

This dataset from Wang et al. can be downloaded from their PaCMAP repository. It then needs to be processed using the mammoth.ipynb notebook.

Single-Cell Mouse Data 🐁

The processed dataset of gene expressions from Guilliams et al. is not available online, please reach out if you are interested. A raw version is available under GSE192742.

Citation

TRACE was presented as a demo paper at ECML-PKDD 2024. If you find the tool useful and are using it in your research, we'd appreciate if you could cite our paper:

@inproceedings{heiter2024pattern,
  title={Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE},
  author={Heiter, Edith and Martens, Liesbet and Seurinck, Ruth and Guilliams, Martin and De Bie, Tijl and Saeys, Yvan and Lijffijt, Jefrey},
  booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
  pages={379--382},
  year={2024},
  organization={Springer}
}

[1] TRACE stands for Two-dimensional representation Analysis and Comparison Engine
[2] Lekschas, Fritz. "Regl-Scatterplot: A Scalable Interactive JavaScript-based Scatter Plot Library." Journal of Open Source Software (2023)

⬆️ Back to top