From 2c210825242ba4c3c591db57836aae0c9f53a7ae Mon Sep 17 00:00:00 2001 From: Eduardo Reyes <62916582+EdRey05@users.noreply.github.com> Date: Mon, 15 Jan 2024 16:06:43 -0500 Subject: [PATCH] Updated README.md --- README.md | 71 +++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 48 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 06f6cbe..b7476b1 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ Streamlit app (see GIF -->) to automate the creation of these plots with Python.

At the end of the project, we were able to identify less than 10 gene pairs showing the behavior of interest. That information was used in combination with other data from different techniques (in silico and in vitro) to prioritize - further studies evaluating the effect of inhibtion of those genes in cancer cell models. + further studies evaluating the effect of inhibtion of those genes in cancer cell models.

@@ -23,7 +23,6 @@

To see in full screen, right click on image and select "Open in new tab"

-

Problem

@@ -53,11 +52,11 @@
  • I first generated a Google Colab notebook that was dataset-specific to produce batches of 40-50 plots. This exclusively makes 4 groups from the original dataset based on the expression of RET and one other gene, which required to manually write in the code all 40-50 names of the - other gene (View tool).
  • + other gene (View tool).
  • Then, I found a way to generalize some steps and created a Jupyter notebook that used ipywidgets to interactively get user inputs, allowing dynamic selection of any measured variable to divide the dataset into 2 or more groups and - re-plotting curves easily (View tool).
  • + re-plotting curves easily (View tool).
  • Finally, I discovered Streamlit and adapted my interactive notebook to a data app (GIF above) that used a similar approach but has more interactivy, improved outputs and better user experience.
  • Although the app works well for several datasets, I noticed high variability in the formatting of clinical @@ -65,7 +64,7 @@
  • -

    NOTE: I am not planning on deploying my app to any server, it runs locally and I also set it up in Github Codespaces to share with others (see last section).

    +

    NOTE: I am not planning on deploying my app to a hosted server (for now), it runs locally or in Github Codespaces (see last section).

    Read the instructions and watch another demo of the Streamlit app here: Demo_KM_plotter

    @@ -130,7 +129,7 @@ to the root directory for that experimental group to find the pairs of images to insert.
  • A big iterable is generated with names, counts, and image locations which are analyzed to separate in groups of up to 20 for a single slide (see app info).
  • -
  • I implemented this approach first in a Google Colab notebook (View tool) and then created a Streamlit app (GIF above). The app has the same functionality +
  • I implemented this approach first in a Google Colab notebook (View tool) and then created a Streamlit app (GIF above). The app has the same functionality but better user experience, especially to read additional info on the input/output and the design of the slides.
  • The app allows quick and easy automation, as the user only needs to upload a zip file with as @@ -139,19 +138,33 @@ -

    NOTE: I am not planning on deploying my app to any server, it runs locally and I also set it up in Github Codespaces to share with others (see last section).

    +

    NOTE: I am not planning on deploying my app to a hosted server (for now), it runs locally or in Github Codespaces (see last section).

    Read the instructions and watch another demo of the Streamlit app here: Demo_PPTX_PLA


    -

    001 - Extract RNA expression data from CCLE/DepMap

    +

    001 - Interactive extraction of RNA Seq data from CCLE/DepMap

    Expand this to read more...

    Summary

    -

    Some summary here. +

    During my graduate studies, I came across the Cancer Cell Line + Encyclopedia, which is a project containing information on over 1,800 cell models, including RNA + Seq gene expression data. Although my reseach group does not specialize in bioinfomatics, we used the CCLE dataset at times to + cross-validate some observations, potential gene candidates, or to look for cell model options we didn't have in-house.

    + I downloaded the 2019 dataset and noticed it was very large (around 1,800 by 20,000), and we only used a few cell + lines in our analyses. Moreover, we wanted different combinations of these cell lines per file, so I created a basic tool as a + Google Colab notebook (View tool) + to search and retrieve only the cell lines of interest.

    + However, I noticed that the dataset was merged with the Achilles project to make the DepMap project, + which added few more cell lines and several more datasets from other types of genomics, proteomics, and metabolomics assays. They also + reshaped datasets, reassigned IDs to make all datasets consistent, etc. Once I noticed the new version, I adapted my tool to work for the + new dataset (at that time, 23Q2), and generated a similar notebook.

    + Finally, when I discovered Streamlit the first data app I built attempted to replicate my tool for DepMap. I soon + discovered how easy was to add widgets and interactive plots that would allow not only to extract the data, but also to automate + basic data exploration and visualization of the cell lines and gene expression in a very user-friendly manner.

    @@ -164,27 +177,39 @@

    Problem

      -
    • A.
    • -
    • B.
    • -
    • C.
    • -
    • D.
    • -
    • E.
    • -
    • F.
    • +
    • The RNA Seq dataset is very large and it no longer has cell line names, as they were changed to Achilles IDs which are + encoded in another file.
    • +
    • We needed to pre-process both datasets before mapping the IDs, but asking the user to get the required files from the + website was confusing and led to errors as the datasets change 2-4 times a year.
    • +
    • The notebook tool required the user to have the required files already stored in a specific Google Drive folder + (or to have access to a Google account that had them).
    • +
    • The notebook tool was only able to search based on cell line name, but sometimes we needed just to explore + what models are available for some tissues.
    • +
    • The notebook tool only provided a simple view of the search results showing the cell line name followed + by tissue, no more information.
    • +
    • While the notebook tool provided some degree of automation, it was not easy to de-select cell lines and only gave + the raw data for the user to plot or analyze.

    Solution

      -
    • A.
    • -
    • B.
    • -
    • C.
    • -
    • D.
    • -
    • E.
    • -
    • F.
    • +
    • I set the Streamlit app to automatically download the required files for the current release at the time + (23Q2). It takes like a minute or two, but the user does not need any Google account, nor to upload anything + to be able to use the app.
    • +
    • The pre-processing is tailored to that specific data release and caches the prepared dataframe to improve efficiency.
    • +
    • I added a second search mode, so the user can search names of cell lines (or parts of them), and also search + by tissue type.
    • +
    • The app displays more interactive search results, allowing to check boxes of cell lines to keep (instead of intering numbers) + and I provide the Achilles ID, clean cell line name, tissue type and cancer type.
    • +
    • The csv output is the same as the notebook tool, however, the app has several widgets to preview the selected data.
    • +
    • Although it is not perfect, the preview area shows the generated dataset and lets the user easily type + in genes of interest to make a bar chart or a heatmap. These visualizations are interactive (plotly) and the user can + take snapshots if needed.
    -

    NOTE: I am not planning on deploying my app to any server, it runs locally and I also set it up in Github Codespaces to share with others (see last section).

    +

    NOTE: I am not planning on deploying my app to a hosted server (for now), it runs locally or in Github Codespaces (see last section).

    Read the instructions and watch another demo of the Streamlit app here: Demo_RNA_DepMap

    @@ -199,7 +224,7 @@

    If you have a Github account, you can create a Github Codespace with all the requirements to run my apps. You only have to log into you account, click on the button below, create your Codespace (we all have 60h of free usage per month!), and follow the instructions in this video→.

    - ***Note that due to size limits, I did everything quickly but added notes so pause, read and see where I clicked!

    + ***Due to size limits, I did everything in the video quickly but added notes so pause, read and see where I clicked!