Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSS enrichment with a region extension 2000bp #1638

Open
Montsecb opened this issue Feb 26, 2024 · 4 comments
Open

TSS enrichment with a region extension 2000bp #1638

Montsecb opened this issue Feb 26, 2024 · 4 comments

Comments

@Montsecb
Copy link

Hi there,
I was trying to calculate the TSS enrichment score using signac version 1.11 by adding an extension up to 2000 bp from TSSs. I see that the results produced have a high enrichment value (up to 150 in some cells). I would like to know if these values are normal or if they are too high, as I usually see enrichment values ranging up to 60 or 80.

I'm leaving the code and the results here using 1000 and 2000 bp.

seu <- TSSEnrichment(object = seu, fast=F, region_extension = 2000)

Here are the results using a region extension of 1000 bp(left) and 2000 bp (right)
Captura de Pantalla 2024-02-26 a las 9 41 56

@timoast
Copy link
Collaborator

timoast commented Jun 13, 2024

Hi @Montsecb, this is indeed not expected, thanks for reporting the issue. There may be small changes in TSS enrichment score depending on the size of the flanking region used but certainly not to this extent. I will try to look into this, for now I'd suggest sticking with the region_extension = 1000 which is the default and seems to give more expected distribution

@nigiord
Copy link

nigiord commented Jul 24, 2024

I looked at the code of Signac::TSSEnrichment() while investigating drastic changes in values compared to ArchR, and I’m rather surprised by what I found. It seems like the documentation of the function is wrong.

#' Compute TSS enrichment score per cell
#'
#' Compute the transcription start site (TSS) enrichment score for each cell,
#' as defined by ENCODE:
#' \url{https://www.encodeproject.org/data-standards/terms/}.
#'
#' The computed score will be added to the object metadata as "TSS.enrichment".
#'

If we look at the link provided, we have the following definition for TSS enrichment:

Transcription Start Site (TSS) Enrichment Score - The TSS enrichment calculation is a signal to noise calculation. The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 2000 bp in either direction (for a total of 4000bp). This distribution is then normalized by taking the average read depth in the 100 bps at each of the end flanks of the distribution (for a total of 200bp of averaged data) and calculating a fold change at each position over that average read depth. This means that the flanks should start at 1, and if there is high read signal at transcription start sites (highly open regions of the genome) there should be an increase in signal up to a peak in the middle. We take the signal value at the center of the distribution after this normalization as our TSS enrichment metric. Used to evaluate ATAC-seq.

But if we look at the actual code of the function:

  • the default region_extension is actually 1000 pb (for a total of 2000 bp), so not in line with ENCODE definition
  • for the "signal value at the center of the distribution", Signac takes a 1001-bp region
    # Take signal value at center of distribution after normalization as
    # TSS enrichment score, average the 1001 bases at the center
    center_region <- seq.int(from = (region_extension - 500), to = (region_extension + 500))
    object$TSS.enrichment <- rowMeans(x = norm.matrix[, center_region], na.rm = TRUE)
    This is HUGE considering you use by default a total length of 2000 bp (so 50% of the distribution span)

As a comparison, ArchR is more in line with ENCODE definition and uses by default a region extension of 2000 (total length 4000), 100 bp at each flank for background, and signal window of 101 bp centered at the TSS. Furthermore they provide a way to feed those 3 values as parameter to the function and to arbitrary exclude extra chromosomes (not just mitochondrial) which is convenient to compare with other tools.

I would suggest to either fix TSSEnrichment to comply with ENCODE definition, or provides a better description of what is computed. Note that the former would drastically change TSSEnrichment computed in users’ previous analyses (the QC thresholds would have to be adapted) but since it currently does not follow any standard in ATACseq I think that would be the right course. It would also be nice if we could have the following extra parameters:

  • center_window (default 1001 if we keep what is currently used), window length to compute signal value at the center of the distribution
  • flank_window (default 100), window length to compute signal value at both flank of the distribution
  • exclude_chr (default c("chrM", "Mt", "MT")), chromosomes that will be excluded from the computation

I could have a look if you are open to pull requests.

Cheers,
−Nils

@timoast
Copy link
Collaborator

timoast commented Jul 26, 2024

@nigiord when this function was implemented in Signac ENCODE did not actually say what size region was being used for the central and flanking part. It looks like they've since added more documentation for this, and we are now doing something that isn't in line with the ENCODE terminology.

I agree we should change this to be consistent with ENCODE and other ATAC analysis packages. The current values are not wrong, but it is obviously preferable if the definition of TSS enrichment score can be consistent across different tools. This would change the TSS scores being returned, but as long as this change is clearly documented in the release notes I'm ok with that.

It would also be nice if we could have the following extra parameters:
center_window (default 1001 if we keep what is currently used), window length to compute signal value at the center of the distribution
flank_window (default 100), window length to compute signal value at both flank of the distribution
exclude_chr (default c("chrM", "Mt", "MT")), chromosomes that will be excluded from the computation

Agreed these would be nice to have, I'm certainly open to a PR if you're interested in contributing. This would also enable users to get the current TSS scores if we were to update the defaults to be in line with ENCODE

@nigiord
Copy link

nigiord commented Jul 26, 2024

Hi @timoast, thank you for your answer! It seems simple enough since the code is rather neat and well commented so I’ll try to suggest a PR as soon as possible.

Regarding @Montsecb issue with very high TSS values when extending the region to -2000:+2000, the code that implements region_extension looks fine to me, so I think we lack information to properly understand what’s actually happening. Could you generate the TSSPlot() of your object and overlay the windows used by Signac in both cases? The following code should do the trick:

TSSPlot(seuratobj) +
  ggplot2::geom_vline(
    data = data.frame(
      x = c(-500, 500, -901, 901, -1901, 1901),
      label = c(
        "center window", "center window",
        "flank default", "flank default",
        "flank 2000bp", "flank 2000bp"
      )
    ),
    mapping = aes(xintercept = x, color = label),
    linetype = "dashed",
  ) +
  theme(legend.position = "right")

I sometimes have very high TSS value but it’s always with barcodes that fail other QC metrics though (usually number of fragments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants