Skip to content

tdhock/PeakSegOptimal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PeakSegOptimal: Optimal Segmentation Subject to Up-Down Constraints

https://travis-ci.org/tdhock/PeakSegOptimal.png?branch=master

NOTE: the name of the package has been changed from coseg to PeakSegOptimal, to avoid confusion with the CoSeg package on CRAN (which is completely unrelated to this package).

Installation

From CRAN:

install.packages("PeakSegOptimal")

From GitHub:

if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/PeakSegOptimal")

PeakSeg solvers

Linear time exact C++ solver for the optimal change points using the Poisson loss and the PeakSeg constraint.

PeakSeg Segment Neighborhood problem

There are two R functions that solve the Segment Neighborhood problem (best models with at most 1,…,S segments). The constraints are first segment mean <= second segment mean >= third segment mean, etc. For S segments and N data points, complexity is O(S N log N) memory and O(S N log N) time.

  • PeakSegPDPA is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step. Input parameter is the maximum number of segments S, and outputs are the optimal models from 1 to S segments.
  • PeakSegPDPAchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output. Input parameter is the maximum number of peaks P (implying S = P*2+1 segments), and outputs are the optimal models from 0 to P peaks.

PeakSeg Optimal Partitioning problem

There are two R functions that solve the Optimal Partitioning problem (best model for a given penalty parameter). Input parameter is a non-negative penalty constant (bigger for fewer peaks, smaller for more peaks), and output is the model that minimizes the penalized cost using the specified penalty constant. For N data points, complexity is O(N log N) memory and time. The first segment mean is constrained to be down (mu_1 <= mu_2), as is the last (muN-1 >= mu_N).

  • PeakSegFPOP is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step.
  • PeakSegFPOPchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output.

Note about inequality constraints

Note that as defined in the PeakSeg ICML’15 paper, the PeakSeg problem has strict inequality constraints (segment mean 1 < segment mean 2 > segment mean 3, etc). However, the solvers in this package use non-strict inequality constraints (segment mean 1 <= segment mean 2 >= segment mean 3, etc), and so may return a model with adjacent segment means that are equal. In this case, the minimum of the PeakSeg problem with strict inequality constraints is undefined. For example,

library(coseg)
data.vec <- as.integer(c(1, 2, 3))
fit <- PeakSegPDPA(data.vec, max.segments=3L)

For the data 1, 2, 3 the optimal model with at most 3 segments actually has only 2 segments:

> fit$mean.mat[3,]
[1] 1.0 2.5 2.5

This implies that the optimal model with strict inequality constraints is undefined for these data. For example, the model with means 1, 3, 2 satisfies the strict inequality constraint, and has a PoissonLoss of

> PoissonLoss(c(1,2,3), c(1,3,2))
[1] 1.723334

But that model is not optimal, because there is another model that satisfies the strict inequality constraints, but has a better cost. For example,

> PoissonLoss(c(1,2,3), c(1,2.9,2.1))
[1] 1.644766

But that model is not optimal, for the same reason (exercise for the reader).

Related work: peak calling on the whole genome

An on-disk implementation of PeakSegFPOP is available in PeakSegFPOP or PeakSegPipeline. These programs can be used to compute optimal peak models for large bedGraph files, since they stores the optimal cost in a temporary file on disk (rather than in memory). If you want to use PeakSeg to perform peak calling on the whole genome, please try using these other implementations.

implementationtimememorydisk
PeakSegPipeline::PeakSegFPOP_diskO(N log N)O(log N)O(N log N)
PeakSegOptimal::PeakSegFPOPO(N log N)O(N log N)0

Note that although both implementations are O(N log N) time complexity for N data points, the PeakSegPipeline implementation is slower due to disk read/write overhead.

About

Constrained Optimal Segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages