GitHub - tdhock/PeakSegOptimal: Constrained Optimal Segmentation

PeakSegOptimal: Optimal Segmentation Subject to Up-Down Constraints

NOTE: the name of the package has been changed from coseg to PeakSegOptimal, to avoid confusion with the CoSeg package on CRAN (which is completely unrelated to this package).

Installation

From CRAN:

install.packages("PeakSegOptimal")

From GitHub:

if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/PeakSegOptimal")

PeakSeg solvers

Linear time exact C++ solver for the optimal change points using the Poisson loss and the PeakSeg constraint.

PeakSeg Segment Neighborhood problem

There are two R functions that solve the Segment Neighborhood problem (best models with at most 1,…,S segments). The constraints are first segment mean <= second segment mean >= third segment mean, etc. For S segments and N data points, complexity is O(S N log N) memory and O(S N log N) time.

PeakSegPDPA is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step. Input parameter is the maximum number of segments S, and outputs are the optimal models from 1 to S segments.
PeakSegPDPAchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output. Input parameter is the maximum number of peaks P (implying S = P*2+1 segments), and outputs are the optimal models from 0 to P peaks.

PeakSeg Optimal Partitioning problem

There are two R functions that solve the Optimal Partitioning problem (best model for a given penalty parameter). Input parameter is a non-negative penalty constant (bigger for fewer peaks, smaller for more peaks), and output is the model that minimizes the penalized cost using the specified penalty constant. For N data points, complexity is O(N log N) memory and time. The first segment mean is constrained to be down (mu_1 <= mu_2), as is the last (mu_N-1 >= mu_N).

PeakSegFPOP is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step.
PeakSegFPOPchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output.

Note about inequality constraints

Note that as defined in the PeakSeg ICML’15 paper, the PeakSeg problem has strict inequality constraints (segment mean 1 < segment mean 2 > segment mean 3, etc). However, the solvers in this package use non-strict inequality constraints (segment mean 1 <= segment mean 2 >= segment mean 3, etc), and so may return a model with adjacent segment means that are equal. In this case, the minimum of the PeakSeg problem with strict inequality constraints is undefined. For example,

library(coseg)
data.vec <- as.integer(c(1, 2, 3))
fit <- PeakSegPDPA(data.vec, max.segments=3L)

For the data 1, 2, 3 the optimal model with at most 3 segments actually has only 2 segments:

> fit$mean.mat[3,]
[1] 1.0 2.5 2.5

This implies that the optimal model with strict inequality constraints is undefined for these data. For example, the model with means 1, 3, 2 satisfies the strict inequality constraint, and has a PoissonLoss of

> PoissonLoss(c(1,2,3), c(1,3,2))
[1] 1.723334

But that model is not optimal, because there is another model that satisfies the strict inequality constraints, but has a better cost. For example,

> PoissonLoss(c(1,2,3), c(1,2.9,2.1))
[1] 1.644766

But that model is not optimal, for the same reason (exercise for the reader).

Related work: peak calling on the whole genome

An on-disk implementation of PeakSegFPOP is available in PeakSegFPOP or PeakSegPipeline. These programs can be used to compute optimal peak models for large bedGraph files, since they stores the optimal cost in a temporary file on disk (rather than in memory). If you want to use PeakSeg to perform peak calling on the whole genome, please try using these other implementations.

implementation	time	memory	disk
PeakSegPipeline::PeakSegFPOP_disk	O(N log N)	O(log N)	O(N log N)
PeakSegOptimal::PeakSegFPOP	O(N log N)	O(N log N)	0

Note that although both implementations are O(N log N) time complexity for N data points, the PeakSegPipeline implementation is slower due to disk read/write overhead.

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
R		R
data		data
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS		NEWS
README.org		README.org
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

PeakSeg solvers

PeakSeg Segment Neighborhood problem

PeakSeg Optimal Partitioning problem

Note about inequality constraints

Related work: peak calling on the whole genome

About

Releases

Packages

Contributors 2

Languages

tdhock/PeakSegOptimal

Folders and files

Latest commit

History

Repository files navigation

Installation

PeakSeg solvers

PeakSeg Segment Neighborhood problem

PeakSeg Optimal Partitioning problem

Note about inequality constraints

Related work: peak calling on the whole genome

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages