R: PeakSeg penalized solver with caching

PeakSegFPOP_dir {PeakSegDisk}

R Documentation

PeakSeg penalized solver with caching

Description

Main function/interface for the PeakSegDisk package. Run the low-level solver, PeakSegFPOP_file, on one genomic segmentation problem directory, and read the result files into R. Actually, this function will first check if the result files are already present (and consistent), and if so, it will simply read them into R (without running PeakSegFPOP_file) – this is a caching mechanism that can save a lot of time. To run the algo on an integer vector, use PeakSegFPOP_vec; for a data.frame, use PeakSegFPOP_df. To compute the optimal model for a given number of peaks, use sequentialSearch_dir.

Usage

PeakSegFPOP_dir(problem.dir, 
    penalty.param, db.file = NULL)

Arguments

`problem.dir`	Path to a directory like sampleID/problems/chrXX-start-end which contains a coverage.bedGraph file with the aligned read counts for one genomic segmentation problem. This must be a plain text file with the following four columns: chrom (character chromosome name), chromStart (integer start position), chromEnd (integer end position), count (integer aligned read count on chrom from chromStart+1 to chromEnd); see also https://genome.ucsc.edu/goldenPath/help/bedgraph.html. Note that the standard coverage.bedGraph file name is required; for full flexibility the user can run the algo on an arbitrarily named file via `PeakSegFPOP_file` (see that man page for an explanation of how storage on disk happens).
`penalty.param`	non-negative numeric penalty parameter (larger values for fewer peaks), or character scalar which can be interpreted as such. 0 means max peaks, Inf means no peaks.
`db.file`	character scalar: file for writing temporary cost function database – there will be a lot of disk writing to this file. Default NULL means to write the same disk where the input bedGraph file is stored; another option is tempfile() which may result in speedups if the input bedGraph file is on a slow network disk and the temporary storage is a fast local disk.

Details

Finds the optimal change-points using the Poisson loss and the PeakSeg constraint (changes in mean alternate between non-decreasing and non-increasing). For N data points, the functional pruning algorithm is O(\log N) memory. It is O(N \log N) time and disk space. It computes the exact solution to the optimization problem in vignette("Examples", package="PeakSegDisk").

Value

Named list of two data.tables:

`segments`	has one row for every segment in the optimal model,
`loss`	has one row and contains the following columns:

penalty: same as input parameter
segments: number of segments in optimal model
peaks: number of peaks in optimal model
bases: number of positions described in bedGraph file
bedGraph.lines: number of lines in bedGraph file
total.loss: total Poisson loss = \sum_i m_i-z_i*\log(m_i) = mean.pen.cost*bases-penalty*peaks
mean.pen.cost: mean penalized cost = (total.loss+penalty*peaks)/bases
equality.constraints: number of adjacent segment means that have equal values in the optimal solution
mean.intervals: mean number of intervals/candidate changepoints stored in optimal cost functions – useful for characterizing the computational complexity of the algorithm
max.intervals: maximum number of intervals
megabytes: disk usage of *.db file
seconds: timing of PeakSegFPOP_file

Author(s)

Toby Dylan Hocking

Examples


data(Mono27ac, package="PeakSegDisk", envir=environment())
data.dir <- file.path(
  tempfile(),
  "H3K27ac-H3K4me3_TDHAM_BP",
  "samples",
  "Mono1_H3K27ac",
  "S001YW_NCMLS",
  "problems",
  "chr11-60000-580000")
dir.create(data.dir, recursive=TRUE, showWarnings=FALSE)
write.table(
  Mono27ac$coverage, file.path(data.dir, "coverage.bedGraph"),
  col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t")

## Compute one model with penalty=1952.6
(fit <- PeakSegDisk::PeakSegFPOP_dir(data.dir, 1952.6))
summary(fit)#same as fit$loss

## Visualize that model.
ann.colors <- c(
  noPeaks="#f6f4bf",
  peakStart="#ffafaf",
  peakEnd="#ff4c4c",
  peaks="#a445ee")
if(require(ggplot2)){
  lab.min <- Mono27ac$labels[1, chromStart]
  lab.max <- Mono27ac$labels[.N, chromEnd]
  plist <- coef(fit)
  gg <- ggplot()+
    theme_bw()+
    geom_rect(aes(
      xmin=chromStart/1e3, xmax=chromEnd/1e3,
      ymin=-Inf, ymax=Inf,
      fill=annotation),
      color="grey",
      alpha=0.5,
      data=Mono27ac$labels)+
    scale_fill_manual("label", values=ann.colors)+
    geom_step(aes(
      chromStart/1e3, count),
      color="grey50",
      data=Mono27ac$coverage)+
    geom_segment(aes(
      chromStart/1e3, mean,
      xend=chromEnd/1e3, yend=mean),
      color="green",
      size=1,
      data=plist$segments)+
    geom_vline(aes(
      xintercept=chromEnd/1e3, linetype=constraint),
      color="green",
      data=plist$changes)+
    scale_linetype_manual(
      values=c(
        inequality="dotted",
        equality="solid"))
  print(gg)
  print(gg+coord_cartesian(xlim=c(lab.min, lab.max)/1e3, ylim=c(0, 10)))
  ## Default plotting method only shows model.
  print(gg <- plot(fit))
  ## Data can be added on top of model.
  print(
    gg+
      geom_step(aes(
        chromStart, count),
        color="grey50",
        data=Mono27ac$coverage)
  )
}

[Package PeakSegDisk version 2023.11.27 Index]