API CHANGE: support peaksData to return data.frame or matrix #289

jorainer · 2023-05-22T08:34:02Z

This is related to issue #287 and was discussed here: change the MsBackend API to support peaksData to return a list of data.frame instead (or in addition) to a list of matrix. This would enable native support of peak variables (annotations) that are not numeric.

This change should not break any functionality, but we need to check also if and how strong it's impact on the performance is.

The text was updated successfully, but these errors were encountered:

jorainer · 2023-05-22T12:32:43Z

Comparing impact of returning a data.frame instead of matrix by peaksData:

library(Spectra)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
                  package = "msdata")
be <- backendInitialize(MsBackendMzR(), fl)

mem <- backendInitialize(MsBackendMemory(), spectraData(be))
mem2 <- backendInitialize(MsBackendMemory2(), spectraData(be))

class(mem@peaksData[[1L]])
[1] "matrix" "array" 
class(mem2@peaksData[[1L]])
[1] "data.frame"

The MsBackendMemory2 is essentially identical to the MsBackendMemory only that it keeps "mz" and "intensity" values in a data.frame instead of a matrix.

Tests at the `MsBackend` level

Performance of accessing peaksData, mz and intensity.

library(testthat)
library(microbenchmark)
microbenchmark(peaksData(mem), peaksData(mem2))
Unit: microseconds
            expr   min     lq    mean median     uq    max neval cld
  peaksData(mem) 5.486 5.6200 5.81063 5.6945 5.8110 13.633   100   a
 peaksData(mem2) 5.663 5.7825 6.25971 5.8550 5.9495 42.090   100   a

No difference (not unexpectedly).

expect_equal(mz(mem), mz(mem2))
microbenchmark(mz(mem), mz(mem2), times = 13)
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval cld
  mz(mem) 36.67263 38.38423 40.14132 38.97847 41.48369 49.28339    13  a 
 mz(mem2) 46.58165 48.36403 51.12181 51.03753 52.70381 58.58926    13   b

The matrix-based backend is about 1.3 times faster. Note: access using $mz and $intensity instead of [[ would increase performance again

library(IRanges)
mymz <- function(x) {
    NumericList(lapply(x@peaksData, function(z) z$mz), compress = FALSE)
}
expect_equal(mz(mem), mymz(mem2))
microbenchmark(mz(mem), mz(mem2), mymz(mem2), times = 13)
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval cld
    mz(mem) 36.32873 36.84343 38.59994 38.31872 39.07635  43.59662    13   a
   mz(mem2) 46.81475 48.03289 65.21417 50.94912 52.18459 246.95200    13   a
 mymz(mem2) 36.11362 36.58824 38.32284 38.28542 39.72803  41.07391    13   a

Also for spectraData, the matrix-based backend is about 1.25 times faster (not shown).

Tests at the `Spectra` level

Performing some tests at the Spectra level. Here we might see the difference to subset matrix or data.frame by row.

sps <- Spectra(mem)
sps2 <- Spectra(mem2)

Accessing m/z:

expect_equal(mz(sps), mz(sps2))
microbenchmark(mz(sps), mz(sps2), times = 13)
Unit: milliseconds
     expr     min       lq     mean   median       uq      max neval cld
  mz(sps) 35.9689 36.66129 38.18032 37.12317 38.36346 42.75073    13  a 
 mz(sps2) 45.1387 46.18861 47.53825 46.72444 48.27922 52.71166    13   b

Similar to above. Next we compare performance for compareSpectra:

sps_sub <- filterMsLevel(sps)[1:20]
sps2_sub <- filterMsLevel(sps2)[1:20]

expect_equal(compareSpectra(sps_sub), compareSpectra(sps2_sub))
microbenchmark(compareSpectra(sps_sub), compareSpectra(sps2_sub), times = 13)
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max
  compareSpectra(sps_sub) 47.81649 50.13000 52.51268 51.86670 54.86539 58.57741
 compareSpectra(sps2_sub) 74.27192 75.84804 78.10415 76.80175 81.55561 84.79143
 neval cld
    13  a 
    13   b

data.frame backend is about 1.5 times slower. The reason being most likely accessing columns of the peaks matrix which could eventually be improved by directly using $mz and $intensity.

For the filtering operations the subset by row will be an issue. Here we know that matrix is faster than data.frame.

sps_f <- filterIntensity(sps, intensity = 1)
sps2_f <- filterIntensity(sps2, intensity = 1)

microbenchmark(mz(sps), mz(sps2), mz(sps_f), mz(sps2_f), times = 13)
Unit: milliseconds
       expr      min        lq      mean    median        uq       max neval
    mz(sps)  37.1891  38.26193  38.93608  38.79268  39.34904  41.02237    13
   mz(sps2)  47.2803  50.00244  52.43369  52.15033  54.37638  58.17806    13
  mz(sps_f) 194.8442 198.56253 207.51396 203.34011 207.82562 243.59039    13
 mz(sps2_f) 386.7848 392.46101 404.23477 397.99145 404.28471 456.45066    13
  cld
 a   
  b  
   c 
    d

The data.frame backend is thus 2 times slower.

jorainer · 2023-05-22T13:28:40Z

Maybe a workable solution would be to allow the backend to return a matrix if columns = c("mz", "intensity") in the peaksData call and a data.frame when additional columns are requested. Thus peaksData,MsBackend should return a list of two-dimensional data structures (e.g., matrix or data.frame) but we don't impose that it has to be always a matrix (like we do now). The functions in Spectra working with these lists of peak data will use [ operations that would be supported by both matrix and data.frame.

Thus, we would only run into the above described performance issues if the user has data with additional peak annotations and if he/she requests them from the backend by specifically using e.g. columns = c("mz", "intensity", "peak_annotation") in the peaksData call.

Any opinion @lgatto @sgibb ?

jorainer · 2023-05-25T13:24:05Z

I'm developing currently in the jomain branch. First change:

peaksData,MsBackendMemory returns a list of matrix by default (with parameter columns = c("mz", "intensity").
peaksData,MsBackendMemory returns a list of data.frames if additional/other peak variables are requested.

that should enable to start working on ensuring peaksData,Spectra works with filtering peak matrices again.

- `peaksData,MsBackendMemory` returns by default a `list` of `matrix` or a `list` of `data.frame`s if other peak variables than `"mz"`, `"intensity"` are requested. Issue #289.

jorainer · 2023-05-25T13:38:47Z

In Spectra, the .peaksapply will be the key function. That function ensures that the lazy processing queue is applied properly and the eventually filtered/changes peak data is returned.

jorainer · 2023-05-25T13:47:15Z

The good news: this update already enables the main issue/problem discussed in the last dev call:

Create a Spectra with an additional peak annotation:

library(Spectra)

df <- data.frame(rtime = c(1.1, 1.2, 1.3, 1.4),
                 msLevel = 1L)
df$mz <- list(c(13, 14.1, 22, 23, 24, 49),
              c(45.1, 56),
              c(34.3, 134.4, 344, 443),
              c(12.1, 31))
df$intensity <- list(c(100, 300, 30, 120, 12, 34),
                     c(345, 234),
                     c(123, 124, 145, 3),
                     c(122, 421))

#' add some arbitrary information for each peak to the data.frame
df$ann <- list(c("a", NA, "b", "c", "d", NA),
               c("e", "f"),
               c("g", "h", "i", NA),
               c("j", "k"))
B <- Spectra(df, peaksVariables = c("mz", "intensity", "ann"))

peaksData(B)[[1L]]
       mz intensity
[1,] 13.0       100
[2,] 14.1       300
[3,] 22.0        30
[4,] 23.0       120
[5,] 24.0        12
[6,] 49.0        34

peaksData(B, columns = peaksVariables(B))[[1L]]
    mz intensity  ann
1 13.0       100    a
2 14.1       300 <NA>
3 22.0        30    b
4 23.0       120    c
5 24.0        12    d
6 49.0        34 <NA>

So, in the first case a matrix is returned, in the second a data.frame. This also enables (and does no longer break) filtering:

B2 <- filterMzValues(B, 23, tolerance = 1)
peaksData(B2, columns = c("mz", "intensity"))[[1L]]
     mz intensity
[1,] 22        30
[2,] 23       120
[3,] 24        12

And including additional peak variables:

peaksData(B2, columns = peaksVariables(B))[[1L]]
  mz intensity ann
3 22        30   b
4 23       120   c
5 24        12   d

jorainer · 2023-05-25T13:51:22Z

What is not yet working (needs some fixes on the Spectra level) is when individual peak variables are extracted:

B2$ann[[1L]]
[1] "a" NA  "b" "c" "d" NA

jorainer · 2023-05-25T14:49:21Z

In addition, spectraData,Spectra needs to ensure peaks variables are correctly subsetted/processed if lazy processing queue is not empty.

- Ensure lazy evaluation of processing queue in `peaksData,Spectra` does not break if requested `columns` do not contains `"mz"` and `"intensity"`. Issue #289.

- `applyProcessing` ensures that all peaks variables are properly updates/subset depending on the processing queue (issue #289).

jorainer · 2023-05-31T10:58:20Z

spectraData now also correctly subsets the peaks data if needed:

spectraData(B2, column = "ann")
DataFrame with 4 rows and 1 column
     ann
  <list>
1  b,c,d
2       
3       
4

jorainer · 2023-05-31T10:59:43Z

And applyProcessing correctly updates the peaks data and additional peaks variables in the backend:

B3 <- applyProcessing(B2)
B3@backend@peaksDataFrame
[[1]]
  ann
3   b
4   c
5   d

[[2]]
[1] ann
<0 rows> (or 0-length row.names)

[[3]]
[1] ann
<0 rows> (or 0-length row.names)

[[4]]
[1] ann
<0 rows> (or 0-length row.names)

jorainer · 2023-05-31T11:00:58Z

We need also to ensure that replacing peaks variables works properly with a non-empty processing queue (this is in fact a bug in the current implementation).

- `spectraData<-,Spectra` and `$<-,Spectra` throw an error if processing queue is not empty and values for peaks variables are requested to be replaced. Issue #289.

- Add support for peak variables to `MsBackendDataFrame`. Issue #289. - Add examples for peak variables to `MsBackend` and `Spectra` documentation. - Expand documentation for peaks variables to the `Spectra` vignette.

- Add support for peaks variables to `MsBackendDataFrame` and add/test its funcitionality (issue #289).

jorainer · 2023-06-05T07:27:24Z

Replacing peaks variables works (in fact, throws an error if the processing queue is not empty). After applyProcessing peaks variables can be replaced with spectraData<- and $<-. All this was added in the recent commits (in the jomain branch).

jorainer added the API label May 22, 2023

jorainer changed the title ~~API CHANGE: peaksData to return data.frame instead of matrix~~ API CHANGE: support peaksData to return data.frame or matrix May 23, 2023

jorainer added a commit that referenced this issue May 31, 2023

refactor: peaksData,Spectra to correctly filter peaks matrix

f0d13c6

- Ensure lazy evaluation of processing queue in `peaksData,Spectra` does not break if requested `columns` do not contains `"mz"` and `"intensity"`. Issue #289.

jorainer added a commit that referenced this issue May 31, 2023

refactor: applyProcessing correctly manages processing queue

533a225

- `applyProcessing` ensures that all peaks variables are properly updates/subset depending on the processing queue (issue #289).

jorainer added a commit that referenced this issue May 31, 2023

refactor: spectraData<- and $<- to check peaks variables

1482b9d

- `spectraData<-,Spectra` and `$<-,Spectra` throw an error if processing queue is not empty and values for peaks variables are requested to be replaced. Issue #289.

jorainer added a commit that referenced this issue Jun 5, 2023

refactor: extend support for peaks variables (issue #289)

db17225

jorainer added a commit that referenced this issue Jun 5, 2023

tests: add peaks variables unit tests for MsBackendDataFrame

a6fabdf

- Add support for peaks variables to `MsBackendDataFrame` and add/test its funcitionality (issue #289).

jorainer mentioned this issue Jun 15, 2023

Add peak variable support to MsBackendMemory and MsBackendDataFrame #297

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API CHANGE: support peaksData to return data.frame or matrix #289

API CHANGE: support peaksData to return data.frame or matrix #289

jorainer commented May 22, 2023

jorainer commented May 22, 2023

jorainer commented May 22, 2023

jorainer commented May 25, 2023

jorainer commented May 25, 2023

jorainer commented May 25, 2023

jorainer commented May 25, 2023

jorainer commented May 25, 2023

jorainer commented May 31, 2023

jorainer commented May 31, 2023

jorainer commented May 31, 2023

jorainer commented Jun 5, 2023

API CHANGE: support peaksData to return data.frame or matrix #289

API CHANGE: support peaksData to return data.frame or matrix #289

Comments

jorainer commented May 22, 2023

jorainer commented May 22, 2023

Tests at the MsBackend level

Tests at the Spectra level

jorainer commented May 22, 2023

jorainer commented May 25, 2023

jorainer commented May 25, 2023

jorainer commented May 25, 2023

jorainer commented May 25, 2023

jorainer commented May 25, 2023

jorainer commented May 31, 2023

jorainer commented May 31, 2023

jorainer commented May 31, 2023

jorainer commented Jun 5, 2023

Tests at the `MsBackend` level

Tests at the `Spectra` level