Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API CHANGE: support peaksData to return data.frame or matrix #289

Open
jorainer opened this issue May 22, 2023 · 11 comments
Open

API CHANGE: support peaksData to return data.frame or matrix #289

jorainer opened this issue May 22, 2023 · 11 comments
Labels

Comments

@jorainer
Copy link
Member

This is related to issue #287 and was discussed here: change the MsBackend API to support peaksData to return a list of data.frame instead (or in addition) to a list of matrix. This would enable native support of peak variables (annotations) that are not numeric.

This change should not break any functionality, but we need to check also if and how strong it's impact on the performance is.

@jorainer jorainer added the API label May 22, 2023
@jorainer
Copy link
Member Author

Comparing impact of returning a data.frame instead of matrix by peaksData:

library(Spectra)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
                  package = "msdata")
be <- backendInitialize(MsBackendMzR(), fl)

mem <- backendInitialize(MsBackendMemory(), spectraData(be))
mem2 <- backendInitialize(MsBackendMemory2(), spectraData(be))

class(mem@peaksData[[1L]])
[1] "matrix" "array" 
class(mem2@peaksData[[1L]])
[1] "data.frame"

The MsBackendMemory2 is essentially identical to the MsBackendMemory only that it keeps "mz" and "intensity" values in a data.frame instead of a matrix.

Tests at the MsBackend level

Performance of accessing peaksData, mz and intensity.

library(testthat)
library(microbenchmark)
microbenchmark(peaksData(mem), peaksData(mem2))
Unit: microseconds
            expr   min     lq    mean median     uq    max neval cld
  peaksData(mem) 5.486 5.6200 5.81063 5.6945 5.8110 13.633   100   a
 peaksData(mem2) 5.663 5.7825 6.25971 5.8550 5.9495 42.090   100   a

No difference (not unexpectedly).

expect_equal(mz(mem), mz(mem2))
microbenchmark(mz(mem), mz(mem2), times = 13)
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval cld
  mz(mem) 36.67263 38.38423 40.14132 38.97847 41.48369 49.28339    13  a 
 mz(mem2) 46.58165 48.36403 51.12181 51.03753 52.70381 58.58926    13   b

The matrix-based backend is about 1.3 times faster. Note: access using $mz and $intensity instead of [[ would increase performance again

library(IRanges)
mymz <- function(x) {
    NumericList(lapply(x@peaksData, function(z) z$mz), compress = FALSE)
}
expect_equal(mz(mem), mymz(mem2))
microbenchmark(mz(mem), mz(mem2), mymz(mem2), times = 13)
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval cld
    mz(mem) 36.32873 36.84343 38.59994 38.31872 39.07635  43.59662    13   a
   mz(mem2) 46.81475 48.03289 65.21417 50.94912 52.18459 246.95200    13   a
 mymz(mem2) 36.11362 36.58824 38.32284 38.28542 39.72803  41.07391    13   a

Also for spectraData, the matrix-based backend is about 1.25 times faster (not shown).

Tests at the Spectra level

Performing some tests at the Spectra level. Here we might see the difference to subset matrix or data.frame by row.

sps <- Spectra(mem)
sps2 <- Spectra(mem2)

Accessing m/z:

expect_equal(mz(sps), mz(sps2))
microbenchmark(mz(sps), mz(sps2), times = 13)
Unit: milliseconds
     expr     min       lq     mean   median       uq      max neval cld
  mz(sps) 35.9689 36.66129 38.18032 37.12317 38.36346 42.75073    13  a 
 mz(sps2) 45.1387 46.18861 47.53825 46.72444 48.27922 52.71166    13   b

Similar to above. Next we compare performance for compareSpectra:

sps_sub <- filterMsLevel(sps)[1:20]
sps2_sub <- filterMsLevel(sps2)[1:20]

expect_equal(compareSpectra(sps_sub), compareSpectra(sps2_sub))
microbenchmark(compareSpectra(sps_sub), compareSpectra(sps2_sub), times = 13)
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max
  compareSpectra(sps_sub) 47.81649 50.13000 52.51268 51.86670 54.86539 58.57741
 compareSpectra(sps2_sub) 74.27192 75.84804 78.10415 76.80175 81.55561 84.79143
 neval cld
    13  a 
    13   b

data.frame backend is about 1.5 times slower. The reason being most likely accessing columns of the peaks matrix which could eventually be improved by directly using $mz and $intensity.

For the filtering operations the subset by row will be an issue. Here we know that matrix is faster than data.frame.

sps_f <- filterIntensity(sps, intensity = 1)
sps2_f <- filterIntensity(sps2, intensity = 1)

microbenchmark(mz(sps), mz(sps2), mz(sps_f), mz(sps2_f), times = 13)
Unit: milliseconds
       expr      min        lq      mean    median        uq       max neval
    mz(sps)  37.1891  38.26193  38.93608  38.79268  39.34904  41.02237    13
   mz(sps2)  47.2803  50.00244  52.43369  52.15033  54.37638  58.17806    13
  mz(sps_f) 194.8442 198.56253 207.51396 203.34011 207.82562 243.59039    13
 mz(sps2_f) 386.7848 392.46101 404.23477 397.99145 404.28471 456.45066    13
  cld
 a   
  b  
   c 
    d

The data.frame backend is thus 2 times slower.

@jorainer
Copy link
Member Author

Maybe a workable solution would be to allow the backend to return a matrix if columns = c("mz", "intensity") in the peaksData call and a data.frame when additional columns are requested. Thus peaksData,MsBackend should return a list of two-dimensional data structures (e.g., matrix or data.frame) but we don't impose that it has to be always a matrix (like we do now). The functions in Spectra working with these lists of peak data will use [ operations that would be supported by both matrix and data.frame.

Thus, we would only run into the above described performance issues if the user has data with additional peak annotations and if he/she requests them from the backend by specifically using e.g. columns = c("mz", "intensity", "peak_annotation") in the peaksData call.

Any opinion @lgatto @sgibb ?

@jorainer jorainer changed the title API CHANGE: peaksData to return data.frame instead of matrix API CHANGE: support peaksData to return data.frame or matrix May 23, 2023
@jorainer
Copy link
Member Author

I'm developing currently in the jomain branch. First change:

  • peaksData,MsBackendMemory returns a list of matrix by default (with parameter columns = c("mz", "intensity").
  • peaksData,MsBackendMemory returns a list of data.frames if additional/other peak variables are requested.

that should enable to start working on ensuring peaksData,Spectra works with filtering peak matrices again.

jorainer added a commit that referenced this issue May 25, 2023
- `peaksData,MsBackendMemory` returns by default a `list` of `matrix` or a
  `list` of `data.frame`s if other peak variables than `"mz"`, `"intensity"`
  are requested. Issue #289.
@jorainer
Copy link
Member Author

In Spectra, the .peaksapply will be the key function. That function ensures that the lazy processing queue is applied properly and the eventually filtered/changes peak data is returned.

@jorainer
Copy link
Member Author

The good news: this update already enables the main issue/problem discussed in the last dev call:

Create a Spectra with an additional peak annotation:

library(Spectra)

df <- data.frame(rtime = c(1.1, 1.2, 1.3, 1.4),
                 msLevel = 1L)
df$mz <- list(c(13, 14.1, 22, 23, 24, 49),
              c(45.1, 56),
              c(34.3, 134.4, 344, 443),
              c(12.1, 31))
df$intensity <- list(c(100, 300, 30, 120, 12, 34),
                     c(345, 234),
                     c(123, 124, 145, 3),
                     c(122, 421))

#' add some arbitrary information for each peak to the data.frame
df$ann <- list(c("a", NA, "b", "c", "d", NA),
               c("e", "f"),
               c("g", "h", "i", NA),
               c("j", "k"))
B <- Spectra(df, peaksVariables = c("mz", "intensity", "ann"))
peaksData(B)[[1L]]
       mz intensity
[1,] 13.0       100
[2,] 14.1       300
[3,] 22.0        30
[4,] 23.0       120
[5,] 24.0        12
[6,] 49.0        34

peaksData(B, columns = peaksVariables(B))[[1L]]
    mz intensity  ann
1 13.0       100    a
2 14.1       300 <NA>
3 22.0        30    b
4 23.0       120    c
5 24.0        12    d
6 49.0        34 <NA>

So, in the first case a matrix is returned, in the second a data.frame. This also enables (and does no longer break) filtering:

B2 <- filterMzValues(B, 23, tolerance = 1)
peaksData(B2, columns = c("mz", "intensity"))[[1L]]
     mz intensity
[1,] 22        30
[2,] 23       120
[3,] 24        12

And including additional peak variables:

peaksData(B2, columns = peaksVariables(B))[[1L]]
  mz intensity ann
3 22        30   b
4 23       120   c
5 24        12   d

@jorainer
Copy link
Member Author

What is not yet working (needs some fixes on the Spectra level) is when individual peak variables are extracted:

B2$ann[[1L]]
[1] "a" NA  "b" "c" "d" NA 

@jorainer
Copy link
Member Author

In addition, spectraData,Spectra needs to ensure peaks variables are correctly subsetted/processed if lazy processing queue is not empty.

jorainer added a commit that referenced this issue May 31, 2023
- Ensure lazy evaluation of processing queue in `peaksData,Spectra` does not
  break if requested `columns` do not contains `"mz"` and `"intensity"`. Issue
  #289.
jorainer added a commit that referenced this issue May 31, 2023
- `applyProcessing` ensures that all peaks variables are properly updates/subset
  depending on the processing queue (issue #289).
@jorainer
Copy link
Member Author

spectraData now also correctly subsets the peaks data if needed:

spectraData(B2, column = "ann")
DataFrame with 4 rows and 1 column
     ann
  <list>
1  b,c,d
2       
3       
4       

@jorainer
Copy link
Member Author

And applyProcessing correctly updates the peaks data and additional peaks variables in the backend:

B3 <- applyProcessing(B2)
B3@backend@peaksDataFrame
[[1]]
  ann
3   b
4   c
5   d

[[2]]
[1] ann
<0 rows> (or 0-length row.names)

[[3]]
[1] ann
<0 rows> (or 0-length row.names)

[[4]]
[1] ann
<0 rows> (or 0-length row.names)

@jorainer
Copy link
Member Author

We need also to ensure that replacing peaks variables works properly with a non-empty processing queue (this is in fact a bug in the current implementation).

jorainer added a commit that referenced this issue May 31, 2023
- `spectraData<-,Spectra` and `$<-,Spectra` throw an error if processing queue
  is not empty and values for peaks variables are requested to be
  replaced. Issue #289.
jorainer added a commit that referenced this issue Jun 1, 2023
- Add support for peak variables to `MsBackendDataFrame`. Issue #289.
- Add examples for peak variables to `MsBackend` and `Spectra` documentation.
- Expand documentation for peaks variables to the `Spectra` vignette.
jorainer added a commit that referenced this issue Jun 5, 2023
- Add support for peaks variables to `MsBackendDataFrame` and add/test its
  funcitionality (issue #289).
@jorainer
Copy link
Member Author

jorainer commented Jun 5, 2023

Replacing peaks variables works (in fact, throws an error if the processing queue is not empty). After applyProcessing peaks variables can be replaced with spectraData<- and $<-. All this was added in the recent commits (in the jomain branch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant