Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular format: compact version of peak level table? #12

Closed
RalfG opened this issue Nov 11, 2019 · 2 comments
Closed

Tabular format: compact version of peak level table? #12

RalfG opened this issue Nov 11, 2019 · 2 comments

Comments

@RalfG
Copy link
Collaborator

RalfG commented Nov 11, 2019

Following the current JSON format, the tabular format (HDF, CSV, TSV...) would have four tables, one for each data level (library, spectrum, peak and peak interpretation) with the following columns: cv_param_group, accession, name, value_accession, and value (and some additional grouping columns):


Library level

cv_param_group accession name value_accession value
  MS:xxxxxxx format version   0.1
  MS:xxxxxxx title   library_001
  MS:xxxxxxx description   spectral library 001
...

Spectrum level

spectrum_index ion_group cv_param_group accession name value_accession value
1 1   MS:xxxxxxx index   1
1 1   MS:xxxxxxx title   peptide1
1 1   MS:xxxxxxx is decoy spectrum   FALSE
1 1 1 MS:xxxxxxx calibrated retention index   xx
1 1 1 UO:0000000 unit UO:0000031 minute
...

Peak level

spectrum_index peak_index cv_param_group accession name value_accession value
1 1   MS:xxxxxxx m/z 725.123   
1 1   MS:xxxxxxx theoretical m/z 725.1244   
1 1   MS:xxxxxxx intensity 2138.325   
...

Peak interpretation level

spectrum_index peak_index peak_interpretation_index cv_param_group accession name value_accession value
1 1 1     peptidoform ion series type   y
1 1 1     peptidoform ion series start ordinal   1
1 1 1     product ion series charge state   1
...

This works perfectly fine for the library, spectrum and peak interpretation levels (where there are a lot of possible attributes per entry), but for the peak level, it might be better to have a compact form:

Peak level (compact)

spectrum_index peak_index product ion m/z product ion intensity
1 1 138.0661469 190.7953186
1 2 219.1087494 29.48472786
1 3 305.0644836 1067.439087
...    

This could be extended with a few optional columns.

To keep everything well standardized and machine readable, I would add an additional table Peak level columns defining the used columns the Peak level (compact) table, which could also contain info about the used units (if applicable). E.g.:

Peak level columns additional table

column_index accession name unit_accession unit_name
0   spectrum_index    
1   peak_index    
2 MS:1001225 product ion m/z MS:1000040 m/z
3 MS:1001226 product ion intensity MS:1000132 percent of base peak

To summarize:

  • Peak level would get very verbose if we would follow the same fields as the JSON specification.
  • Solution would be a compact form, together with a small table specifying the columns.

Questions:

  • Does everyone agree with having a compact form for the peak level?
  • We could completely deviate from the JSON spec and have a compact form on all columns. This would make the file more compact in general and more database-like. A drawback is that for all levels, the number of columns can variate between libraries, which would make parsing the metadata somewhat harder. We would also have to deal with the value, value_accession duality, which we do not have at the peak level, as all values are just numbers. What does everyone think about the "full-on compact form" idea?
@bittremieux
Copy link
Contributor

bittremieux commented Nov 12, 2019

I think the design decisions can be quite different based on the data format, i.e. between text-based (CSV, TSV) and binary (HDF5).

Personally, with spectral libraries increasing in size, I'm strongly in favor of HDF5. HDF5 has built-in compression, resulting in much smaller files. Also, it's much easier and efficient to slice and dice the data.
With these big, repository-scale spectral libraries I think it's quite important to focus on IO and computational efficiency. Just the time required to read a multi-GB spectral library is non-negligible, making up a considerable part of the search time.

Taking that into consideration, the compact option seems considerably superior to me. You could go even further and just have two arrays per spectrum (m/z and intensity), which fits the HDF5 data model perfectly. This minimizes the query time (2 lookups to retrieve a spectrum versus 2 * k per spectrum with k peaks), and HDF5 was developed to store (binary) arrays.
Also, keep in mind that HDF5 performance can degrade quite significantly if millions of keys are used because the internal B-tree index can become quite unbalanced, leading to significant overhead during querying. Keeping the number of keys limited might thus be essential to achieve acceptable performance.

Related to your final question, the main consideration here is what the goal of this version of the format is. Is it readability, then CSV is obviously superior. But I don't care about readability here, as you mention in #11, there's already the text version (and to a lesser extent the JSON version). Instead, when going for HDF5 performance should be the main goal. And that means using HDF5 the way it was intended and storing values in arrays instead of each individual value separately. Make it as compact as possible, make spectrum reading efficient by storing the peaks in compact arrays.

@RalfG
Copy link
Collaborator Author

RalfG commented Nov 14, 2019

Thanks for the reply! I started looking into HDF and there's a lot more to it then I initially thought. Definitely the nested key system and metadata for each group would be very useful for the spectral library format.

This means that an optimal HDF representation would look pretty different from this general tabular format. Since we want to make it possible with the specification to have multiple representations (txt, json, csv, hdf...), I propose that we keep this discussion about general tabular formats (such as csv/tsv) and move the discussion on the HDF format to a new issue (#13).

@RalfG RalfG closed this as completed Jun 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants