Tabular format: compact version of peak level table? #12

RalfG · 2019-11-11T13:57:00Z

Following the current JSON format, the tabular format (HDF, CSV, TSV...) would have four tables, one for each data level (library, spectrum, peak and peak interpretation) with the following columns: cv_param_group, accession, name, value_accession, and value (and some additional grouping columns):

Library level

cv_param_group	accession	name	value
	MS:xxxxxxx	format version	0.1
	MS:xxxxxxx	title	library_001
	MS:xxxxxxx	description	spectral library 001
...

Spectrum level

spectrum_index	ion_group	cv_param_group	accession	name	value_accession	value
1	1		MS:xxxxxxx	index		1
1	1		MS:xxxxxxx	title		peptide1
1	1		MS:xxxxxxx	is decoy spectrum		FALSE
1	1	1	MS:xxxxxxx	calibrated retention index		xx
1	1	1	UO:0000000	unit	UO:0000031	minute
...

Peak level

spectrum_index	peak_index	accession	name	value_accession
1	1	MS:xxxxxxx	m/z	725.123
1	1	MS:xxxxxxx	theoretical m/z	725.1244
1	1	MS:xxxxxxx	intensity	2138.325
...

Peak interpretation level

spectrum_index	peak_index	peak_interpretation_index	name	value
1	1	1	peptidoform ion series type	y
1	1	1	peptidoform ion series start ordinal	1
1	1	1	product ion series charge state	1
...

This works perfectly fine for the library, spectrum and peak interpretation levels (where there are a lot of possible attributes per entry), but for the peak level, it might be better to have a compact form:

Peak level (compact)

spectrum_index	peak_index	product ion m/z	product ion intensity
1	1	138.0661469	190.7953186
1	2	219.1087494	29.48472786
1	3	305.0644836	1067.439087
...

This could be extended with a few optional columns.

To keep everything well standardized and machine readable, I would add an additional table Peak level columns defining the used columns the Peak level (compact) table, which could also contain info about the used units (if applicable). E.g.:

Peak level columns additional table

column_index	accession	name	unit_accession	unit_name
0		spectrum_index
1		peak_index
2	MS:1001225	product ion m/z	MS:1000040	m/z
3	MS:1001226	product ion intensity	MS:1000132	percent of base peak

To summarize:

Peak level would get very verbose if we would follow the same fields as the JSON specification.
Solution would be a compact form, together with a small table specifying the columns.

Questions:

Does everyone agree with having a compact form for the peak level?
We could completely deviate from the JSON spec and have a compact form on all columns. This would make the file more compact in general and more database-like. A drawback is that for all levels, the number of columns can variate between libraries, which would make parsing the metadata somewhat harder. We would also have to deal with the value, value_accession duality, which we do not have at the peak level, as all values are just numbers. What does everyone think about the "full-on compact form" idea?

The text was updated successfully, but these errors were encountered:

bittremieux · 2019-11-12T18:20:57Z

I think the design decisions can be quite different based on the data format, i.e. between text-based (CSV, TSV) and binary (HDF5).

Personally, with spectral libraries increasing in size, I'm strongly in favor of HDF5. HDF5 has built-in compression, resulting in much smaller files. Also, it's much easier and efficient to slice and dice the data.
With these big, repository-scale spectral libraries I think it's quite important to focus on IO and computational efficiency. Just the time required to read a multi-GB spectral library is non-negligible, making up a considerable part of the search time.

Taking that into consideration, the compact option seems considerably superior to me. You could go even further and just have two arrays per spectrum (m/z and intensity), which fits the HDF5 data model perfectly. This minimizes the query time (2 lookups to retrieve a spectrum versus 2 * k per spectrum with k peaks), and HDF5 was developed to store (binary) arrays.
Also, keep in mind that HDF5 performance can degrade quite significantly if millions of keys are used because the internal B-tree index can become quite unbalanced, leading to significant overhead during querying. Keeping the number of keys limited might thus be essential to achieve acceptable performance.

Related to your final question, the main consideration here is what the goal of this version of the format is. Is it readability, then CSV is obviously superior. But I don't care about readability here, as you mention in #11, there's already the text version (and to a lesser extent the JSON version). Instead, when going for HDF5 performance should be the main goal. And that means using HDF5 the way it was intended and storing values in arrays instead of each individual value separately. Make it as compact as possible, make spectrum reading efficient by storing the peaks in compact arrays.

RalfG · 2019-11-14T16:13:01Z

Thanks for the reply! I started looking into HDF and there's a lot more to it then I initially thought. Definitely the nested key system and metadata for each group would be very useful for the spectral library format.

This means that an optimal HDF representation would look pretty different from this general tabular format. Since we want to make it possible with the specification to have multiple representations (txt, json, csv, hdf...), I propose that we keep this discussion about general tabular formats (such as csv/tsv) and move the discussion on the HDF format to a new issue (#13).

RalfG mentioned this issue Nov 14, 2019

HDF representation #13

Open

RalfG closed this as completed Jun 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular format: compact version of peak level table? #12

Tabular format: compact version of peak level table? #12

RalfG commented Nov 11, 2019

bittremieux commented Nov 12, 2019 •

edited

Loading

RalfG commented Nov 14, 2019

Tabular format: compact version of peak level table? #12

Tabular format: compact version of peak level table? #12

Comments

RalfG commented Nov 11, 2019

bittremieux commented Nov 12, 2019 • edited Loading

RalfG commented Nov 14, 2019

bittremieux commented Nov 12, 2019 •

edited

Loading