Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF representation #13

Open
RalfG opened this issue Nov 14, 2019 · 15 comments
Open

HDF representation #13

RalfG opened this issue Nov 14, 2019 · 15 comments

Comments

@RalfG
Copy link
Collaborator

RalfG commented Nov 14, 2019

It was already decided to allow multiple representations (txt, json, csv, hdf...) of the new spectral library format, based on a common framework (required (meta)data, controlled vocabulary...). In this issue thread, we can discuss the best way to represent the spectral library format in HDF.

As a reference, the current TXT format looks like this:

MS:1008014|spectrum index=500
MS:1008013|spectrum name=AAAVDPTPAAPAR/2_0
MS:1008010|molecular mass=1208.6510
MS:1008015|spectrum aggregation type=MS:1008017|consensus spectrum
[1]MS:1008030|number of enzymatic termini=2
[1]MS:1001045|cleavage agent name=MS:1001251|Trypsin
MS:1001471|peptide modification details=0
...

And JSON (for one metadata item) would take the following shape:

    {
      "accession": "MS:1001045",
      "cv_param_group": "1",
      "name": "cleavage agent name",
      "value": "Trypsin",
      "value_accession": "MS:1001251"
    },

Discussion spun off from issue #12:

@bittremieux:

I think the design decisions can be quite different based on the data format, i.e. between text-based (CSV, TSV) and binary (HDF5).

Personally, with spectral libraries increasing in size, I'm strongly in favor of HDF5. HDF5 has built-in compression, resulting in much smaller files. Also, it's much easier and efficient to slice and dice the data.
With these big, repository-scale spectral libraries I think it's quite important to focus on IO and computational efficiency. Just the time required to read a multi-GB spectral library is non-negligible, making up a considerable part of the search time.

Taking that into consideration, the compact option seems considerably superior to me. You could go even further and just have two arrays per spectrum (m/z and intensity), which fits the HDF5 data model perfectly. This minimizes the query time (2 lookups to retrieve a spectrum versus 2 * k per spectrum with k peaks), and HDF5 was developed to store (binary) arrays.
Also, keep in mind that HDF5 performance can degrade quite significantly if millions of keys are used because the internal B-tree index can become quite unbalanced, leading to significant overhead during querying. Keeping the number of keys limited might thus be essential to achieve acceptable performance.

Related to your final question, the main consideration here is what the goal of this version of the format is. Is it readability, then CSV is obviously superior. But I don't care about readability here, as you mention in #11, there's already the text version (and to a lesser extent the JSON version). Instead, when going for HDF5 performance should be the main goal. And that means using HDF5 the way it was intended and storing values in arrays instead of each individual value separately. Make it as compact as possible, make spectrum reading efficient by storing the peaks in compact arrays.

@RalfG:

Thanks for the reply! I started looking into HDF and there's a lot more to it then I initially thought. Definitely the nested key system and metadata for each group would be very useful for the spectral library format. This means that an optimal HDF representation would look pretty different from this general tabular format. Since we want to make it possible with the specification to have multiple representations (txt, json, csv, hdf...), I propose that we keep this discussion about general tabular formats (such as csv/tsv) and move the discussion on the HDF format to a new issue.

@RalfG
Copy link
Collaborator Author

RalfG commented Nov 14, 2019

@bittremieux, would you then propose a group system that looks like this:

.
└── library (group)
    ├── spectrum_001 (group)
    ├── spectrum_002 (group)
    ├── ...
    └── spectrum_n (group)
        ├── intensity (array)
        └── mz (array)

with metadata embedded as attributes for each relevant group?

That seems very structured and efficient to read a given spectrum, but that would mean that one library would have lot of keys (up to a million or more) and you mentioned performance degradation in this case.

@bittremieux
Copy link
Contributor

That seems a pretty logical layout. And yes, for large-scale spectral libraries the number of keys will be in the millions, but still significantly lower than the layout that was previously suggested where each peak would be stored separately.

This is a much better trade-off between ease of querying and performance, and I can't immediately come up with a better alternative. Also, by using groups the performance hit should be relatively modest, even for millions of keys. HDF5 was developed as a big data format after all.

Here are some relevant threads on HDF5 performance:

@sneumann
Copy link
Member

sneumann commented Nov 15, 2019

Hi, the above structure library (group) ... looks a lot like mzML to me. In fact, the OpenMS people already have a tool to search spectra in one mzML file against spectra in another mzML file for exactly that purpose. We used their prototype to export MassBank to mzML (MassBank/MassBank-data#31). With mz5 you get mzML encoded as HDF5 as well, and there are already tools to convert between mzML and mz5 forth and back. What's currently missing is the set of userParam and cvParam to keep the spectral-library specific attributes. Yours, Steffen

@RalfG
Copy link
Collaborator Author

RalfG commented Jan 21, 2020

Thanks for the input!
For future reference, here's the mz5 article: https://doi.org/10.1074/mcp.O111.011379

@tkschmidt
Copy link

tkschmidt commented Jan 21, 2020

Mz5 is a nice format in-between hdf5 and mzML. If you want input from @mwilhelm as well, I can ping him.
Just a meta-comment/question from my side:
Im a big fan of binary formats and enjoy hdf5s in R and Python, but the moment you develop tools in any other language its a hell. Even compiling it for your own C stuff is an adventure.

Is there really no new binary format for scientific data? Another interesting approach was BiblioSpec, which is just Sqlite. Not sure how it scales for millions of millions of spectra (but I had Sqlite spectral libraries > 100GB which were fine and fast; years ago). Their trick is to not have a peak table but rather store the compressed version of an array as a binary blob https://github.com/ProteoWizard/pwiz/blob/dbd6221d39dc43b3b8a595bd9bd63661fee1daa6/pwiz_tools/BiblioSpec/src/LibToSqlite3.cpp#L262-L286

Original publication: https://www.ncbi.nlm.nih.gov/pubmed/18428681

@mobiusklein
Copy link
Collaborator

There are loads of binary formats designed for one purpose or another. They tend to be tuned specifically to use-case, written for one language (usually C with a binding for another scripting language), and probably not capable of all of the features that HDF5 or SQLite3 have.
https://github.com/rainwoodman/bigfile
https://github.com/Blosc/bcolz

Both HDF5 and SQLite3 can store array data, though HDF5 may have an edge on certain features being built-in, and even more plugin support. My experience has been that HDF5 also makes those arrays infinitely more transparent to the caller with less work to manipulate them before fully loading them into memory. I don't think the design goal is to create large arrays to be sliced and indexed on disk though.

Does HDF5 support queries over attributes? Suppose you've got a repository-scale library of a mix of ETD and HCD spectra (or positive mode and negative mode spectra) over a range of different activation energies, and you want to find all the library entries of only one of those types AND has a precursor m/z within a some interval AND came from a mass analyzer having a PPM error tolerance below 10 PPM AND an activation energy close to what you used.

When first considering the problem of storing a variable number of descriptive properties in SQLite3, I thought of what was done in mzDB (10.1074/mcp.O114.039115), which was to store serialized XML "param_tree"s, which queries cannot parse or index over. SQLite3 supports JSON object-valued columns and partial indices over them with the JSON1 extension that ships with the SQLite3 source amalgam already today. This would let us keep the "param_tree" simple storage and the query-ability without needing a huge Entity-Attribute-Value table. Note that JSON1 isn't ubiquitous or enabled by default though, so you might still need to statically link with or ship an up-to-date SQLite3 shared library. Alternatively, entity-attribute-value tables aren't evil, just uncomfortable to use and slow to completely enumerate.

@tkschmidt
Copy link

I also want to add that entity-attribute-values are a pain in the butt. Especially if you use it in a standard (SQL) setup.

@bittremieux
Copy link
Contributor

AFAIK you can't query entries in an HDF5 based on the value of their attributes. You can use attributes to store metadata, but to filter on a specific value you'd need to walk the entire tree and filter manually. Not great.

Also +1 on the account that entity-attribute-values in SQL are pretty annoying.

@edeutsch
Copy link
Contributor

edeutsch commented Feb 5, 2020

This is always a problem, but the issue is that there are potentially hundreds of attributes of metadata that we want to be able to capture. So we need a system that can capture and store any attribute. BUT, it is usually the case that there is only a relatively small subset of attributes that one would want to filter or search on. So, the data model and archival/transmission format needs to be able to store anything. The optimized active-use format probably should be able to store anything but then optimize search and filter on a subset of terms. Some RDBMSs have SPARSE matrix support, maybe other systems, too. Otherwise we could implement a sparse matrix system in a custom binary format. Or just reuse SQLite or HDF5 where the most common attributes are indexed as columns and the rest are just stored. But all that is very usage-specific. metabolomics will want different columns that proteomics. So I think the best goal is to make a nice format that can store anything with controlled vocabulary, and then everyone can try to implement their idea of a fast storage mechanism using the technology they like, which history has shown we can't all agree on.

@mobiusklein
Copy link
Collaborator

With SQLite3's JSON extension enabled, you can create partial indices over JSON-path expressions.

CREATE INDEX idx_spectrum_is_etd ON [spectrum_table](id)
    WHERE json_extract(params, '$.MS:1000598') != NULL;

Then searching for ETD spectra will incur one index traversal followed by a fast series of reads from the spectrum table, whatever it is called. This does not require that every row in the spectrum table store a value in params under the key "MS:1000598", but it does essentially replicate whether or not that value is present for all spectra in the index.

Of course, we can solve this problem in HDF5 by simply building an index array ourselves and storing it in the library too. This complicates library writing code because now we have to manage building and updating all the indices whenever we add a new entry or decide we want to add/remove an index. It also complicates reading code because it requires that the library reading code manually manipulate these index arrays.

@RalfG
Copy link
Collaborator Author

RalfG commented Feb 5, 2020

My idea for HDF5 was indeed to have an additional index array with the most common attributes as columns. We could make a small set of required attributes (e.g. precursor mz, charge...) and make it possible for the user to expand that list. Writing does indeed get a little more complex, but should still be manageable.

SQLite with the JSON extension also looks nice, but that could easily be another representation of the format.

@mobiusklein
Copy link
Collaborator

I think I came into the middle of this without knowing the data model, and just engaged in pedantics over data format engineering. Is there something I can read that goes over the data model formally? I saw some Google Docs links, but it wasn't clear if they were current/authoritative.

Wouldn't reading be trickier too, since there would need to be a way to introspect read requests for which indices to use? This would be tightly coupled to the reading interface of course. The more general/compose-able the interface is, the closer the implementation needs to get to implementing a query optimizer. It seems tempting to say indices are an implementation detail, but it makes library exchange more difficult unless you completely re-build the file on receipt. On the other hand, that might be necessary with a proliferation of back-ends in the first place, requiring libraries be exchanged in one of the less "optimized" formats before being tuned for searching by the application consuming it. Is this the intent?

@sneumann
Copy link
Member

sneumann commented Feb 6, 2020

Over the last weeks I heard several comments offline from people asking about the current data model, not knowing the current state. Can we make a push to collect that first ? I'd expect that this would also form basis of reporting & discussions @ PSI-MS in San Diego next month. Yours, Steffen

@RalfG RalfG changed the title HDF representation of spectral libraries within the new specification HDF representation Jun 6, 2020
@edeutsch
Copy link
Contributor

@RalfG any interest in giving this a try since most things are settled now?

@hechth
Copy link
Member

hechth commented Oct 21, 2023

Are there any other binary implementation/representations planned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants