Persistence Mechanisms #722

j340m3 · 2015-05-03T20:56:27Z

j340m3
May 3, 2015

Status quo

In special-measure, the pulses are all stored in one struct, ordered and accessible by their IDs. There is one global database struct in which the default pulses are stored. For a new experiment, the user has to work with his local copy where he overwrites unneeded IDs. When the experiment has been finished, the user saves the whole database file and its report into a folder.

Desired Improvements

Better structure inside of the database
Space efficiency
Allow easy replication of experiments, even when pulses in the main database change
Automated generation of folder structure for each experiment

Solution proposals

Relational database (SQL)

The main database could be rewritten into a SQL-Database. These databases only store basetypes and references, therefore additional steps are necessary to store arbitrary objects.

Pickling:
The pulse will be divided in its subcomponents, until there are just basetypes. These subcomponents will then be referenced.
Object-Relational-Mapping (like SQLAlchemy or SQLObject)
Prewritten wrapper around the SQL-Database with different approaches:
- Store object outside of the object and reference it (SQLAlchemy)
- Automated dynamic pickling with table generation (SQLObject)

Folder structure and XML-Representations

The main database can be represented in the XML-Format. This format is the standard for data exchange. For each experiment, we generate a subfolder in which the relevant data (pulse, measurements, documentation) will be stored. The pulse file may reference the main database and stores the composition of subpulses.

Comparison of the proposals

Criteria	SQL	XML
Space efficiency	stores only basetypes and references	one big string storing variable names, structure and values as string
Speed efficiency	optimized for big databases with 100.000+	Needs to be parsed for each operation
Human readability	not readable as is, but exportable	easy to read and understand
Package dependency (python)	pickling already included (sqlite), ORM needs additional third party packages: SQLObject (LGPL), SQLAlchemy (MIT)	included

So, in the end, we have the trade-off between having a performant database and having an easy one.
Which setup do you think will fulfill your needs?

hbluhm · 2015-05-04T17:22:48Z

hbluhm
May 4, 2015
Maintainer

Regarding the above question, my feeling is that performance will become important, while exchange between different platforms or human-readability of the whole database is not. Hence, SQL seems preferable.

Before going into more depth, I'd like to make sure I understand the reason for using a database.
The main limitation of regular files is that loading then is often too slow. Loading the content into memory helps, but requires user-managed caching in a simple form. Does a database approach solve that problem, i.e. offer a performance close to reading from memory but persistent without explicit synchronization commands? Is there another advantage of databases?

Another question is whether it makes sense to use a database for every pulse? I could imagine that the more disposable ones that are, for example, used only for a single experiment are more conveniently archived together with data instead of cluttering a pulse database. Of course one could also maintain several databases or sections, but is that convenient? Hence, would it make sense to provide both files and a database as persistence mechanism?

0 replies

pcerf · 2015-05-06T07:41:23Z

pcerf
May 6, 2015

For us, code (and possibly class definitions) changes on a daily basis. How does this work for saving?

In general, I think the problem should be split:

Persistence mechanisms for reusability
Documentation purposes (and repeatability)

Addressing them in turn:

Store only references. If the references have changed by the time the experiment is run again, this is intended behaviour in this case (because the experimenter fixed an erroneous template pulse for example)
Store everything. If some references have changed by the time the experiment is run again, this is unintended behaviour in this case because the experiment needs to be rerun exactly like it was run the first time. Still, one might have fixed an erroneous template pulse in the meantime, so one should have the options to use the new references or the old.

0 replies

lumip · 2015-05-06T09:45:38Z

lumip
May 6, 2015

Persistent Storing of Pulses
First of all, pulse definitions are implicitly stored in the code which first defines them. As long as they are not shared between experiments/persons, this should be sufficient and there is no need to store them in another persistent storage.
A database as a persistent storage for pulses is only intended to contain pulses that should be shared/reused in some way, e.g., because they are commonly used in many experiments (or are the final implementation of a gate). Disposable pulses should not be stored in a globally accessible database. The different levels/scopes of sharing (global sharing for all experiments, internal sharing between team members of a experiment, reuse of pulses by a single person across experiments this person takes part in, ...) may require some subdivision of databases.
As we see it, all pulse definitions are added to the plsdata struct which is stored persistently using MATLAB functionality. This somewhat contradicts Hendrik's statement: "I could imagine that the more disposable ones that are, for example, used only for a single experiment are more conveniently archived together with data instead of cluttering a pulse database."

On Database Systems
In general, database systems provide solutions to storing lots of data and relations in-between. This means, that they take care of how the data is laid out and stored on the disk, establish indices for faster access, scaling to larger amounts of data and so on. Hence, these problems need not be solved by the user of the database system. Using only "raw" files, these issues may need to be considered while implementing the storage system. This is the main reason for considering them as a means of storing the data as opposed to manually implementing a storage mechanism.
Full-fledged database management systems (e.g., Microsoft SQL, PostgreSQL, MySQL, Oracle...) usually require a database server process to be run on some machine which is then accessed via a network connection. These systems are designed to be a central database for large (and distributed) systems storing large amounts of data (which is generally stored in multiple files on the servers file system). SQLite is a library implementation of a SQL database which stores the database in a file and thus provides the functionality and abstraction of a relational database while still storing all data in a single file without involving server processes, meaning it is generally faster as long as the database does not grow too large. However, this also implies that establishing a global accessible database with SQLite is difficult because all clients would need to have access to the database file.
Speed-up in Contrast to Raw Files
I don't quite understand the issue here. Storing pulses for reuse means that they at some point need to be read from the file system. Whether a DBMS does this or one does it manually cannot change this fact. Thus, database implementations cannot be faster reading the actual data, indeed they generally introduce some runtime overhead due to their internal data management complexity.
Maybe I've misunderstood the question?
My Conclusion (on Databases)
Which database implementation one should use for the project depends on the actual requirements.
If it is required to have a database to store commonly used pulses and experimental results in a central place, a database server would be most convenient, Note that this does not necessarily mean, that everyone has to use or even know all the pulses stored there. It would be possible to set up scopes of visibility such that teams working on different experiments would not interfere with each other.
If pulses are only shared in smaller groups and a need for sharing pulses between those groups essentially never arises, local storage (SQLite or even raw files) would be sufficient.

Addressing Pulses
My understanding is the following: If a new pulse that reuses existing pulses is set up within an experiment, it should use the latest version of these existing pulses. If that pulse in turn should be reusable, it should always refer to the latest version of its subpulses to allow improvements/fixes of these to propagate. Hence, this pulse, as a reusable "template", should store references to its subpulses.
However, for each experiment such a pulse is used in, its actual composition at that concrete moment should be preserved, meaning that an experiment should store copies such that changes to single pulses do not affect the experiment when re-run later on.
These copies could then, only of actually desired, replaced by more recent versions to include fixes into the experiment.
Is my understanding correct? Does this not solve the problems?

0 replies

hbluhm · 2015-05-06T10:19:49Z

hbluhm
May 6, 2015
Maintainer

@lumip Lukas, could it be that something is missing at the end of the first section? (This contradicts ...)

Regarding performance: Yes, pulses have to be reread from disk if reused. However, when they are constructed hierarchically, one sometimes ends up reading the same pulse quite often, which is slow if only stored on disk. At least this seems to be a problem with current pulse dictionaries. Caching them in memory is much faster. I guess the issue might simply be that much more than the pulse needed is read each time. Loading the pulse definition from file into memory and saving it only for syncing is one solution that works reasonably well, but is somewhat inconvenient and seems crude.

I doubt that overall size will be a limiting factor for recyclable material, but may be for archival storage.

Would the extensive use of references, e.g. in recursive pulse definitions, be another reason to use a database system? Do we need more than pulse indices or names, that could translate into filenames?

Multi-user access to pulses could be a nice feature in the future, but so far has not been critical. Copying and weeding out the pulse repository from another team does not seem too bad.

Regarding "Adressing pulses": I think your analysis is correct, and may solve the problem at a very high level.

0 replies

lumip · 2015-05-06T13:36:12Z

lumip
May 6, 2015

could it be that something is missing at the end of the first section?
Yes, I forgot to end that sentence. I have edited it above. I just thought that the fact that currently all pulses end up in the (global) plsdata struct would contradict the following statement: "I could imagine that the more disposable ones that are, for example, used only for a single experiment are more conveniently archived together with data instead of cluttering a pulse database."

Regarding performance: Database systems should address this. The described problem is mostly due a bad implementation of loading pulses and must be dealt with in our implementation. However, database systems generally do some caching due to their implementation such that repeated accesses to the same data in a relatively short period of time should be a bit faster than reading a file over and over again.

Would the extensive use of references, e.g. in recursive pulse definitions, be another reason to use a database system?
Yes. Relational database allow for easy referencing and resolution of such references.

Do we need more than pulse indices or names, that could translate into filenames?
No. A management of these files and following references must be implemented manually when storing in raw files. As well as a file format to store the informatin. A database system already provides solutions for (parts of) this.

Multi-user access to pulses could be a nice feature in the future, but so far has not been critical. Copying and weeding out the pulse repository from another team does not seem too bad.
If that could be a nice feature, I would consider it now instead of changing it in the future.

Regarding "Adressing pulses": I think your analysis is correct, and may solve the problem at a very high level.
Which levels would remain unsolved? It seems I am not grasping the entire picture here.

0 replies

hbluhm · 2015-05-06T18:17:48Z

hbluhm
May 6, 2015
Maintainer

Another actually quite relevant feature of a server based database is that it facilitates access to the pulses from different computers for preparing experiments and analysing data.
You are right about the contradiction. We realized that keeping everything central database has its drawback as it grows rather quickly. Dictionaries in contrast are often treated as disposable and saved along with the data. Essentially historical artifacts of the current system.
Using code as documentation is possible, but has potential limitations
i) One needs to make sure (ideally automate) that the code is archived
ii) Changes of utility functions must be tracked (e.g. via version control)
iii) Using scripts brings the danger of using data living in the workspace is used and not archived
iv) Code is not easily machine readable short of re-executing.
Of course these can all be addressed, but storing the result of the code as well still seems attractive.
Imagine one wants to analyze some data from a year ago. Loading saved pulses should be simple as long as the format is still supported by utilities. Rerunning a script on the other hand would require restoring the appropriate environment, version, executing it and then looking at the results.
Nevertheless, the idea of also automatically saving generating code is good, as it helps reusing it.
What I mean by unsolved at the low level is that many details remain open as far as I can see.
Summarizing what emerged from the discussion:

Frequently recycled pulses are stored in a central database making extensive use of references. Improvements thus propagate easily to higher level constructs.
In addition, a documentation record needs to be maintained. Options include saving all information with each experiments, keeping track of database changes, saving only the relevant extracts from a database, saving code, or combinations of all of these. Doing this efficiently will likely require a distinction between information with different lifetimes.

BTW, an additional complication arises if data on an AWG is supposed to be recycled, which is highly desirable in some form for performance reasons. One must be able to reconstruct the instrument state at (nearly) any point of time an experiment is carried out. For pulsegroups, we attempted to solve this by tracking any change applied.

0 replies

ekammerloher · 2015-05-16T14:30:36Z

ekammerloher
May 16, 2015
Maintainer

Despite the favour of an immediate database implementation, I would like to present my take on this. It may cover a few more issues than Persistence only but the implementation sketch is only an example.

A few additional remarks about databases in general

Pro

Pulses go with the database and cannot be orphaned.
Speed benefit when resolving hierarchical references in pulses, since search/access is fast.
Fancy queries on the database.
No concurrency issues (server).
Program crash will not cause data loss of cashed data, if implemented correctly (server).

Contra

Version control (a must-have I think) requires scripts that manage database entry creation.
Pulse definition API changes may be expensive operations (schema change in SQL).
The level of knowledge required to maintain a database goes up in proportion to the size of the database quote. A community standard should be easy to grasp on all levels.

Pulses as JSON files

My preference would be to stick to files in a file system for now or a hybrid solution in the long run. I think databases become interesting for managing more than 100,000 pulses or so, since file system access per folder becomes slow.

Implementation sketch

A folder under version control with any number of subfolders for organisation contains json-files. Example:

    templates/
    user1/
        blup/
    experiment1/
        /initialisation

A commit takes place on every experiment run. User required to write commit message (documentation).
A json-file is a pulse definition. Example: simple.pulse

{
    "apiVersion": 1.0, // Defines with which api version the pulse was created (what features available).
    "name": "simple", // Json-file name is generated from this.
    "uid": "some_unique_hash_string", // Unique hash string for cashing.
    "channel": [1, 3], // This pulse is multi-channel.
    "parameter": // Evaluated in the order of definition. Can be defined without value as template.
    [
        {"name": "par", "start": 0, "step": 10, "end": 0.2, "type": "linear"}
    ],
    "data": // Can contain value type or ref type pulses. Is array. Can be tree like structure.
    [
        {"type": "ref", "path": "core/initialise/init.pulse"},
        {"type": "ramp", "start": 0, "stop": "par", "duration": "par+2"}
    ]
}

A pulse definition can reference other pulse-files with a relative path. Parameters of hierarchically higher pulses overwrite hierarchically lower parameters.
apiVersion defines format/features of pulse-files. New API version for new pulse features required.
Pulse object in code constructed from json-file. Does conversion to waveform, helper methods, ...
Maintain hash-table in memory using uid hash of already created Pulse objects to minimize file system I/O, when evaluating Pulse hierarchy.
Prepare experiment/evaluate on office computer (separate repository), push/pull lab repository. Never work on lab repository from office computer.

Discussion

Pro

Json-text files can be processed by machine/human equally well.
Solid documentation, very version control friendly (meaningful changes trackable by line/commit message documents).
Experiment state can be reproduced by accessing version control history.
Language agnostic format with little overhead.
Can be structured on disk in folders (templates, experiment specific, user specific, …).
Api with api version defines clear interface and capabilities of pulses.
New feature requires api version update. Clear documentation why/when in program code.
Json is just a serialization technique, can be deserialized for performance reasons to a database/ what not.
Better multi-user support than SQLite.

Contra

Large amount of files may take long to load/save on program start up/shutdown. Requires a clever approach to have good performance.
Clear text files take up more space than binary data.
New feature requires api version update. Overhead to write converter code for loading older pulse files.

0 replies

ekammerloher · 2015-06-04T19:15:37Z

ekammerloher
Jun 4, 2015
Maintainer

I did a little benchmark to compare the performance of a SQL type database (sqlite3) and just using the filesystem (json files) in saving a pulse tree structure. I considered the worst case scenario, where one pulse references many many other pulses.

Benchmark implementation

The implementation is simple. Since I could linearize a pulse tree structure in principle I just consider a linked list in the first place. I create the list of integers id_list=1:id_count and shuffle it. The root pulse is id_list[0] and has a reference to id_list[1], which in turn has a reference to id_list[2] and so on.

To account for the case where a pulse is referenced several times, a ref_count variable is stored with the reference. When reading the pulse, the read-operation is repeated ref_count times.

The read-operation can work either with a sqlite database, json files in a folder or a redis server.

Results

Results on a 2,3 GHz Intel Core i7 with an APPLE SSD SM256E and Python 3.5.0b2. Python 2.7.8 is similar but a few percent slower.

Every 10th reference is read 10 times, to see how cashing improves the situation.

No. of pulses	load sql	load sql (+cashing)	load json	load json (+cashing)	load redis
10	0.6 ms	0.6 ms	0.7 ms	0.8 ms	2 ms
100	16 ms	10 ms	14 ms	12 ms	23 ms
1000	201 ms	130 ms	111 ms	72 ms	225 ms
10000	14 s	8 s	1 s	0.7 s	2.3 s
100000	22 min 23 s	12 min	16 s	28 s	23 s

Discussion

It seems SQL is not a good choice here, or my SQL skills are just too low. Redis on the other hand is quite fast (The data is stored in RAM and dumped to disk periodically -> cannot get too large). It may seem strange, that json with cashing is slower than without, but the small files are probably cashed on a lower level anyway and lru_cache just adds more overhead here.

Conclusion: Plain files are basically the fastest/easiest way.

Code

import random
import itertools
import json
import sqlite3
import time
# from functools import lru_cache # Uncomment for caching. Works only in python 3.
# import redis # Uncomment for redis.


def generate_id_list(id_count):
    """Generate shuffled list of unique ids."""
    id_list = [_id for _id in range(id_count)]
    random.shuffle(id_list)
    return id_list


def traverse(id_list, ref_count_list, func, **kwargs):
    """Traverse id_list and call func on the elements.

    Add a ref_count that cycles through ref_count_list and a reference
    that points to the next element.
    """
    ref_iter = itertools.cycle(ref_count_list)
    id_iter, next_id_iter = itertools.tee(id_list)
    next(next_id_iter, None) # Point iterator to next element.

    # Will go over len(id_list)-1 elements.
    for _id, next_id, ref_count in zip(id_iter, next_id_iter, ref_iter):
        func(_id, next_id, ref_count,  **kwargs)
    # Call func on the last element without reference.
    func(id_list[-1], None, next(ref_iter), **kwargs)


def populate(root_id, func, **kwargs):
    """Simulate list construction using func output.

    Simulate multiple queries to reference by looping over ref_count.
    """
    ref_id, ref_count = func(root_id, **kwargs)
    while ref_id is not None:
        for _ in range(ref_count-1):
            func(ref_id, **kwargs)
        ref_id, ref_count = func(ref_id, **kwargs)


def prepare_sql(path):
    """Setup a simple one table id, ref_id, ref_count database."""
    con = sqlite3.connect(path)
    with con:
        cur = con.cursor()
        cur.execute('DROP TABLE IF EXISTS Ref')
        cur.execute(('CREATE TABLE Ref(id INTEGER, '
                     'ref_id INTEGER, ref_count INTEGER)'))
    return con


def save_sql(_id, ref_id, ref_count, con=None, **kwargs):
    """Write out to the database."""
    with con:
        cur = con.cursor()
        cur.execute('INSERT INTO Ref VALUES(?,?,?)',
                    (_id, ref_id, ref_count))

#@lru_cache(maxsize=16) # Uncomment for caching.
def load_sql(_id, con=None, **kwargs):
    """Return ref_id, ref_count for _id."""
    with con:
        cur = con.cursor()
        cur.execute(('SELECT ref_id, ref_count FROM Ref '
                     'WHERE id=%d') %_id)
        return cur.fetchone()


def save_json(_id, ref_id, ref_count, path='', **kwargs):
    """Write out to a json file located in path."""
    with open('%s%d.json' % (path, _id), 'w') as outfile:
        data = {'ref_id': ref_id, 'ref_count': ref_count}
        json.dump(data, outfile)

#@lru_cache(maxsize=16) # Uncomment for caching.
def load_json(_id, path='', **kwargs):
    """Return ref_id, ref_count for _id."""
    with open('%s%d.json' % (path, _id), 'r') as infile:
        data = json.load(infile)
        return data['ref_id'], data['ref_count']


#def prepare_redis(host='localhost', port=6379):
#    r = redis.StrictRedis(host=host, port=port, db=0)
#    return r


#def save_redis(_id, ref_id, ref_count, r_server=None, **kwargs):
#    r_server.rpush(_id, ref_id)
#    r_server.rpush(_id, ref_count)


#def load_redis(_id, r_server=None, **kwargs):
#    data = r_server.lrange(_id, 0, -1)
#    if data[0] == b'None':
#        return None, int(data[1])
#    return int(data[0]), int(data[1])


def run_benchmark(id_count, ref_count_list,
                  db_path='pls.db', json_path='pls/'):
    con = prepare_sql(db_path)
#    r = prepare_redis()
    id_list = generate_id_list(id_count)

    save_func = [save_sql, save_json]
    save_timing = []
    for func in save_func:
        start = time.time()
        traverse(id_list, ref_count_list, func, con=con, path=json_path)
        stop = time.time()
        save_timing.append(stop-start)

    load_func = [load_sql, load_json]
    load_timing = []
    for func in load_func:
        start = time.time()
        populate(id_list[0], func, con=con, path=json_path)
        stop = time.time()
        load_timing.append(stop-start)
#    r.flushdb()
    return save_timing, load_timing


if __name__ == '__main__':
    ref_cnt_list = [1, 1, 1, 1, 1, 1, 1, 1, 1, 10]
    for cnt in [10, 100, 1000, 10000, 100000]:
        print(run_benchmark(cnt, ref_cnt_list))

WARNING: This will run for about 30 min.

0 replies

lumip · 2015-06-08T10:01:16Z

lumip
Jun 8, 2015

I don't think this test is entirely accurate because of the following.

Omitting the actual data stored for each pulse might give JSON an advantage. When reading a file, the JSON parser typically reads the file in its entirety and has to parse the contained string while SQLite stores data in more organised data structures as binary information (where applicable). The large the information for on pulse gets, the faster SQLite should be compared to JSON. (At least that's my guess).
I think a more realistic example would be to construct a pulse hierarchy that isn't as deep as the current one but allows for pulses to reference multiple other pulses.
SQLite is not used to its full extend here. If we were to use references via foreign keys, the number of database queries would be reduced.
I think the tests for each number of pulses should be repeated several times to cancel out noise in the measurement.

Regarding the caching: The size of the cache is limited to 16 entries. That is a small number compared to the number of pulses for the last few tests which means that we mostly get cache-misses and effectively add only overhead to the function calls. Hence the increase in runtime with caching enabled.

This is not intended to be an argument towards SQLite - as far as I remeber we agreed on using JSON files during the last meeting - but I think this benchmark is not accurate in its current form. We could rewrite it, but it might yield the same result and I don't know whether this is high-priority. If so, please create a ticket for us.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistence Mechanisms #722

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Persistence Mechanisms #722

j340m3 May 3, 2015

Status quo

Desired Improvements

Solution proposals

Relational database (SQL)

Folder structure and XML-Representations

Comparison of the proposals

Replies: 9 comments

hbluhm May 4, 2015 Maintainer

pcerf May 6, 2015

lumip May 6, 2015

hbluhm May 6, 2015 Maintainer

lumip May 6, 2015

hbluhm May 6, 2015 Maintainer

ekammerloher May 16, 2015 Maintainer

A few additional remarks about databases in general

Pro

Contra

Pulses as JSON files

Implementation sketch

Discussion

Pro

Contra

ekammerloher Jun 4, 2015 Maintainer

Benchmark implementation

Results

Discussion

Code

lumip Jun 8, 2015

j340m3
May 3, 2015

hbluhm
May 4, 2015
Maintainer

pcerf
May 6, 2015

lumip
May 6, 2015

hbluhm
May 6, 2015
Maintainer

lumip
May 6, 2015

hbluhm
May 6, 2015
Maintainer

ekammerloher
May 16, 2015
Maintainer

ekammerloher
Jun 4, 2015
Maintainer

lumip
Jun 8, 2015