Replies: 9 comments
-
Regarding the above question, my feeling is that performance will become important, while exchange between different platforms or human-readability of the whole database is not. Hence, SQL seems preferable. Before going into more depth, I'd like to make sure I understand the reason for using a database. Another question is whether it makes sense to use a database for every pulse? I could imagine that the more disposable ones that are, for example, used only for a single experiment are more conveniently archived together with data instead of cluttering a pulse database. Of course one could also maintain several databases or sections, but is that convenient? Hence, would it make sense to provide both files and a database as persistence mechanism? |
Beta Was this translation helpful? Give feedback.
-
For us, code (and possibly class definitions) changes on a daily basis. How does this work for saving? In general, I think the problem should be split:
Addressing them in turn:
|
Beta Was this translation helpful? Give feedback.
-
Persistent Storing of Pulses On Database Systems Addressing Pulses |
Beta Was this translation helpful? Give feedback.
-
@lumip Lukas, could it be that something is missing at the end of the first section? (This contradicts ...) Regarding performance: Yes, pulses have to be reread from disk if reused. However, when they are constructed hierarchically, one sometimes ends up reading the same pulse quite often, which is slow if only stored on disk. At least this seems to be a problem with current pulse dictionaries. Caching them in memory is much faster. I guess the issue might simply be that much more than the pulse needed is read each time. Loading the pulse definition from file into memory and saving it only for syncing is one solution that works reasonably well, but is somewhat inconvenient and seems crude. I doubt that overall size will be a limiting factor for recyclable material, but may be for archival storage. Would the extensive use of references, e.g. in recursive pulse definitions, be another reason to use a database system? Do we need more than pulse indices or names, that could translate into filenames? Multi-user access to pulses could be a nice feature in the future, but so far has not been critical. Copying and weeding out the pulse repository from another team does not seem too bad. Regarding "Adressing pulses": I think your analysis is correct, and may solve the problem at a very high level. |
Beta Was this translation helpful? Give feedback.
-
could it be that something is missing at the end of the first section? Regarding performance: Database systems should address this. The described problem is mostly due a bad implementation of loading pulses and must be dealt with in our implementation. However, database systems generally do some caching due to their implementation such that repeated accesses to the same data in a relatively short period of time should be a bit faster than reading a file over and over again. Would the extensive use of references, e.g. in recursive pulse definitions, be another reason to use a database system? Do we need more than pulse indices or names, that could translate into filenames? Multi-user access to pulses could be a nice feature in the future, but so far has not been critical. Copying and weeding out the pulse repository from another team does not seem too bad. Regarding "Adressing pulses": I think your analysis is correct, and may solve the problem at a very high level. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Despite the favour of an immediate database implementation, I would like to present my take on this. It may cover a few more issues than Persistence only but the implementation sketch is only an example. A few additional remarks about databases in generalPro
Contra
Pulses as JSON filesMy preference would be to stick to files in a file system for now or a hybrid solution in the long run. I think databases become interesting for managing more than 100,000 pulses or so, since file system access per folder becomes slow. Implementation sketch
{
"apiVersion": 1.0, // Defines with which api version the pulse was created (what features available).
"name": "simple", // Json-file name is generated from this.
"uid": "some_unique_hash_string", // Unique hash string for cashing.
"channel": [1, 3], // This pulse is multi-channel.
"parameter": // Evaluated in the order of definition. Can be defined without value as template.
[
{"name": "par", "start": 0, "step": 10, "end": 0.2, "type": "linear"}
],
"data": // Can contain value type or ref type pulses. Is array. Can be tree like structure.
[
{"type": "ref", "path": "core/initialise/init.pulse"},
{"type": "ramp", "start": 0, "stop": "par", "duration": "par+2"}
]
}
DiscussionPro
Contra
|
Beta Was this translation helpful? Give feedback.
-
I did a little benchmark to compare the performance of a SQL type database (sqlite3) and just using the filesystem (json files) in saving a pulse tree structure. I considered the worst case scenario, where one pulse references many many other pulses. Benchmark implementationThe implementation is simple. Since I could linearize a pulse tree structure in principle I just consider a linked list in the first place. I create the list of integers To account for the case where a pulse is referenced several times, a The read-operation can work either with a sqlite database, json files in a folder or a redis server. ResultsResults on a 2,3 GHz Intel Core i7 with an APPLE SSD SM256E and Python 3.5.0b2. Python 2.7.8 is similar but a few percent slower. Every 10th reference is read 10 times, to see how cashing improves the situation.
DiscussionIt seems SQL is not a good choice here, or my SQL skills are just too low. Redis on the other hand is quite fast (The data is stored in RAM and dumped to disk periodically -> cannot get too large). It may seem strange, that json with cashing is slower than without, but the small files are probably cashed on a lower level anyway and lru_cache just adds more overhead here. Conclusion: Plain files are basically the fastest/easiest way. Codeimport random
import itertools
import json
import sqlite3
import time
# from functools import lru_cache # Uncomment for caching. Works only in python 3.
# import redis # Uncomment for redis.
def generate_id_list(id_count):
"""Generate shuffled list of unique ids."""
id_list = [_id for _id in range(id_count)]
random.shuffle(id_list)
return id_list
def traverse(id_list, ref_count_list, func, **kwargs):
"""Traverse id_list and call func on the elements.
Add a ref_count that cycles through ref_count_list and a reference
that points to the next element.
"""
ref_iter = itertools.cycle(ref_count_list)
id_iter, next_id_iter = itertools.tee(id_list)
next(next_id_iter, None) # Point iterator to next element.
# Will go over len(id_list)-1 elements.
for _id, next_id, ref_count in zip(id_iter, next_id_iter, ref_iter):
func(_id, next_id, ref_count, **kwargs)
# Call func on the last element without reference.
func(id_list[-1], None, next(ref_iter), **kwargs)
def populate(root_id, func, **kwargs):
"""Simulate list construction using func output.
Simulate multiple queries to reference by looping over ref_count.
"""
ref_id, ref_count = func(root_id, **kwargs)
while ref_id is not None:
for _ in range(ref_count-1):
func(ref_id, **kwargs)
ref_id, ref_count = func(ref_id, **kwargs)
def prepare_sql(path):
"""Setup a simple one table id, ref_id, ref_count database."""
con = sqlite3.connect(path)
with con:
cur = con.cursor()
cur.execute('DROP TABLE IF EXISTS Ref')
cur.execute(('CREATE TABLE Ref(id INTEGER, '
'ref_id INTEGER, ref_count INTEGER)'))
return con
def save_sql(_id, ref_id, ref_count, con=None, **kwargs):
"""Write out to the database."""
with con:
cur = con.cursor()
cur.execute('INSERT INTO Ref VALUES(?,?,?)',
(_id, ref_id, ref_count))
#@lru_cache(maxsize=16) # Uncomment for caching.
def load_sql(_id, con=None, **kwargs):
"""Return ref_id, ref_count for _id."""
with con:
cur = con.cursor()
cur.execute(('SELECT ref_id, ref_count FROM Ref '
'WHERE id=%d') %_id)
return cur.fetchone()
def save_json(_id, ref_id, ref_count, path='', **kwargs):
"""Write out to a json file located in path."""
with open('%s%d.json' % (path, _id), 'w') as outfile:
data = {'ref_id': ref_id, 'ref_count': ref_count}
json.dump(data, outfile)
#@lru_cache(maxsize=16) # Uncomment for caching.
def load_json(_id, path='', **kwargs):
"""Return ref_id, ref_count for _id."""
with open('%s%d.json' % (path, _id), 'r') as infile:
data = json.load(infile)
return data['ref_id'], data['ref_count']
#def prepare_redis(host='localhost', port=6379):
# r = redis.StrictRedis(host=host, port=port, db=0)
# return r
#def save_redis(_id, ref_id, ref_count, r_server=None, **kwargs):
# r_server.rpush(_id, ref_id)
# r_server.rpush(_id, ref_count)
#def load_redis(_id, r_server=None, **kwargs):
# data = r_server.lrange(_id, 0, -1)
# if data[0] == b'None':
# return None, int(data[1])
# return int(data[0]), int(data[1])
def run_benchmark(id_count, ref_count_list,
db_path='pls.db', json_path='pls/'):
con = prepare_sql(db_path)
# r = prepare_redis()
id_list = generate_id_list(id_count)
save_func = [save_sql, save_json]
save_timing = []
for func in save_func:
start = time.time()
traverse(id_list, ref_count_list, func, con=con, path=json_path)
stop = time.time()
save_timing.append(stop-start)
load_func = [load_sql, load_json]
load_timing = []
for func in load_func:
start = time.time()
populate(id_list[0], func, con=con, path=json_path)
stop = time.time()
load_timing.append(stop-start)
# r.flushdb()
return save_timing, load_timing
if __name__ == '__main__':
ref_cnt_list = [1, 1, 1, 1, 1, 1, 1, 1, 1, 10]
for cnt in [10, 100, 1000, 10000, 100000]:
print(run_benchmark(cnt, ref_cnt_list)) WARNING: This will run for about 30 min. |
Beta Was this translation helpful? Give feedback.
-
I don't think this test is entirely accurate because of the following.
Regarding the caching: The size of the cache is limited to 16 entries. That is a small number compared to the number of pulses for the last few tests which means that we mostly get cache-misses and effectively add only overhead to the function calls. Hence the increase in runtime with caching enabled. This is not intended to be an argument towards SQLite - as far as I remeber we agreed on using JSON files during the last meeting - but I think this benchmark is not accurate in its current form. We could rewrite it, but it might yield the same result and I don't know whether this is high-priority. If so, please create a ticket for us. |
Beta Was this translation helpful? Give feedback.
-
Status quo
In special-measure, the pulses are all stored in one struct, ordered and accessible by their IDs. There is one global database struct in which the default pulses are stored. For a new experiment, the user has to work with his local copy where he overwrites unneeded IDs. When the experiment has been finished, the user saves the whole database file and its report into a folder.
Desired Improvements
Solution proposals
Relational database (SQL)
The main database could be rewritten into a SQL-Database. These databases only store basetypes and references, therefore additional steps are necessary to store arbitrary objects.
The pulse will be divided in its subcomponents, until there are just basetypes. These subcomponents will then be referenced.
Prewritten wrapper around the SQL-Database with different approaches:
Folder structure and XML-Representations
The main database can be represented in the XML-Format. This format is the standard for data exchange. For each experiment, we generate a subfolder in which the relevant data (pulse, measurements, documentation) will be stored. The pulse file may reference the main database and stores the composition of subpulses.
Comparison of the proposals
So, in the end, we have the trade-off between having a performant database and having an easy one.
Which setup do you think will fulfill your needs?
Beta Was this translation helpful? Give feedback.
All reactions