-
Notifications
You must be signed in to change notification settings - Fork 0
Load data #16
Load data #16
Changes from 16 commits
2c378c0
7d7c9df
558aed3
6c27ab6
6629190
eda6a8e
cf84da1
6fb4983
4ec4a94
fb42944
0095870
2178039
c5db159
8b1011c
96b0485
bfd99d8
d3c8d4c
3b0b91b
3229083
d00926c
9656f22
b331b76
6ebad71
e09d394
c546648
1ad55e3
021d61f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -85,3 +85,6 @@ venv/ | |
# config specific | ||
**/config.yml | ||
.env | ||
|
||
# used in testing | ||
*.hdf5 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,20 @@ | ||
import logging | ||
from pathlib import Path | ||
from typing import Tuple | ||
|
||
import h5py | ||
from decouple import config | ||
|
||
from ..objects.data_raw import DataRaw | ||
from ..objects.enums import AnalysisType, DataType | ||
from ..objects.specifications import Specifications | ||
from .read_config import read | ||
|
||
CONFIG_PATH = config("CONFIG_PATH") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This python-decouple module is very cool, I have not seen it before. As a completely naive user, I tried playing around with this just in the console and got 'CONFIG_PATH not found. Declare it as envvar or define a default value.', I can't quite figure out how it works. Is it possible to add these two lines of code to a function (e.g. a new function read_configs.get_config_path()) and add a docstring on how decouple works? Would it be possible / sensible to get config_path in main.py and pass it load_data > get_specifications > read_configurations() ) so that settings are displayed together in main.py for readability? but maybe this suggestion is not optimal for python-decouple There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am using decouple as a way to not store in the repo the local paths of the config. It is set up in a way that it takes from granted the existence of an |
||
config_path = Path(__file__).parents[1] / CONFIG_PATH | ||
|
||
|
||
def load_data(folder_name: str) -> Tuple[list, Specifications]: | ||
def load_data(folder_name: str) -> Tuple[DataRaw, Specifications]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is folder_name the top_level folder of the experimental data directory? This is probably obvious as I can't recall how they lay out their data, but a quick example e.g. if the path is /windows/something/myproject the name is myproject? how does the loader know the full path to the project directory? This seems to be all handled very gracefully, maybe just some docs for the naive user on the main() docstring would be useful There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The path is stored in the config file that was hidden from you sorry |
||
"""Creates the configuration object and loads the data. | ||
|
||
Parameters | ||
|
@@ -51,8 +56,31 @@ def get_specifications(folder_name: str) -> Specifications: | |
return specs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could a little confused about the difference between 'specifications' vs. 'confgurations'. Instead of 'specifications' could this be a subset of configurations e.g. project_configs (maybe later there will be other types of configs) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this is very confusing because also the |
||
|
||
|
||
def load(specs: Specifications) -> list: | ||
raise NotImplementedError("TODO") | ||
def load(specs: Specifications) -> DataRaw: | ||
if specs.config["use-allen-dff"]: | ||
if specs.config["analysis-type"] == "sf_tf": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it possible to expand this abbreviation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is meant to be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe |
||
allen_data_files = [ | ||
file | ||
for file in specs.folder_naming.all_files | ||
if file.datatype == DataType.ALLEN_DFF | ||
and file.analysistype == AnalysisType.SF_TF | ||
] | ||
if len(allen_data_files) == 1: | ||
data_raw = DataRaw( | ||
h5py.File(allen_data_files[0].path, "r"), is_allen=True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This may already be happening in DataRaw, but if not it might be worth opening the file in a context manager that will close it automatically in case of an exception during loading, e.g.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes it makes sense, thanks! |
||
) | ||
logging.info(f"Allen data loaded: {data_raw}") | ||
return data_raw | ||
else: | ||
raise ValueError( | ||
"There is more than one Allen file for sf_tf analysis" | ||
) | ||
else: | ||
raise NotImplementedError( | ||
"Only sf_tf analysis is implemented for Allen data" | ||
) | ||
else: | ||
raise NotImplementedError("Only loading for Allen data is implemented") | ||
|
||
|
||
def read_configurations() -> dict: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function is similar to read_config.read() but with an extra layer of logging. Could these two functions be combined? (just a suggestion, this layering of logging could be very useful) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, it is quite useless |
||
|
@@ -65,7 +93,7 @@ def read_configurations() -> dict: | |
""" | ||
|
||
logging.debug("Reading configurations") | ||
config = read(CONFIG_PATH) | ||
config = read(config_path) | ||
logging.debug(f"Configurations read: {config}") | ||
|
||
return config |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
import logging | ||
from typing import Union | ||
|
||
import h5py | ||
import numpy as np | ||
|
||
|
||
class DataRaw: | ||
"""Class to load and contain the raw data. | ||
It can load data from Allen or from | ||
a list of Paths. Only the Allen case is implemented so far. | ||
""" | ||
|
||
def __init__(self, data, is_allen: bool = True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. data does not have class definition (but maybe it is not possible) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can, I forgot to add it! |
||
if is_allen: | ||
logging.info("Loading Allen data, starting to unpack...") | ||
|
||
self.day = self.unpack_data(data["day"], data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is self.unpack_data private or also planned to be called from outside class There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. similar with group_to_dict_recursive and ref_dataset_to_array There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
logging.info("Unpacked day") | ||
|
||
self.imaging = self.unpack_data(data["imaging"], data) | ||
logging.info("Unpacked imaging") | ||
|
||
self.f = self.unpack_data(data["f"], data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could the field "f" be un-abbreviated? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could, it is |
||
logging.info("Unpacked f") | ||
|
||
self.is_cell = self.unpack_data(data["is_cell"], data) | ||
logging.info("Unpacked is_cell") | ||
|
||
self.r_neu = self.unpack_data(data["r_neu"], data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe this is a candidate for expanding the field too (but maybe it is obvious in the field) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I'll unpack it |
||
logging.info("Unpacked r_neu") | ||
|
||
self.stim = self.unpack_data(data["stim"], data) | ||
logging.info("Unpacked stim") | ||
|
||
self.trig = self.unpack_data(data["trig"], data) | ||
logging.info("Unpacked trig") | ||
else: | ||
self.day = data["day"] | ||
self.imaging = data["imaging"] | ||
self.f = data["f"] | ||
self.is_cell = data["is_cell"] | ||
self.r_neu = data["r_neu"] | ||
self.stim = data["stim"] | ||
self.trig = data["trig"] | ||
|
||
def __repr__(self) -> str: | ||
return f"DataRaw(day={self.day}, imaging={self.imaging}, f={self.f}, \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is cool |
||
is_cell={self.is_cell}, r_neu={self.r_neu}, stim={self.stim}, \ | ||
trig={self.trig})" | ||
|
||
@classmethod | ||
def group_to_dict_recursive(cls, group: h5py._hl.group.Group) -> dict: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if its worth it, but might be nice to have an example folder directory tree, just to get a handle on how the underlying data looks to help interpret the functions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do have it... I could add this overall info in the class docstring There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added two examples in the |
||
"""Takes a Group and resolves its content. If the Group contains | ||
other Groups, it calls itself recursively. | ||
It assumes there are no more References. | ||
|
||
Args: | ||
group (h5py._hl.group.Group): | ||
HDF5 Group containing references | ||
|
||
Returns: | ||
dict: the resolved dictionary | ||
""" | ||
dict = {} | ||
for key in group: | ||
if isinstance(group[key], h5py._hl.group.Group): | ||
dict[key] = cls.group_to_dict_recursive(group[key]) | ||
else: | ||
dict[key] = np.squeeze(group[key][:]) | ||
return dict | ||
|
||
@classmethod | ||
def ref_dataset_to_array( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could / should the order of these functions be re-arranged for readability? if I understand correctly unpack_data calls ref_dataset_to_array calls group_to_dict_recursive There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK I can re-arrange them! |
||
cls, | ||
dataset: h5py._hl.dataset.Dataset, | ||
parent: Union[h5py._hl.group.Group, h5py.File], | ||
) -> np.ndarray: | ||
"""Takes a Dataset that contains references to other Datasets or | ||
Groups and resolves its content. | ||
|
||
Args: | ||
dataset (h5py._hl.dataset.Dataset): | ||
HDF5 Dataset containing references | ||
parent_container (Union[h5py._hl.group.Group, h5py.File]): | ||
is the object that contains the element. | ||
It is used to resolve references. | ||
|
||
Returns: | ||
np.ndarray: an array of numbers or an array of dictionaries | ||
""" | ||
array = np.zeros((dataset.shape[0], dataset.shape[1]), dtype=object) | ||
|
||
for i in range(dataset.shape[0]): | ||
for j in range(dataset.shape[1]): | ||
ref = dataset[i][j] | ||
if isinstance(parent[ref], h5py._hl.group.Group): | ||
array[i, j] = cls.group_to_dict_recursive(parent[ref]) | ||
else: | ||
array[i, j] = np.squeeze(parent[ref][:]) | ||
|
||
return np.squeeze(array) | ||
|
||
@classmethod | ||
def unpack_data( | ||
cls, | ||
element: Union[h5py._hl.dataset.Dataset, h5py._hl.group.Group], | ||
parent: Union[h5py.File, h5py._hl.group.Group], | ||
) -> Union[np.ndarray, dict]: | ||
"""This method unpack a complex MATLAB datastructure and returns a | ||
nested dictionary or numpy array. Only the relevant subset (Dataset | ||
and Groups) of the possible datastructures is implemented. | ||
Datasets can be mapped to arrays. Groups can be mapped to | ||
dictionaires, and each entry can be a Dataset or another Group. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo (dictionaries) |
||
An array might contain numbers or point to other Arrays or Groups | ||
through References. | ||
References are a HDF5 type that can point either to an array or | ||
to a group. | ||
They need to be resolved in order to get the data. They are resolved | ||
by calling the methods ref_dataset_to_array. | ||
If element is a Group, its content is unpacked recursively. | ||
|
||
Args: | ||
element Union[h5py._hl.dataset.Dataset, h5py._hl.group.Group]: | ||
is either a h5py Group or Dataset. | ||
It is what we want to unpack. | ||
parent Union[h5py.File, h5py._hl.group.Group]: | ||
is the object that contains the element. | ||
It is used to resolve references. | ||
|
||
Returns: | ||
Union[np.ndarray, dict]: | ||
is either a numpy array or a nested dictionary. | ||
""" | ||
if isinstance(element, h5py._hl.dataset.Dataset): | ||
if element.dtype == h5py.special_dtype(ref=h5py.Reference): | ||
return cls.ref_dataset_to_array(element, parent) | ||
else: | ||
return np.squeeze(element[:]) | ||
elif isinstance(element, h5py._hl.group.Group): | ||
dict = {} | ||
for key in element: | ||
dict[key] = cls.unpack_data(element[key], element) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this recursive file / folder unpacking is handled very nicely |
||
return dict | ||
else: | ||
return None |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ dependencies = [ | |
"fancylog", | ||
"PyYAML", | ||
"types-PyYAML", | ||
"h5py", | ||
] | ||
|
||
[project.urls] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
import h5py | ||
import numpy as np | ||
|
||
from load_suite2p.objects.data_raw import DataRaw | ||
|
||
array0 = np.array([1, 2, 3, 4, 5]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could use np.arange(1, 6) etc. as more concise, but less readable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes thanks! |
||
array1 = np.array([3, 4, 5, 6, 7]) | ||
array2 = np.array([5, 6, 7, 8, 9]) | ||
array3 = np.array([7, 8, 9, 10, 11]) | ||
array4 = np.array([9, 10, 11, 12, 13]) | ||
|
||
|
||
def create_mock_hdf5_data(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a pytest fixture could be used in this case, it is a nice way to control data that is passed to tests. The main benefits are
I think this will work, but maybe not as the create function doesn't explicitly return anything, but that shouldn't matter, including it in this way should trigger the dataset creation for each function b) fixtures are nice because you can control the scope of the data. For example with scope="function", create_mock_hdf5_data() will be called again for each function it is used in. for scope="class", the function will be called once and then re-used (nice to save time if you are definately not changing the data). However typically I would always use function scope just so you are 100% sure you are starting fresh for each test. c) All code after the yeild keyword is still run, which makes tearing down after tests easier. In this case I would suggest adding teardown code that deletes the written file. I always think pytest docs are a bit confusing but this is nice https://towardsdatascience.com/make-your-python-tests-efficient-with-pytest-fixtures-3d7a1892265f There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you very much, I like fixtures very much. I've just implemented better mocks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pytest has a automatic fixture tmp_path that gives you access to a temporary directory that it might be easier to write to rather than cwd (which if I understand correctly this does) https://docs.pytest.org/en/7.1.x/how-to/tmp_path.html There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very cool! I've implemented it |
||
with h5py.File("mytestfile.hdf5", "w") as f: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just personal preference, I like to unabreviate f to file in this case just to be super explicit |
||
# create a file and add a simple dataset | ||
f.create_dataset("array0", data=array0, dtype="i") | ||
|
||
# create a file and add a group with a dataset inside | ||
grp = f.create_group("mygroup") | ||
grp.create_dataset("array1", data=array1, dtype="f") | ||
|
||
# create group with subgroup and a dataset | ||
subgroup = grp.create_group("subgroup") | ||
subgroup.create_dataset("array2", data=array2, dtype="f") | ||
|
||
# create a dataset with references of dataset | ||
dataset_to_be_referenced = f.create_dataset( | ||
"array3", data=array3, dtype="f" | ||
) | ||
ref = dataset_to_be_referenced.ref | ||
ref_array = [[ref, ref], [ref, ref]] | ||
f.create_dataset( | ||
"ref_dataset", | ||
data=ref_array, | ||
dtype=h5py.special_dtype(ref=h5py.Reference), | ||
) | ||
|
||
# create a dataset with references of group with subgroup | ||
group_to_be_referenced = f.create_group("#ref_group#") | ||
subgroup2 = group_to_be_referenced.create_group("subgroup2") | ||
subgroup2.create_dataset("array4", data=array4, dtype="f") | ||
ref2 = group_to_be_referenced.ref | ||
ref_array2 = [[ref2, ref2], [ref2, ref2]] | ||
f.create_dataset( | ||
"ref_dataset2", | ||
data=ref_array2, | ||
dtype=h5py.special_dtype(ref=h5py.Reference), | ||
) | ||
|
||
|
||
def test_unpack_of_simple_dataset(): | ||
create_mock_hdf5_data() | ||
with h5py.File("mytestfile.hdf5", "r") as f: | ||
assert np.all(DataRaw.unpack_data(f["array0"], f) == array0) | ||
|
||
|
||
def test_unpack_of_dataset_in_group(): | ||
create_mock_hdf5_data() | ||
with h5py.File("mytestfile.hdf5", "r") as f: | ||
assert np.all(DataRaw.unpack_data(f["mygroup"]["array1"], f) == array1) | ||
|
||
|
||
def test_unpack_of_dataset_in_subgroup(): | ||
create_mock_hdf5_data() | ||
with h5py.File("mytestfile.hdf5", "r") as f: | ||
assert np.all( | ||
DataRaw.unpack_data(f["mygroup"]["subgroup"]["array2"], f) | ||
== array2 | ||
) | ||
|
||
|
||
def test_unpack_of_dataset_with_references_to_dataset(): | ||
create_mock_hdf5_data() | ||
with h5py.File("mytestfile.hdf5", "r") as f: | ||
assert np.all(DataRaw.unpack_data(f["ref_dataset"], f)[0][0] == array3) | ||
|
||
|
||
def test_unpack_of_dataset_with_references_to_group_with_subgroup(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Overall I think these tests look very nice. It may also be worth testing the package handles any potentially common bad inputs the user might give. This can be done by checking an error is raised (and its content) with pytest.raises() There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder what a bad input could be in this case, maybe if they give the wrong file type? or a corrupted hdf5 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm yes, I am not too sure. Maybe bad paths on the config file? Although this is already quite obvious from the error. Another possibility is they pass a .mat file that is not structured in the expected way. But I'm not sure if it's possible in practice. Another test possibility (although this may be overkill) is to go through one exmaple .mat file and save all (or a subset) of the expected arrays into separate mat files (or csv or something). Then you can load the file file into load-suite2p and test all arrays in python against matlab. This would be useful as a sanity check against a real dataset, and would highlight any small numerical changes that may occur during conversion (these are likely to be extremely small, if they exist at all). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you Joe for the suggestions. I will open a separate issue to address these use cases, also because they seem related to a further expansion of the load function that I aim to develop later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #21 here we go |
||
create_mock_hdf5_data() | ||
with h5py.File("mytestfile.hdf5", "r") as f: | ||
assert np.all( | ||
DataRaw.unpack_data(f["ref_dataset2"], f)[0][0]["subgroup2"][ | ||
"array4" | ||
] | ||
== array4 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a added to pyproject.toml
from decouple import config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks!