-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
SummaryIn this PR, I add the logic required to open and unpack the matlab files generated with the matlab codebase that I am translating, called
|
Great job @lauraporta this is very nice, very well structured, code is clean and intuitive to follow. I've left a few suggestions but nothing major. Learnt some cool new features and might do some investigating into datashuttle structure inspired by #8! |
from typing import Tuple | ||
|
||
import h5py | ||
from decouple import config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a added to pyproject.toml from decouple import config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks!
from decouple import config | ||
|
||
from ..objects.data_raw import DataRaw | ||
from ..objects.enums import AnalysisType, DataType | ||
from ..objects.specifications import Specifications | ||
from .read_config import read | ||
|
||
CONFIG_PATH = config("CONFIG_PATH") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This python-decouple module is very cool, I have not seen it before. As a completely naive user, I tried playing around with this just in the console and got 'CONFIG_PATH not found. Declare it as envvar or define a default value.', I can't quite figure out how it works. Is it possible to add these two lines of code to a function (e.g. a new function read_configs.get_config_path()) and add a docstring on how decouple works?
Would it be possible / sensible to get config_path in main.py and pass it load_data > get_specifications > read_configurations() ) so that settings are displayed together in main.py for readability? but maybe this suggestion is not optimal for python-decouple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using decouple as a way to not store in the repo the local paths of the config. It is set up in a way that it takes from granted the existence of an .env
containing the variable CONFIG_PATH
. I forgot to specify this, my bad. I will add it to Readme.md
file.
I am not yet sure of the way I want to handle the configuration file yet, that is why this bit is still vague and unpolished.
|
||
|
||
def load_data(folder_name: str) -> Tuple[list, Specifications]: | ||
def load_data(folder_name: str) -> Tuple[DataRaw, Specifications]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is folder_name the top_level folder of the experimental data directory? This is probably obvious as I can't recall how they lay out their data, but a quick example e.g. if the path is /windows/something/myproject the name is myproject? how does the loader know the full path to the project directory? This seems to be all handled very gracefully, maybe just some docs for the naive user on the main() docstring would be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path is stored in the config file that was hidden from you sorry
raise NotImplementedError("TODO") | ||
def load(specs: Specifications) -> DataRaw: | ||
if specs.config["use-allen-dff"]: | ||
if specs.config["analysis-type"] == "sf_tf": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to expand this abbreviation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is meant to be spatial frequency, temporal frequency
, which is quite long...
I could just call it frequencies
if it appears very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe spat_freq_temp_freq
? or "spat_f_temp_f`?
load_suite2p/load/load_data.py
Outdated
] | ||
if len(allen_data_files) == 1: | ||
data_raw = DataRaw( | ||
h5py.File(allen_data_files[0].path, "r"), is_allen=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may already be happening in DataRaw, but if not it might be worth opening the file in a context manager that will close it automatically in case of an exception during loading, e.g.
with h5py.File(allen_data_files[0].path, "r") as h5py_file:
data_raw = DataRaw(h5py_file, is_allen=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it makes sense, thanks!
|
||
from load_suite2p.objects.data_raw import DataRaw | ||
|
||
array0 = np.array([1, 2, 3, 4, 5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use np.arange(1, 6) etc. as more concise, but less readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes thanks!
|
||
|
||
def create_mock_hdf5_data(): | ||
with h5py.File("mytestfile.hdf5", "w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just personal preference, I like to unabreviate f to file in this case just to be super explicit
array4 = np.array([9, 10, 11, 12, 13]) | ||
|
||
|
||
def create_mock_hdf5_data(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a pytest fixture could be used in this case, it is a nice way to control data that is passed to tests. The main benefits are
a) you don't need to call the function in the code each time, you can pass it as an argument e.g.
@pytest.mark.fixture(scope="function")
def create_mock_hdf5_data():
....
yeild None
teardown_code()
def test_unpack_of_simple_dataset(create_mock_hdf5_data)
I think this will work, but maybe not as the create function doesn't explicitly return anything, but that shouldn't matter, including it in this way should trigger the dataset creation for each function
b) fixtures are nice because you can control the scope of the data. For example with scope="function", create_mock_hdf5_data() will be called again for each function it is used in. for scope="class", the function will be called once and then re-used (nice to save time if you are definately not changing the data). However typically I would always use function scope just so you are 100% sure you are starting fresh for each test.
c) All code after the yeild keyword is still run, which makes tearing down after tests easier. In this case I would suggest adding teardown code that deletes the written file.
I always think pytest docs are a bit confusing but this is nice https://towardsdatascience.com/make-your-python-tests-efficient-with-pytest-fixtures-3d7a1892265f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much, I like fixtures very much. I've just implemented better mocks!
array4 = np.array([9, 10, 11, 12, 13]) | ||
|
||
|
||
def create_mock_hdf5_data(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest has a automatic fixture tmp_path that gives you access to a temporary directory that it might be easier to write to rather than cwd (which if I understand correctly this does) https://docs.pytest.org/en/7.1.x/how-to/tmp_path.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! I've implemented it
assert np.all(DataRaw.unpack_data(f["ref_dataset"], f)[0][0] == array3) | ||
|
||
|
||
def test_unpack_of_dataset_with_references_to_group_with_subgroup(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think these tests look very nice. It may also be worth testing the package handles any potentially common bad inputs the user might give. This can be done by checking an error is raised (and its content) with pytest.raises()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what a bad input could be in this case, maybe if they give the wrong file type? or a corrupted hdf5 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yes, I am not too sure. Maybe bad paths on the config file? Although this is already quite obvious from the error. Another possibility is they pass a .mat file that is not structured in the expected way. But I'm not sure if it's possible in practice.
Another test possibility (although this may be overkill) is to go through one exmaple .mat file and save all (or a subset) of the expected arrays into separate mat files (or csv or something). Then you can load the file file into load-suite2p and test all arrays in python against matlab. This would be useful as a sanity check against a real dataset, and would highlight any small numerical changes that may occur during conversion (these are likely to be extremely small, if they exist at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Joe for the suggestions. I will open a separate issue to address these use cases, also because they seem related to a further expansion of the load function that I aim to develop later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#21 here we go
Hey Laura I tested running the full pipeline now, seems to work well! A couple of suggestions just from the perspective of somone not familiar with the data:
|
Thank you Joe for the last feedbacks. config filesI see your point, this happens because I explicitly chosen to not to track the config files, but yes a kind of template should be present to initialise the program. I don't want to track |
Oh yes this is a very good point, maybe a file called example.env can be un-excluded (with this syntax) |
No description provided.