Skip to content
This repository has been archived by the owner on Oct 30, 2024. It is now read-only.

Load data #16

Merged
merged 27 commits into from
Jan 26, 2023
Merged

Load data #16

merged 27 commits into from
Jan 26, 2023

Conversation

lauraporta
Copy link
Member

No description provided.

@lauraporta lauraporta linked an issue Dec 14, 2022 that may be closed by this pull request
@lauraporta lauraporta changed the base branch from developement to main December 19, 2022 16:34
@lauraporta lauraporta changed the base branch from main to developement December 19, 2022 16:35
@lauraporta lauraporta changed the base branch from developement to main January 10, 2023 14:29
@lauraporta lauraporta changed the base branch from main to developement January 10, 2023 14:30
@lauraporta lauraporta marked this pull request as ready for review January 10, 2023 17:29
@lauraporta
Copy link
Member Author

Summary

In this PR, I add the logic required to open and unpack the matlab files generated with the matlab codebase that I am translating, called allen-df files. They are specific for analysis modality, and in this case I am focusing only on the sf-tf one.

load_data.py

The load data method, after receiving the path of the file to load (in specs) creates a DataRaw object through the method load().
For now I skip the implementations that go beyond using allen-df files, this is why there are many NotImplementedErrors.

data_raw.py

In DataRaw various objects are extracted, all with the method unpack_data(), which deals with HDF5 using the library h5py. You can find an explanation of how does this method work in its docstring. It is tailor made to the specific file structure that I am opening.

Tests

The methods of DataRaw have the attribute @classmethod because this helps me testing them independently of the instantiation of the DataRaw class. In the test suite, I create on the local machine a mock HDF5 file check fundamental behaviors of my unpack_data() method.

@JoeZiminski
Copy link
Member

Great job @lauraporta this is very nice, very well structured, code is clean and intuitive to follow. I've left a few suggestions but nothing major. Learnt some cool new features and might do some investigating into datashuttle structure inspired by #8!

from typing import Tuple

import h5py
from decouple import config
Copy link
Member

@JoeZiminski JoeZiminski Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a added to pyproject.toml from decouple import config

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks!

from decouple import config

from ..objects.data_raw import DataRaw
from ..objects.enums import AnalysisType, DataType
from ..objects.specifications import Specifications
from .read_config import read

CONFIG_PATH = config("CONFIG_PATH")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This python-decouple module is very cool, I have not seen it before. As a completely naive user, I tried playing around with this just in the console and got 'CONFIG_PATH not found. Declare it as envvar or define a default value.', I can't quite figure out how it works. Is it possible to add these two lines of code to a function (e.g. a new function read_configs.get_config_path()) and add a docstring on how decouple works?

Would it be possible / sensible to get config_path in main.py and pass it load_data > get_specifications > read_configurations() ) so that settings are displayed together in main.py for readability? but maybe this suggestion is not optimal for python-decouple

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using decouple as a way to not store in the repo the local paths of the config. It is set up in a way that it takes from granted the existence of an .env containing the variable CONFIG_PATH. I forgot to specify this, my bad. I will add it to Readme.md file.
I am not yet sure of the way I want to handle the configuration file yet, that is why this bit is still vague and unpolished.



def load_data(folder_name: str) -> Tuple[list, Specifications]:
def load_data(folder_name: str) -> Tuple[DataRaw, Specifications]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is folder_name the top_level folder of the experimental data directory? This is probably obvious as I can't recall how they lay out their data, but a quick example e.g. if the path is /windows/something/myproject the name is myproject? how does the loader know the full path to the project directory? This seems to be all handled very gracefully, maybe just some docs for the naive user on the main() docstring would be useful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path is stored in the config file that was hidden from you sorry ☹️

raise NotImplementedError("TODO")
def load(specs: Specifications) -> DataRaw:
if specs.config["use-allen-dff"]:
if specs.config["analysis-type"] == "sf_tf":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to expand this abbreviation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is meant to be spatial frequency, temporal frequency, which is quite long...
I could just call it frequencies if it appears very confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe spat_freq_temp_freq? or "spat_f_temp_f`?

]
if len(allen_data_files) == 1:
data_raw = DataRaw(
h5py.File(allen_data_files[0].path, "r"), is_allen=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may already be happening in DataRaw, but if not it might be worth opening the file in a context manager that will close it automatically in case of an exception during loading, e.g.

with h5py.File(allen_data_files[0].path, "r") as h5py_file:
    data_raw = DataRaw(h5py_file, is_allen=True)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it makes sense, thanks!


from load_suite2p.objects.data_raw import DataRaw

array0 = np.array([1, 2, 3, 4, 5])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use np.arange(1, 6) etc. as more concise, but less readable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes thanks!



def create_mock_hdf5_data():
with h5py.File("mytestfile.hdf5", "w") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just personal preference, I like to unabreviate f to file in this case just to be super explicit

array4 = np.array([9, 10, 11, 12, 13])


def create_mock_hdf5_data():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a pytest fixture could be used in this case, it is a nice way to control data that is passed to tests. The main benefits are
a) you don't need to call the function in the code each time, you can pass it as an argument e.g.

@pytest.mark.fixture(scope="function")
def create_mock_hdf5_data():
    ....
    yeild None
    teardown_code()

def test_unpack_of_simple_dataset(create_mock_hdf5_data)

I think this will work, but maybe not as the create function doesn't explicitly return anything, but that shouldn't matter, including it in this way should trigger the dataset creation for each function

b) fixtures are nice because you can control the scope of the data. For example with scope="function", create_mock_hdf5_data() will be called again for each function it is used in. for scope="class", the function will be called once and then re-used (nice to save time if you are definately not changing the data). However typically I would always use function scope just so you are 100% sure you are starting fresh for each test.

c) All code after the yeild keyword is still run, which makes tearing down after tests easier. In this case I would suggest adding teardown code that deletes the written file.

I always think pytest docs are a bit confusing but this is nice https://towardsdatascience.com/make-your-python-tests-efficient-with-pytest-fixtures-3d7a1892265f

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, I like fixtures very much. I've just implemented better mocks!

array4 = np.array([9, 10, 11, 12, 13])


def create_mock_hdf5_data():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest has a automatic fixture tmp_path that gives you access to a temporary directory that it might be easier to write to rather than cwd (which if I understand correctly this does) https://docs.pytest.org/en/7.1.x/how-to/tmp_path.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I've implemented it

assert np.all(DataRaw.unpack_data(f["ref_dataset"], f)[0][0] == array3)


def test_unpack_of_dataset_with_references_to_group_with_subgroup():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think these tests look very nice. It may also be worth testing the package handles any potentially common bad inputs the user might give. This can be done by checking an error is raised (and its content) with pytest.raises()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what a bad input could be in this case, maybe if they give the wrong file type? or a corrupted hdf5 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yes, I am not too sure. Maybe bad paths on the config file? Although this is already quite obvious from the error. Another possibility is they pass a .mat file that is not structured in the expected way. But I'm not sure if it's possible in practice.

Another test possibility (although this may be overkill) is to go through one exmaple .mat file and save all (or a subset) of the expected arrays into separate mat files (or csv or something). Then you can load the file file into load-suite2p and test all arrays in python against matlab. This would be useful as a sanity check against a real dataset, and would highlight any small numerical changes that may occur during conversion (these are likely to be extremely small, if they exist at all).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Joe for the suggestions. I will open a separate issue to address these use cases, also because they seem related to a further expansion of the load function that I aim to develop later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21 here we go

@JoeZiminski
Copy link
Member

Hey Laura I tested running the full pipeline now, seems to work well! A couple of suggestions just from the perspective of somone not familiar with the data:

  1. It might be worth setting up a .env file with CONFIG_PATH=config/config.yml already, and an empty config file (e.g. with ontents such as the example config file that is in the README.md) already there. That way a new user doesn't have to worry about any of this an can just fill in their .yml file. This way you could change the README to be platform agnostic (e.g. edit the cofig file in your preferred text editor (e.g. Notepad+ for windows, vim for UNIX) as at the moment for windows users the touch etc. does not apply.
  2. could extend the string on input folder_name = Prompt.ask("Please provide the folder name") to incldue a quick example e.g. AK_1111739_hL_RSPd_monitor_front

@lauraporta
Copy link
Member Author

Thank you Joe for the last feedbacks.

config files

I see your point, this happens because I explicitly chosen to not to track the config files, but yes a kind of template should be present to initialise the program. I don't want to track .env files as well, because by definition they should not.
I will solve this issue in a separate PR (related to #10).

@JoeZiminski
Copy link
Member

Thank you Joe for the last feedbacks.

config files

I see your point, this happens because I explicitly chosen to not to track the config files, but yes a kind of template should be present to initialise the program. I don't want to track .env files as well, because by definition they should not. I will solve this issue in a separate PR (related to #10).

Oh yes this is a very good point, maybe a file called example.env can be un-excluded (with this syntax)

@lauraporta lauraporta merged commit 31d6e74 into developement Jan 26, 2023
@lauraporta lauraporta deleted the load-data branch February 7, 2023 01:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load data logic
2 participants