Skip to content
This repository has been archived by the owner on Oct 30, 2024. It is now read-only.

Refactor PhtonData and FrequencyResponsiveness #37

Open
lauraporta opened this issue Mar 29, 2023 · 22 comments
Open

Refactor PhtonData and FrequencyResponsiveness #37

lauraporta opened this issue Mar 29, 2023 · 22 comments
Assignees
Labels
bug Something isn't working

Comments

@lauraporta
Copy link
Member

From Joe:

for these variables, I get IDE style warning that all class attributes should be defined in init (i.e. initialise as None). Unfortunately, the IDE complains even if putting a type hint there rather than initialising. It does seem worth doing this even though annoying.

related to this, because this function contains routines that make assumptions based on the experimental setup (e.g. baseline before and after triggers) e.g. self.n_all_triggers - 2 * self.n_session_boundary_baseline_triggers ) * self.n_sessions. Maybe this function can be factored out somewhere, if it may change depending on experimental setup (e.g. there could be multiple versions). Then, it can return all these variables in the init where they are saved as class attributes

Related to #34

@lauraporta lauraporta added the bug Something isn't working label Mar 29, 2023
@lauraporta
Copy link
Member Author

More from Joe:

(if my understanding is correct, and each element of frames is a session), just checking, can it always be assumed the number of frames will be the same across sessions? (e.g. if one day, things go wrong halfway through the session etc)

@lauraporta
Copy link
Member Author

lauraporta commented Mar 29, 2023

From Joe:
'm not sure, but might be worth doing a sanity-check here (assuming n_stimulus_triggers_across_all_sessions should always divide evenly into n_stimulus_triggers_across_all_sessions . as if its not but should be, that info will be lost on conversion to int e.g.

if n_stimulus_triggers_across_all_sessions % n_triggers_per_stimulus != 0:
# raise some error

@lauraporta
Copy link
Member Author

fps -> sampling_rate ?

@lauraporta
Copy link
Member Author

          could this be combined with above? e.g. 
self.n_stimulus_triggers_per_session = int(self.n_all_triggers - 2 * self.n_session_boundary_baseline_triggers)
self.n_stimulus_triggers_across_all_sessions = self.n_stimulus_triggers_per_session * self.n_sessions

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          maybe for readability, some of these could be moved to above function (e.g. `n_of_stimuli_per_session = ...` could go under `n_of_stimuli_across_all_sessions = ...`. Then this function is strictly related to calculation of baseline / stimuli frame indexes

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

This is nice function, it might be worth doccing the operation in the doctoring frames_all_sessions = (frames_in_session.reshape(-1, 1) + np.arange(0, self.n_sessions) * self.n_frames_per_session), just to say what it does and the matrix dimensions you end up with

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          This data frame organisation is very neat and informative, and all information is clear and explicit. The cost of this is some repetition of information, for example day (which I guess would be the same for the dataset, or at least individual sessions) goes from a scalar to a `n_frames_per_session * n_session` array, or for example frames_id is increased to ` frames_id_num * num_roi`. This isn't too bad in this case, but if the num ROI scales to 50 or 100, then there might be some performance issues. I think it will also make some operations more computationally intensive, for example if you have a `n_roi x n_sample` matrix (e.g. one of these matrices per session), I believe you can calculate the mean baseline periods per row with `np.mean(x[:, baseline_idx], axis=1)`. I think it will be slower to index these values out along one array with repeated indexes than across a 2d array.

There is always a struggle between well organised and explicitly labelled data and performance gains / memory usage. If performance does become an issue, nested dictionary where the top level has info for the entire experiment, then each session has information for all data in that session e.g.{"day": day, "session": ({"signal", roi_by_n_frames_matrix, "time": time_array, "frames_id": ..., "sf_array": ..}, {}...). The benefit of this from a computational standpoint is there is no duplication of data and operations can be done over the row x time matrix per session. In general its ideal with multiple data streams with a shared time axis (e.g. ROIs within a session in this case) to store this in a time x stream matrix). I think this suggestion is halfway between how the data originally came and its very organised state now. If computational cost does not become an issue then is okay but good to bear in mind in case of future scaling to more ROIs.

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          not 100% necessary but it can be useful to initialise empty array as NaN as that way it is easier to find bugs / errors in case for some reason they don't get fully filled as expected

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          if possible can use numpy .size attribute rather than len for numpy array for consistency (easier said than done though I never do this consistently)

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          it might be worth casting these to int if they are only used as indexes

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          As the logic is the same here between response and baseline windows (i.e. index corresponding mask,  initialise empty array, fill with values), this could be factored out to a function to reduce code repetition. But not super necessary if it ends up making the flow a lot harder to follow

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          here you could specify the scope for the fixture (e.g. if it is reused across all tests you can use scope="session" (I think) and it will only be called once then re-used for all tests. Alternatively if scope="function" it will be called on each test run. This is useful if you want the fixture torn-down and re-started for each test.. Here I think scope="session" will work, but as overhead of this function is so low it doesn't really matter. However this is useful for fixtures that are heavy to setup but are never changed in the test.

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          this could be changed to yield for consistency (unless I am missing something). Technically I think the yield is only needed if there is teardown code after the yield. You could use return when there is no teardown and yield when there is, or just yield consistency for stylistic reasons.

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          could expand abbreviation (e.g. p_sign_test) here and within the function to be super explicit [oops I think I mean this for to be in the perform_sign_tests() function itself but same applies).

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          I just noticed that (I think) this is loading from the .yaml as a string (e.g. config["use-allen-dff"] =  'true') so will always be true i think, e.g. `if 'false': do x ` will do x(is true because the string exists)

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          this can also raise if the list is empty

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          This is a small point, but I believe the tests are all currently interlinked. e.g. create raw data object (is tested) -> from this make PhotonData (is tested) -> from this make Frequency results. This makes a lot of sense but creates some interdependency between tests (e.g. if something breaks in PhotonData is will break all tests after this). You could consider splitting this interdependency by creating a completely new fake photondata.signal table to feed into FrequencyAnalysis. However, this will be more work and in practice if PhotonData does break then you'd just fix it and re-run tests. But just something to note

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta lauraporta changed the title Refactor PhtonData Refactor PhtonData and FrequencyResponsiveness Mar 29, 2023
@lauraporta
Copy link
Member Author

          would it also be possible / does it also make sense to check all variables, and all arrays elements by element? is it possible that the len() might be the same but somehow actual values are different?

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          I think this onwards could be refactored the below and be equivilent (in other tests too):
    p_values = list(p_values.values())
    p_values_seed_101 = np.array([0.055, 0.473, 0.324, 0.127, 0.653])
    assert np.allclose(p_values, p_values_seed_101, rtol=0, atol=1e-03)

this np.fromiter is very cool I didn't know about that

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          a quick doc on how these test values were generated might be useful. It might be worth considering generating test data and results on the fly as well as having fixed values. The benefit with this is you are not tied to the random seed and generate with random values / arbitary ranges of input data to interrorgate the software a little more and possibly detect edge cases. The downside of this is you have to code the same thing twice - maybe in a different way - but are always liable to make the same error twice. Maybe a mix between hard-coded test values and some randomly-generated inputs is the best of both worlds.

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta
Copy link
Member Author

          overall these tests are very nice and robust!! Is it also feasible to somehow test against their original analysis pipeline? e.g. run their pipeline with some .mat file input data, get their p-values etc. Then make sure all results match using this pipeline? That way you can be 100% sure the implementations yeild identical outputs (or, at least very similar to within some known tolerance).

Originally posted by @JoeZiminski in #25 (comment)

@lauraporta lauraporta self-assigned this Mar 29, 2023
@lauraporta
Copy link
Member Author

Check also all the comments in the PR review #49

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant