Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

Solve model.cond with custom materializer #36

Merged
merged 38 commits into from
May 18, 2024
Merged

Conversation

grst
Copy link
Contributor

@grst grst commented Feb 26, 2024

Trying to solve #15 using a custom formulaic materializer.

  • wait for Refactor API / Wilcox test #33
  • Testcases:
    • ~ A
    • ~ A + B
    • ~ A * B
    • ~ A + C(A) + C(A, contr.treatment(base=x))
    • ~ 0 + A
    • ~ C(A, contr.treatment(base=x))
    • ~ C(A, contr.sum)
    • ~ A + continuous + np.log(continuous)
    • ~ A * C(B, contr.treatment(base='y'))

@grst grst mentioned this pull request Apr 1, 2024
5 tasks
@codecov-commenter
Copy link

codecov-commenter commented Apr 2, 2024

Codecov Report

Attention: Patch coverage is 94.00000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 70.66%. Comparing base (7610d50) to head (9d8add0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #36      +/-   ##
==========================================
+ Coverage   67.67%   70.66%   +2.98%     
==========================================
  Files          12       14       +2     
  Lines         495      559      +64     
==========================================
+ Hits          335      395      +60     
- Misses        160      164       +4     
Files Coverage Δ
src/multi_condition_comparisions/_util/__init__.py 100.00% <100.00%> (ø)
src/multi_condition_comparisions/_util/checks.py 100.00% <ø> (ø)
...lti_condition_comparisions/methods/_statsmodels.py 100.00% <ø> (ø)
...rc/multi_condition_comparisions/_util/formulaic.py 98.38% <98.38%> (ø)
src/multi_condition_comparisions/methods/_edger.py 87.09% <75.00%> (-0.79%) ⬇️
.../multi_condition_comparisions/methods/_pydeseq2.py 93.33% <80.00%> (-1.79%) ⬇️
src/multi_condition_comparisions/methods/_base.py 86.00% <88.88%> (-2.00%) ⬇️

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@grst grst marked this pull request as ready for review April 12, 2024 10:33
@grst grst requested review from Zethson, const-ae and ilan-gold April 12, 2024 10:33
@grst
Copy link
Contributor Author

grst commented Apr 12, 2024

Finally made it! As usual with these things it was more complicated than I anticipated, but it seems to work quite robustly now.

High-level description:

  • Formulaic uses a "Materializer" class that converts the model specification into a design matrix
  • It is possible and supported to make a custom materializer that extends the base class provided by formulaic (intended to generate something different than a pandas data frame)
  • I use a custom materializer to hook into the materializer functions to extract factor metadata that we later need for bulding contrast vectors with model.cond.
  • model.cond uses this information to check categories and to fill in default values. There might be situations in which the default value can't be inferred, in that case we raise an error.

@const-ae, would be great if you could have a look, too. In particular, if the test cases are correct and if there's any additional (more complex?) model you think that should be tested.

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @grst! This is a loooooot of work (although I have the suspicion that you enjoyed it :) )

I went through the PR but quickly noticed that it's a bit over my head at the moment. I'd need more headspace to give this a review that would deserve the name "review". I might give this another pass later, but I'm sure that the other reviews will have more useful comments.

src/multi_condition_comparisions/methods/_base.py Outdated Show resolved Hide resolved
src/multi_condition_comparisions/_util/formulaic.py Outdated Show resolved Hide resolved
@grst grst mentioned this pull request Apr 15, 2024
Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble following exactly what is going on here or what is solved...I will need to look back. I've never done much with formulaic so still a little bit dazzled....

factor_storage: dict[str, list[FactorMetadata]] = defaultdict(list)
variable_to_factors: dict[str, set[str]] = defaultdict(set)

class CustomPandasMaterializer(PandasMaterializer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the class declaration inside of another class? This makes reasoning about what goes on a bit challenging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because each class needs to be tied to one specific factor_storage object. The class gets instantiated by formulaic, therefore we can only pass a class, rather than an instance.

Each class is used only for one formulaic formula and stores the formula-specific metadata in the factor_storage object that is tied to it.

src/multi_condition_comparisions/methods/_base.py Outdated Show resolved Hide resolved
@grst
Copy link
Contributor Author

grst commented Apr 16, 2024

I can understand that it all looks a bit daunting. I suggest to start reading at def cond(), this is the actual user-facing function to generate contrast vectors. All the other fuzz is about finding the default categories of each categorical variable specified in the model.

@Zethson Zethson merged commit 8c4ac7a into main May 18, 2024
5 checks passed
@grst grst mentioned this pull request May 27, 2024
2 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants