Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for nested formulae (useful e.g. in IV contexts). #108

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

matthewwardrop
Copy link
Owner

@matthewwardrop matthewwardrop commented Sep 25, 2022

Hi @bashtage ,

Persuant to #24, I did a quick draft of additional support for IV-like formula in formulaic (in addition to the multi-part formula that was already implemented). There are some bugs and rough edges, but would you mind taking a look and adding any suggestions? I'm also not sure whether this should be a plugin or part of the default stack, so your thoughts there would be helpful too. All naming/etc is in draft status, so you can feel free to suggest improvements there.

Suppose you wanted to model some data using IV. With these patches you could write:

>>> from formulaic import Formula
>>> Formula("y ~ x1 + x2 + [ x3 + x4 ~ z1 + z2]")
.lhs:
    y
.rhs:
    root:
        1 + x1 + x2 + x3_hat + x4_hat
    .deps:
        [0]:
            .lhs:
                x3 + x4
            .rhs:
                1 + z1 + z2

The resulting formula could then be parsed by the consumer of the formula to do the right things.

If you end up using an interaction term, or later multiplying, formulaic still does the right thing.

>>> formulaic.Formula("y ~ x0 + [ x1:x2 ~ z1 + z2 ] : x3")
.lhs:
    y
.rhs:
    root:
        1 + x0 + x1:x2_hat:x3
    .deps:
        [0]:
            .lhs:
                x1:x2
            .rhs:
                1 + z1 + z2

The x1:x2_hat is considered one factor, and looked up by name.

Note that this could also (with a small amount of effort) also be used for double ML (if we add a delta transform/operator), and more general things like:

>>> formulaic.Formula("y ~ x1 + x2 + [ x2 + x3 ~ z1 + z2 ] + [ x4 ~ z3 + [z4 ~ a1 + a2 ] ]")
.lhs:
    y
.rhs:
    root:
        1 + x1 + x2 + x2_hat + x3_hat + x4_hat
    .deps:
        [0]:
            .lhs:
                x2 + x3
            .rhs:
                1 + z1 + z2
        [1]:
            .lhs:
                x4
            .rhs:
                root:
                    1 + z3 + z4_hat
                .deps:
                    [0]:
                        .lhs:
                            z4
                        .rhs:
                            1 + a1 + a2

Though this does stress credulity a bit.

Lastly, I plan to add some utility methods to Formulaic to allow easy recursive iteration over the formula to assist with the evaluation of dependencies and updating of the dataframe as you go up the tree. This might even be able to be integrated into the high-level tooling, if so desired, with the user passing a dep_data_resolver hook of some description.

closes: #24

@codecov
Copy link

codecov bot commented Oct 5, 2022

Codecov Report

Attention: Patch coverage is 45.45455% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 99.75%. Comparing base (c064ed3) to head (891c31a).
Report is 3 commits behind head on main.

❗ Current head 891c31a differs from pull request most recent head 5b88650. Consider uploading reports for the commit 5b88650 to get more accurate results

Files Patch % Lines
formulaic/parser/parser.py 14.28% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              main     #108      +/-   ##
===========================================
- Coverage   100.00%   99.75%   -0.25%     
===========================================
  Files           53       39      -14     
  Lines         2850     2425     -425     
===========================================
- Hits          2850     2419     -431     
- Misses           0        6       +6     
Flag Coverage Δ
unittests 99.75% <45.45%> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@matthewwardrop
Copy link
Owner Author

@bashtage Any thoughts on this before it gets merged?

@matthewwardrop
Copy link
Owner Author

@s3alfisc: I just saw your project @ https://github.com/s3alfisc/pyfixest to implement fixest for Python. That looks awesome. I had some internal work that did IV based on this PR, but I was wondering whether you would be interested in having this support too?

@s3alfisc
Copy link

Hi Matthew - yes, I'd definitely be interested in that! Right now I do a lot of string parsing to get the two formulas for first and second stage and call 'model_matrix' twice. Likely not very efficient and clearly not too elegant, but it works =) please let me know if I can be of any help in testing & debugging this PR!

@bashtage
Copy link
Contributor

They syntax looks good to me. I will definitely switch from my own so-so parser to this.

@matthewwardrop
Copy link
Owner Author

Thanks for buying in @bashtage and @s3alfisc . It's about time I got this in. I'll rebase it on the latest code-base and let you know when it is ready for you to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for multi-stage formulas.
3 participants