Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add required variables to the Formula class #179

Open
timpiperseek opened this issue Mar 7, 2024 · 6 comments
Open

Add required variables to the Formula class #179

timpiperseek opened this issue Mar 7, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@timpiperseek
Copy link

I would like to be able to do something like the following. Appologies I am struggling to articulate what I want but effectively I want the following.

Say I have the following formula.
apps ~ prior_apps + I(prior_apps^2) + factor + I(prior_apps:factor)

I am wondering if it is possible to get extract out the rhs terms from the formula. By terms I mean ['prior_apps','factor']

I have tried doing the following.

formula_parser = formulaic.parser.DefaultFormulaParser()
tokens = formula_parser.get_tokens(formula_str)
tokens = [t for t in tokens]

but that gets me the individual parts of the string and not the terms.

I feel like it should be possible?

@matthewwardrop
Copy link
Owner

matthewwardrop commented Mar 8, 2024

Hi @timpiperseek ,

Does something like the following work?

from formulaic import Formula
f = Formula("apps ~ prior_apps + I(prior_apps**2) + factor + prior_apps:factor")
set(
    factor
    for term in f.rhs
    for factor in term.factors
)
# This would output all the factors: {1, I(prior_apps ** 2), factor, prior_apps}

(Note that interaction terms should not be enclosed in "I(...)", since that is a Python function call).

If you need to, you could parse the AST represented by the non-lookup factors (e.g. I(prior_apps ** 2)) to extract the variables used; prior_apps here.

If you are actually just looking for the terms, you can do: list(f.rhs) == [1, prior_apps, I(prior_apps ** 2), factor, prior_apps:factor].

Does that help?

@timpiperseek
Copy link
Author

yeah that is really close to what I am after.

what do you mean by

If you need to, you could parse the AST represented by the non-lookup factors (e.g. I(prior_apps ** 2)) to extract the variables used; prior_apps here.

because ideally it would also identify that prior_apps**2 is the same underlying metric as prior_apps.

@matthewwardrop
Copy link
Owner

Ah... Using some internal utility functions you can do:

from formulaic import Formula
from formulaic.utils.variables import get_expression_variables
f = Formula("apps ~ prior_apps + I(prior_apps**2) + factor + prior_apps:factor")
set(
    variable
    for term in f.rhs
    for factor in term.factors
    for variable in get_expression_variables(factor.expr, {})
    if "value" in variable.roles
)
# Outputs: {'factor', 'prior_apps'}

Note that get_expression_variables parses the AST associated with the python expression, which is used internally to keep track of which variables have been used when generating the model matrix.

@timpiperseek
Copy link
Author

Oh that is absolutely awesome, thank you.

@matthewwardrop
Copy link
Owner

I'll consider adding this directly to the formula class as something like .required_variables.

@matthewwardrop matthewwardrop self-assigned this Mar 11, 2024
@matthewwardrop matthewwardrop added the enhancement New feature or request label Mar 11, 2024
@matthewwardrop matthewwardrop changed the title is it possible to extract terms without providing a dataframe Add required variables to the Formula class Mar 11, 2024
@mayer79
Copy link

mayer79 commented Jun 23, 2024

This would indeed be very handy, thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants