Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to get the baseline value for categorical variables? #169

Closed
grst opened this issue Dec 14, 2023 · 11 comments
Closed

Is there a way to get the baseline value for categorical variables? #169

grst opened this issue Dec 14, 2023 · 11 comments

Comments

@grst
Copy link

grst commented Dec 14, 2023

Hi,

I'm looking for a way to retreive the baseline value of categorical variables used in a formula, e.g.

import formulaic
import pandas as pd

data = pd.DataFrame({"celltype": ['A', 'A', 'B', 'C', 'C', 'C'],
        "condition": ['a', 'b', 'a', 'b', 'a', 'b'],
        "cont": [1, 2, 3, 4, 5, 6]})

# (1) Here, I want to retrieve 'a' for condition
design = formulaic.model_matrix('~ condition', data)

# (2) Here, I want to retreive 'b' for condition
design = formulaic.model_matrix('~ C(condition, contr.treatment(base="b"))', data)

# (3) Here, I want to retreive 'b' for condition as well
data["condition"] = pd.Categorical(data["condition"], categories = ["b", "a"], dtype = "category")
design = formulaic.model_matrix('~ condition', data)

I found design.model_spec.encoder_state["condition"][1]['categories'], but it doesn't work in case (2).

Is there a way to achieve this?

Additional context

We are trying to build a "mini language" to specify contrasts for linear models, inspired by glmGamPoi, e.g.

contrast = model.cond(condition='a') - model.cond(condition='b')

In case of multiple variables or interaction terms, we would like to fill the contrast vector for variables that are not specified in cond with the value that refers to the baseline level. See also scverse/multi-condition-comparisions#15

CC @Zethson @const-ae

@matthewwardrop
Copy link
Owner

Hi @grst ,

Thanks for reaching out. It's an interesting question.

The encoder state has always been intended primarily as a way to ensure reproducibility, and not (as here) a way to introspect the state to then build out automation/tooling. The layout of this encoding state is also not guaranteed to remain fixed between major versions of Formulaic (it being considered an implementation detail); and so while changes at this point are relatively unlikely, you could find your code not working in the future. AND not all contrast encodings "leave one out" in a nice parsimonious way.

This is definitely an interesting problem, though. If you do not need support for arbitrary contrasts, you could override the C stateful transform with your own version that only supports dummy encoding, and that adds the base level to the encoder state. Would that work for you? That would side-step a lot of these issues. Otherwise, I can think a bit more about this.

Also, in case you were unaware, formulaic also offers tooling that could be used to generate contrast matrices (though it was not originally intended for that purpose). Likely it is not helpful for you here, since it uses a very different syntax, but... here you go anyway (following the conventions of https://matthewwardrop.github.io/formulaic/guides/contrasts/#how-does-contrast-coding-work):

>>> import numpy
>>> from formulaic.utils.constraints import LinearConstraints
>>> Z = LinearConstraints.from_spec("a, b-a, c-a", variable_names=["a", "b", "c"]).constraint_matrix
>>> C = numpy.linalg.inv(Z)
>>> Z, C
(array([[ 1.,  0.,  0.],
        [-1.,  1.,  0.],
        [-1.,  0.,  1.]]),
 array([[1., 0., 0.],
        [1., 1., 0.],
        [1., 0., 1.]]))

@matthewwardrop
Copy link
Owner

Given your emoticon response, I'll assume that the custom transform route best suits you... but feel free to reopen if you need more guidance or want me to revisit it!

@grst
Copy link
Author

grst commented Dec 26, 2023

Thanks for the detailed response and sorry for not responding earlier! I'll check out your suggestion with the custom transform class and let you know in case I have further questions :)

@grst
Copy link
Author

grst commented Dec 30, 2023

Hi @matthewwardrop,

I now have a follow-up question:

I got a PoC to work by passing the overridden C via context. This works well when the C is explicitly specified in the formula. However, this doesn't capture the variables that are implicitly treated as categorical (in this case condition).
Is there a way to override the default behavior as well?

from formulaic.transforms import C
import formulaic
import pandas as pd

data = pd.DataFrame({"celltype": ['A', 'A', 'B', 'C', 'C', 'C'],
        "condition": ['a', 'b', 'a', 'b', 'a', 'b'],
        "cont": [1, 2, 3, 4, 5, 6]})

def custom_C(data, contrasts = None, *, levels = None, spans_intercept = True):
    print("variable", data.name)
    print("contrasts", contrasts)
    print("levels", levels)

    return C(data, contrasts, levels=levels, spans_intercept=spans_intercept)

formulaic.model_matrix("~condition + C(celltype, contr.treatment)", data=data, context = {"C": custom_C})
variable celltype
contrasts <class 'formulaic.transforms.contrasts.TreatmentContrasts'>
levels None
   Intercept  condition[T.b]  C(celltype, contr.treatment)[T.B]  C(celltype, contr.treatment)[T.C]
0        1.0               0                                  0                                  0
1        1.0               1                                  0                                  0
2        1.0               0                                  1                                  0
3        1.0               1                                  0                                  1
4        1.0               0                                  0                                  1
5        1.0               1                                  0                                  1

@grst
Copy link
Author

grst commented Jan 17, 2024

Hi @matthewwardrop, sorry for the renewed ping - I was just afraid my message got lost in the Christmas break.
Would be great to get your input on my last comment!

@matthewwardrop
Copy link
Owner

Hi @grst,

Apologies for the delay (I feel I'm always so apologising)... life is and remains pretty busy!

Yeah, catching the default categorisation is a little trickier. You can see the default implementation for pandas materializers here: https://github.com/matthewwardrop/formulaic/blob/main/formulaic/materializers/pandas.py#L95C9-L95C28 . You would have to subclass this.

If this is going to be very useful to you, then it might be to others also... so I'd be willing to extend the transform state with information about the baseline too. The amount of redundant information is relatively small, so I'm not worried from that perspective. If you still want this, let me know and I can add it in.

@grst
Copy link
Author

grst commented Jun 20, 2024

Hi @matthewwardrop,

Apologies for the delay (I feel I'm always so apologising)... life is and remains pretty busy!

I know what you are talking about, so no worries ;)

In the meanwhile, I found a solution that indeed involves subclassing the Python materializer. By hooking into _encode_evaled_factor and _flatten_encoded_evaled_factor, I was able to exfiltrate, amongst others, the drop_field, categories and base category:
https://github.com/scverse/pertpy/blob/74914a92b5ab49de76f47cde3f262ed8d87e2829/pertpy/tools/_differential_gene_expression/_formulaic.py#L94-L167

Maybe you could take a quick look and let me know if you could envisage any problems that might arise due to future updates of formulaic?

Of course it would be cool if there was a more straightforward solution, but I can also live with what I have now. At least it passes all my test cases.

@grst
Copy link
Author

grst commented Dec 2, 2024

Just for future reference, our code for capturing the factor metadata and building contrasts is now available as a standalone package: https://formulaic-contrasts.readthedocs.io/en/latest/index.html

@matthewwardrop
Copy link
Owner

Nice! In the current formulaic code, which will be released as 1.1 shortly, you can also potentially use ModelSpec.factor_contrasts (

def factor_contrasts(self) -> Dict[Factor, ContrastsState]:
). Also, I'm not sure if this change will break you or not... But it might?

@grst
Copy link
Author

grst commented Dec 2, 2024

Thanks for the heads-up! I'll test and check if I can simplify a few things with factor_contrasts then.

@matthewwardrop
Copy link
Owner

Actually, it shouldn't break you. It was a purely additive change (I'd forgotten already) :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants