Is there a way to get the baseline value for categorical variables? #169

grst · 2023-12-14T10:40:41Z

Hi,

I'm looking for a way to retreive the baseline value of categorical variables used in a formula, e.g.

import formulaic
import pandas as pd

data = pd.DataFrame({"celltype": ['A', 'A', 'B', 'C', 'C', 'C'],
        "condition": ['a', 'b', 'a', 'b', 'a', 'b'],
        "cont": [1, 2, 3, 4, 5, 6]})

# (1) Here, I want to retrieve 'a' for condition
design = formulaic.model_matrix('~ condition', data)

# (2) Here, I want to retreive 'b' for condition
design = formulaic.model_matrix('~ C(condition, contr.treatment(base="b"))', data)

# (3) Here, I want to retreive 'b' for condition as well
data["condition"] = pd.Categorical(data["condition"], categories = ["b", "a"], dtype = "category")
design = formulaic.model_matrix('~ condition', data)

I found design.model_spec.encoder_state["condition"][1]['categories'], but it doesn't work in case (2).

Is there a way to achieve this?

Additional context

We are trying to build a "mini language" to specify contrasts for linear models, inspired by glmGamPoi, e.g.

contrast = model.cond(condition='a') - model.cond(condition='b')

In case of multiple variables or interaction terms, we would like to fill the contrast vector for variables that are not specified in cond with the value that refers to the baseline level. See also scverse/multi-condition-comparisions#15

CC @Zethson @const-ae

The text was updated successfully, but these errors were encountered:

matthewwardrop · 2023-12-20T17:46:02Z

Hi @grst ,

Thanks for reaching out. It's an interesting question.

The encoder state has always been intended primarily as a way to ensure reproducibility, and not (as here) a way to introspect the state to then build out automation/tooling. The layout of this encoding state is also not guaranteed to remain fixed between major versions of Formulaic (it being considered an implementation detail); and so while changes at this point are relatively unlikely, you could find your code not working in the future. AND not all contrast encodings "leave one out" in a nice parsimonious way.

This is definitely an interesting problem, though. If you do not need support for arbitrary contrasts, you could override the C stateful transform with your own version that only supports dummy encoding, and that adds the base level to the encoder state. Would that work for you? That would side-step a lot of these issues. Otherwise, I can think a bit more about this.

Also, in case you were unaware, formulaic also offers tooling that could be used to generate contrast matrices (though it was not originally intended for that purpose). Likely it is not helpful for you here, since it uses a very different syntax, but... here you go anyway (following the conventions of https://matthewwardrop.github.io/formulaic/guides/contrasts/#how-does-contrast-coding-work):

>>> import numpy
>>> from formulaic.utils.constraints import LinearConstraints
>>> Z = LinearConstraints.from_spec("a, b-a, c-a", variable_names=["a", "b", "c"]).constraint_matrix
>>> C = numpy.linalg.inv(Z)
>>> Z, C
(array([[ 1.,  0.,  0.],
        [-1.,  1.,  0.],
        [-1.,  0.,  1.]]),
 array([[1., 0., 0.],
        [1., 1., 0.],
        [1., 0., 1.]]))

matthewwardrop · 2023-12-22T21:07:06Z

Given your emoticon response, I'll assume that the custom transform route best suits you... but feel free to reopen if you need more guidance or want me to revisit it!

grst · 2023-12-26T19:48:55Z

Thanks for the detailed response and sorry for not responding earlier! I'll check out your suggestion with the custom transform class and let you know in case I have further questions :)

grst · 2023-12-30T14:04:31Z

Hi @matthewwardrop,

I now have a follow-up question:

I got a PoC to work by passing the overridden C via context. This works well when the C is explicitly specified in the formula. However, this doesn't capture the variables that are implicitly treated as categorical (in this case condition).
Is there a way to override the default behavior as well?

from formulaic.transforms import C
import formulaic
import pandas as pd

data = pd.DataFrame({"celltype": ['A', 'A', 'B', 'C', 'C', 'C'],
        "condition": ['a', 'b', 'a', 'b', 'a', 'b'],
        "cont": [1, 2, 3, 4, 5, 6]})

def custom_C(data, contrasts = None, *, levels = None, spans_intercept = True):
    print("variable", data.name)
    print("contrasts", contrasts)
    print("levels", levels)

    return C(data, contrasts, levels=levels, spans_intercept=spans_intercept)

formulaic.model_matrix("~condition + C(celltype, contr.treatment)", data=data, context = {"C": custom_C})

variable celltype
contrasts <class 'formulaic.transforms.contrasts.TreatmentContrasts'>
levels None
   Intercept  condition[T.b]  C(celltype, contr.treatment)[T.B]  C(celltype, contr.treatment)[T.C]
0        1.0               0                                  0                                  0
1        1.0               1                                  0                                  0
2        1.0               0                                  1                                  0
3        1.0               1                                  0                                  1
4        1.0               0                                  0                                  1
5        1.0               1                                  0                                  1

grst · 2024-01-17T14:57:19Z

Hi @matthewwardrop, sorry for the renewed ping - I was just afraid my message got lost in the Christmas break.
Would be great to get your input on my last comment!

matthewwardrop · 2024-06-14T06:20:00Z

Hi @grst,

Apologies for the delay (I feel I'm always so apologising)... life is and remains pretty busy!

Yeah, catching the default categorisation is a little trickier. You can see the default implementation for pandas materializers here: https://github.com/matthewwardrop/formulaic/blob/main/formulaic/materializers/pandas.py#L95C9-L95C28 . You would have to subclass this.

If this is going to be very useful to you, then it might be to others also... so I'd be willing to extend the transform state with information about the baseline too. The amount of redundant information is relatively small, so I'm not worried from that perspective. If you still want this, let me know and I can add it in.

grst · 2024-06-20T19:07:28Z

Hi @matthewwardrop,

Apologies for the delay (I feel I'm always so apologising)... life is and remains pretty busy!

I know what you are talking about, so no worries ;)

In the meanwhile, I found a solution that indeed involves subclassing the Python materializer. By hooking into _encode_evaled_factor and _flatten_encoded_evaled_factor, I was able to exfiltrate, amongst others, the drop_field, categories and base category:
https://github.com/scverse/pertpy/blob/74914a92b5ab49de76f47cde3f262ed8d87e2829/pertpy/tools/_differential_gene_expression/_formulaic.py#L94-L167

Maybe you could take a quick look and let me know if you could envisage any problems that might arise due to future updates of formulaic?

Of course it would be cool if there was a more straightforward solution, but I can also live with what I have now. At least it passes all my test cases.

grst · 2024-12-02T10:56:51Z

Just for future reference, our code for capturing the factor metadata and building contrasts is now available as a standalone package: https://formulaic-contrasts.readthedocs.io/en/latest/index.html

matthewwardrop · 2024-12-02T15:57:17Z

Nice! In the current formulaic code, which will be released as 1.1 shortly, you can also potentially use ModelSpec.factor_contrasts (

formulaic/formulaic/model_spec.py

Line 334 in 28fd9ec

def factor_contrasts(self) -> Dict[Factor, ContrastsState]:

). Also, I'm not sure if this change will break you or not... But it might?

grst · 2024-12-02T16:00:26Z

Thanks for the heads-up! I'll test and check if I can simplify a few things with factor_contrasts then.

matthewwardrop · 2024-12-02T16:26:31Z

Actually, it shouldn't break you. It was a purely additive change (I'd forgotten already) :).

matthewwardrop closed this as completed Dec 22, 2023

This was referenced Dec 2, 2024

Test against formulaic 1.1 scverse/formulaic-contrasts#9

Closed

Investigate if code can be simplified by using factor_contrasts scverse/formulaic-contrasts#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to get the baseline value for categorical variables? #169

Is there a way to get the baseline value for categorical variables? #169

grst commented Dec 14, 2023

matthewwardrop commented Dec 20, 2023

matthewwardrop commented Dec 22, 2023

grst commented Dec 26, 2023

grst commented Dec 30, 2023

grst commented Jan 17, 2024

matthewwardrop commented Jun 14, 2024

grst commented Jun 20, 2024

grst commented Dec 2, 2024

matthewwardrop commented Dec 2, 2024

grst commented Dec 2, 2024

matthewwardrop commented Dec 2, 2024

Is there a way to get the baseline value for categorical variables? #169

Is there a way to get the baseline value for categorical variables? #169

Comments

grst commented Dec 14, 2023

Additional context

matthewwardrop commented Dec 20, 2023

matthewwardrop commented Dec 22, 2023

grst commented Dec 26, 2023

grst commented Dec 30, 2023

grst commented Jan 17, 2024

matthewwardrop commented Jun 14, 2024

grst commented Jun 20, 2024

grst commented Dec 2, 2024

matthewwardrop commented Dec 2, 2024

grst commented Dec 2, 2024

matthewwardrop commented Dec 2, 2024