-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to get the baseline value for categorical variables? #169
Comments
Hi @grst , Thanks for reaching out. It's an interesting question. The encoder state has always been intended primarily as a way to ensure reproducibility, and not (as here) a way to introspect the state to then build out automation/tooling. The layout of this encoding state is also not guaranteed to remain fixed between major versions of Formulaic (it being considered an implementation detail); and so while changes at this point are relatively unlikely, you could find your code not working in the future. AND not all contrast encodings "leave one out" in a nice parsimonious way. This is definitely an interesting problem, though. If you do not need support for arbitrary contrasts, you could override the Also, in case you were unaware, formulaic also offers tooling that could be used to generate contrast matrices (though it was not originally intended for that purpose). Likely it is not helpful for you here, since it uses a very different syntax, but... here you go anyway (following the conventions of https://matthewwardrop.github.io/formulaic/guides/contrasts/#how-does-contrast-coding-work):
|
Given your emoticon response, I'll assume that the custom transform route best suits you... but feel free to reopen if you need more guidance or want me to revisit it! |
Thanks for the detailed response and sorry for not responding earlier! I'll check out your suggestion with the custom transform class and let you know in case I have further questions :) |
Hi @matthewwardrop, I now have a follow-up question: I got a PoC to work by passing the overridden from formulaic.transforms import C
import formulaic
import pandas as pd
data = pd.DataFrame({"celltype": ['A', 'A', 'B', 'C', 'C', 'C'],
"condition": ['a', 'b', 'a', 'b', 'a', 'b'],
"cont": [1, 2, 3, 4, 5, 6]})
def custom_C(data, contrasts = None, *, levels = None, spans_intercept = True):
print("variable", data.name)
print("contrasts", contrasts)
print("levels", levels)
return C(data, contrasts, levels=levels, spans_intercept=spans_intercept)
formulaic.model_matrix("~condition + C(celltype, contr.treatment)", data=data, context = {"C": custom_C})
|
Hi @matthewwardrop, sorry for the renewed ping - I was just afraid my message got lost in the Christmas break. |
Hi @grst, Apologies for the delay (I feel I'm always so apologising)... life is and remains pretty busy! Yeah, catching the default categorisation is a little trickier. You can see the default implementation for pandas materializers here: https://github.com/matthewwardrop/formulaic/blob/main/formulaic/materializers/pandas.py#L95C9-L95C28 . You would have to subclass this. If this is going to be very useful to you, then it might be to others also... so I'd be willing to extend the transform state with information about the baseline too. The amount of redundant information is relatively small, so I'm not worried from that perspective. If you still want this, let me know and I can add it in. |
Hi @matthewwardrop,
I know what you are talking about, so no worries ;) In the meanwhile, I found a solution that indeed involves subclassing the Python materializer. By hooking into Maybe you could take a quick look and let me know if you could envisage any problems that might arise due to future updates of formulaic? Of course it would be cool if there was a more straightforward solution, but I can also live with what I have now. At least it passes all my test cases. |
Just for future reference, our code for capturing the factor metadata and building contrasts is now available as a standalone package: https://formulaic-contrasts.readthedocs.io/en/latest/index.html |
Nice! In the current formulaic code, which will be released as 1.1 shortly, you can also potentially use formulaic/formulaic/model_spec.py Line 334 in 28fd9ec
|
Thanks for the heads-up! I'll test and check if I can simplify a few things with |
Actually, it shouldn't break you. It was a purely additive change (I'd forgotten already) :). |
Hi,
I'm looking for a way to retreive the baseline value of categorical variables used in a formula, e.g.
I found
design.model_spec.encoder_state["condition"][1]['categories']
, but it doesn't work in case (2).Is there a way to achieve this?
Additional context
We are trying to build a "mini language" to specify contrasts for linear models, inspired by glmGamPoi, e.g.
In case of multiple variables or interaction terms, we would like to fill the contrast vector for variables that are not specified in
cond
with the value that refers to the baseline level. See also scverse/multi-condition-comparisions#15CC @Zethson @const-ae
The text was updated successfully, but these errors were encountered: