Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

Improve how model.cond works #15

Closed
1 of 2 tasks
grst opened this issue Dec 1, 2023 · 6 comments
Closed
1 of 2 tasks

Improve how model.cond works #15

grst opened this issue Dec 1, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@grst
Copy link
Contributor

grst commented Dec 1, 2023

Description of feature

  • The implementation for finding the baseline level for each variable is currently hacky. Need to reach out to the formulatic devs for input.
  • The function should print out the actual contrast used for diagnostics
@grst grst added the enhancement New feature or request label Dec 1, 2023
@grst grst assigned grst and const-ae Dec 1, 2023
@const-ae
Copy link
Collaborator

const-ae commented Dec 4, 2023

Hi Gregor,

I was looking into getting the default values again and was wondering if the following works or if there was a reason why we dismissed it?

import formulaic
import pandas as pd
import numpy as np

data = pd.DataFrame({"celltype": ['A', 'A', 'B', 'C', 'C', 'C'],
        "condition": ['a', 'b', 'a', 'b', 'a', 'b'],
        "cont": [1, 2, 3, 4, 5, 6]})
formulaic.model_matrix('~ condition', data).model_spec.encoder_state['condition'][1]['categories']
data["condition"] = pd.Categorical(data["condition"], categories = ["b", "a"], dtype = "category")
formulaic.model_matrix('~ condition', data).model_spec.encoder_state['condition'][1]['categories']

@grst
Copy link
Contributor Author

grst commented Dec 4, 2023

The reason was that if you use the syntax described here, it breaks:

>>> formulaic.model_matrix('~ C(condition, contr.treatment(base="b"))', data).model_spec.encoder_state['C(condition, contr.treatment(base="b"))'][1]['categories']
['a', 'b']

Of course, also something we could decide not to support initially, but it would be nice to have a robust way to achieve this.

@const-ae
Copy link
Collaborator

const-ae commented Dec 5, 2023

Ah yes. But I think it is fine to nonetheless use 'a' as a reference level. I think the reference level should be based on whatever the data contains and not which additional modifications the formula applies.

I think a helpful analogy is to consider a continuous covariate and formulaic.model_matrix('~ I(cont + 10)', data). The formula shifts all values up by 10, but that doesn't mean that we would move the continuous intercept to -10. Applying the same logic to the categorical covariates means that I think it's fine to just look at the categories.

@grst
Copy link
Contributor Author

grst commented Dec 6, 2023

But it does change the design matrix accordingly?

image

@grst
Copy link
Contributor Author

grst commented Dec 10, 2023

Also adding here that I think this function should print out the actual contrast vector used (maybe omit zeros).
It's already a dataframe, so something like

contrasts.T.loc[lambda x: x[0] != 0]

could work nicely.

@grst
Copy link
Contributor Author

grst commented May 27, 2024

@grst grst closed this as completed May 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants