-
Notifications
You must be signed in to change notification settings - Fork 7
Using the LinearModel helper class
- When would you use the lm.LinearModel helper class?
- When would you not use the helper class?
- Where should I setup and store a LinearModel?
- Setup a LinearModel
- Calling a LinearModel
- Which LinearModelType do I want?
- Predictors
- Multinomial Logistic Regression
- Caveats
A common situation in the model is that you want to determine the probability of an event happening for each person and that probability is to be based on characteristics of that person (e.g. their sex and age). Or, perhaps, that the time that a newly infected person is expected to survive with some condition is dependent on some characteristics of persons (e.g. their age at infection and their BMI). In these situations, you could create a lots of statement like this:
prob = 0.01
for person_id in df.loc[df.is_alive].index:
if df.at[person_id, 'age_years'] < 5:
prob = prob * 10
elif df.at[person_id, 'age_years'] < 10:
prob = prob * 2
Or:
prob = pd.Series(0.01, index = df.index)
prob.loc[df.age_years.between(0, 5)] *= 10
prob.loc[df.age_years.between(5, 10)] *= 2
But, this gets messy for lots of conditions and this logic will often need to be used in several places (e.g. initialise_population and in the main module event). So, the LinearModel is a way of neatly containing all the information about how something is determined as a function of individual characteristics.
If you find yourself doing anything like the above, use a LinearModel
instead!
- LinearModel helpers are very much an alpha-feature of the framework!
- The way the class works is by abstracting over direct operations on the Pandas dataframe. We don't know whether this will have a significant performance cost.
-
A LinearModel can be created anywhere in your code and stored for future execution. They are not supported as
PARAMETERS
of the modules but can be saved as instance variables (i.e.self.xxx
) of modules and events. If you do intend to store LMs in this way, we recommend setting up a dictionary in the module's or event's__init__
method withself.lm = dict()
and adding any LMs you create to the dictionary withself.lm['tobacco usage'] = LinearModel(xxx)
. This is to avoid having to initialise many instance variables in your__init__
. -
An expected common usage will be to define this in
read_parameters()
. You will have read in parameters from an external file and can then assemble these models and store them (as above). They will be then ready for use ininitialise_population()
and beyond!
- A
LinearModel
is created by specifying its (i) type, (ii) intercept and (iii) one or morePredictor
s - See test examples, and how Panda ops in the Lifestyle module can use the LinearModel class: test_logistic_application_low_ex and test_logistic_application_tob
- General Outline:
my_lm = LinearModel( LinearModelType.ADDITIVE, # or MULTIPLICATIVE or LOGISTIC 0.0, # INTERCEPT CAN BE FLOAT OR INTEGER Predictor('property_1').when(..., 0.1).when(..., 0.2) Predictor('property_2').when(True, 0.01).otherwise(0.02), ... Predictor('property_n').when(True, 0.00001).otherwise(0.00002) )
Here is the canonical example of a LinearModelType.LOGISTIC
model:
baseline_prob = 0.5
baseline_odds = prob / (1 - prob)
OR_X = 2 # Odds Ratio for Factor 'X'
OR_Y = 5 # Odds Ratio for Factor 'Y'
eq = LinearModel(
LinearModelType.LOGISTIC, # We specify the Logistic model
baseline_odds, # The intercept term is the baseline_odds
Predictor('FactorX').when(True, OR_X), # We declare that 'FactorX' is a predictor. Anyone that has 'FactorX' as True, will have a higher odds, as given by OR_X
Predictor('FactorY').when(True, OR_Y) # We declare that 'FactorY' is a predictor. Anyone that has 'FactorY' as True, will have a higher odds, as given by OR_Y
)
df = pd.DataFrame(data={
'FactorX': [False, True, False, True],
'FactorY': [False, False, True, True]
})
pred = eq.predict(df) # We provide the dataframe as an argument to the LinearModel's predict method.
# The result will be as expected.
assert all(pred.values == [
baseline_prob,
(odds * OR_X) / (1 + odds * OR_X),
(odds * OR_Y) / (1 + odds * OR_Y),
(odds * OR_X * OR_Y) / (1 + odds * OR_X * OR_Y)
])
Note that:
- All models must have intercept -- this is the value this is given to all individuals/rows in the dataframe. If there are no predictors then this is the final value for each individual/row in the
ADDITIVE
andMULTIPLICATIVE
model types, and this value will be transformed as x/(1+x) for the final value value in theLOGISTIC
model type (see below for more information on this). - The model can have no predictors -- in which case, see above! This 'hollow' mode is helpful when sketching out the logic of a module without specifying the details of how individuals' characteristics might affect things.
- The model can have any number of predictors --- and each will be applied independently and then combined according to the model type (see below).
Pass in population pd.DataFrame
(or subset of the population dataframe) as an argument to the LinearModel's .predict()
method. Note that the dataframe must contain columns that are named exactly as any (non-external) Predictor name. It will return a series with an index equal to index of the dataframe that is passed Thus:
result = my_lm.predict(df)
Common use cases would be:
- Passing in a subset of the main population data frame:
result = my_lm.predict(df.loc[df.is_alive & (df.age_years < 5)])
- Getting the result for one person and extracting the prediction as a value (not a pd.Series):
result = my_lm.predict(df.loc[[person_id]]).values[0]
The ModelType
determines how the effects of each Predictor
are combined to give the final result. Let y be the output of the model, x_i be indicator variables for a particular condition, beta_0 be the intercept term and beta_i the 'effects' then:
-
The
LinearModelType.ADDITIVE
model will provide:y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + ...
-
The
LinearModelType.MULTIPLICATIVE
model will provide:y = beta_0 * (beta_1 * x_1) * (beta_2 * x_2) * ...
-
The
LinearModelType.LOGISTIC
model is different as it is designed to accept the outputs of standard logistic regression, which are based on 'odds'. Recall that the relationship between odds and probabilities are as follows: odds = prob / (1 - prob); prob = odds / (1 + odds) The intercept is the baseline odds and effects are the Odds Ratio. The model returns the probability that is implied by the model.
Predictors specify how a particular column in the pd.DataFrame
should affect the result of the LinearModel. This is where the magic happens.
The following uses are supported:
- A boolean property:
Predictor('property_name').when(True, value_for_true).otherwise(value_for_false)
Predictor('property_name').when(True, value_for_true) # If false for a row, there will no effect of this predictor for that row
- A property for which particular values are important:
Predictor('property_name').when('string_condition', value_for_condition).otherwise(value_for_otherwise)
Predictor('property_name').when('string_condition_one', value_for_condition_one)
.when('string_condition_one', value_for_condition_two)
.otherwise(value_for_otherwise)
- A property for which particular range are to be selected
NB. .between
is an inclusive range, such that .between(a,b)
is True
for values a and b and all values in-between.
Predictor('property_name').when('.between(low_value, high_value)', value_for_condition).otherwise(value_for_otherwise)
An alternative specification uses inequalities. Note that successive .when()
conditions on a Predictor are not evaluated for a row if a prior .when()
condition has been satisfied.
Predictor('property_name').when('<10', value_for_less_than_ten)
.when('<20', value_for_10_to_20)
.when('<30', value_for_20_to_30)
-
A condition that is based on more than one variable
Here the
Predictor
is defined with an empty set of parentheses and the condition is given in the.when()
as a string verbatim as it should be evaluated; the value for when is assigned to the row when that condition isTrue
.
Predictor().when(string_of_compound_condition_that_will_yield_bool, value_for_compound_condition_true)
For example:
Predictor().when('(property_name_1 == True) & (property_name_2 == "M")', value_for_compound_condition_true)
-
A coefficient that is a function of the value of the Predictor
The value of the model may depend on a function of the predictor. For example, in the model y = k1 + k2*x^2, where k1 and k2 are constants. In this case, we provide the function -- either as a defined function or a lambda function - to the
.apply()
method of thePredictor
.
Predictor('property_name').apply(lambda x: (k1 + k2*(x**2)))
- External Variables An external variable is one that is not in the dataframe - for example year or some 'environmental' level variable that is the same for all individuals at the time that the model is used to predict.
Predictor('external_variable_name', external=True).when(2010, 5.0).otherwise(1.0)
If you have specified an 'external' variable, you must pass it as an argument to the predict method.
my_lm = LinearModel(0.0,
Predictor('external_variable_name', external=True).when(2010, 5.0).otherwise(1.0)
)
result = my_lm.predict(df, year = self.sim.date.year)
-
Any number of
.when()
conditions can be specified on a Predictor but the order is important: for one row / individual, successive.when()
conditions are not evaluated if a previous.when()
condition has been satisfied. This allows simple specification of mapping of an effect to different contiguous ranges (see example above) -
An
.otherwise()
condition is not needed. If one is not specified and the no other condition is met for that row, then the predictor has no effect. That is, there is an implicit.otherwise(0.0)
forADDITIVE
models, an implicit.otherwise(1.0)
forMULTIPLICATIVE
models, and an implicit.otherwise()
condition of Odds Ratio = 1.0 forLOGISTIC
models.
The toy example here uses the output from Stata mlogit function (which is also helpful in interpreting the output).
We run mlogit on a Stata example dataset (ice cream flavour by sex):
use https://stats.idre.ucla.edu/stat/stata/output/mlogit, clear
table ice_cream
mlogit ice_cream female, rrr
predict pchoc pvan pstraw, p
list female pchoc pvan pstraw
This produces the output:
Iteration 0: log likelihood = -210.58254
Iteration 1: log likelihood = -208.26661
Iteration 2: log likelihood = -208.24309
Iteration 3: log likelihood = -208.24309
Multinomial logistic regression Number of obs = 200
LR chi2(2) = 4.68
Prob > chi2 = 0.0964
Log likelihood = -208.24309 Pseudo R2 = 0.0111
------------------------------------------------------------------------------
ice_cream | RRR Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
chocolate |
female | 2.088888 .7816646 1.97 0.049 1.003216 4.349466
_cons | .319149 .0946444 -3.85 0.000 .178471 .570715
-------------+----------------------------------------------------------------
vanilla | (base outcome)
-------------+----------------------------------------------------------------
strawberry |
female | .9791667 .3263365 -0.06 0.950 .5095283 1.881676
_cons | .6170213 .1456998 -2.04 0.041 .3884207 .9801622
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.
we get Stata to calculate the predicted probabilities of each flavour by sex:
+-----------------------------------------+
| female pchoc pvan pstraw |
|-----------------------------------------|
1. | male .1648352 .5164835 .3186813 |
2. | female .293578 .440367 .266055 |
+-----------------------------------------+
Note: these probabilities are not the same as running each flavour's linear model separately.
We implement prediction on this model in TLO as follows:
from tlo.lm import LinearModel, LinearModelType, Predictor
import pandas as pd
# Create a toy dataframe as input
df = pd.DataFrame(data={'sex': ['M', 'F']})
# The df looks like this:
# We create a separate LinearModel instanace for each category (not including the base category "vanilla")
# We use LinearModelType.MULTIPLICATIVE because we want to calculate the model probabilities manually
# The numbers are from the Stata model above (RRR=relative risk ratio)
choc_lm = LinearModel(LinearModelType.MULTIPLICATIVE,
0.319149, # this is the intercept
Predictor('sex').when('F', 2.088888)) # the relative risk ratio when female
straw_lm = LinearModel(LinearModelType.MULTIPLICATIVE,
0.6170213,
Predictor('sex').when('F', 0.9791667))
# For each model, get the RRR prediction
choc_pred = choc_lm.predict(df)
# returns:
# 0 0.319149
# 1 0.666667
# dtype: float64
straw_pred = straw_lm.predict(df)
# returns:
# 0 0.617021
# 1 0.604167
# dtype: float64
# We calculate the denominator to normalise the ratios and calculate the predicted probabilities:
denom = 1 + choc_pred + straw_pred
# returns
# 0 1.936170
# 1 2.270833
# dtype: float64
# Now calculate the predicted probabilities of each category
p_choc = choc_pred / denom
p_straw = straw_pred / denom
p_van = 1 / denom # note: vanilla is the base outcome, RRR is always '1' for the base outcome
# Look at everything together, the predicted probabilities same as Stata above
# Rows sum to 1.0
pd.concat([df, p_choc, p_van, p_straw], axis=1)
# returns:
# sex 0 1 2
# 0 M 0.164835 0.516483 0.318681
# 1 F 0.293578 0.440367 0.266055
- Using the
external=True
flag on aPredictor
currently has a performance penalty, which may be significant with a large population. We are currently monitoring real-world usage to determine how to modify the code to avoid this.
TLO Model Wiki