Using the LinearModel helper class

When would you use the lm.LinearModel helper class?
When would you not use the helper class?
Where should I setup and store a LinearModel?
Setup a LinearModel
Calling a LinearModel
Which LinearModelType do I want?
Predictors
- Performance tips
Multinomial Logistic Regression
Caveats
TLDR -- give me a good shortcut for what I am mostly going to use this for
Getting the outcome of a prediction automatically

When would you use the lm.LinearModel helper class?

A common situation in the model is that you want to determine the probability of an event happening for each person and that probability is to be based on characteristics of that person (e.g. their sex and age). Or, perhaps, that the time that a newly infected person is expected to survive with some condition is dependent on some characteristics of persons (e.g. their age at infection and their BMI). In these situations, you could create a lots of statement like this:

prob = 0.01
for person_id in df.loc[df.is_alive].index:
   if df.at[person_id, 'age_years'] < 5:
       prob = prob * 10
   elif df.at[person_id, 'age_years'] < 10:
       prob = prob * 2

Or:

prob = pd.Series(0.01, index = df.index)
prob.loc[df.age_years.between(0, 5)] *= 10
prob.loc[df.age_years.between(5, 10)] *= 2

But, this gets messy for lots of conditions and this logic will often need to be used in several places (e.g. initialise_population and in the main module event). So, the LinearModel is a way of neatly containing all the information about how something is determined as a function of individual characteristics.

If you find yourself doing anything like the above, use a LinearModel instead!

When would you not use the helper class?

LinearModel helpers are very much an alpha-feature of the framework!
The way the class works is by abstracting over direct operations on the Pandas dataframe. We don't know whether this will have a significant performance cost.

Where should I setup and store a LinearModel?

A LinearModel can be created anywhere in your code and stored for future execution. They are not supported as PARAMETERS of the modules but can be saved as instance variables (i.e. self.xxx) of modules and events. If you do intend to store LMs in this way, we recommend setting up a dictionary in the module's or event's __init__ method with self.lm = dict() and adding any LMs you create to the dictionary with self.lm['tobacco usage'] = LinearModel(xxx). This is to avoid having to initialise many instance variables in your __init__.
An expected common usage will be to define this in read_parameters(). You will have read in parameters from an external file and can then assemble these models and store them (as above). They will be then ready for use in initialise_population() and beyond!

Setup a LinearModel

A LinearModel is created by specifying its (i) type, (ii) intercept and (iii) one or more Predictors
See test examples, and how Panda ops in the Lifestyle module can use the LinearModel class: test_logistic_application_low_ex and test_logistic_application_tob

General Outline:

 my_lm = LinearModel(
 		    LinearModelType.ADDITIVE,   # or MULTIPLICATIVE or LOGISTIC
 		    0.0,  # INTERCEPT CAN BE FLOAT OR INTEGER
 		    Predictor('property_1').when(..., 0.1).when(..., 0.2)
 		    Predictor('property_2').when(True, 0.01).otherwise(0.02),
 			...
 		    Predictor('property_n').when(True, 0.00001).otherwise(0.00002)
 		)

Here is the canonical example of a LinearModelType.LOGISTIC model:

    baseline_prob = 0.5
    baseline_odds = prob / (1 - prob)
    OR_X = 2   # Odds Ratio for Factor 'X'
    OR_Y = 5   # Odds Ratio for Factor 'Y'

    eq = LinearModel(
        LinearModelType.LOGISTIC,                # We specify the Logistic model
        baseline_odds,                           # The intercept term is the baseline_odds 
        Predictor('FactorX').when(True, OR_X),   # We declare that 'FactorX' is a predictor. Anyone that has 'FactorX' as True, will have a higher odds, as given by OR_X
        Predictor('FactorY').when(True, OR_Y)   # We declare that 'FactorY' is a predictor. Anyone that has 'FactorY' as True, will have a higher odds, as given by OR_Y
    )

    df = pd.DataFrame(data={
        'FactorX': [False, True, False, True],
        'FactorY': [False, False, True, True]
    })

    pred = eq.predict(df)       # We provide the dataframe as an argument to the LinearModel's predict method.
    
    # The result will be as expected.
    assert all(pred.values == [
        baseline_prob,
        (odds * OR_X) / (1 + odds * OR_X),
        (odds * OR_Y) / (1 + odds * OR_Y),
        (odds * OR_X * OR_Y) / (1 + odds * OR_X * OR_Y)
    ])

Note that:

All models must have intercept -- this is the value this is given to all individuals/rows in the dataframe. If there are no predictors then this is the final value for each individual/row in the ADDITIVE and MULTIPLICATIVE model types, and this value will be transformed as x/(1+x) for the final value value in the LOGISTIC model type (see below for more information on this).
The model can have no predictors -- in which case, see above! This 'hollow' mode is helpful when sketching out the logic of a module without specifying the details of how individuals' characteristics might affect things.
The model can have any number of predictors --- and each will be applied independently and then combined according to the model type (see below).

Calling a LinearModel

Pass in population pd.DataFrame (or subset of the population dataframe) as an argument to the LinearModel's .predict() method. Note that the dataframe must contain columns that are named exactly as any (non-external) Predictor name. It will return a series with an index equal to index of the dataframe that is passed Thus: result = my_lm.predict(df)

Common use cases would be:

Passing in a subset of the main population data frame:

result = my_lm.predict(df.loc[df.is_alive & (df.age_years < 5)])

Getting the result for one person and extracting the prediction as a value (not a pd.Series):

result = my_lm.predict(df.loc[[person_id]]).values[0]

Which LinearModelType do I want?

The ModelType determines how the effects of each Predictor are combined to give the final result. Let y be the output of the model, x_i be indicator variables for a particular condition, beta_0 be the intercept term and beta_i the 'effects' then:

The LinearModelType.ADDITIVE model will provide:

y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + ...
The LinearModelType.MULTIPLICATIVE model will provide:

y = beta_0 * (beta_1 * x_1) * (beta_2 * x_2) * ...
The LinearModelType.LOGISTIC model is different as it is designed to accept the outputs of standard logistic regression, which are based on 'odds'. Recall that the relationship between odds and probabilities are as follows: odds = prob / (1 - prob); prob = odds / (1 + odds) The intercept is the baseline odds and effects are the Odds Ratio. The model returns the probability that is implied by the model.

Predictors

Predictors specify how a particular column in the pd.DataFrame should affect the result of the LinearModel. This is where the magic happens.

The following uses are supported:

A boolean property:

Predictor('property_name').when(True, value_for_true).otherwise(value_for_false)

Predictor('property_name').when(True, value_for_true)   # If false for a row, there will no effect of this predictor for that row

A property for which particular values are important:

Predictor('property_name').when('string_condition', value_for_condition).otherwise(value_for_otherwise)

Predictor('property_name').when('string_condition_one', value_for_condition_one)
                          .when('string_condition_one', value_for_condition_two)
                          .otherwise(value_for_otherwise)

A property for which particular range are to be selected:

Predictor('property_name').when('.between(low_value, high_value)', value_for_condition).otherwise(value_for_otherwise)

The Pandas series between method can be used to specify a range of values to select. By default .between specifies an inclusive range, such that .between(a,b) is True for values a and b and all values in-between. To specify a non-inclusive range the optional inclusive keyword argument can be set to an appropriate value, for example .between(a, b, inclusive="left") will be True for all values > a and <= b, .between(a, b, inclusive="right") will be True for all values >= a and < b and .between(a, b, inclusive="neither") will be True for all values > a and < b.

By default successive .when() conditions on a Predictor are not evaluated for a row if a prior .when() condition has been satisfied. Therefore the following

Predictor('property_name').when('<10', value_for_less_than_ten)
                          .when('<20', value_for_10_to_20)
                          .when('<30', value_for_20_to_30)

will be equivalent in output to

Predictor('property_name').when('<10', value_for_less_than_ten)
                          .when('.between(10, 20, inclusive="left")', value_for_10_to_20)
                          .when('.between(20, 30, inclusive="left")', value_for_20_to_30)

Note that however when possible it is preferential from a performance perspective to define the conditions so that they are mutually exclusive and annotating them as so - see Performance tips section for details.

A condition that is based on more than one variable Here the Predictor is defined with an empty set of parentheses and the condition is given in the .when() as a string verbatim as it should be evaluated; the value for when is assigned to the row when that condition is True.

Predictor().when(string_of_compound_condition_that_will_yield_bool, value_for_compound_condition_true)

For example:

Predictor().when('(property_name_1 == True) & (property_name_2 == "M")', value_for_compound_condition_true)

A coefficient that is a function of the value of the Predictor The value of the model may depend on a function of the predictor. For example, in the model y = k1 + k2*x^2, where k1 and k2 are constants. In this case, we provide the function -- either as a defined function or a lambda function - to the .apply() method of the Predictor.

Predictor('property_name').apply(lambda x: (k1 + k2*(x**2)))

External Variables An external variable is one that is not in the dataframe - for example year or some 'environmental' level variable that is the same for all individuals at the time that the model is used to predict.

Predictor('external_variable_name', external=True).when(2010, 5.0).otherwise(1.0)

If you have specified an 'external' variable, you must pass it as an argument to the predict method.

        my_lm = LinearModel(0.0, 
			    Predictor('year', external=True).when(2010, 5.0).otherwise(1.0)
                           )
        result = my_lm.predict(df, year = self.sim.date.year)

Or:

        lm = LinearModel(
            LinearModelType.MULTIPLICATIVE,
            1.0,
            Predictor('ext_var', external=True).when(True, 10).otherwise(-10)
        )

        df = pd.DataFrame(index=range(10))
        ext_var = self.module.rng.rand(10)<0.5

        lm.predict(df, ext_var=ext_var)

Performance tips

To get the best performance, the set of conditions for each predictor should be mutually exclusive (for any pair of conditions, one condition evaluating to True implies the other must evaluate to False) and exhaustive (at least one condition will always be True). Either of these properties holding can be indicated by setting conditions_are_mutually_exclusive or conditions_are_exhaustive to True in the Predictor initialiser, for example

# conditions are mutually exclusive and exhaustive
Predictor('variable_name', conditions_are_mutually_exclusive=True, conditions_are_exhaustive=True)
# conditions are mutually exclusive but not necessarily exhaustive
Predictor('variable_name', conditions_are_mutually_exclusive=True)

By default, the LinearModel.predict evaluation evaluates conditions sequentially in the order defined on the subset which has not already matched any previous conditions. For a predictor with N conditions this results in a cost of evaluating the predictor which scales quadratically in N which can become signifcant for predictors with many conditions. Defining the conditions such that they are mutually exclusive and annotating the predictor with the conditions_are_mutually_exclusive=True keyword argument, allows using a more efficient method of evaluating the predictor which scales linearly in N.

If a predictor does not specify an explicit 'catch-all' .otherwise condition, by default a 'no-effect' value is outputted by the predictor (zero for additive, one for multiplicative models and an odds ratio of one for logistic models) when none of the conditions has been matched. As this requires tracking the rows for which conditions have been matched, this incurs an overhead which scales linearly with the number of conditions specified in the model. If the set of conditions is exhaustive and covers all possible cases already this no-effect term is redundant and so can safely be skipped. Setting conditions_are_exhaustive=True in the predictor indicates the conditions are exhaustive and so this optimisation can be exploited.

Important note: In both cases it is assumed that if conditions_are_mutually_exclusive or conditions_are_exhaustive have been set to True, then these properties do actually hold for the conditions, with no checking that this is actually the case. This means that if these flags are set to True incorrectly the model will have an erroneous output.

Multinomial Logistic Regression

The toy example here uses the output from Stata mlogit function (the manual is very helpful). You don't need to run Stata, we're just creating an example model here. We run mlogit (ice cream flavour by sex):

use https://stats.idre.ucla.edu/stat/stata/output/mlogit, clear
table ice_cream
mlogit ice_cream female, rrr
predict pchoc pvan pstraw, p
list female pchoc pvan pstraw

This produces the output:

Iteration 0:   log likelihood = -210.58254  
Iteration 1:   log likelihood = -208.26661  
Iteration 2:   log likelihood = -208.24309  
Iteration 3:   log likelihood = -208.24309  

Multinomial logistic regression                 Number of obs     =        200
                                                LR chi2(2)        =       4.68
                                                Prob > chi2       =     0.0964
Log likelihood = -208.24309                     Pseudo R2         =     0.0111

------------------------------------------------------------------------------
   ice_cream |        RRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
chocolate    |
      female |   2.088888   .7816646     1.97   0.049     1.003216    4.349466
       _cons |    .319149   .0946444    -3.85   0.000      .178471     .570715
-------------+----------------------------------------------------------------
vanilla      |  (base outcome)
-------------+----------------------------------------------------------------
strawberry   |
      female |   .9791667   .3263365    -0.06   0.950     .5095283    1.881676
       _cons |   .6170213   .1456998    -2.04   0.041     .3884207    .9801622
------------------------------------------------------------------------------

Note: _cons estimates baseline relative risk for each outcome.

Stata calculates the predicted probabilities of each flavour by sex:

     +-----------------------------------------+
     | female      pchoc       pvan     pstraw |
     |-----------------------------------------|
  1. |   male   .1648352   .5164835   .3186813 |
  2. | female    .293578    .440367    .266055 |
     +-----------------------------------------+

Note: these probabilities are not the same as running each flavour's linear model separately.

We implement prediction on this model in TLO as follows:

from tlo.lm import LinearModel, LinearModelType, Predictor
import pandas as pd

# Create a toy dataframe as input, two individuals

df = pd.DataFrame(data={'sex': ['M', 'F']})
# returns
#   sex
# 0   M
# 1   F

# We create a separate LinearModel instance for each category 
# but not including the base category "vanilla"
# 
# We use LinearModelType.MULTIPLICATIVE because we will calculate 
# the probabilities ourselves
#
# The numbers are from the Stata model above (RRR=relative risk ratio)

choc_lm = LinearModel(LinearModelType.MULTIPLICATIVE, 
                      0.319149,  # this is the intercept
                      Predictor('sex').when('F', 2.088888))  # the relative risk ratio when female

straw_lm = LinearModel(LinearModelType.MULTIPLICATIVE, 
                       0.6170213, 
                       Predictor('sex').when('F', 0.9791667))

# For each model, get the RRR prediction

choc_pred = choc_lm.predict(df)
# returns:
# 0    0.319149
# 1    0.666667
# dtype: float64

straw_pred = straw_lm.predict(df)
# returns:
# 0   0.617021
# 1   0.604167
# dtype: float64

# We calculate the denominator of logistic function (note, the base outcome is vanilla)

denom = 1 + choc_pred + straw_pred
# returns
# 0    1.936170
# 1    2.270833
# dtype: float64

# Now calculate the predicted probabilities of each category

p_choc = choc_pred / denom
p_straw = straw_pred / denom
p_van = 1 / denom  # note: vanilla is the base outcome, RRR is always '1' for the base outcome

# Look at everything together, the predicted probabilities are same as Stata above
# Rows sum to 1.0

pd.concat([df, p_choc, p_straw, p_van], axis=1)
# returns:
#   sex         0         1         2
# 0   M  0.164835  0.318681  0.516483
# 1   F  0.293578  0.266055  0.440367

# we can wrap this into a utility function

def get_multinomial_probabilities(_df, *linear_models):
    """returns a dataframe of probabilities for each outcome
    columns are in the same order as function arguments
    last column is base outcome
    """
    ratios = pd.concat([lm.predict(df) for lm in linear_models], axis=1)
    ratios.insert(len(ratios.columns), len(ratios.columns), 1.0)
    denom = ratios.sum(axis=1)
    return ratios.div(denom, axis=0)

p = get_multinomial_probabilities(df, choc_lm, straw_lm)
# returns:
#           0         1         2
# 0  0.164835  0.318681  0.516483
# 1  0.293578  0.266055  0.440367

Caveats

Using the external=True flag on a Predictor currently has a performance penalty, which may be significant with a large population. We are currently monitoring real-world usage to determine how to modify the code to avoid this.

Shortcut

The most common usage of the linear model is to specify the probability of something happening for different groups, where the parameters come in the form of a value for each particular group. This is essentially a Multiplicative Type model with an intercept of 1.0 and a whole load of Predictors.

You could write this:

eq = LinearModel(
    LinearModelType.MULTIPLICATIVE,      # <--- nearly always this is the right one to use 
    1.0,                                 # <--- nearly always this is the right one to use 
    Predictor('column1').when(True, 1),
    Predictor('column2').when(True, 2)
)

But it's a even easier to just use this shortcut:

eq = LinearModel.multiplicative(              # <--- all the "usual stuff" hidden away 
         Predictor('column1').when(True, 1),
         Predictor('column2').when(True, 2)
)

Get Outcome

If the LinearModel is giving a probability of the occurrence of something, you are often going to create some random numbers in order to see if that something will happen. The LinearModel can do that for you in one step -- just pass in the relevant np.random.RandomState (usually self.module.rng inside an HSI event or self.rng in the module) and it will return bools.

By default, if you are evaluating a model in this way for one row of the data frame, it returns just the result for that one row (instead of a Pandas Series of length 1 containing the bool). To disable this behaviour and return a Pandas Series object even for single-row inputs specify squeeze_single_row_output = False as a keyword argument to predict.

See here:

eq = LinearModel(
   LinearModelType.MULTIPLICATIVE,
   0.5       # <--- probability is being applied to all rows in the data frame as there are no predictors
)

df = pd.DataFrame(index=['row1', 'row2', 'row3', 'row4'])

import numpy as np
rng = np.random.RandomState(0) # <--- replace with self.module.rng in an HSI or self.rng in a module
outcome = eq.predict(df, rng=rng)

TLO Model Wiki

Provide feedback

Saved searches

Use saved searches to filter your results more quickly