Cookbook

Add requests in this document

Date arithmetic

Dates should be 'TimeStamp' objects and intervals should be 'Timedelta' objects.

Handle partial months and years

Note: Pandas does not know how to handle partial years and months. Convert the interval to days e.g.

# pandas can handle partial days
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='d')
TimedeltaIndex(['0 days 06:00:00', '0 days 12:00:00', '1 days 00:00:00',
                '1 days 12:00:00', '2 days 00:00:00'],
               dtype='timedelta64[ns]', freq=None)

# pandas cannot handle partial months
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='M')
TimedeltaIndex([ '0 days 00:00:00',  '0 days 00:00:00', '30 days 10:29:06',
                '30 days 10:29:06', '60 days 20:58:12'],
               dtype='timedelta64[ns]', freq=None)

# pandas cannot handle partial years
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='Y')
TimedeltaIndex([  '0 days 00:00:00',   '0 days 00:00:00', '365 days 05:49:12',
                '365 days 05:49:12', '730 days 11:38:24'],
               dtype='timedelta64[ns]', freq=None)

The way to handle this is to multiply by average number of days in months or year. For example

partial_interval = pd.Series([0.25, 0.5, 1, 1.5, 2])

# we want timedelta for 0.25, 0.5, 1, 1.5 etc months, we need to convert to days
interval = pd.to_timedelta(partial_interval * 30.44, unit='d')
print(interval)

TimedeltaIndex([ '7 days 14:38:24', '15 days 05:16:48', '30 days 10:33:36',
                '45 days 15:50:24', '60 days 21:07:12'],
               dtype='timedelta64[ns]', freq=None)

# we want timedelta for 0.25, 0.5, 1, 1.5 etc years, we need to convert to days
interval = pd.to_timedelta(partial_interval * 365.25, unit='d')
print(interval)

TimedeltaIndex([ '91 days 07:30:00', '182 days 15:00:00', '365 days 06:00:00',
                '547 days 21:00:00', '730 days 12:00:00'],
               dtype='timedelta64[ns]', freq=None)

Adding/substracting time intervals from a date

current_date = self.sim.date

# sample a list of numbers from an exponential distribution
# (remember to use self.rng in TLO code)
random_draw = np.random.exponential(scale=5, size=10)

# convert these numbers into years
# valid units are: [h]ours; [d]ays; [M]onths; [y]ears
# REMEMBER: Pandas cannot handle fractions of months or years
random_years = pd.to_timedelta(random_draw, unit='y')

# add to current date
future_dates = current_date + random_years

A regular event for individual in population

An event scheduled to run every day on a given person. Note the order of the mixin & superclass:

class MyRegularEventOnIndividual(IndividualScopeEventMixin, RegularEvent):
    def __init__(self, module, person):
        super().__init__(module=module, person=person, frequency=DateOffset(days=1))

    def apply(self, person):
        print('do something on person', person.index, 'on', self.sim.date)

Add to simulation e.g. in initialise_simulation():

sim.schedule_event(MyRegularEventOnIndividual(module=self, person=an_individual),
                   sim.date + DateOffset(days=1)

Understanding assignment by index or row offset

When you assign a series/column of values from one dataframe/series to another dataframe/series, Pandas will by default honour the index on the collection. However, you can ignore the index by accessing the values directly. Example (run in a Python console):

import pandas as pd

# create a dataframe with one column
df1 = pd.DataFrame({'column_1': range(0, 5)})
df1.index.name = 'df1_index'
print(df1)

# df1:
#            column_1
# df1_index
# 0                 0
# 1                 1
# 2                 2
# 3                 3
# 4                 4

df2 = pd.DataFrame({'column_2': range(10, 15)})
df2.index.name = 'df2_index'
df2 = df2.sort_values(by='column_2', ascending=False) # reverse the order of rows in df2
print(df2)

# notice the df2_index:
#
#            column_2
# df2_index
# 4                14
# 3                13
# 2                12
# 1                11
# 0                10

# if we assign one column to another, Pandas will use the index to merge the columns
df1['df2_col2_use_index'] = df2['column_2']

# if we assign the column's values to another, Pandas will ignore the index
df1['df2_col2_use_row_offset'] = df2['column_2'].values

# note difference when assigning using index vs '.values'
print(df1)

#            column_1  df2_col2_use_index  df2_col2_use_row_offset
# df1_index
# 0                 0                  10                       14
# 1                 1                  11                       13
# 2                 2                  12                       12
# 3                 3                  13                       11
# 4                 4                  14                       10

Assign values to population with specified probability

Assigning True or False at given probability

Assign True to all individuals at probability p_true (otherwise False)

df = population.prop
random_draw = self.rng.random_sample(size=len(df))  # random sample for each person between 0 and 1
df['my_property'] = (p_true < random_draw)

or randomly sample a set of rows at the given probability:

df = population.prop
df['my_property'] = False
sampled_indices = np.random.choice(df.index.values, int(len(df) * p_true))
df.loc[sampled_indices, 'my_property'] = True

You can sample a proportion of the index and set those:

df = population.prop
df['my_property'] = False
df.loc[df.index.to_series().sample(frac=p_true).index, 'my_property'] = True

Assigning True or False at different rates based on criteria

Imagine we have different rate of my_property being true based on sex.

df = population.props

# create a dataframe to hold the probabilities (or read from an Excel workbook)
prob_by_sex = pd.DataFrame(data=[('M', 0.46), ('F', 0.62)], columns=['sex', 'p_true'])

# merge with the population dataframe
df_with_prob = df[['sex']].merge(prob_by_sex, left_on=['sex'], right_on=['sex'], how='left')

# randomly sample numbers between 0 and 1
random_draw = self.rng.random_sample(size=len(df))

# assign true or false based on draw and individual's p_true
df['my_property'] = (df_with_prob.p_true.values < random_draw)

Assigning value from a set with given probability

df = population.props

# get the categories and probabilities (read from Excel file/in the code etc)
categories = [1, 2, 3, 4]  # or categories = ['A', 'B', 'C', 'D']
probabilities = [0.1, 0.2, 0.3, 0.4]

random_choice = self.rng.choice(categories, size=len(df), p=probabilities)

# if 'categories' should be treated as a plain old number or string
df['my_category'] = random_choice

# else if 'categories' should be treated as a real Pandas Categorical
# i.e. property was set up using Types.CATEGORICAL
df['my_category'].values[:] = random_choice

TLO Model Wiki

Provide feedback

Saved searches

Use saved searches to filter your results more quickly