-
Notifications
You must be signed in to change notification settings - Fork 7
Cookbook
Add requests in this document
- Date arithmetic
- A regular event for individual in population
- Understanding assignment by index or row offset
- Assign values to population with specified probability
Pandas timeseries documentation
Dates should be 'TimeStamp' objects and intervals should be 'Timedelta' objects.
Note: Pandas does not know how to handle partial years and months. Convert the interval to days e.g.
# pandas can handle partial days
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='d')
TimedeltaIndex(['0 days 06:00:00', '0 days 12:00:00', '1 days 00:00:00',
'1 days 12:00:00', '2 days 00:00:00'],
dtype='timedelta64[ns]', freq=None)
# pandas cannot handle partial months
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='M')
TimedeltaIndex([ '0 days 00:00:00', '0 days 00:00:00', '30 days 10:29:06',
'30 days 10:29:06', '60 days 20:58:12'],
dtype='timedelta64[ns]', freq=None)
# pandas cannot handle partial years
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='Y')
TimedeltaIndex([ '0 days 00:00:00', '0 days 00:00:00', '365 days 05:49:12',
'365 days 05:49:12', '730 days 11:38:24'],
dtype='timedelta64[ns]', freq=None)
The way to handle this is to multiply by average number of days in months or year. For example
partial_interval = pd.Series([0.25, 0.5, 1, 1.5, 2])
# we want timedelta for 0.25, 0.5, 1, 1.5 etc months, we need to convert to days
interval = pd.to_timedelta(partial_interval * 30.44, unit='d')
print(interval)
TimedeltaIndex([ '7 days 14:38:24', '15 days 05:16:48', '30 days 10:33:36',
'45 days 15:50:24', '60 days 21:07:12'],
dtype='timedelta64[ns]', freq=None)
# we want timedelta for 0.25, 0.5, 1, 1.5 etc years, we need to convert to days
interval = pd.to_timedelta(partial_interval * 365.25, unit='d')
print(interval)
TimedeltaIndex([ '91 days 07:30:00', '182 days 15:00:00', '365 days 06:00:00',
'547 days 21:00:00', '730 days 12:00:00'],
dtype='timedelta64[ns]', freq=None)
current_date = self.sim.date
# sample a list of numbers from an exponential distribution
# (remember to use self.rng in TLO code)
random_draw = np.random.exponential(scale=5, size=10)
# convert these numbers into years
# valid units are: [h]ours; [d]ays; [M]onths; [y]ears
# REMEMBER: Pandas cannot handle fractions of months or years
random_years = pd.to_timedelta(random_draw, unit='y')
# add to current date
future_dates = current_date + random_years
An event scheduled to run every day on a given person. Note the order of the mixin & superclass:
class MyRegularEventOnIndividual(IndividualScopeEventMixin, RegularEvent):
def __init__(self, module, person):
super().__init__(module=module, person=person, frequency=DateOffset(days=1))
def apply(self, person):
print('do something on person', person.index, 'on', self.sim.date)
Add to simulation e.g. in initialise_simulation()
:
sim.schedule_event(MyRegularEventOnIndividual(module=self, person=an_individual),
sim.date + DateOffset(days=1)
When you assign a series/column of values from one dataframe/series to another dataframe/series, Pandas will by default honour the index on the collection. However, you can ignore the index by accessing the values directly. Example (run in a Python console):
import pandas as pd
# create a dataframe with one column
df1 = pd.DataFrame({'column_1': range(0, 5)})
df1.index.name = 'df1_index'
print(df1)
# df1:
# column_1
# df1_index
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
df2 = pd.DataFrame({'column_2': range(10, 15)})
df2.index.name = 'df2_index'
df2 = df2.sort_values(by='column_2', ascending=False) # reverse the order of rows in df2
print(df2)
# notice the df2_index:
#
# column_2
# df2_index
# 4 14
# 3 13
# 2 12
# 1 11
# 0 10
# if we assign one column to another, Pandas will use the index to merge the columns
df1['df2_col2_use_index'] = df2['column_2']
# if we assign the column's values to another, Pandas will ignore the index
df1['df2_col2_use_row_offset'] = df2['column_2'].values
# note difference when assigning using index vs '.values'
print(df1)
# column_1 df2_col2_use_index df2_col2_use_row_offset
# df1_index
# 0 0 10 14
# 1 1 11 13
# 2 2 12 12
# 3 3 13 11
# 4 4 14 10
Assign True
to all individuals at probability p_true
(otherwise False
)
df = population.prop
random_draw = self.rng.random_sample(size=len(df)) # random sample for each person between 0 and 1
df['my_property'] = (p_true < random_draw)
or randomly sample a set of rows at the given probability:
df = population.prop
df['my_property'] = False
sampled_indices = np.random.choice(df.index.values, int(len(df) * p_true))
df.loc[sampled_indices, 'my_property'] = True
You can sample a proportion of the index and set those:
df = population.prop
df['my_property'] = False
df.loc[df.index.to_series().sample(frac=p_true).index, 'my_property'] = True
Imagine we have different rate of my_property
being true based on sex.
df = population.props
# create a dataframe to hold the probabilities (or read from an Excel workbook)
prob_by_sex = pd.DataFrame(data=[('M', 0.46), ('F', 0.62)], columns=['sex', 'p_true'])
# merge with the population dataframe
df_with_prob = df[['sex']].merge(prob_by_sex, left_on=['sex'], right_on=['sex'], how='left')
# randomly sample numbers between 0 and 1
random_draw = self.rng.random_sample(size=len(df))
# assign true or false based on draw and individual's p_true
df['my_property'] = (df_with_prob.p_true.values < random_draw)
df = population.props
# get the categories and probabilities (read from Excel file/in the code etc)
categories = [1, 2, 3, 4] # or categories = ['A', 'B', 'C', 'D']
probabilities = [0.1, 0.2, 0.3, 0.4]
random_choice = self.rng.choice(categories, size=len(df), p=probabilities)
# if 'categories' should be treated as a plain old number or string
df['my_category'] = random_choice
# else if 'categories' should be treated as a real Pandas Categorical
# i.e. property was set up using Types.CATEGORICAL
df['my_category'].values[:] = random_choice
TLO Model Wiki