Skip to content

Cookbook

Asif Tamuri edited this page Oct 30, 2018 · 52 revisions

Add requests in this document

Table of Contents

Date arithmetic

Pandas timeseries documentation

Dates should be 'TimeStamp' objects and intervals should be 'Timedelta' objects.

Handle partial months and years

Note: Pandas does not know how to handle partial years and months. Convert the interval to days e.g.

# pandas can handle partial days
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='d')
TimedeltaIndex(['0 days 06:00:00', '0 days 12:00:00', '1 days 00:00:00',
                '1 days 12:00:00', '2 days 00:00:00'],
               dtype='timedelta64[ns]', freq=None)

# pandas cannot handle partial months
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='M')
TimedeltaIndex([ '0 days 00:00:00',  '0 days 00:00:00', '30 days 10:29:06',
                '30 days 10:29:06', '60 days 20:58:12'],
               dtype='timedelta64[ns]', freq=None)

# pandas cannot handle partial years
>>> pd.to_timedelta([0.25, 0.5, 1, 1.5, 2], unit='Y')
TimedeltaIndex([  '0 days 00:00:00',   '0 days 00:00:00', '365 days 05:49:12',
                '365 days 05:49:12', '730 days 11:38:24'],
               dtype='timedelta64[ns]', freq=None)

The way to handle this is to multiply by average number of days in months or year. For example

partial_interval = pd.Series([0.25, 0.5, 1, 1.5, 2])

# we want timedelta for 0.25, 0.5, 1, 1.5 etc months, we need to convert to days
interval = pd.to_timedelta(partial_interval * 30.44, unit='d')
print(interval)

TimedeltaIndex([ '7 days 14:38:24', '15 days 05:16:48', '30 days 10:33:36',
                '45 days 15:50:24', '60 days 21:07:12'],
               dtype='timedelta64[ns]', freq=None)

# we want timedelta for 0.25, 0.5, 1, 1.5 etc years, we need to convert to days
interval = pd.to_timedelta(partial_interval * 365.25, unit='d')
print(interval)

TimedeltaIndex([ '91 days 07:30:00', '182 days 15:00:00', '365 days 06:00:00',
                '547 days 21:00:00', '730 days 12:00:00'],
               dtype='timedelta64[ns]', freq=None)

Adding/substracting time intervals from a date

current_date = self.sim.date

# sample a list of numbers from an exponential distribution
# (remember to use self.rng in TLO code)
random_draw = np.random.exponential(scale=5, size=10)

# convert these numbers into years
# valid units are: [h]ours; [d]ays; [M]onths; [y]ears
# REMEMBER: Pandas cannot handle fractions of months or years
random_years = pd.to_timedelta(random_draw, unit='y')

# add to current date
future_dates = current_date + random_years

A regular event for individual in population

An event scheduled to run every day on a given person. Note the order of the mixin & superclass:

class MyRegularEventOnIndividual(IndividualScopeEventMixin, RegularEvent):
    def __init__(self, module, person):
        super().__init__(module=module, person=person, frequency=DateOffset(days=1))

    def apply(self, person):
        print('do something on person', person.index, 'on', self.sim.date)

Add to simulation e.g. in initialise_simulation():

sim.schedule_event(MyRegularEventOnIndividual(module=self, person=an_individual),
                   sim.date + DateOffset(days=1)

Understanding assignment by index or row offset

When you assign a series/column of values from one dataframe/series to another dataframe/series, Pandas will by default honour the index on the collection. However, you can ignore the index by accessing the values directly. Example (run in a Python console):

import pandas as pd

# create a dataframe with one column
df1 = pd.DataFrame({'column_1': range(0, 5)})
df1.index.name = 'df1_index'
print(df1)

# df1:
#            column_1
# df1_index
# 0                 0
# 1                 1
# 2                 2
# 3                 3
# 4                 4

df2 = pd.DataFrame({'column_2': range(10, 15)})
df2.index.name = 'df2_index'
df2 = df2.sort_values(by='column_2', ascending=False) # reverse the order of rows in df2
print(df2)

# notice the df2_index:
#
#            column_2
# df2_index
# 4                14
# 3                13
# 2                12
# 1                11
# 0                10

# if we assign one column to another, Pandas will use the index to merge the columns
df1['df2_col2_use_index'] = df2['column_2']

# if we assign the column's values to another, Pandas will ignore the index
df1['df2_col2_use_row_offset'] = df2['column_2'].values

# note difference when assigning using index vs '.values'
print(df1)

#            column_1  df2_col2_use_index  df2_col2_use_row_offset
# df1_index
# 0                 0                  10                       14
# 1                 1                  11                       13
# 2                 2                  12                       12
# 3                 3                  13                       11
# 4                 4                  14                       10

Assign values to population with specified probability

Assigning True or False at given probability

Assign True to all individuals at probability p_true (otherwise False)

df = population.prop
random_draw = self.rng.random_sample(size=len(df))  # random sample for each person between 0 and 1
df['my_property'] = (p_true < random_draw)

or randomly sample a set of rows at the given probability:

df = population.prop
df['my_property'] = False
sampled_indices = np.random.choice(df.index.values, int(len(df) * p_true))
df.loc[sampled_indices, 'my_property'] = True

You can sample a proportion of the index and set those:

df = population.prop
df['my_property'] = False
df.loc[df.index.to_series().sample(frac=p_true).index, 'my_property'] = True

Assigning True or False at different rates based on criteria

Imagine we have different rate of my_property being true based on sex.

df = population.props

# create a dataframe to hold the probabilities (or read from an Excel workbook)
prob_by_sex = pd.DataFrame(data=[('M', 0.46), ('F', 0.62)], columns=['sex', 'p_true'])

# merge with the population dataframe
df_with_prob = df[['sex']].merge(prob_by_sex, left_on=['sex'], right_on=['sex'], how='left')

# randomly sample numbers between 0 and 1
random_draw = self.rng.random_sample(size=len(df))

# assign true or false based on draw and individual's p_true
df['my_property'] = (df_with_prob.p_true.values < random_draw)

Assigning value from a set with given probability

df = population.props

# get the categories and probabilities (read from Excel file/in the code etc)
categories = [1, 2, 3, 4]  # or categories = ['A', 'B', 'C', 'D']
probabilities = [0.1, 0.2, 0.3, 0.4]

random_choice = self.rng.choice(categories, size=len(df), p=probabilities)

# if 'categories' should be treated as a plain old number or string
df['my_category'] = random_choice

# else if 'categories' should be treated as a real Pandas Categorical
# i.e. property was set up using Types.CATEGORICAL
df['my_category'].values[:] = random_choice
Clone this wiki locally