Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate household allocation algorithm #168

Open
RomeshA opened this issue Apr 20, 2020 · 5 comments
Open

Alternate household allocation algorithm #168

RomeshA opened this issue Apr 20, 2020 · 5 comments
Labels
approved Confirmed that this is an issue and can be worked on enhancement New feature or request

Comments

@RomeshA
Copy link
Contributor

RomeshA commented Apr 20, 2020

Summary

In the attached files, make allocate_people_simple() scale better with population size

Files

allocation_functions.zip

Overview

The Covasim simulation needs to model individual households, but typically country data comes aggregated over individuals. This means that households need to be reconstructed based on statistical descriptions of ages, household sizes, and mixing between age groups. Currently the algorithm for doing this relies on having a 'reference person' in each household, but not all countries provide this data. So the challenge is to develop an algorithm that can operate without this information.

For this problem, there are three inputs

  • A set of people, each of whom has an age. For example, p=[5,48,32,12,68]. The ages can be any positive integer but in practice would never exceed 125
  • A set of households, each of which has a size. For example, H=[1,4,2] which represents 3 households, with 1 person, 4 people, and 2 people, respectively. The sum of the households is equal to the total number of people i.e. sum(H) == len(p) so there are the same number of people as there are spaces in households to allocate them to. The household size can be assumed to be an positive integer less than 15.
  • An age mixing matrix. These summarize the probability/proportion of interactions taking place between people of different ages. This matrix looks like
        0-10        , 10-20       , 20-30
0-10    0.478648591	, 0.365561467 , 0.155789942
10-20   0.193983089	, 0.55183246, , 0.254184451
20-30   0.067361074	, 0.205434891 , 0.727204034

The rows of the matrix sum to 1. This matrix describes the proportion of contacts that a person with an age in a given row has with people of other ages, in each column. For the purpose of household contact allocation, we can assume that a household forms a fully-connected network of people, with an equal number of interactions between each household member. So for example, suppose that a household with 2 adults and 3 kids has ages [29, 27, 1, 5, 12]. Then we can produce an edge list containing all of the contact pairs here e.g. 29-27 which can then be expressed in the form of an age mixing matrix by aggregating the edges. For example, consider the 29 year old individual. Their contribution to the age mixing matrix will be in the 20-30 row since that's the age bin they fall into. And their contribution to that row is calculated by binning the ages of the other household members they interact with:

       0-10 , 10-20 , 20-30
20-30  2    , 1     , 1

Similarly, for the 1 year old, their interactions could be summarized as

       0-10 , 10-20 , 20-30
0-10   1    , 1     , 2

Therefore, the age mixing matrix for the entire household is

        0-10 , 10-20 , 20-30
0-10    2	 , 2     , 4
10-20   2	 , 0     , 2
20-30   4	 , 2     , 2

This matrix represents contacts for a single household. We can accumulate the interactions across all households, and then normalize so that the sum across each row is 1, ending up with something like

      , 0-10       , 10-20      , 20-30
0-10  , 0.36349418 , 0.4898672  , 0.14663862
10-20 , 0.75333864 , 0.04428216 , 0.2023792
20-30 , 0.56219066 , 0.08092245 , 0.35688689

Thus, the age mixing matrix for the population can be computed given an specification of the ages of people in each household e.g. [[1,20],[30],[50,49,16,14]].

Question

Given a collection of people with specified ages, and a collection of households with specified sizes (such that the number of places in the households matches the number of people), assign people to households such that the resulting age mixing matrix optimally matches a target age mixing matrix also provided as input. Any reasonable metric to quantify the match between the age mixing matrices can be used, but it is of course most convenient if the metric of choice is 0 if the two matrices are identical (e.g. Frobenius distance).

In the first instance, we can assume that the format of the age mixing matrix is fixed (i.e. a 16x16 matrix, 5 year age bins up to age 80) although ideally this could be changed (and in particular, it would be ideal if the bin width could vary e.g. 0-14 and 15-64 as bins).

Note that households must not consist entirely of individuals below a cutoff age (use 18 as an initial value). This ensures that every household contains at least one adult

Included files

  • inputs.xlsx containing the mixing matrix and population-level data about the number of people in each age group and household size. These data are in a similar format to real country data
  • allocation_functions.py which contains
    • sample_inputs() to generate new inputs of variable size. The algorithm will need to perform well up to ~100000 people but it is probably useful to test on smaller populations
    • allocate_people_simple() which contains a working algorithm. It needs to return the allocation of ages to households
    • validate_allocation() ensures that the allocation satisfies any constraints. Currently the only constraint is that each household must have at least one adult
    • compute_mixing_matrix() which has a basic implementation to calculate the mixing matrix for a given allocation of ages to households
    • compute_allocation_quality() which as a basic implementation of a distance metric between the target mixing matrix and the mixing matrix from the proposed allocation. This can be changed if a different distance metric is used.

Based on the code in allocation_functions.py what is needed is a different implementation of allocate_people_simple() that minimizes the result of compute_allocation_quality() more efficiently. In particular, there are two key limitations

  • The current algorithm scales poorly with the population size
  • It's also susceptible to getting stuck in local minima

There are some ways to speed up the calculation without changing the algorithm (e.g. tracking the contribution of each household to the mixing matrix, and only updating those for the two households being considered at a time). However, these would probably not be enough to allow the population size to scale up to 100000+, what would be really great is an approach that runs in less than quadratic time.

@ckerr-IDM ckerr-IDM added approved Confirmed that this is an issue and can be worked on enhancement New feature or request highpriority High priority task labels Apr 20, 2020
@gwincr11
Copy link
Contributor

gwincr11 commented Apr 22, 2020

Not sure how you want this submitted, but I played around with part of this algorithm and found a increase in speed in setup of the households array:

I took this setup:

households = []

    adults = ages >= 18
    if sum(adults) < len(household_sizes):
        raise Exception('Not enough adults to fill houses')
    ages = list(ages)
    for _ in household_sizes:
        for i,age in enumerate(ages):
            if age >= 18:
                households.append([age])
                del ages[i]
                break

    # Add remaining people to households
    for n, household in zip(household_sizes, households):
        for i in range(n-1):
            household.append(ages.pop(0))

Moved it into its own function then wrote a new version so I could time each:

def setup_households_fast(ages, household_sizes):
    households = []

    ages = list(ages)
    # sort the ages to put adults first then grab an adult for
    # each household that we have and shuffle the order
    ages.sort(reverse=True)
    adults = ages[:len(household_sizes)]
    if adults[-1] < 18:
        raise Exception('Not enough adults to fill houses')
    random.shuffle(adults)
    # Take everyone not in the adult array and shuffle them around
    remainder = ages[len(household_sizes):]
    random.shuffle(remainder)


    for size in household_sizes:
        # Create a house with at least 1 adult then
        # add people from the random ordered remainder to reach
        # the house size
        household = []
        household.append(adults.pop(0))
        size = size - 1
        household += remainder[:size]
        remainder = remainder[size:]
        households.append(household)

    return households

On an allocation size of 10,000 it save about 38 seconds.
Screen Shot 2020-04-22 at 11 31 21 AM

@ckerr-IDM
Copy link
Contributor

ckerr-IDM commented Apr 22, 2020

Supercool!! Thanks @gwincr11, this is awesome!! @RomeshA how shall we incorporate this in the code? Should we have a new function in population.py?

@gwincr11
Copy link
Contributor

gwincr11 commented Apr 22, 2020

Oops there is one error in there you want to shuffle the adult array after checking the last member.. Fixed it in the provided code in the comment.

@gwincr11
Copy link
Contributor

gwincr11 commented Apr 24, 2020

I am about to go out of office for about ten days. But attached is some work where I was seeing what would happen if you tested different possible household combinations by spinning up a pool of processes. I was getting some decent speed improvements. I also did some other optimizations which are commented in the code.

It is pretty tricky to get a base line time difference for large sets as it is really dependent on the starting household and how many swaps are required for get to a good quality score, so my timing results were all over the map.
tests.zip

There are 2 versions in this zip file.

The first v2.py which switches from using a the for loop to create all possible combinations to a permutation. It is a bit slower to build the list of possible outcome, however for larger datasets where household sizes get bigger it was somewhat faster as it seems to increase the quality score in larger jumps.

The custom.py is a reimplementation of the original potential swaps algorithm but in a multi-processed approach. My best outcome from the 2 was v2.py building a set of 100,000 in about an hour with a quality score of 3.291. However, it is not really reproducible due to the random nature of the input.

Anyways there are some ideas in there, take them or leave them 😅

@RomeshA
Copy link
Contributor Author

RomeshA commented Apr 25, 2020

@gwincr11 Thanks a lot for the optimizations, very much appreciate it! Doing it in parallel would definitely give a significant boost :) Hope you have a great vacation!

@cliffckerr cliffckerr removed the highpriority High priority task label May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Confirmed that this is an issue and can be worked on enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants