Skip to content

Hot deck imputer class with methods allowing for random noise injections, targeted cell definition, and results summarized by cell.

License

Notifications You must be signed in to change notification settings

UI-Research/hot-deck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This repo contains Python code that implements hot deck imputation using Polars dataframes. It generalizes methods that are possible

Hot deck imputation involves randomly sampling individuals to create data for rows missing information. In many microsimulation settings at Urban, this concept is applied across datasets as well, where information missing in one dataset is inferred from another dataset. The basic process is as follows:

  • Define categorical cells that are avaialable in both datasets, for example race.
  • Divide the data into categories available in donor and source data.
  • Split specific cells further if desired.
  • Impute by randomly selecting observations from donor cells, and applying their values to recipients.
  • Compare the data among donor and recipient cells to ensure that relevant characteristics translate well from donor data to recipient data.

Example Implementation in Python

Install the package

In the command line, do: pip install git+https://github.com/UI-Research/hot-deck

Generate data tracking asset values and race, sex, and work

from hot_deck_class import HotDeckImputer
# Data where we know asset values, i.e. the 'donor'
donor_data = {
    'assets': [50000, 20000, 300000, 2000, 
                     10000, 10000, 200, 2000, 4000, 500000],
    'race_cell': ['Black','Black','Black','White','White',
                     'White','Black','White','Black','Black'],
    'sex_cell': ['M','F','F','M','F',
                     'M','F','F','M','F'],
    'work_cell': [1,0,1,0,1,
                     0,1,1,1,0],
    'weight': [1, 2, 1, 2, 1,
               2, 1, 2, 1, 2]
}

donor_data = pl.DataFrame(donor_data)

# Data where we don't know asset values, i.e. the 'recipient'
recipient_data = {
    'race_cell': ['Black','Black','Black','White','White',
                     'White','Black','White','Black','Black','Black','Black','White','White'],
    'sex_cell': ['M','F','F','M','F',
                     'M','F','F','M','F', 'F', 'M', 'M', 'F'],
    'work_cell': [1,0,1,0,1,
                     0,1,1,1,0,0,1,0,1],
    'weight': [1, 3, 2, 3, 2,
               1, 4, 2, 1, 3, 4, 2, 1, 1]
}

recipient_data = pl.DataFrame(recipient_data)

Instantiate HotDeckImputer

imputer = HotDeckImputer(donor_data = donor_data, 
                         imputation_var = 'assets', 
                         weight_var = 'weight', 
                         recipient_data = recipient_data)

Age dollar amounts to align data collected in different years

imputer.age_dollar_amounts(donor_year_cpi = 223.1, imp_year_cpi = 322.1)

Define cells according to race and sex

# Input as a list
variables = ['race_cell','sex_cell']
# Define every combination of race and sex, then partition data into cells
imputer.define_cells(variables)
imputer.generate_cells()
# View the definitions
imputer.cell_definitions

Split specific cells up where sample allows

imputer.split_cell("race_cell == 'Black' & sex_cell == 'F'", "work_cell")

Impute data

imputer.impute()

Add random noise to smooth the results

imputer.apply_random_noise(variation_stdev = (1/6), floor_noise = 1.5)

Generate file comparing donor data vs. recipient data

imputer.gen_analysis_file('hot_deck_stats')

About

Hot deck imputer class with methods allowing for random noise injections, targeted cell definition, and results summarized by cell.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published