Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling weights for MergedChoiceTable #5

Closed
smmaurer opened this issue Jun 29, 2017 · 3 comments
Closed

Sampling weights for MergedChoiceTable #5

smmaurer opened this issue Jun 29, 2017 · 3 comments

Comments

@smmaurer
Copy link
Member

1) Feature overview

When sampling from a large set of conceivable alternatives (home location choices, travel destination choices), users may want to assign weights to different types of alternatives rather than using pure random sampling. This can be a way of generating more realistic choice sets, or of improving computational efficiency by ensuring an appropriate degree of variation among sampled alternatives.

MergedChoiceTable() should accept a parameter for sampling weights, use them in generating the merged table, and store the weights somehow in the returned object.

2) Format for the weights

I think there will be three different use conditions to handle.

  • N(weights) = N(alternatives), with the weights consistent across choosers. For example: people are choosing from a library of books where there's a long tail of unpopular titles.
  • N(weights) = N(alternatives) x N(choosers). For example: people are choosing restaurants, and are more likely to consider the ones closer to their own home.
  • Same as prior, but with a large enough number of choosers + alternatives that it's more efficient to provide an InteractionGenerator() than a lookup table. (See Issue Calculating interactions between chooser and alternative #4.)

3) Common use cases

A common use for this will be weights that are inversely related to the distance between chooser and alternative.

Another case will be alternatives that are divided into importance strata conditioned on types of choosers. (An InteractionGenerator() could be a nice way to specify this as well.)

4) Sampling implementation

To implement the sampling, we need to refactor the underlying mnl_interaction_dataset() code. Instead of drawing a single large sample of alternatives to distribute among choosers, we'll iteratively draw a new sample for each chooser, passing appropriate weights to np.random.choice().

Refactoring the code this way will also let us try sampling without replacement, in other words ensuring that each alternative is sampled no more than once for a given chooser. This is more expensive computationally, and probably not very important for estimation or prediction -- but I think it's the nicest behavior for a procedure like this, especially for small sample sizes.

There's an example of iterative sampling (without weights) in section 3a of the Sampling-correction-02 notebook.

5) Estimation corrections

Will we need to implement sampling corrections in the model estimation? Probably not, especially if the sampling is used to generate more realistic choice sets.

But this is something to look into, along with methods for out-of-sample validation. I have some references that I'll add to this thread.

@smmaurer
Copy link
Member Author

smmaurer commented Jul 2, 2018

Updating this issue from a year ago. We've had a number of discussions about this, but haven't implemented the full feature yet.

APPLYING sampling weights

One task is writing code to carry out the sampling when weights are passed to the MergedChoiceTable() constructor (interaction.py#70). This is in my court, I think.

GENERATING sampling weights

In a case where we have NxM distinct weights, they might sometimes be unwieldy to calculate (network distance) and/or to store (off the cuff, 100k choosers x 100k alternatives = 10 trillion weights = 10 GB if each weight is 8 bits). Here are some approaches to try:

a) Precomputing

Precompute weights, store them on disk (or in memory), pass them to MergedChoiceTable(). This could be a good approach if computation time is slow but we are likely to need all the weights and to use them multiple times.

b) Calculating on the fly

Pass a generator function to MergedChoiceTable(), and calculate weights on the fly for particular combinations of choosers and alternatives. This could be a good approach if memory is the bottleneck, or if we will only need weights for a small number of combinations (e.g. model estimation using a small sample of choosers).

c) Stratified weights

In some cases it might be more efficient to use a heuristic to calculate approximate weights, or to store group-based weights rather than individual ones.

This is what @gboeing is working on in PR #32 (distance bands). Geoff, have you done any evaluation of the speed/memory/disk performance of individual distance calculations vs. the stratified bands?

This will be something to explore in more depth as we begin using sampling weights for model estimation..

Also tagging @waddell @mxndrwgrdnr @Arezoo-bz

@smmaurer smmaurer changed the title Sampling weights for MergedChoiceTable() Sampling weights for MergedChoiceTable Jul 2, 2018
@gboeing
Copy link
Contributor

gboeing commented Jul 2, 2018

have you done any evaluation of the speed/memory/disk performance of individual distance calculations vs. the stratified bands?

@smmaurer I've been working on this over the past couple weeks. I'm getting to the point where I feel a full distance matrix (ie individual distance calculations) will not be feasible without much better hardware. In the coarsest-grain graph of the bay area, we're looking at a matrix of 4-5 billion elements. I can't even instantiate this in memory, let alone populate it at this point. Regarding speed, no matter how fast of an algorithm we use (pandana, etc) to calculate these pairwise network distances, we'll still have to wrap that algorithm in a loop and have it run billions of times. Essentially another nonstarter. Hence I've been focusing most of my attention and efforts on calculating stratified bands, as it seems tractable.

@smmaurer
Copy link
Member Author

smmaurer commented Sep 7, 2018

Most of this is implemented in PR #37. Moving discussion to Issues #39, #40.

@smmaurer smmaurer closed this as completed Sep 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants