-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampling weights for MergedChoiceTable #5
Comments
Updating this issue from a year ago. We've had a number of discussions about this, but haven't implemented the full feature yet. APPLYING sampling weightsOne task is writing code to carry out the sampling when weights are passed to the MergedChoiceTable() constructor (interaction.py#70). This is in my court, I think. GENERATING sampling weightsIn a case where we have NxM distinct weights, they might sometimes be unwieldy to calculate (network distance) and/or to store (off the cuff, 100k choosers x 100k alternatives = 10 trillion weights = 10 GB if each weight is 8 bits). Here are some approaches to try: a) PrecomputingPrecompute weights, store them on disk (or in memory), pass them to MergedChoiceTable(). This could be a good approach if computation time is slow but we are likely to need all the weights and to use them multiple times. b) Calculating on the flyPass a generator function to MergedChoiceTable(), and calculate weights on the fly for particular combinations of choosers and alternatives. This could be a good approach if memory is the bottleneck, or if we will only need weights for a small number of combinations (e.g. model estimation using a small sample of choosers). c) Stratified weightsIn some cases it might be more efficient to use a heuristic to calculate approximate weights, or to store group-based weights rather than individual ones. This is what @gboeing is working on in PR #32 (distance bands). Geoff, have you done any evaluation of the speed/memory/disk performance of individual distance calculations vs. the stratified bands? This will be something to explore in more depth as we begin using sampling weights for model estimation.. Also tagging @waddell @mxndrwgrdnr @Arezoo-bz |
@smmaurer I've been working on this over the past couple weeks. I'm getting to the point where I feel a full distance matrix (ie individual distance calculations) will not be feasible without much better hardware. In the coarsest-grain graph of the bay area, we're looking at a matrix of 4-5 billion elements. I can't even instantiate this in memory, let alone populate it at this point. Regarding speed, no matter how fast of an algorithm we use (pandana, etc) to calculate these pairwise network distances, we'll still have to wrap that algorithm in a loop and have it run billions of times. Essentially another nonstarter. Hence I've been focusing most of my attention and efforts on calculating stratified bands, as it seems tractable. |
1) Feature overview
When sampling from a large set of conceivable alternatives (home location choices, travel destination choices), users may want to assign weights to different types of alternatives rather than using pure random sampling. This can be a way of generating more realistic choice sets, or of improving computational efficiency by ensuring an appropriate degree of variation among sampled alternatives.
MergedChoiceTable() should accept a parameter for sampling weights, use them in generating the merged table, and store the weights somehow in the returned object.
2) Format for the weights
I think there will be three different use conditions to handle.
InteractionGenerator()
than a lookup table. (See Issue Calculating interactions between chooser and alternative #4.)3) Common use cases
A common use for this will be weights that are inversely related to the distance between chooser and alternative.
Another case will be alternatives that are divided into importance strata conditioned on types of choosers. (An
InteractionGenerator()
could be a nice way to specify this as well.)4) Sampling implementation
To implement the sampling, we need to refactor the underlying mnl_interaction_dataset() code. Instead of drawing a single large sample of alternatives to distribute among choosers, we'll iteratively draw a new sample for each chooser, passing appropriate weights to np.random.choice().
Refactoring the code this way will also let us try sampling without replacement, in other words ensuring that each alternative is sampled no more than once for a given chooser. This is more expensive computationally, and probably not very important for estimation or prediction -- but I think it's the nicest behavior for a procedure like this, especially for small sample sizes.
There's an example of iterative sampling (without weights) in section 3a of the Sampling-correction-02 notebook.
5) Estimation corrections
Will we need to implement sampling corrections in the model estimation? Probably not, especially if the sampling is used to generate more realistic choice sets.
But this is something to look into, along with methods for out-of-sample validation. I have some references that I'll add to this thread.
The text was updated successfully, but these errors were encountered: