-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable #11
Comments
Extract of a conversation that can be relevant to keep it in here for posterity.
I was thinking, since the constraint are the following: """
These are the constraints that must be in place:
constraint 1: sum w_i = 1
constraint 2: w_i >= 0.05 for all w_i may not apply if some w_i is discarded
constraint 3: w_i <= 0.3 for all w_i
""" Maybe we can set the space to be (0, 0.3) given the third constraint, what is the point on looking for a weight higher than 0.3 if we are going to discard it anyway. In this way we don't need to do any "normalization" of the vector and rely on the result from the sampling, since the chance of hitting a vector that fulfill the first if condition (if sum(w) <= 1 - LOWER_BOUND | where LOWER_BOUND is 0.05) is higher. here is the code: dim = len(METHODS) - len(low_quality_index)
space = Space([(0.0, 0.3)]*dim)
lhs = Lhs(lhs_type='classic', criterion='maximin', iterations=1000)
for w in lhs.generate(dimensions=space.dimensions, n_samples=1000):
if sum(w) <= 1 - LOWER_BOUND: # inside the simplex
w_dim = list(np.append(w, [1 - sum(w)])) # consequently len(w_dim) == dim
if all_constraints(w_dim):
w_all = fill_with_zeros(w_dim, low_quality_index)
f = func(w_all)
if optimum['Objective_Function'] > f: # remember we are running a minimization
optimum['Objective_Function'] = f
for i, v in enumerate(METHODS):
optimum[v] = w_all[i] Could it make sense? |
We need the weights we sample |
Okay, so if I am understanding correctly, basically restricting the space to My concern is given by the first if condition: if sum(w) <= 1 - LOWER_BOUND: # 0.95 I was running an LHS sampling of When reading the example they mention taken from the Latin hypercube sampling and the sensitivity analysis of a Monte Carlo epidemic model I was thinking that maybe we could borrow the idea of rescaling the |
Implementation: Testing phase: Combination - test
|
7 methods - exhaustive | 8 methods - exhaustive | 8 methods - LHS |
---|---|---|
1h 4m 53s | 7h 19s | 15m 19s |
An additional test could be running intogen in the beeGFS partition from the new IRB cluster |
Recap - meeting with Ferran - 16/01Although the speed improved significantly with the implementation of LHC sampling, the longest run takes aproximately 1d 12h. Still a lot. Discussing with Ferran we came up with the following step to optimize the computational cost of the method:
intogen-plus/combination/intogen_combination/grid_optimizer.py Lines 201 to 204 in 7564e7c
|
The current approach to scan the grid of possible weight vectors to find an optimal method combination is inefficient and expected to blow up as we add new methods to the combination pool. Instead of using an exhaustive grid search with a predefined grid step, we can use a filling-space random grid search strategy. This is a technique to randomly sample vectors from the grid in such a way that the subsequent samples avoid the areas of the previously sampled ones.
One common approach to do this is based on latin hypercubes. The approach has been already implemented in the Python package "scikit-optimize". Specifically, we can make use of the skopt.sampler.Lhs class: https://scikit-optimize.github.io/stable/modules/generated/skopt.sampler.Lhs.html
The text was updated successfully, but these errors were encountered: