IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable #11

koszulordie · 2023-11-06T11:17:49Z

The current approach to scan the grid of possible weight vectors to find an optimal method combination is inefficient and expected to blow up as we add new methods to the combination pool. Instead of using an exhaustive grid search with a predefined grid step, we can use a filling-space random grid search strategy. This is a technique to randomly sample vectors from the grid in such a way that the subsequent samples avoid the areas of the previously sampled ones.

One common approach to do this is based on latin hypercubes. The approach has been already implemented in the Python package "scikit-optimize". Specifically, we can make use of the skopt.sampler.Lhs class: https://scikit-optimize.github.io/stable/modules/generated/skopt.sampler.Lhs.html

FedericaBrando · 2023-12-21T16:33:53Z

Extract of a conversation that can be relevant to keep it in here for posterity.

I need to set up the space with n-dimension as the number of methods (example 8 ) and then sample as many time as I want (ex. 10000). Then apply the constraints to verify if the vector fulfills them.
The bottleneck is indeed the number of sampling, because I will be iterating through that, but the w.c.s is O(samples) which is definitely lower than 21^methods

I was thinking, since the constraint are the following:

    """
    These are the constraints that must be in place:
    constraint 1: sum w_i = 1
    constraint 2: w_i >= 0.05 for all w_i may not apply if some w_i is discarded
    constraint 3: w_i <= 0.3 for all w_i

    """

Maybe we can set the space to be (0, 0.3) given the third constraint, what is the point on looking for a weight higher than 0.3 if we are going to discard it anyway.

In this way we don't need to do any "normalization" of the vector and rely on the result from the sampling, since the chance of hitting a vector that fulfill the first if condition (if sum(w) <= 1 - LOWER_BOUND | where LOWER_BOUND is 0.05) is higher.

here is the code:

dim = len(METHODS) - len(low_quality_index)

    space = Space([(0.0, 0.3)]*dim)
    lhs = Lhs(lhs_type='classic', criterion='maximin', iterations=1000)

    for w in lhs.generate(dimensions=space.dimensions, n_samples=1000):
        if sum(w) <= 1 - LOWER_BOUND:                        # inside the simplex
            w_dim = list(np.append(w, [1 - sum(w)]))   # consequently len(w_dim) == dim
            if all_constraints(w_dim):
                w_all = fill_with_zeros(w_dim, low_quality_index)
                f = func(w_all)
                if optimum['Objective_Function'] > f:  # remember we are running a minimization
                    optimum['Objective_Function'] = f
                    for i, v in enumerate(METHODS):
                        optimum[v] = w_all[i]

Could it make sense?

koszulordie · 2023-12-21T17:33:58Z

We need the weights we sample $w_i$ to be specified as proportions of the total weight, meaning $\sum_i w_i = 1$,
in order to apply the constraints on the relative voting rights that the methods are allowed to hold when they are combined. If we restrict our search to $(0,3)^n$ instead of $(0,1)^n$ we still have the problem that we need to convert the vector into proportions and make sure they satisfy the additional constraint: $0.05 \leq w_i\leq 0.3$ for each i. In other words, sampling from $(0,3)$ or from $(0.05, 0.3)$ does not mean the respective weight, once converted to a proportion, will satisfy the constraints.
Unfortunately, the method we are using does not allow to do LHC sampling whilst satisfying the constraints, a.k.a. constrained LHC sampling and I have not been able to find any ready-to-use solution.

FedericaBrando · 2023-12-22T10:54:27Z

Okay, so if I am understanding correctly, basically restricting the space to $(0,0.3)$ is just rescaling the problem, not solving it.

My concern is given by the first if condition:

if sum(w) <= 1 - LOWER_BOUND:  # 0.95

I was running an LHS sampling of $n=10000$ and it was never finding $w$ with a $\sum_i w_i = 1$, so I was looking in other implementations of the method and I came across the scipy one: Here

When reading the example they mention taken from the Latin hypercube sampling and the sensitivity analysis of a Monte Carlo epidemic model I was thinking that maybe we could borrow the idea of rescaling the $w$ dimensions with the appropriate bounds $(0.05,0.3)$, using the built-in qmc.scale() and then apply the constraint. But, as before, maybe something that makes sense to me, it indeed does not apply to our problem..

FedericaBrando · 2023-12-28T18:47:39Z

Implementation:

dev/optimization-combination

Testing phase:

Combination - test `CBIOP_WXS_PRAD_BROAD`

7 methods - exhaustive	8 methods - exhaustive	8 methods - LHS
1h 4m 53s	7h 19s	15m 19s

migrau · 2024-01-11T09:52:38Z

An additional test could be running intogen in the beeGFS partition from the new IRB cluster

FedericaBrando · 2024-01-16T14:29:11Z

Recap - meeting with Ferran - 16/01

Although the speed improved significantly with the implementation of LHC sampling, the longest run takes aproximately 1d 12h. Still a lot.

Discussing with Ferran we came up with the following step to optimize the computational cost of the method:

Keep only the LHS optimization and get rid of the optimize_with_seed()

intogen-plus/combination/intogen_combination/grid_optimizer.py

Lines 245 to 248 in 7564e7c

    
           def optimize_with_seed(func, w0, low_quality=None): 
        
               """ 
        
               Args: 
        
                   func: function admitting w as argument

Store the the change of optimum for every iteration in order to analyse it after.

intogen-plus/combination/intogen_combination/grid_optimizer.py

Lines 201 to 204 in 7564e7c

    
           if optimum['Objective_Function'] > f:  # remember we are running a minimization 
        
               optimum['Objective_Function'] = f 
        
               for i, v in enumerate(METHODS): 
        
                   optimum[v] = w_all[i]

Try different number of sampling (100 - 1000) - Ideally a number dependant on the dimension of the simplex that ensure an equal distribution of the sampling in order to maximize the exploration (correct if wrong).
Run with TCGA and HARTWIG
Check for time improvement - compare with v2023 and dev-O3D
Check for changing in in results between runs

koszulordie changed the title ~~Grid search for fitting the optimal method combination is inefficient and not scalable~~ IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable Nov 6, 2023

FedericaBrando self-assigned this Nov 20, 2023

FedericaBrando added the enhancement New feature or request label Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable #11

IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable #11

koszulordie commented Nov 6, 2023

FedericaBrando commented Dec 21, 2023

koszulordie commented Dec 21, 2023 •

edited

Loading

FedericaBrando commented Dec 22, 2023

FedericaBrando commented Dec 28, 2023 •

edited

Loading

migrau commented Jan 11, 2024

FedericaBrando commented Jan 16, 2024

IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable #11

IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable #11

Comments

koszulordie commented Nov 6, 2023

FedericaBrando commented Dec 21, 2023

koszulordie commented Dec 21, 2023 • edited Loading

FedericaBrando commented Dec 22, 2023

FedericaBrando commented Dec 28, 2023 • edited Loading

Combination - test CBIOP_WXS_PRAD_BROAD

migrau commented Jan 11, 2024

FedericaBrando commented Jan 16, 2024

Recap - meeting with Ferran - 16/01

koszulordie commented Dec 21, 2023 •

edited

Loading

FedericaBrando commented Dec 28, 2023 •

edited

Loading

Combination - test `CBIOP_WXS_PRAD_BROAD`