Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable #11

Open
koszulordie opened this issue Nov 6, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@koszulordie
Copy link

The current approach to scan the grid of possible weight vectors to find an optimal method combination is inefficient and expected to blow up as we add new methods to the combination pool. Instead of using an exhaustive grid search with a predefined grid step, we can use a filling-space random grid search strategy. This is a technique to randomly sample vectors from the grid in such a way that the subsequent samples avoid the areas of the previously sampled ones.

One common approach to do this is based on latin hypercubes. The approach has been already implemented in the Python package "scikit-optimize". Specifically, we can make use of the skopt.sampler.Lhs class: https://scikit-optimize.github.io/stable/modules/generated/skopt.sampler.Lhs.html

@koszulordie koszulordie changed the title Grid search for fitting the optimal method combination is inefficient and not scalable IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable Nov 6, 2023
@FedericaBrando FedericaBrando self-assigned this Nov 20, 2023
@FedericaBrando FedericaBrando added the enhancement New feature or request label Nov 20, 2023
@FedericaBrando
Copy link
Member

Extract of a conversation that can be relevant to keep it in here for posterity.

I need to set up the space with n-dimension as the number of methods (example 8 ) and then sample as many time as I want (ex. 10000). Then apply the constraints to verify if the vector fulfills them.
The bottleneck is indeed the number of sampling, because I will be iterating through that, but the w.c.s is O(samples) which is definitely lower than 21^methods

I was thinking, since the constraint are the following:

    """
    These are the constraints that must be in place:
    constraint 1: sum w_i = 1
    constraint 2: w_i >= 0.05 for all w_i may not apply if some w_i is discarded
    constraint 3: w_i <= 0.3 for all w_i

    """

Maybe we can set the space to be (0, 0.3) given the third constraint, what is the point on looking for a weight higher than 0.3 if we are going to discard it anyway.

In this way we don't need to do any "normalization" of the vector and rely on the result from the sampling, since the chance of hitting a vector that fulfill the first if condition (if sum(w) <= 1 - LOWER_BOUND | where LOWER_BOUND is 0.05) is higher.

here is the code:

dim = len(METHODS) - len(low_quality_index)

    space = Space([(0.0, 0.3)]*dim)
    lhs = Lhs(lhs_type='classic', criterion='maximin', iterations=1000)

    for w in lhs.generate(dimensions=space.dimensions, n_samples=1000):
        if sum(w) <= 1 - LOWER_BOUND:                        # inside the simplex
            w_dim = list(np.append(w, [1 - sum(w)]))   # consequently len(w_dim) == dim
            if all_constraints(w_dim):
                w_all = fill_with_zeros(w_dim, low_quality_index)
                f = func(w_all)
                if optimum['Objective_Function'] > f:  # remember we are running a minimization
                    optimum['Objective_Function'] = f
                    for i, v in enumerate(METHODS):
                        optimum[v] = w_all[i]

Could it make sense?

@koszulordie
Copy link
Author

koszulordie commented Dec 21, 2023

We need the weights we sample $w_i$ to be specified as proportions of the total weight, meaning $\sum_i w_i = 1$,
in order to apply the constraints on the relative voting rights that the methods are allowed to hold when they are combined. If we restrict our search to $(0,3)^n$ instead of $(0,1)^n$ we still have the problem that we need to convert the vector into proportions and make sure they satisfy the additional constraint: $0.05 \leq w_i\leq 0.3$ for each i. In other words, sampling from $(0,3)$ or from $(0.05, 0.3)$ does not mean the respective weight, once converted to a proportion, will satisfy the constraints.
Unfortunately, the method we are using does not allow to do LHC sampling whilst satisfying the constraints, a.k.a. constrained LHC sampling and I have not been able to find any ready-to-use solution.

@FedericaBrando
Copy link
Member

Okay, so if I am understanding correctly, basically restricting the space to $(0,0.3)$ is just rescaling the problem, not solving it.

My concern is given by the first if condition:

if sum(w) <= 1 - LOWER_BOUND:  # 0.95

I was running an LHS sampling of $n=10000$ and it was never finding $w$ with a $\sum_i w_i = 1$, so I was looking in other implementations of the method and I came across the scipy one: Here

When reading the example they mention taken from the Latin hypercube sampling and the sensitivity analysis of a Monte Carlo epidemic model I was thinking that maybe we could borrow the idea of rescaling the $w$ dimensions with the appropriate bounds $(0.05,0.3)$, using the built-in qmc.scale() and then apply the constraint. But, as before, maybe something that makes sense to me, it indeed does not apply to our problem..

@FedericaBrando
Copy link
Member

FedericaBrando commented Dec 28, 2023

Implementation:

Testing phase:

Combination - test CBIOP_WXS_PRAD_BROAD

7 methods - exhaustive 8 methods - exhaustive 8 methods - LHS
1h 4m 53s 7h 19s 15m 19s

@migrau
Copy link
Member

migrau commented Jan 11, 2024

An additional test could be running intogen in the beeGFS partition from the new IRB cluster

@FedericaBrando
Copy link
Member

Recap - meeting with Ferran - 16/01

Although the speed improved significantly with the implementation of LHC sampling, the longest run takes aproximately 1d 12h. Still a lot.

Discussing with Ferran we came up with the following step to optimize the computational cost of the method:

  1. Keep only the LHS optimization and get rid of the optimize_with_seed()

    def optimize_with_seed(func, w0, low_quality=None):
    """
    Args:
    func: function admitting w as argument

  2. Store the the change of optimum for every iteration in order to analyse it after.

if optimum['Objective_Function'] > f: # remember we are running a minimization
optimum['Objective_Function'] = f
for i, v in enumerate(METHODS):
optimum[v] = w_all[i]

  1. Try different number of sampling (100 - 1000) - Ideally a number dependant on the dimension of the simplex that ensure an equal distribution of the sampling in order to maximize the exploration (correct if wrong).

  2. Run with TCGA and HARTWIG

  3. Check for time improvement - compare with v2023 and dev-O3D

  4. Check for changing in in results between runs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants