Merge pull request #31 from TARGENE/rm_svp

clean repo from sieve and try link docker build to tagbot
TARGENE · Sep 6, 2024 · ef52065 · ef52065 · olivierlabayle · Sep 6, 2024
2 parents 0e9f583 + 851ea87
commit ef52065
Show file tree

Hide file tree

Showing 20 changed files with 22 additions and 1,706 deletions.
diff --git a/.github/workflows/Release.yml → .github/workflows/PublishDockerImage.yml b/.github/workflows/Release.yml → .github/workflows/PublishDockerImage.yml
@@ -2,8 +2,10 @@ name: Publish Docker Image
 
 on:
   workflow_dispatch:
-  release:
-    types: [published]
+  workflow_run:
+    workflows: [TagBot]
+    types:
+      - completed
 
 jobs:
   push-to-registries:

diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "TMLECLI"
 uuid = "2573d147-4098-46ba-9db2-8608d210ccac"
 authors = ["Olivier Labayle"]
-version = "0.9.1"
+version = "0.10.0"
 
 [deps]
 ArgParse = "c7e460c6-2fb9-53a9-8c5b-16f535851c63"
@@ -15,14 +15,12 @@ EvoTrees = "f6006082-12f8-11e9-0c9c-0d5d367ab1e5"
 GLMNet = "8d5ece8b-de18-5317-b113-243142960cc6"
 JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
 JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
-MKL = "33e6dc65-8f57-5167-99aa-e5a354878fb2"
 MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
 MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
 MLJLinearModels = "6ee0df7b-362f-4a72-a706-9e79364fb692"
 MLJModelInterface = "e80e1ace-859a-464e-9ed9-23947d8ae3ea"
 MLJModels = "d491faf4-2d78-11e9-2867-c94bc002c0b7"
 MLJXGBoostInterface = "54119dfa-1dab-4055-a167-80440f4f7a91"
-Mmap = "a63ad114-7e13-5084-954f-fe012c677804"
 PackageCompiler = "9b87118b-4619-50d2-8e1e-99f35a4d4d9d"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
@@ -42,7 +40,6 @@ EvoTrees = "0.16.5"
 GLMNet = "0.7"
 JLD2 = "0.4.22"
 JSON = "0.21.4"
-MKL = "0.6, 0.7"
 MLJ = "0.20.0"
 MLJBase = "1.0.1"
 MLJLinearModels = "0.10.0"

diff --git a/docs/make.jl b/docs/make.jl
@@ -15,7 +15,7 @@ makedocs(
     modules = [TMLECLI],
     pages=[
         "Home" => "index.md",
-        "Command Line Interface" => ["cli.md", "tmle_estimation.md", "sieve_variance.md", "make_summary.md"],
+        "Command Line Interface" => ["cli.md", "tmle_estimation.md", "make_summary.md"],
         "MLJ Extensions" => ["models.md", "resampling.md"],
     ],
     pagesonly=true,

diff --git a/docs/src/cli.md b/docs/src/cli.md
@@ -25,6 +25,6 @@ Bellow is a description of the functionalities offered by the CLI.
 ## CLI Description
 
 ```@contents
-Pages = ["tmle_estimation.md", "sieve_variance.md", "make_summary.md"]
+Pages = ["tmle_estimation.md", "make_summary.md"]
 Depth = 5
 ```
diff --git a/docs/src/sieve_variance.md b/docs/src/sieve_variance.md
diff --git a/docs/src/tmle_estimation.md b/docs/src/tmle_estimation.md
@@ -133,86 +133,4 @@ For further information, we recommend you have a look at both:
 
 ## Note on Outputs
 
-We can output results in three different formats: HDF5, JSON and JLS. By default no output is written, so you need to specify at least one. An output can be generated by specifying an output filename for it. For instance `--outputs.json.filename=output.json` will output a JSON file. Note that you can generate multiple formats at once, e.g. `--outputs.json.filename=output.json --outputs.hdf5.filename=output.hdf5` will output both JSON and HDF5 result files. Another important output option is the `pval_threshold`. Each estimation result is accompanied by an influence curve vector and by default these vectors are erased before saving the results because they typically take up too much space and are not usually needed. In some occasions you might want to keep them and this can be achieved by specifiying the output's `pval_threhsold`. For instance `--outputs.hdf5.pval_threshold=1.` will keep all such vectors because all p-values lie in between 0 and 1.
-
-In order to run sieve variance plateau correction after a TMLE run you need to save the results in HDF5 format with influence curve vectors. Furthermore, you will need to save the sample-ids associated with each result. A complete option set for this could be: `--outputs.hdf5.filename=output.hdf5 --outputs.hdf5.pval_threshold=0.05 --sample_ids=true`. In this case, only those results with an individual p-value of less than ``0.05`` will keep track of their influence curves and be considered for sieve variance correction.
-
-## Runtime
-
-Targeted Learning can quickly become computationally intensive compared to traditional parametric inference. Here, we illustrate typical runtimes using examples from population genetics. This is because population genetics is currently the main use case for this package, but it shouldn't be understood as the only scope. In fact, the two most prominent study designs in population genetics are perfect illustrations of the computational complexity associated with Targeted Learning.
-
-### Preliminary
-
-Remember that for each estimand of interest, Targeted Learning requires 3 main ingredients that drive computational complexity:
-
-- An estimator for the propensity score: `G(T, W) = P(T|W)`.
-- An estimator for the outcome's mean: `Q(T, W) = E[Y|T, W]`.
-- A targeting step towards the estimand of interest.
-
-While the targeting step has a fixed form, both `G` and `Q` require specification of learning algorithms that can range from simple generalized linear models to complex Super Learners. In general, one doesn't know how the data has been generated and the model space should be kept as large as possible in order to provide valid inference. This means we recommend the use Super Learning for both `G` and `Q` as it comes with asymptotic theoretical guarantees. However, Super Learning is an expensive procedure and, depending on the context, might become infeasible. Also, notice that while the targeting step is specific to a given estimand, `G` and `Q` are only specific to the variables occuring in the causal graph. This means that they can potentially be cleverly reused across the estimation of multiple estimands. Note that this clever reuse, is already baked into this package, and nothing needs to be done beside specifying the learning algorithms for `G` and `Q`. The goal of the subsequent sections is to provide some examples, guiding the choice of those learning algorithms.
-
-In what follows, `Y` is an outcome of interest, `W` a set of confounding variables and `T` a genetic variation. Genetic variations are usually represented as a pair of alleles corresponding to an individual's genotype. We will further restrict the scope to bi-allelic single nucleotide variations. This means that, at a given locus where the two alleles are `A` and `C`, an individual could have any of the following genotype: `AA`, `AC`, `CC`. Those will be our treatment values.
-
-For all the following experiments:
-
-- The Julia script can be found at [experiments/runtime.jl](https://github.com/TARGENE/TMLECLI.jl/tree/main/experiments/runtime.jl).
-- The various estimators used below are further described in the[estimators-configs](https://github.com/TARGENE/TMLE.jl/tree/main/estimators-configs) folder.
-
-### Multiple treatment contrasts
-
-In a classic randomized control trial, the treatment variable can only take one of two levels: `treated` or `not treated`. In out example however, any genetic variation takes its values from three different levels. As such, the `treated` and `not treated` levels need to be defined and any of the following contrasts can be of interest:
-
-- `AA` -> `AC`
-- `AC` -> `CC`
-- `AA` -> `CC`
-
-For a given outcome and genetic variation, for each contrast, both `G` and `Q` are actually invariant. This shows a first level of reduction in computational complexity. **Both `G` and `Q` need to be fitted only once across multiple treatment contrasts and only the targeting step needs to be carried out again.**
-
-### The PheWAS study design
-
-In a PheWAS, one is interested in the effect of a genetic variation across many outcomes (typically around 1000). Because the treatment variable is always the same, the propensity score `G` can be reused across all parameters, which drastically reduces computational complexity.
-
-```@raw html
-<div style="text-align:center">
-<img src="assets/phewas.png" alt="PheWAS" style="width:400px;"/>
-</div>
-```
-
-With this setup in mind, the computational complexity is mostly driven by the specification of the learning algorithms for `Q`, which will have to be fitted for each outcome. For 10 outcomes, we estimate the 3 Average Treatment Effects corresponding to the 3 possible treatment contrasts defined in the previous section. There are thus two levels of reuse of `G` and `Q` in this study design. In the table below are presented some runtimes for various specifications of `G` and `Q` using a single cpu. The "Unit runtime" is the average runtime across all estimands and can roughly be extrapolated to bigger studies.
-
-| Estimator | Unit runtime (s) | Extrapolated runtime to 1000 outcomes |
-| --- | :---: | :---: |
-| `glm.` | 4.65 | ≈ 1h20 |
-| `glmnet` | 7.19 | ≈ 2h |
-| `G-superlearning-Q-glmnet` | 50.05| ≈ 13h45 |
-| `superlearning` | 168.98 | ≈ 46h |
-
-Depending on the exact setup, this means one can probably afford to use Super Learning for at least the estimation of `G` (and potentially also for `Q` for a single PheWAS). This turns out to be a great news because TMLE is a double robust estimator. As a reminder, it means that only one of the estimators for `G` or `Q` needs to converge sufficiently fast to the ground truth to guarantee that our estimates will be asymptotically unbiased.
-
-Finally, note that those runtime estimates should be interpreted as worse cases, this is because:
-
-- Only 1 cpu is used.
-- Most modern high performance computing platform will allow further parallelization.
-- In the case where `G` only is a Super Learner, since the number of parameters is still relatively low in this example, it is possible that the time to fit `G` still dominates the runtime.
-- Runtimes include precompilation which becomes negligible with the size of the study.
-
-### The GWAS study design
-
-In a GWAS, the outcome variable is held fixed and we are interested in the effects of very many genetic variations on this outcome (typically 800 000 for a genotyping array). The propensity score cannot be reused across parameters resulting in a more expensive run.
-
-```@raw html
-<div style="text-align:center">
-<img src="assets/gwas.png" alt="GWAS" style="width:400px;"/>
-</div>
-```
-
-Again, we estimate the 3 Average Treatment Effects corresponding to the 3 possible treatment contrasts. However we now look at 3 different genetic variations and only one outcome. In the table below are presented some runtimes for various specifications of `G` and `Q` using a single cpu. The "Unit runtime" is the average runtime across all estimands and can roughly be extrapolated to bigger studies.
-
-| Estimator file | Continuous outcome unit runtime (s) | Binary outcome unit runtime (s) | Projected Time on HPC (200 folds //) |
-| --- | :---: | :---: | :---: |
-| `glm` | 5.64 | 6.14 | ≈ 6h30 |
-| `glmnet` | 17.46 | 22.24 | ≈ 22h |
-| `G-superlearning-Q-glmnet` | 430.54 | 438.67 | ≈ 20 days |
-| `superlearning` | 511.26 | 567.72 | ≈ 24 days |
-
-We can see that modern high performance computing platforms definitely enable this study design when using GLMs or GLMNets. It is unlikely however, that you will be able to use Super Learning for any of `P(V|W)` or `E[Y|V, W]` if you don't have privileged access to such platform. While the double robustness guarantees will generally not be satisfied, our estimate will still be targeted, which means that its bias will be reduced compared to classic inference using a parametric model.
+We can output results in three different formats: HDF5, JSON and JLS. By default no output is written, so you need to specify at least one. An output can be generated by specifying an output filename for it. For instance `--outputs.json.filename=output.json` will output a JSON file. Note that you can generate multiple formats at once, e.g. `--outputs.json.filename=output.json --outputs.hdf5.filename=output.hdf5` will output both JSON and HDF5 result files. Another important output option is the `pval_threshold`. Each estimation result is accompanied by an influence curve vector and by default these vectors are erased before saving the results because they typically take up too much space and are not usually needed. In some occasions you might want to keep them and this can be achieved by specifiying the output's `pval_threhsold`. For instance `--outputs.hdf5.pval_threshold=1.` will keep all such vectors because all p-values lie in between 0 and 1.
diff --git a/experiments/parameters.binary.gwas.yaml b/experiments/parameters.binary.gwas.yaml