diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 5fbbaca..c32996d 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-08-25T11:56:15","documenter_version":"1.6.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-08-28T14:09:46","documenter_version":"1.6.0"}} \ No newline at end of file diff --git a/dev/cli/index.html b/dev/cli/index.html index ca51708..ccd9d97 100644 --- a/dev/cli/index.html +++ b/dev/cli/index.html @@ -1,2 +1,2 @@ -The Command Line Interface (CLI) · TMLECLI.jl

The Command Line Interface (CLI)

CLI Installation

Via Docker (requires Docker)

While we are getting close to providing a standalone application, the most reliable way to use the app is still via the provided Docker container. In this container, the command line interface is accessible and can be used directly. For example via:

docker run -it --rm -v HOST_DIR:CONTAINER_DIR olivierlabayle/targeted-estimation:TAG tmle --help

where HOST_DIR:CONTAINER_DIR will map the host directory HOST_DIR to the container's CONTAINER_DIR and TAG is the currently released version of the project.

Build (requires Julia)

Alternatively, provided you have Julia installed, you can build the app via:

julia --project deps/build_app.jl app

Bellow is a description of the functionalities offered by the CLI.

CLI Description

+The Command Line Interface (CLI) · TMLECLI.jl

The Command Line Interface (CLI)

CLI Installation

Via Docker (requires Docker)

While we are getting close to providing a standalone application, the most reliable way to use the app is still via the provided Docker container. In this container, the command line interface is accessible and can be used directly. For example via:

docker run -it --rm -v HOST_DIR:CONTAINER_DIR olivierlabayle/targeted-estimation:TAG tmle --help

where HOST_DIR:CONTAINER_DIR will map the host directory HOST_DIR to the container's CONTAINER_DIR and TAG is the currently released version of the project.

Build (requires Julia)

Alternatively, provided you have Julia installed, you can build the app via:

julia --project deps/build_app.jl app

Bellow is a description of the functionalities offered by the CLI.

CLI Description

diff --git a/dev/index.html b/dev/index.html index 0433b71..242e44a 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · TMLECLI.jl

TMLECLI.jl

The goal of this package, is to provide a standalone executable to run large scale Targeted Minimum Loss-based Estimation (TMLE) on tabular datasets. To learn more about TMLE, please visit TMLE.jl, the companion package.

We also provide extensions to the MLJ universe that are particularly useful in causal inference.

+Home · TMLECLI.jl

TMLECLI.jl

The goal of this package, is to provide a standalone executable to run large scale Targeted Minimum Loss-based Estimation (TMLE) on tabular datasets. To learn more about TMLE, please visit TMLE.jl, the companion package.

We also provide extensions to the MLJ universe that are particularly useful in causal inference.

diff --git a/dev/make_summary/index.html b/dev/make_summary/index.html index e5192d7..db72230 100644 --- a/dev/make_summary/index.html +++ b/dev/make_summary/index.html @@ -2,4 +2,4 @@ Merging TMLE outputs · TMLECLI.jl

Merging TMLE outputs

Usage

tmle make-summary --help
TMLECLI.make_summaryFunction
make_summary(
     prefix; 
     outputs=Outputs(json=JSONOutput(filename="summary.json"))
-)

Combines multiple TMLE .hdf5 output files in a single file. Multiple formats can be output at once.

Args

  • prefix: Prefix to .hdf5 files to be used to create the summary file

Options

  • -o, --outputs: Ouptuts configuration.
source
+)

Combines multiple TMLE .hdf5 output files in a single file. Multiple formats can be output at once.

Args

Options

source diff --git a/dev/models/index.html b/dev/models/index.html index 17d6e11..d6db6ad 100644 --- a/dev/models/index.html +++ b/dev/models/index.html @@ -2,7 +2,7 @@ Models · TMLECLI.jl

Models

Because TMLE.jl is based on top of MLJ, we can support any model respecting the MLJ interface. At the moment, we readily support all models from the following packages:

  • MLJLinearModels: Generalized Linear Models in Julia.
  • XGBoost.jl: Julia wrapper of the famous XGBoost package.
  • EvoTrees.jl: A pure Julia implementation of histogram based gradient boosting trees (subset of XGBoost)
  • GLMNet: A Julia wrapper of the glmnet package. See the GLMNet section.
  • MLJModels: General utilities such as the OneHotEncoder or InteractionTransformer.

Further support for more packages can be added on request, please fill an issue.

Also, because the estimator file used by the TMLE CLI is a pure Julia file, it is possible to use it in order to install additional package that can be used to define additional models.

Finally, we also provide some additional models described in Additional models provided by TMLECLI.jl.

Additional models provided by TMLECLI.jl

GLMNet

This is a simple wrapper around the glmnetcv function from the GLMNet.jl package. The only difference is that the resampling is made based on MLJ resampling strategies.

TMLECLI.GLMNetRegressorMethod
GLMNetRegressor(;resampling=CV(), params...)

A GLMNet regressor for continuous outcomes based on the glmnetcv function from the GLMNet.jl package.

Arguments:

Examples:

A glmnet with alpha=0.


 model = GLMNetRegressor(resampling=CV(nfolds=3), alpha=0)
 mach = machine(model, X, y)
-fit!(mach, verbosity=0)
source
TMLECLI.GLMNetClassifierMethod
GLMNetClassifier(;resampling=StratifiedCV(), params...)

A GLMNet classifier for binary/multinomial outcomes based on the glmnetcv function from the GLMNet.jl package.

Arguments:

Examples:

A glmnet with alpha=0.


+fit!(mach, verbosity=0)
source
TMLECLI.GLMNetClassifierMethod
GLMNetClassifier(;resampling=StratifiedCV(), params...)

A GLMNet classifier for binary/multinomial outcomes based on the glmnetcv function from the GLMNet.jl package.

Arguments:

Examples:

A glmnet with alpha=0.


 model = GLMNetClassifier(resampling=StratifiedCV(nfolds=3), alpha=0)
 mach = machine(model, X, y)
-fit!(mach, verbosity=0)
source

RestrictedInteractionTransformer

This transformer generates interaction terms based on a set of primary variables in order to limit the combinatorial explosion.

TMLECLI.RestrictedInteractionTransformerType
RestrictedInteractionTransformer(;order=2, primary_variables=Symbol[], primary_variables_patterns=Regex[])

Definition

This transformer generates interaction terms based on a set of primary variables. All generated interaction terms are composed of a set of primary variables and at most one remaining variable in the provided table. If (T₁, T₂) are defining the set of primary variables and (W₁, W₂) are reamining variables in the table, the generated interaction terms at order 2 will be:

  • T₁xT₂
  • T₁xW₂
  • W₁xT₂

but W₁xW₂ will not be generated because it would contain 2 remaining variables.

Arguments:

  • order: All interaction features up to the given order will be computed
  • primary_variables: A set of column names to generate the interactions
  • primaryvariablespatterns: A set of regular expression that can additionally

be used to identify primary_variables.

source

BiAllelicSNPEncoder

This transformer, mostly useful for genetic studies, converts bi-allelic single nucleotide polyphormism columns, encoded as Strings to a count of one of the two alleles.

TMLECLI.BiAllelicSNPEncoderType
BiAllelicSNPEncoder(patterns=Symbol[])

Encodes bi-allelic SNP columns, identified by the provided patterns Regex, as a count of a reference allele determined dynamically (not necessarily the minor allele).

source
+fit!(mach, verbosity=0)source

RestrictedInteractionTransformer

This transformer generates interaction terms based on a set of primary variables in order to limit the combinatorial explosion.

TMLECLI.RestrictedInteractionTransformerType
RestrictedInteractionTransformer(;order=2, primary_variables=Symbol[], primary_variables_patterns=Regex[])

Definition

This transformer generates interaction terms based on a set of primary variables. All generated interaction terms are composed of a set of primary variables and at most one remaining variable in the provided table. If (T₁, T₂) are defining the set of primary variables and (W₁, W₂) are reamining variables in the table, the generated interaction terms at order 2 will be:

  • T₁xT₂
  • T₁xW₂
  • W₁xT₂

but W₁xW₂ will not be generated because it would contain 2 remaining variables.

Arguments:

  • order: All interaction features up to the given order will be computed
  • primary_variables: A set of column names to generate the interactions
  • primaryvariablespatterns: A set of regular expression that can additionally

be used to identify primary_variables.

source

BiAllelicSNPEncoder

This transformer, mostly useful for genetic studies, converts bi-allelic single nucleotide polyphormism columns, encoded as Strings to a count of one of the two alleles.

TMLECLI.BiAllelicSNPEncoderType
BiAllelicSNPEncoder(patterns=Symbol[])

Encodes bi-allelic SNP columns, identified by the provided patterns Regex, as a count of a reference allele determined dynamically (not necessarily the minor allele).

source
diff --git a/dev/objects.inv b/dev/objects.inv index c553eb0..90fd1d5 100644 Binary files a/dev/objects.inv and b/dev/objects.inv differ diff --git a/dev/resampling/index.html b/dev/resampling/index.html index a9939d0..cdc6596 100644 --- a/dev/resampling/index.html +++ b/dev/resampling/index.html @@ -1,2 +1,2 @@ -Resampling Strategies · TMLECLI.jl

Resampling Strategies

We also provide additional resampling strategies compliant with the MLJ.ResamplingStrategy interface.

AdaptiveResampling

The AdaptiveResampling strategies will determine the number of cross-validation folds adaptively based on the available data. This is inspired from the this paper on practical considerations for super learning.

The AdaptiveCV will determine the number of folds adaptively and perform a classic cross-validation split:

TMLECLI.AdaptiveCVType
AdaptiveCV(;shuffle=nothing, rng=nothing)

A CV (see MLJBase.CV) resampling strategy where the number of folds is determined data adaptively based on the rule of thum described here.

source

The AdaptiveStratifiedCV will determine the number of folds adaptively and perform a stratified cross-validation split:

TMLECLI.AdaptiveStratifiedCVType
AdaptiveStratifiedCV(;shuffle=nothing, rng=nothing)

A StratifiedCV (see MLJBase.StratifiedCV) resampling strategy where the number of folds is determined data adaptively based on the rule of thum described here.

source

JointStratifiedCV

Sometimes, the treatment variables (or some other features) are imbalanced and naively performing cross-validation or stratified cross-validation could result in the violation of the positivity hypothesis. To overcome this difficulty, the following JointStratifiedCV, performs a stratified cross-validation based on both features variables and the outcome variable.

TMLECLI.JointStratifiedCVType
JointStratifiedCV(;patterns=nothing, resampling=StratifiedCV())

Applies a stratified cross-validation strategy based on a variable constructed from X and y. A composite variable is built from:

  • x variables from X matching any of patterns and satisfying autotype(x) <: Union{Missing, Finite}.

If no pattern is provided, then only the second condition is considered.

  • y if autotype(y) <: Union{Missing, Finite}

The resampling needs to be a stratification compliant resampling strategy, at the moment one of StratifiedCV or AdaptiveStratifiedCV

source
+Resampling Strategies · TMLECLI.jl

Resampling Strategies

We also provide additional resampling strategies compliant with the MLJ.ResamplingStrategy interface.

AdaptiveResampling

The AdaptiveResampling strategies will determine the number of cross-validation folds adaptively based on the available data. This is inspired from the this paper on practical considerations for super learning.

The AdaptiveCV will determine the number of folds adaptively and perform a classic cross-validation split:

TMLECLI.AdaptiveCVType
AdaptiveCV(;shuffle=nothing, rng=nothing)

A CV (see MLJBase.CV) resampling strategy where the number of folds is determined data adaptively based on the rule of thum described here.

source

The AdaptiveStratifiedCV will determine the number of folds adaptively and perform a stratified cross-validation split:

TMLECLI.AdaptiveStratifiedCVType
AdaptiveStratifiedCV(;shuffle=nothing, rng=nothing)

A StratifiedCV (see MLJBase.StratifiedCV) resampling strategy where the number of folds is determined data adaptively based on the rule of thum described here.

source

JointStratifiedCV

Sometimes, the treatment variables (or some other features) are imbalanced and naively performing cross-validation or stratified cross-validation could result in the violation of the positivity hypothesis. To overcome this difficulty, the following JointStratifiedCV, performs a stratified cross-validation based on both features variables and the outcome variable.

TMLECLI.JointStratifiedCVType
JointStratifiedCV(;patterns=nothing, resampling=StratifiedCV())

Applies a stratified cross-validation strategy based on a variable constructed from X and y. A composite variable is built from:

  • x variables from X matching any of patterns and satisfying autotype(x) <: Union{Missing, Finite}.

If no pattern is provided, then only the second condition is considered.

  • y if autotype(y) <: Union{Missing, Finite}

The resampling needs to be a stratification compliant resampling strategy, at the moment one of StratifiedCV or AdaptiveStratifiedCV

source
diff --git a/dev/sieve_variance/index.html b/dev/sieve_variance/index.html index 2bf326e..698f2fd 100644 --- a/dev/sieve_variance/index.html +++ b/dev/sieve_variance/index.html @@ -6,4 +6,4 @@ n_estimators=10, max_tau=0.8, estimator_key=1 -)

Sieve Variance Plateau CLI.

Args

Options

source +)

Sieve Variance Plateau CLI.

Args

Options

source diff --git a/dev/tmle_estimation/index.html b/dev/tmle_estimation/index.html index 6b006ba..736a7c2 100644 --- a/dev/tmle_estimation/index.html +++ b/dev/tmle_estimation/index.html @@ -8,7 +8,7 @@ rng=123, cache_strategy="release-unusable", sort_estimands=false -)

TMLE CLI.

Args

Options

Flags

source

Specifying Estimands

The easiest way to create an estimands' file is to use the companion Julia TMLE.jl package and create a Configuration structure. This structure can be serialized to a file using any of serialize (Julia serialization format), write_json (JSON) or write_yaml (YAML).

Alternatively you can write this file manually. The following example illustrates the creation of three estimands in YAML format: an Average Treatment Effect (ATE), an Average Interaction Effect (AIE) and a Counterfactual Mean (CM).

type: "Configuration"
+)

TMLE CLI.

Args

Options

Flags

source

Specifying Estimands

The easiest way to create an estimands' file is to use the companion Julia TMLE.jl package and create a Configuration structure. This structure can be serialized to a file using any of serialize (Julia serialization format), write_json (JSON) or write_yaml (YAML).

Alternatively you can write this file manually. The following example illustrates the creation of three estimands in YAML format: an Average Treatment Effect (ATE), an Average Interaction Effect (AIE) and a Counterfactual Mean (CM).

type: "Configuration"
 estimands:
   - outcome_extra_covariates:
       - C1
@@ -57,4 +57,4 @@
 PheWAS
 

With this setup in mind, the computational complexity is mostly driven by the specification of the learning algorithms for Q, which will have to be fitted for each outcome. For 10 outcomes, we estimate the 3 Average Treatment Effects corresponding to the 3 possible treatment contrasts defined in the previous section. There are thus two levels of reuse of G and Q in this study design. In the table below are presented some runtimes for various specifications of G and Q using a single cpu. The "Unit runtime" is the average runtime across all estimands and can roughly be extrapolated to bigger studies.

EstimatorUnit runtime (s)Extrapolated runtime to 1000 outcomes
glm.4.65≈ 1h20
glmnet7.19≈ 2h
G-superlearning-Q-glmnet50.05≈ 13h45
superlearning168.98≈ 46h

Depending on the exact setup, this means one can probably afford to use Super Learning for at least the estimation of G (and potentially also for Q for a single PheWAS). This turns out to be a great news because TMLE is a double robust estimator. As a reminder, it means that only one of the estimators for G or Q needs to converge sufficiently fast to the ground truth to guarantee that our estimates will be asymptotically unbiased.

Finally, note that those runtime estimates should be interpreted as worse cases, this is because:

The GWAS study design

In a GWAS, the outcome variable is held fixed and we are interested in the effects of very many genetic variations on this outcome (typically 800 000 for a genotyping array). The propensity score cannot be reused across parameters resulting in a more expensive run.

GWAS -

Again, we estimate the 3 Average Treatment Effects corresponding to the 3 possible treatment contrasts. However we now look at 3 different genetic variations and only one outcome. In the table below are presented some runtimes for various specifications of G and Q using a single cpu. The "Unit runtime" is the average runtime across all estimands and can roughly be extrapolated to bigger studies.

Estimator fileContinuous outcome unit runtime (s)Binary outcome unit runtime (s)Projected Time on HPC (200 folds //)
glm5.646.14≈ 6h30
glmnet17.4622.24≈ 22h
G-superlearning-Q-glmnet430.54438.67≈ 20 days
superlearning511.26567.72≈ 24 days

We can see that modern high performance computing platforms definitely enable this study design when using GLMs or GLMNets. It is unlikely however, that you will be able to use Super Learning for any of P(V|W) or E[Y|V, W] if you don't have privileged access to such platform. While the double robustness guarantees will generally not be satisfied, our estimate will still be targeted, which means that its bias will be reduced compared to classic inference using a parametric model.

+

Again, we estimate the 3 Average Treatment Effects corresponding to the 3 possible treatment contrasts. However we now look at 3 different genetic variations and only one outcome. In the table below are presented some runtimes for various specifications of G and Q using a single cpu. The "Unit runtime" is the average runtime across all estimands and can roughly be extrapolated to bigger studies.

Estimator fileContinuous outcome unit runtime (s)Binary outcome unit runtime (s)Projected Time on HPC (200 folds //)
glm5.646.14≈ 6h30
glmnet17.4622.24≈ 22h
G-superlearning-Q-glmnet430.54438.67≈ 20 days
superlearning511.26567.72≈ 24 days

We can see that modern high performance computing platforms definitely enable this study design when using GLMs or GLMNets. It is unlikely however, that you will be able to use Super Learning for any of P(V|W) or E[Y|V, W] if you don't have privileged access to such platform. While the double robustness guarantees will generally not be satisfied, our estimate will still be targeted, which means that its bias will be reduced compared to classic inference using a parametric model.