This is the reComBat implementation as described in our recent paper. The paper introduces a generalized version of the empirical Bayes batch correction method introduced in [1]. We use the two-design-matrix approach of Wachinger et al. [2]
reComBat is a PyPI package which can be installed via pip
:
pip install reComBat
You can also clone the repository and install it locally via Poetry by executing
poetry install
in the repository directory.
The reComBat
package is inspired by the code of [3] and also uses a scikit-learn like
API.
In a Python script, you can import it via
from reComBat import reComBat
combat = reComBat()
combat.fit(data,batches)
combat.transform(data,batches)
or
combat.fit_transform(data,batches)
All data input (data, batches, design matrices) are input as pandas dataframes. The format is (rows x columns) = (samples x features), and the index is an arbitrary sample index. The batches should be given as a pandas series. Note that there are two types of columns for design matrices, numerical columns and categorical columns. All columns in X and C are by default assumed categorical. If a column contains numerical covariates, these columns should have the suffix "_numerical" in the column name.
There is also a command-line interface which can be called from a bash shell.
reComBat data_file.csv batch_file.csv --<optional args>
The reComBat
class has many optional arguments (see below).
The fit
, transform
and fit_transform
functions all take pandas dataframes as arguments,
data
and batches
. Both dataframes should be in the form above.
The reComBat
class has the following optional arguments:
parametric
:True
orFalse
. Choose between the parametric or non-parametric version of the empirical Bayes method. By default, this isTrue
, i.e. the parametric method is performed. Note that the non-parametric method has a longer run time than the parametric one.model
: Choose which regression model should be used to standardise the data. You can choose betweenlinear
,ride
,lasso
andelastic_net
regression. By default theelastic_net
model is used.config
: A Python dictionary specifying the keyword arguments for the relevantscikit-learn
regression classes.
For example, the LinearRegression
class in scikit-learn
currently has four non-deprecated keyword arguments, fit_intercept
, copy_X
, n_jobs
, and positive
. To specify each of them, we create a config
dict
config = {'fit_intercept':False,'copy_X':True,'n_jobs':1,'positive':False}
Note that in order for reComBat to give the correct result, the fit_intercept
parameter always needs to be set to False
.
For further details refer to sklearn.linear_model. The default config
is None
.
conv_criterion
: The convergence criterion for the parametric empirical Bayes optimization. Relative, rather than absolute convergence criteria are used. The default is 1e-4.max_iter
: The maximum number of iterations for the parametric empirical Bayes optimization. The default is 1000.n_jobs
: The number of parallel thread used in the non-parametric empirical Bayes optimization. A larger number of threads considerably speeds up the computation, but also has higher memory requirements. The default is the number of CPUs of the machine.mean_only
:True
orFalse
. Chooses whether the only the means are adjusted (no scaling is performed), or the full algorithm should be run. The default isFalse
.optimize_params
:True
orFalse
. Chooses whether the Bayesian parameters should be optimised, or if the starting values should be used. The default isTrue
.reference_batch
: If the data contains a reference batch, then this can be specified here. The reference batch will not be adjusted. The default isNone
.verbose
:True
orFalse
. Toggles verbose output. The default isTrue
.
The command line interface can take any of these arguments (except for config
) via --<argument>=ARG
. Any scikit-learn
keyword arguments should be given explicitly, e.g. --alpha=1e-10
. The command line interface has the additional following optional arguments:
X_file
: The csv file containing the design matrix of desired variation. The default isNone
.C_file
: The csv file containing the design matrix of undesired variation. The default isNone
.data_path
: The path to the data/design matrices. The default is the current directory.out_path
: The path where the output file should be stored. The default is the current directory.out_file
: The name out the output file (with extension).
The transform
method and the command line interface output a dataframe, respectively a csv file, of the form (samples x features) with the adjusted data.
We included a step-by-step tutorial in the tutorial
folder of the GitHub repository. We also provide a PDF version which serves as a manual.
This code is developed and maintained by members of the Machine Learning and Computational Biology Lab of Prof. Dr. Karsten Borgwardt:
- Michael Adamer (GitHub)
- Sarah Brüningk (GitHub)
References:
[1] W. Evan Johnson, Cheng Li, Ariel Rabinovic, Adjusting batch effects in microarray expression data using empirical Bayes methods, Biostatistics, Volume 8, Issue 1, January 2007, Pages 118–127, https://doi.org/10.1093/biostatistics/kxj037
[2] Christian Wachinger, Anna Rieckmann, Sebastian Pölsterl. Detect and Correct Bias in Multi-Site Neuroimaging Datasets. arXiv:2002.05049
[3] pycombat, CoAxLab, https://github.com/CoAxLab/pycombat