diff --git a/paper/figures/gp_vs_nngp.png b/paper/figures/gp_vs_nngp.png new file mode 100644 index 0000000..0124ade Binary files /dev/null and b/paper/figures/gp_vs_nngp.png differ diff --git a/paper/figures/nngp_nnsize.png b/paper/figures/nngp_nnsize.png new file mode 100644 index 0000000..bf21abb Binary files /dev/null and b/paper/figures/nngp_nnsize.png differ diff --git a/paper/paper.md b/paper/paper.md index 5574f77..8b77f5e 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -16,11 +16,11 @@ authors: orcid: 0000-0002-5177-8598 affiliation: "3" affiliations: - - name: Research Computing, Harvard University, Cambridge, MA, United States of America + - name: Research Computing, Harvard University, Cambridge, Massachusetts, United States of America index: 1 - - name: McLean Hospital, Belmont, MA, United States of America + - name: McLean Hospital, Belmont, Massachusetts, United States of America index: 2 - - name: Department of Biostatistics, Harvard School of Public Health, Cambridge, MA, United States of America + - name: Department of Biostatistics, Harvard School of Public Health, Cambridge, Massachusetts, United States of America index: 3 date: 15 March 2023 @@ -230,6 +230,17 @@ Original covariate balance: ![Plot of nnGP models S3 object. Left: Estimated CERF with credible band. Right: Covariate balance of confounders before and after weighting with nnGP approach.\label{fig:nngp}](figures/readme_nngp.png){ width=100% } +# Performance analyses of standard and nearest neighbor GP models + +The time complexity of the standard Gaussian Process (GP) model is \( O(n^3) \), while for the nearest neighbor GP (nnGP) model, it is \( O(n * m ^ 3) \), where `m` is the number of neighbors. An in-depth discussion on achieving these complexities is outside the scope of this paper. Readers interested in further details can refer to @Ren_2021_bayesian. This section focuses on comparing the wall clock time of standard GP and nnGP models in calculating the Conditional Exposure Response Function (CERF) at a specific exposure level, \( w \). We set the hyper-parameters to values at $\alpha = \beta = \gamma/\sigma = 1$. \autoref{fig:performance} shows the comparison of standard GP model with nnGP utilizing 50 nearest neighbors. Due to the differing parallelization architectures of the standard GP and nnGP in our package, we conducted this benchmark on a single core. The sample size was varied from 3,000 to 10,000, a range where nnGP begins to demonstrate notable efficiency over the standard GP. We repeat the process 20 times with different seed values. We plotted wall clock time against sample size for both methods. To enhance the visualization of the increasing rate of wall clock time, we applied a log transformation to both axes. For this specific set of analyses the estimated slope of 3.09 (ideally 3) for standard GP aligns with its \( O(n^3) \) time complexity. According to the results, a sample size of 10,000 data samples is not large enough to establish a meaningful relationship for the time complexity of the nnGP model effectively. + +![Representation of Wall Clock Time (s) vs. Data Samples for Standard GP and nnGP Models. All computations are conducted with $w=1$ and $\alpha = \beta = \gamma/\sigma = 1$. The process is repeated 20 times using various seed values to ensure robustness. A jitter effect is applied to enhance the visibility of data points. Both axes are displayed on log10 scales. The solid lines represent the linear regression modeled as $lm(log10(WC) ~ log10(n))$. \label{fig:performance}](figures/gp_vs_nngp.png ){ width=60% } + +\autoref{fig:performance_nn} compares the performance of the nnGP model across three nearest neighbor categories: 50, 100, and 200, using a data sample sequence ranging from 5,000 to 100,000 with intervals of 5,000. For each category, different sets of runs demonstrate a linear relationship, consistent with an \( O(n) \) time complexity, assuming that $m^3$ remains constant for varying sample sizes within each category. + +![Representation of Wall Clock Time (s) vs. Data Samples of the nnGP model across different nearest neighbor categories (50, 100, 200) over a range of data sample sizes from 5,000 to 100,000 in 5,000 increments. . All computations are conducted with $w=1$ and $\alpha = \beta = \gamma/\sigma = 1$. Both axes are displayed on log10 scales. The solid lines represent the linear regression modeled as $lm(log10(WC) ~ log10(n))$. \label{fig:performance_nn}](figures/gp_vs_nngp.png ){ width=60% } + + # Software related features We have implemented several features to enhance the package performance and usability. By utilizing an internal `parallel` package, the software is capable of scaling up in a shared memory system. Additionally, we have implemented a logging infrastructure that tracks the software's internal progress and provides users and developers with detailed information on processed runs [@logger]. We have also activated continuous integration (CI) through GitHub actions, which runs unit tests and checks the code quality for any submitted pull request. The majority of the codebase is tested at least once. To ensure efficient development, we follow a successful git branching model [@driessen_2010] and use the tidyverse styling guide.