efficient_r.Rmd

# Effcient R

Ruochen Zhang, Weipeng Li

```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(profvis)  
library(parallel)
library(dplyr)
```

## Introduction

R is one of the world's most widely used statistics programming languages, which can be used for statistical analysis, graphics representation, and reporting. With an understanding of its characteristics, we can make our R programming more efficient.

R is an interpreted language, but it calls compiled C, Fortran and other languages behind the scenes.  Ususally calling the complied languages will be more efficient compared with writing codes that take time to compile, Vectorized programming can improve efficiency with little modifications to the codes as vector and matrix as the basic operation units in R and their operations call the compiled codes.

R is very flexible. For example, the variables are in dynamic types and the content and type can be modified. The flexible design brings extra burden to the operation and may make the programs slower. We can improve efficiency by paying attention to its flexibility and memory allocation.

In this tutorial, we will introduce some methods to improve efficiency, including efficient programming  and parallel computation. We will also introduce how to use profiling tools to analyze program efficiency.


## Efficient programming

### Vectorized

Accessing the underlying C/Fortran routines as quickly as possible can make the program efficient. The fewer functions calls required to achieve this, the better. Many R functions are vectorised, that is the function’s inputs and/or outputs naturally work with vectors, reducing the number of function calls required. 

We can use the example of calculating mean deviation from median of a sample to show how vectorized functions work.

To caculate:
$\frac{1}{n}\sum_{i=1}^{n} abs(x_{i}-m)$,where m is the median of the sample.

In method 1, we can calculate it with loops.
```{r}
mad_f1 <- function(x){
  n <- length(x)
  mhat <- median(x)
  s <- 0.0
  for(i in 1:n){
    s <- s + abs(x[i] - mhat)
  }
  s <- s/n
  return(s)
}
```

Using vectorized functions we can write it with `mean()` and `median()`.
```{r}
mad_f2 <- function(x) mean( abs(x - median(x)) )

```
The second method is not only more concise, it is also more efficient. We can compare the running time with `bench::mark`, which will count the running time for multiple times.

```{r}
x <- runif(10000)
bench::mark(
  mad_f1(x),
  mad_f2(x)
)
```
We can see the vectorised method are almost 4 times faster than the first one. Because the test run is interfered with by other tasks running at the same time in the operating system, so the median or the minimum time (the fastest possible) can better show the efficiency rather than the average time.

It’s also important to make full use of R functions that use vectors. Sometimes it is not straightforward to  use vectors but they can improve the efficiency by many times. Here consider estimating the integral $\int_ {0}^ {1} x^2 dx$ using a Monte-Carlo method. Essentially, we throw darts at the curve and count the number of darts that fall below the curve. A typical way of doing it is as follows:
```{r}
monte_carlo = function(N) {
  hits = 0
  for (i in seq_len(N)) {
    u1 = runif(1)
    u2 = runif(1)
    if (u1 ^ 2 > u2)
      hits = hits + 1
  }
  return(hits / N)
}
```
A version utilise vectors can be shown as follows:
```{r}
monte_carlo_vec = function(N) sum(runif(N)^2 > runif(N)) / N
```
By comparing their running time, we can see that the ectorized method is much faster.
```{r}
N = 500000
t1<-system.time(monte_carlo(N))[3]
t2<-system.time(monte_carlo_vec(N))[3]
sprintf('monte_carlo time consumption:%.3f',t1)
sprintf('monte_carlo vectorised version time consumption:%.3f',t2)
```
The monte_carlo_vec() function not uses the vectorised functions.In addition,the comparison(>), runif(), power operations(^) are also vectorised by applying to the whole vector rather than repeatedly on single elements.

### The Apply Family

Explicit loops can be slow and verbose in R. While vectorized methods help avoiding many loops, another helpful tool is the apply family. These functions serve as nice substitutes for loops and execute efficiently. The apply functions take at least two arguments: an object and a valid R expression. The expression is applied iteratively on the objects in given orders. There are many apply functions in base R, showing in the table below. As some of them are rarely used, we will introduce some important ones in this section.

  Function        Description   
------------     -------------   
  apply()         Apply functions over array margins            
  lapply()        Apply a function over a list and return a list
  sapply()        Apply a function over a list or dataframe and return a list or a matrix
  vapply()        Apply a function over a list or dataframe and return in a given type
  mapply()        Apply a function over multiple inputs of lists or dataframes
  tapply()        Apply a function over a ragged array
  eapply()        Apply a fuction over values in an environment

Table: Functions in the Apply Family

#### apply()

The apply() function applies an expression to margins of an array or matrix. It takes three arguments: X, MARGIN, and FUN. X is the array to apply the function on. MARGIN indicates the directions over which to apply the expression, with 1 for rows, 2 for columns, and c(1,2) for both rows and columns. FUN is the expression to be applied. It also takes a 'simplify' arguments, with default value of 'TRUE', indicating whether the results should be simplified. An example of apply() as an alternative of for loop is the following.

```{r}
M1 <- matrix(C<-(1:15),nrow=5)
M1_colsum<-apply(M1,2,sum)
M1_colsum
```

#### lapply(), sapply()

lapply() is designed for lists. It applies a function to every entry of the input list and returns a list of the same length. sapply() works similarly as lapply(), but permits a more flexible output. By default it returns a vector or a matrix, but also supports an array output which can be controled by the 'simplify' argument.

```{r}
Names<-c("Manish","Saurabh", "Rahul","Krishna","Venkat")
Names_lower<-lapply(Names,tolower)
Names_lower
```

```{r}
Names_upper<-sapply(Names,toupper)
Names_upper
```

#### tapply()
tapply() applies an operation on a subset of vector broken down by a given factor variable. It works similar to the group_by() function plus an aggregated operation in dplyr. Let's use the iris dataset as an example.

```{r}
data(iris)
summary(iris)
```
```{r}
tapply(iris$Sepal.Length,iris$Species,mean)
```

#### replicate()

replicate() is a nice alternative for writing a for loop in random simulation. It executes a given program for several times without introducing a count variable. The return is a multi-dimension array of the simulating results. It takes two arguments: n and expr. n indicates the number of repetition, and expr is the operation to be executed. For example, we would like to compute the mean and standard deviation of q sequence of standard normal random variables. We decide to repeat the simulation for 8 times, and each time generate 100 random sample.

```{r}
set.seed(123)
replicate(8, {
  x <- rnorm(100, 0, 1); 
  c(mean(x), sd(x)) })
```

### Memory Allocation
A good habit in R programming is to pre-allocate memory for variables before filling in values. Make sure that the size of your list or data frame does not grow after the for loop. Although R supports dynamic memory allocation when executing, it can be time-wasting in big data sets. To illustrate this, consider the different ways of creating a continuous sequence of numbers. 

The first method is to start with an empty list and gradually grow the list during the iteration.
```
method1 <- function(n) {
  vec <- c()
  for (i in seq_len(n))
    vec <- c(vec, i)
  vec
}
```

The second way is to start with a list with length $n$, and fill in the values during the iteration.
```
method2 <- function(n) {
  vec <- numeric(n)
  for (i in seq_len(n))
    vec[i] <- i
  vec
}
```

Compare the running time of the two methods. 
```
n <- 1e5
tmemory_1 <- system.time(method1(n))[3]
tmemory_2 <- system.time(method2(n))[3]

n <- 1e6
tmemory_3 <- system.time(method1(n))[3]
tmemory_4 <- system.time(method2(n))[3]
```

The following table shows the average running time on my local comupter of the two methods with different values of n. We can see from the drastic difference that as $n$ grows, the time for memory allocation can no longer be ignored. When $n=10^6$, method1 takes 36 minutes, while method2 only takes less than 1 second.

  n          Method1     Method2    
-------     ---------   ----------
  $10^5$      31.84       0.007     
  $10^6$       2155       0.063
           
Table:  Running Time (Seconds) for Different Methods of Memory Allocation

What should we do if we are not sure about the vector size before finishing the loop? In these cases it's often a good practice to allocate a large enough size for the vector, and delete the empty part after the loop. Comparing to reallocate memory for the object every time in an iteration, we only need to allocate twice. For example, we would like to find the number subarrays of $k$ number of continuous '1's in a random array. There are two ways for this.

```{r}
findones1 <- function(x,k){
  n <- length(x)
  runs <- NULL
  for(i in 1:(n-k+1)){
    if(all(x[i:(i+k-1)]==1)) runs <- c(runs,i)
  }  
  return(runs)
}
```

```{r}
findones2 <- function(x,k){
  n <- length(x)
  runs <- vector(length=n)
  count <- 0
  for(i in 1:(n-k+1)){
    if(all(x[i:(i+k-1)]==1)){
      count <- count+1
      runs[count] <- i
    }
  }
  ifelse(count > 0, runs <- runs[1:count], runs <- NULL)
  return(runs)
}
```

Test the running time of the two functions on a random sequence of length $10^6$, and set $k=4$. We can find that the first function takes a lot more time the the second function.
```{r}
set.seed(123)
n <- round(runif(n=1e6,min=0,max=1))
sprintf('Running time of findones1: %.3f',system.time(findones1(n,4))[3])
sprintf('Running time of findones2: %.3f',system.time(findones2(n,4))[3])
```


### Avoid making copies

Cumulative codes like `x <- c(x, y)` may make a copy each time running, It slows down the program when the amount of storage is large or the number of repeated modifications is large. We need to avoid making unnecessary copies to make our program more efficient. To avoid modifying the size of the variables can avoid making copies. In addition, we need to pay attention to how we modify the values of the variables.

When we modify the values in a data frame, it generates a copy every time. But modifying values in a list will not generate copies and thus is more efficient. We will compare the running time of modifying values in this two kinds of data type in the following example.
```{r}
set.seed(101)
m <- 2E4; n <- 100
x <- as.data.frame(matrix(
  runif(n*m), nrow=n, ncol=m))
time1<-system.time({
  for(j in seq(m)){
    x[[j]] <- x[[j]] + 1
  }
})
```

```{r}
set.seed(101)
m <- 2E4; n <- 100
x <- replicate(m, 
  runif(n),
  simplify=FALSE)
time2<-system.time({
  for(j in seq(m)){
    x[[j]] <- x[[j]] + 1
  }
})
x <- as.data.frame(x)
```

```{r}
time1[3]
time2[3]
```
replicate() returns a list when `simplify=FALSE`. Saving data in a list is more efficient to access than saving it in a data frame. But data frames offer more functions. We need to choose base on our needs.

## Parallel Computing

Parallel computing is a widely used technique for dealing with big datasets, which uses multiple cores and threads to complete a task at the same time. While most of our computers have multiple cores, R only runs on one of the virtual cores by default.

If a large task can be divided into independent subtasks, it is easy to conduct parallel computing on different cores. The bottleneck lies in the need to communicate between cores. For example, one of the most time-consuming task in R is random simulation, where we would like to avoid using same sequences in different cores and threads.

Luckily, there are some packages for simple parallel computing in R, and we will have a quick glimpse of these functions in this section.

### General R Functions for Parallel Computing

The parallel package in R provides a simple way of parallel computing. It gives the parallel version of the apply family, namely parApply(), parLapply(), parSapply(), etc. These functions require a temporary cluster as the main object. 

To better explain the use of parallel package, let us consider a simple task. We want to compute $$S_{n,k}=\Sigma_{i=1}^{n}\frac{1}{i^k}$$ for $n=10^6$, with $k$ ranging from 2 to 21. As the computation is independent for different $k$s, it's safe to assign the task to different cores.

Let's first compute the task with a single core.
```{r}
f10 <- function(n,k){
  s <- 0.0
  for(i in seq(n)) s <- s + 1/i^k
  s
}
f11 <- function(n,nk){
  v <- sapply(2:(nk+1), function(k) f10(n,k))
  v
}
sprintf('Running time with a single core: %.3f',system.time(f11(n=1e6, nk=20))[3])
```

Now let's do the computation with multiple cores step by step. Firstly, use `detectCores()` to check the virtual cores of your computer.
```{r}

nNodes <- detectCores()
nNodes
```

Next, create a temporary cluster with multiple cores which will be the main objects of our parallel computation.
```{r}
cpucl <- makeCluster(nNodes)
```

Conduct parallel computation with `parLapply()` or `parSapply()`. The computing time is now 0.818 seconds, reduced by 48% compared to the single core computation. Notice that in the parallel version, f10 has to be defined inside f12, because the function runs on different threads independently, and we have to make sure that the initialization and definition is done properly on every thread.

```{r}
f12 <- function(n,nk){
  f10 <- function(n,k){
    s <- 0.0
    for(i in seq(n)) s <- s + 1/i^k
    s
  }

  v <- parSapply(cpucl, 2:(nk+1), function(k) f10(n,k))
  v
}
sprintf('Running time with multiple cores: %.3f',system.time(f12(n=1e6, nk=20))[3])
```

Another way for the initialization is to pass the dependent objects to every cluster by `clusterExport()`. For example, instead of defining f10 inside f12, we can also do the following.
```{r}
clusterExport(cpucl, c("f10"))
f13 <- function(n,nk){
  v <- parSapply(cpucl, 2:(nk+1), function(k) f10(n,nk))
  v
}
sprintf('Running time with multiple cores: %.3f',system.time(f13(n=1e6, nk=20))[3])
```

If you need to execute some operations on every cluster node before doing the computation, `clusterEvalQ()` can be helpful. For example, you can import some certain libraries for further execution.
```
clusterEvalQ(cpucl, library(dplyr))
```

After the computation, always remember to stop the cluster. This ensures efficient CPU allocation for future tasks.
```{r}
stopCluster(cpucl)
```

### Parallel Computing for Random Simulations

As we mentioned at the beginning of this section, a more challenging task for parallel computation is random simulation. For example, we want to conduct a simulation of $10^7$ times. The task can be divided into ten simulation tasks of $10^6$ times, but the random sequences should be different for the ten tasks. 

`L'Ecuyer` is a popular random number generator for parallel computation in R. It has a very long period of around $2^191$. This enables us to use a separate stream for each cluster node so that the random sequences never get into sync. `nextRNGStream()` function in the parallel package sets a specific random seed for the generator, thus assigning different streams for every node.

For example, we want to estimate the probability of a Wilson confidence interval containing the true value  $p$. Recall that the Wilson confidence interval is defined as $$\frac{\hat{p}+\frac{\lambda^2}{2n}}{1+\frac{\lambda^2}{n}} \pm \frac{\lambda}{\sqrt{n}}\frac{\sqrt{\hat{p}(1-\hat{p})+\frac{\lambda^2}{4n}}}{1+\frac{\lambda^2}{n}},$$

where $\lambda = \phi^{-1}(1-\alpha)$. We want to simulate this probability with different $alpha$, $n$ and $p$.

First, let's do the computation on a single core.

```{r}
wilson <- function(n, x, conf){
  hatp <- x/n
  lam <- qnorm((conf+1)/2)
  lam2 <- lam^2 / n
  p1 <- (hatp + lam2/2)/(1 + lam2)
  delta <- lam / sqrt(n) * sqrt(hatp*(1-hatp) + lam2/4) / (1 + lam2)
  c(p1-delta, p1+delta)
}

f20 <- function(nsim){
  set.seed(123)
  n <- 30; p0 <- 0.01; conf <- 0.95
  cover <- 0
  for(i in seq(nsim)){
    x <- rbinom(1, n, p0)
    cf <- wilson(n, x, conf)
    if(p0 >= cf[1] && p0 <= cf[2]) cover <- cover+1
  }
  cover/nsim
}
sprintf('Running time with a single core: %.3f',system.time(cvg1 <- f20(nsim=1e7))[3])
```

Now let's move on to the multi-core version of this task. With parallel computation, the task only takes 29.95s, reduced by 43% comparing to the single-core version.

```{r}
nNodes <- detectCores()
cpucl <- makeCluster(nNodes)
each.seed <- function(s){
  assign(".Random.seed", s, envir = .GlobalEnv)
}
RNGkind("L'Ecuyer-CMRG")
set.seed(123)
seed0 <- .Random.seed
seeds <- as.list(1:nNodes)

# Create different seeds for the cluster nodes
for(i in 1:nNodes){ 
  seed0 <- nextRNGStream(seed0)
  seeds[[i]] <- seed0
}
# Assign different seed for every node
junk <- clusterApply(cpucl, seeds, each.seed)
f21 <- function(isim, nsimsub){
  n <- 30; p0 <- 0.01; conf <- 0.95
  cover <- 0
  for(i in seq(nsimsub)){
    x <- rbinom(1, n, p0)
    cf <- wilson(n, x, conf)
    if(p0 >= cf[1] && p0 <= cf[2]) cover <- cover+1
  }
  cover
}
clusterExport(cpucl, c("f21", "wilson"))

f22 <- function(nsim){
  nbatch <- 40
  nsimsub <- nsim / nbatch
  cvs <- parSapply(cpucl, 1:nbatch, f21, nsimsub=nsimsub)
  sum(cvs)/(nsim*nbatch)
}
sprintf('Running time with multiple cores: %.3f',system.time(cvg2 <- f22(1e7))[3])

stopCluster(cpucl)
```

## Profiling tools in R

When we try to optimize our R programs. first we need to decide whether it is necessary to do so and where the bottleneck lies. In some simple problems, we can get the results within reasonable time and space consumption.It is also unnecessary when the optimization only results in minor improvements. If we find the bottleneck and it is necessary to optimize it, we still need to consider whether we modify the R codes or we can rewrite this part using other languages like C++ to achieve simple improvements.

To analyze the efficiency of codes, we use the profiling tools in R.

Base R `utils::rprof()`can collect profiling data for program runs, and `utils::summaryRprof()` can provide profiling summaries in text format. We also can use `profvis` package which provides an interactive graphical interface for visualising code profiling data.

Here we use an example to show how `profvis` package works.

```{r}
profvis({
  make_adder <- function(n) {
    function(x) {
      pause(0.25)
      x + n
    }
  }

  make_adder(1)(10)
  adder2 <- make_adder(2)
  adder2(10)
})

```
We can see that in flame graph, the upper panel gives the amount of time spent on each line of code. The lower panel  shows the process of the whole program. Some of R’s built-in functions don’t show in the profvis flame graph,although these functions can occupy a lot of time. We need to pay attention to this kind of problem. More discussion into it can be found in the [guide](https://rstudio.github.io/profvis/faq.html).

 For more complex programs,we should save the program as an R source file. Then we use source() method to load the function to be run, and use profvis() to call the function and display the performance analysis results after running.
```{r}
 bad_copy <- function(){
  M <- 1E5
  x <- c()
  for(i in seq(M)){
    x <- c(x, diff(range(runif(10))))
  }
  mean(x)
}
```
we can store the function in bad_copy.R. Then run `profvis(bad_copy())`


## References
[1] https://csgillespie.github.io/efficientR/programming.html

[2] https://bookdown.org/manishpatwal/bookdown-demo/

[3] https://www.math.pku.edu.cn/teachers/lidf/docs/Rbook/html/_Rbook/prog-prof.html