generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 67
/
efficient_r.Rmd
476 lines (378 loc) · 19.6 KB
/
efficient_r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
# Effcient R
Ruochen Zhang, Weipeng Li
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(profvis)
library(parallel)
library(dplyr)
```
## Introduction
R is one of the world's most widely used statistics programming languages, which can be used for statistical analysis, graphics representation, and reporting. With an understanding of its characteristics, we can make our R programming more efficient.
R is an interpreted language, but it calls compiled C, Fortran and other languages behind the scenes. Ususally calling the complied languages will be more efficient compared with writing codes that take time to compile, Vectorized programming can improve efficiency with little modifications to the codes as vector and matrix as the basic operation units in R and their operations call the compiled codes.
R is very flexible. For example, the variables are in dynamic types and the content and type can be modified. The flexible design brings extra burden to the operation and may make the programs slower. We can improve efficiency by paying attention to its flexibility and memory allocation.
In this tutorial, we will introduce some methods to improve efficiency, including efficient programming and parallel computation. We will also introduce how to use profiling tools to analyze program efficiency.
## Efficient programming
### Vectorized
Accessing the underlying C/Fortran routines as quickly as possible can make the program efficient. The fewer functions calls required to achieve this, the better. Many R functions are vectorised, that is the function’s inputs and/or outputs naturally work with vectors, reducing the number of function calls required.
We can use the example of calculating mean deviation from median of a sample to show how vectorized functions work.
To caculate:
$\frac{1}{n}\sum_{i=1}^{n} abs(x_{i}-m)$,where m is the median of the sample.
In method 1, we can calculate it with loops.
```{r}
mad_f1 <- function(x){
n <- length(x)
mhat <- median(x)
s <- 0.0
for(i in 1:n){
s <- s + abs(x[i] - mhat)
}
s <- s/n
return(s)
}
```
Using vectorized functions we can write it with `mean()` and `median()`.
```{r}
mad_f2 <- function(x) mean( abs(x - median(x)) )
```
The second method is not only more concise, it is also more efficient. We can compare the running time with `bench::mark`, which will count the running time for multiple times.
```{r}
x <- runif(10000)
bench::mark(
mad_f1(x),
mad_f2(x)
)
```
We can see the vectorised method are almost 4 times faster than the first one. Because the test run is interfered with by other tasks running at the same time in the operating system, so the median or the minimum time (the fastest possible) can better show the efficiency rather than the average time.
It’s also important to make full use of R functions that use vectors. Sometimes it is not straightforward to use vectors but they can improve the efficiency by many times. Here consider estimating the integral $\int_ {0}^ {1} x^2 dx$ using a Monte-Carlo method. Essentially, we throw darts at the curve and count the number of darts that fall below the curve. A typical way of doing it is as follows:
```{r}
monte_carlo = function(N) {
hits = 0
for (i in seq_len(N)) {
u1 = runif(1)
u2 = runif(1)
if (u1 ^ 2 > u2)
hits = hits + 1
}
return(hits / N)
}
```
A version utilise vectors can be shown as follows:
```{r}
monte_carlo_vec = function(N) sum(runif(N)^2 > runif(N)) / N
```
By comparing their running time, we can see that the ectorized method is much faster.
```{r}
N = 500000
t1<-system.time(monte_carlo(N))[3]
t2<-system.time(monte_carlo_vec(N))[3]
sprintf('monte_carlo time consumption:%.3f',t1)
sprintf('monte_carlo vectorised version time consumption:%.3f',t2)
```
The monte_carlo_vec() function not uses the vectorised functions.In addition,the comparison(>), runif(), power operations(^) are also vectorised by applying to the whole vector rather than repeatedly on single elements.
### The Apply Family
Explicit loops can be slow and verbose in R. While vectorized methods help avoiding many loops, another helpful tool is the apply family. These functions serve as nice substitutes for loops and execute efficiently. The apply functions take at least two arguments: an object and a valid R expression. The expression is applied iteratively on the objects in given orders. There are many apply functions in base R, showing in the table below. As some of them are rarely used, we will introduce some important ones in this section.
Function Description
------------ -------------
apply() Apply functions over array margins
lapply() Apply a function over a list and return a list
sapply() Apply a function over a list or dataframe and return a list or a matrix
vapply() Apply a function over a list or dataframe and return in a given type
mapply() Apply a function over multiple inputs of lists or dataframes
tapply() Apply a function over a ragged array
eapply() Apply a fuction over values in an environment
Table: Functions in the Apply Family
#### apply()
The apply() function applies an expression to margins of an array or matrix. It takes three arguments: X, MARGIN, and FUN. X is the array to apply the function on. MARGIN indicates the directions over which to apply the expression, with 1 for rows, 2 for columns, and c(1,2) for both rows and columns. FUN is the expression to be applied. It also takes a 'simplify' arguments, with default value of 'TRUE', indicating whether the results should be simplified. An example of apply() as an alternative of for loop is the following.
```{r}
M1 <- matrix(C<-(1:15),nrow=5)
M1_colsum<-apply(M1,2,sum)
M1_colsum
```
#### lapply(), sapply()
lapply() is designed for lists. It applies a function to every entry of the input list and returns a list of the same length. sapply() works similarly as lapply(), but permits a more flexible output. By default it returns a vector or a matrix, but also supports an array output which can be controled by the 'simplify' argument.
```{r}
Names<-c("Manish","Saurabh", "Rahul","Krishna","Venkat")
Names_lower<-lapply(Names,tolower)
Names_lower
```
```{r}
Names_upper<-sapply(Names,toupper)
Names_upper
```
#### tapply()
tapply() applies an operation on a subset of vector broken down by a given factor variable. It works similar to the group_by() function plus an aggregated operation in dplyr. Let's use the iris dataset as an example.
```{r}
data(iris)
summary(iris)
```
```{r}
tapply(iris$Sepal.Length,iris$Species,mean)
```
#### replicate()
replicate() is a nice alternative for writing a for loop in random simulation. It executes a given program for several times without introducing a count variable. The return is a multi-dimension array of the simulating results. It takes two arguments: n and expr. n indicates the number of repetition, and expr is the operation to be executed. For example, we would like to compute the mean and standard deviation of q sequence of standard normal random variables. We decide to repeat the simulation for 8 times, and each time generate 100 random sample.
```{r}
set.seed(123)
replicate(8, {
x <- rnorm(100, 0, 1);
c(mean(x), sd(x)) })
```
### Memory Allocation
A good habit in R programming is to pre-allocate memory for variables before filling in values. Make sure that the size of your list or data frame does not grow after the for loop. Although R supports dynamic memory allocation when executing, it can be time-wasting in big data sets. To illustrate this, consider the different ways of creating a continuous sequence of numbers.
The first method is to start with an empty list and gradually grow the list during the iteration.
```
method1 <- function(n) {
vec <- c()
for (i in seq_len(n))
vec <- c(vec, i)
vec
}
```
The second way is to start with a list with length $n$, and fill in the values during the iteration.
```
method2 <- function(n) {
vec <- numeric(n)
for (i in seq_len(n))
vec[i] <- i
vec
}
```
Compare the running time of the two methods.
```
n <- 1e5
tmemory_1 <- system.time(method1(n))[3]
tmemory_2 <- system.time(method2(n))[3]
n <- 1e6
tmemory_3 <- system.time(method1(n))[3]
tmemory_4 <- system.time(method2(n))[3]
```
The following table shows the average running time on my local comupter of the two methods with different values of n. We can see from the drastic difference that as $n$ grows, the time for memory allocation can no longer be ignored. When $n=10^6$, method1 takes 36 minutes, while method2 only takes less than 1 second.
n Method1 Method2
------- --------- ----------
$10^5$ 31.84 0.007
$10^6$ 2155 0.063
Table: Running Time (Seconds) for Different Methods of Memory Allocation
What should we do if we are not sure about the vector size before finishing the loop? In these cases it's often a good practice to allocate a large enough size for the vector, and delete the empty part after the loop. Comparing to reallocate memory for the object every time in an iteration, we only need to allocate twice. For example, we would like to find the number subarrays of $k$ number of continuous '1's in a random array. There are two ways for this.
```{r}
findones1 <- function(x,k){
n <- length(x)
runs <- NULL
for(i in 1:(n-k+1)){
if(all(x[i:(i+k-1)]==1)) runs <- c(runs,i)
}
return(runs)
}
```
```{r}
findones2 <- function(x,k){
n <- length(x)
runs <- vector(length=n)
count <- 0
for(i in 1:(n-k+1)){
if(all(x[i:(i+k-1)]==1)){
count <- count+1
runs[count] <- i
}
}
ifelse(count > 0, runs <- runs[1:count], runs <- NULL)
return(runs)
}
```
Test the running time of the two functions on a random sequence of length $10^6$, and set $k=4$. We can find that the first function takes a lot more time the the second function.
```{r}
set.seed(123)
n <- round(runif(n=1e6,min=0,max=1))
sprintf('Running time of findones1: %.3f',system.time(findones1(n,4))[3])
sprintf('Running time of findones2: %.3f',system.time(findones2(n,4))[3])
```
### Avoid making copies
Cumulative codes like `x <- c(x, y)` may make a copy each time running, It slows down the program when the amount of storage is large or the number of repeated modifications is large. We need to avoid making unnecessary copies to make our program more efficient. To avoid modifying the size of the variables can avoid making copies. In addition, we need to pay attention to how we modify the values of the variables.
When we modify the values in a data frame, it generates a copy every time. But modifying values in a list will not generate copies and thus is more efficient. We will compare the running time of modifying values in this two kinds of data type in the following example.
```{r}
set.seed(101)
m <- 2E4; n <- 100
x <- as.data.frame(matrix(
runif(n*m), nrow=n, ncol=m))
time1<-system.time({
for(j in seq(m)){
x[[j]] <- x[[j]] + 1
}
})
```
```{r}
set.seed(101)
m <- 2E4; n <- 100
x <- replicate(m,
runif(n),
simplify=FALSE)
time2<-system.time({
for(j in seq(m)){
x[[j]] <- x[[j]] + 1
}
})
x <- as.data.frame(x)
```
```{r}
time1[3]
time2[3]
```
replicate() returns a list when `simplify=FALSE`. Saving data in a list is more efficient to access than saving it in a data frame. But data frames offer more functions. We need to choose base on our needs.
## Parallel Computing
Parallel computing is a widely used technique for dealing with big datasets, which uses multiple cores and threads to complete a task at the same time. While most of our computers have multiple cores, R only runs on one of the virtual cores by default.
If a large task can be divided into independent subtasks, it is easy to conduct parallel computing on different cores. The bottleneck lies in the need to communicate between cores. For example, one of the most time-consuming task in R is random simulation, where we would like to avoid using same sequences in different cores and threads.
Luckily, there are some packages for simple parallel computing in R, and we will have a quick glimpse of these functions in this section.
### General R Functions for Parallel Computing
The parallel package in R provides a simple way of parallel computing. It gives the parallel version of the apply family, namely parApply(), parLapply(), parSapply(), etc. These functions require a temporary cluster as the main object.
To better explain the use of parallel package, let us consider a simple task. We want to compute $$S_{n,k}=\Sigma_{i=1}^{n}\frac{1}{i^k}$$ for $n=10^6$, with $k$ ranging from 2 to 21. As the computation is independent for different $k$s, it's safe to assign the task to different cores.
Let's first compute the task with a single core.
```{r}
f10 <- function(n,k){
s <- 0.0
for(i in seq(n)) s <- s + 1/i^k
s
}
f11 <- function(n,nk){
v <- sapply(2:(nk+1), function(k) f10(n,k))
v
}
sprintf('Running time with a single core: %.3f',system.time(f11(n=1e6, nk=20))[3])
```
Now let's do the computation with multiple cores step by step. Firstly, use `detectCores()` to check the virtual cores of your computer.
```{r}
nNodes <- detectCores()
nNodes
```
Next, create a temporary cluster with multiple cores which will be the main objects of our parallel computation.
```{r}
cpucl <- makeCluster(nNodes)
```
Conduct parallel computation with `parLapply()` or `parSapply()`. The computing time is now 0.818 seconds, reduced by 48% compared to the single core computation. Notice that in the parallel version, f10 has to be defined inside f12, because the function runs on different threads independently, and we have to make sure that the initialization and definition is done properly on every thread.
```{r}
f12 <- function(n,nk){
f10 <- function(n,k){
s <- 0.0
for(i in seq(n)) s <- s + 1/i^k
s
}
v <- parSapply(cpucl, 2:(nk+1), function(k) f10(n,k))
v
}
sprintf('Running time with multiple cores: %.3f',system.time(f12(n=1e6, nk=20))[3])
```
Another way for the initialization is to pass the dependent objects to every cluster by `clusterExport()`. For example, instead of defining f10 inside f12, we can also do the following.
```{r}
clusterExport(cpucl, c("f10"))
f13 <- function(n,nk){
v <- parSapply(cpucl, 2:(nk+1), function(k) f10(n,nk))
v
}
sprintf('Running time with multiple cores: %.3f',system.time(f13(n=1e6, nk=20))[3])
```
If you need to execute some operations on every cluster node before doing the computation, `clusterEvalQ()` can be helpful. For example, you can import some certain libraries for further execution.
```
clusterEvalQ(cpucl, library(dplyr))
```
After the computation, always remember to stop the cluster. This ensures efficient CPU allocation for future tasks.
```{r}
stopCluster(cpucl)
```
### Parallel Computing for Random Simulations
As we mentioned at the beginning of this section, a more challenging task for parallel computation is random simulation. For example, we want to conduct a simulation of $10^7$ times. The task can be divided into ten simulation tasks of $10^6$ times, but the random sequences should be different for the ten tasks.
`L'Ecuyer` is a popular random number generator for parallel computation in R. It has a very long period of around $2^191$. This enables us to use a separate stream for each cluster node so that the random sequences never get into sync. `nextRNGStream()` function in the parallel package sets a specific random seed for the generator, thus assigning different streams for every node.
For example, we want to estimate the probability of a Wilson confidence interval containing the true value $p$. Recall that the Wilson confidence interval is defined as $$\frac{\hat{p}+\frac{\lambda^2}{2n}}{1+\frac{\lambda^2}{n}} \pm \frac{\lambda}{\sqrt{n}}\frac{\sqrt{\hat{p}(1-\hat{p})+\frac{\lambda^2}{4n}}}{1+\frac{\lambda^2}{n}},$$
where $\lambda = \phi^{-1}(1-\alpha)$. We want to simulate this probability with different $alpha$, $n$ and $p$.
First, let's do the computation on a single core.
```{r}
wilson <- function(n, x, conf){
hatp <- x/n
lam <- qnorm((conf+1)/2)
lam2 <- lam^2 / n
p1 <- (hatp + lam2/2)/(1 + lam2)
delta <- lam / sqrt(n) * sqrt(hatp*(1-hatp) + lam2/4) / (1 + lam2)
c(p1-delta, p1+delta)
}
f20 <- function(nsim){
set.seed(123)
n <- 30; p0 <- 0.01; conf <- 0.95
cover <- 0
for(i in seq(nsim)){
x <- rbinom(1, n, p0)
cf <- wilson(n, x, conf)
if(p0 >= cf[1] && p0 <= cf[2]) cover <- cover+1
}
cover/nsim
}
sprintf('Running time with a single core: %.3f',system.time(cvg1 <- f20(nsim=1e7))[3])
```
Now let's move on to the multi-core version of this task. With parallel computation, the task only takes 29.95s, reduced by 43% comparing to the single-core version.
```{r}
nNodes <- detectCores()
cpucl <- makeCluster(nNodes)
each.seed <- function(s){
assign(".Random.seed", s, envir = .GlobalEnv)
}
RNGkind("L'Ecuyer-CMRG")
set.seed(123)
seed0 <- .Random.seed
seeds <- as.list(1:nNodes)
# Create different seeds for the cluster nodes
for(i in 1:nNodes){
seed0 <- nextRNGStream(seed0)
seeds[[i]] <- seed0
}
# Assign different seed for every node
junk <- clusterApply(cpucl, seeds, each.seed)
f21 <- function(isim, nsimsub){
n <- 30; p0 <- 0.01; conf <- 0.95
cover <- 0
for(i in seq(nsimsub)){
x <- rbinom(1, n, p0)
cf <- wilson(n, x, conf)
if(p0 >= cf[1] && p0 <= cf[2]) cover <- cover+1
}
cover
}
clusterExport(cpucl, c("f21", "wilson"))
f22 <- function(nsim){
nbatch <- 40
nsimsub <- nsim / nbatch
cvs <- parSapply(cpucl, 1:nbatch, f21, nsimsub=nsimsub)
sum(cvs)/(nsim*nbatch)
}
sprintf('Running time with multiple cores: %.3f',system.time(cvg2 <- f22(1e7))[3])
stopCluster(cpucl)
```
## Profiling tools in R
When we try to optimize our R programs. first we need to decide whether it is necessary to do so and where the bottleneck lies. In some simple problems, we can get the results within reasonable time and space consumption.It is also unnecessary when the optimization only results in minor improvements. If we find the bottleneck and it is necessary to optimize it, we still need to consider whether we modify the R codes or we can rewrite this part using other languages like C++ to achieve simple improvements.
To analyze the efficiency of codes, we use the profiling tools in R.
Base R `utils::rprof()`can collect profiling data for program runs, and `utils::summaryRprof()` can provide profiling summaries in text format. We also can use `profvis` package which provides an interactive graphical interface for visualising code profiling data.
Here we use an example to show how `profvis` package works.
```{r}
profvis({
make_adder <- function(n) {
function(x) {
pause(0.25)
x + n
}
}
make_adder(1)(10)
adder2 <- make_adder(2)
adder2(10)
})
```
We can see that in flame graph, the upper panel gives the amount of time spent on each line of code. The lower panel shows the process of the whole program. Some of R’s built-in functions don’t show in the profvis flame graph,although these functions can occupy a lot of time. We need to pay attention to this kind of problem. More discussion into it can be found in the [guide](https://rstudio.github.io/profvis/faq.html).
For more complex programs,we should save the program as an R source file. Then we use source() method to load the function to be run, and use profvis() to call the function and display the performance analysis results after running.
```{r}
bad_copy <- function(){
M <- 1E5
x <- c()
for(i in seq(M)){
x <- c(x, diff(range(runif(10))))
}
mean(x)
}
```
we can store the function in bad_copy.R. Then run `profvis(bad_copy())`
## References
[1] https://csgillespie.github.io/efficientR/programming.html
[2] https://bookdown.org/manishpatwal/bookdown-demo/
[3] https://www.math.pku.edu.cn/teachers/lidf/docs/Rbook/html/_Rbook/prog-prof.html