Performance, Optimization, and Parallelization

Benchmarking

One of the most important things to know when trying to optimize your code is how long it takes to run. You can use the Sys.time() function to record the time from the internal system clock at any point. However, this approach doesn’t offer many tools, and can also be rather inaccurate. The time any one run of a chunk of code takes to run can vary widely between executions. A solution to this is to measure the time it takes several times and use the average. However, doing so can produce inflated estimates of runtime because R has to execute the code several times while also measuring start and stop times. A better way is to use the microbenchmark pacakge, which uses OS level timing functions and evaluates your code in C to giv e you a more accurate impression of your code’s efficiency. Try out the microbenchmark() function with some code that produces random draws.

library(microbenchmark)
print(microbenchmark(rcauchy(1e5, 0 , 5)))

## Unit: milliseconds
##                  expr min  lq mean median uq max neval
##  rcauchy(1e+05, 0, 5) 5.2 5.9  7.5    6.5  8  38   100

Note: I have to wrap microbenchmark() in print() because the structure of the object that it returns doesn’t play nicely with R Markdown. You don’t have to do this if you’re just executing code in a notebook, but you will have to when you compile.

Notice that we get summary statistics about the distribution of runtimes. By itself, this isn’t all the interesting or helpful. Where it really shines is in its ability to let you compare the efficiency of different ways of accomplishing the same task. Try using a couple different ways to create a $100 \times 100$ matrix filled with draws from a $\mathcal{N}(0,1)$ distribution and compare the efficiency of each. You can combine multiple approaches within one microbenchmark() call by using commas to supply them as separate arguments.

x <- matrix(NA, 1e2, 1e2)
print(microbenchmark(
  nested_loop = for (i in 1:1e2) {
    
    for (j in 1:1e2) {
      
      x[i,j] <- rnorm(1, 0, 1)
      
    }
   
  },
  loop = for (i in 1:1e2) {
    
    x[i,] <- rnorm(1e2, 0, 1)
    
  },
  matrix = x <- matrix(rnorm(1e4, 0, 1), 1e2, 1e2)
))

## Unit: microseconds
##         expr   min    lq  mean median    uq   max neval
##  nested_loop 19129 23403 34331  33086 40351 80526   100
##         loop  2795  3100  5222   3602  5051 29924   100
##       matrix   605   618   762    627   761  2202   100

As you can see, nested loops are extremely inefficient, and you should avoid them whenever possible.

With these tools, we can prove to ourselves one of the most important lessons of optimization: existing functions are almost always faster than a solution we create ourselves. Write your own code to calculate the row means of a $10000 \times 10000$ matrix filled with draws from whatever distribution you like, and then compare the time it takes to run with R’s built in rowMeans() function.

x <- matrix(rweibull(1e4, 1, .7), 1e2, 1e2)

rowmeans_c <- numeric(1e2)
rowmeans_loop <- numeric(1e2)
print(microbenchmark(
  cbind = for (i in 1:1e2) {
    
    rowmeans_c <- c(rowmeans_c, mean(x[i,]))
    
  },
  loop = for (i in 1:1e2) {
    
    rowmeans_loop[i] <- mean(x[i,])
    
  },
  builtin = rowMeans(x)
))

## Unit: microseconds
##     expr  min   lq mean median   uq   max neval
##    cbind 2652 4132 5924   5575 6963 18003   100
##     loop 2507 2647 3697   2861 3827 11368   100
##  builtin   25   30   42     36   42   594   100

The code above compares the elapsed times for the approaches I chose, but there are plenty of others, such as using the apply() function. If there’s an existing function that does what you need, use it. This is especially true for primitive functions like sum(), which call C directly.

Purpose-built functions are often more efficient than user written approaches. The code below is a good illustration of another important point in optimization. This is a stochastic procedure. There are any number of variables that can affect how long it takes to execute a given line of code (background processes, memory usage, how hot your computer is), so you may need to tweak the number of executions that microbenchmark() uses. The second comparison below is more accurate because it is based on 100 times more executions.

x <- rgamma(1, 2, 3)
print(microbenchmark(sqrt(x), x^.05, times = 10))

## Unit: nanoseconds
##     expr min  lq mean median  uq   max neval
##  sqrt(x) 108 163 1253    242 510 10076    10
##   x^0.05 184 193  728    222 726  4320    10

print(microbenchmark(sqrt(x), x^.05, times = 1000))

## Unit: nanoseconds
##     expr min  lq mean median  uq  max neval
##  sqrt(x) 104 109  119    115 121 1378  1000
##   x^0.05 176 186  206    193 198 6816  1000

There are some different ways of doing things with vastly disparate times. For example, testing for equality (==) is much faster than testing for inclusion (%in%), even when your comparison only has one element. If your code needs to be flexible in how many objects it can compare, use %in%, but if you only ever need to check against one value, use ==. This example illustrates one of the major tradeoffs in programming: efficiency vs. flexibility. If your code only has to make a handful of comparisons of diverse inputs, then using the between operator won’t cost you that much time. However, if there are tens or hundreds of thousands of observations, you may want to write a simple function that uses the faster conditional method when the input only has one value.

x <- 5
print(microbenchmark(x == 5, 5 %in% x))

## Unit: nanoseconds
##      expr min  lq mean median  uq   max neval
##    x == 5  86 103  157    112 117  3932   100
##  5 %in% x 498 512  749    519 579 18367   100

Vectorization

One area where you can get large gains in performance for a relatively small effort is in vectorizing your code. Vectorization treats your inputs as vectors instead of scalars. What this means is reducing the number of loops your code requires. How can you work with vectors and not need a loop? What vectorization actually does is shift all of your loops from R to C, where they are much more efficient. If you’ve ever done any programming in C, you know that it doesn’t have any low level vectorization and anything involving vectors has to involve a loop of some sort. Compare the speed of rowMeans() and using apply() and mean() to take a row row means of a $ 100 100 $ matrix of draws from a distribution.

x <- matrix(rnorm(1e6), 1e3, 1e3)
print(microbenchmark(apply(x, 1, mean), rowMeans(x)))

## Unit: milliseconds
##               expr  min   lq mean median   uq  max neval
##  apply(x, 1, mean) 16.6 19.0 22.6   21.3 23.6 62.7   100
##        rowMeans(x)  2.1  2.1  2.2    2.2  2.3  3.8   100

Note that these are both built in functions, and there’s not a loop in sight, yet rowMeans() is nearly 20 times faster than our apply() solution. This is because of vectorization and the use of C code. An example of vectorization in R that you’re already familiar with is matrix algebra. C implements matrix algebra through external libraries like BLAS which benefit from decades of refinement and platform specific tuning. Generate some simulated variables and a simualted coefficient matrix, then compare the speed of using a loop to calculate $\mathbf{X}\beta$ with R’s matrix multiplication operator.

x <- cbind(rnorm(100), runif(100, -3, 7), rpois(100, 2))
beta <- c(.78, 3.02, -1.3)
xb <- function(x, beta) {
  output <- numeric(nrow(x))
  for (i in 1:nrow(x)) {
    
    output[i] <- sum(x[i, ] * beta)
    
  }
  output
}
print(microbenchmark(x %*% beta, xb(x, beta)))

## Unit: nanoseconds
##         expr   min    lq   mean median    uq     max neval
##   x %*% beta   756   851   1284   1040  1142   22652   100
##  xb(x, beta) 77800 79124 138730  80763 81993 5600622   100

R’s matrix algebra functions, which use loops in C, are nearly two orders of magnitude faster than an approach that uses loops within R. Imagine how much larger this difference would be if beta was a matrix whose columns we had to traverse in addition to the rows of x.

Byte Compilation

Have a custom function that just can’t be vectorized? One way to speed it up is byte compilation. R is an interpreted language; C is a compiled language. R can be run on many different architectures while C has to be compiled for the specific architecture that the code will run on. Byte code is something of a compromise between the two. It produces non-human readable machine code, but it is generic code intended to be adapated to individual architectures, rather than written for a specific one. Byte compilation can thus give your code some of the speed benefits of a compiled language without all of the headaches of worrying about different architectures.

Write a custom mean function, and then create a byte compiled version. Compare the performance of the two functions on a 1000000 length vector of draws from your distribution of choice. You’ll get a larger performance difference if you intentionally write an inefficient function that uses a loop instead of relying on sum().

x <- rWishart(1e3, 5, cbind(c(2.4, .7), c(.3, 2.4)))

my.mean <- function(x) {
  
  total <- 0
  
  for (i in 1:length(x)) {
    
    total <- total + x[i] / length(x)
    
  }
  
  total
  
}

my.mean.c <- compiler::cmpfun(my.mean)
print(microbenchmark(standard = my.mean(x), compiled = my.mean.c(x), times = 1e3))

## Unit: microseconds
##      expr min  lq mean median  uq  max neval
##  standard 578 605  692    613 646 3828  1000
##  compiled 578 606  706    613 645 6586  1000

In this case, we can see that our compiled function is ever so slightly faster. The more complex the function, the bigger the improvement.

Profiling

When your code becomes more complex, measuring how long it takes the whole thing to run provides you less information. Instead, we want to be able to identify which parts are fast and which are slow. Then we can use this information to target the slowest parts when optimizing, which should lead to be biggest efficiency gains.

library(profvis)
x <- matrix(NA, 1e3, 1e3)
profvis({
  for (i in 1:1e3) {
    
    for (j in 1:1e3) {
      
      x[i,j] <- rnorm(1, 0, 1)
      
    }
   
  }
}, interval = 0.005)

Parallelization

Parallelization lets you increase the speed of your code by running instructions in parallel (side by side), instead of serially (one after another). However, parallelization is not a magic bullet, and you need to use it intelligently to speed up your code. This is due to two main factors. First, parallel code is more complicated than serial code, so the speed gains you can get from your code need to outweigh the time it takes you to parallelize. Second, only certain tasks benefit from parallelization. Copy and paste the code below into R. Before you run it, try to guess which one will run faster.

library(parallel)

fake_data <- as.data.frame(matrix(rbeta(1e6, 1, 2), 1e3, 1e3))

## serial execution
print(microbenchmark(serial = lapply(fake_data, function(x) sqrt(x) / max(x)),
                     parallel = mclapply(fake_data, function(x) sqrt(x) / max(x),
                                         mc.cores = parallel::detectCores())))

## Unit: milliseconds
##      expr  min lq mean median uq max neval
##    serial  8.6 14   23     20 26  88   100
##  parallel 53.2 66   88     73 91 350   100

Sending instructions to multiple cores takes time. If the length of time it takes to carry out an instruction is sufficiently short, the extra step of sending the instructions to mulitple cores can actually make parallel code slower than serial code. This means that parallelization is appropriate for code where each instruction takes a long time to execute.

One way to execute code in parallel is using mclapply() in the parallel package like we did above. This is a good approach if the operations you need to carry out can be accomplished with simple anonymous functions, or even preexisting functions.

If you need to carry out more complicated operations, the foreach() function in the foreach package can be a good option. It also gives you more control over exactly how your code is parallelized. For Windows users, this can be your only option, as mclapply() does not work due to the way it parallelizes code. The foreach() function can use parallel backends from multiple different pacakges, but we’re going to use doParallel. Use the registerDoParallel() command on the makeCluster() command, with detectCores() as the number of cores argument

library(doParallel)
registerDoParallel(makeCluster(parallel::detectCores()))

The syntax for parallelized loops is slightly different from regular loops. Be sure to include the %dopar% after the arguments, otherwise the code will just be executed serially (…like a regular loop).

for (i in 1:10) print(i)

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

foreach(i = 1:10) %dopar% print(i)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 8
## 
## [[9]]
## [1] 9
## 
## [[10]]
## [1] 10

We also retrieve the results of a foreach() loop differently. Normally we create an object outside a loop, and then fill it in by indexing using our iteration variable e.g. x[i]. If we want to save the results of a foreach loop to an object, we need to assign that object to the loop. Note that this means any object we assign in the loop won’t be coming out of that environment; only operations called at the end of the loop will be included in the final object.

x <- foreach(i = 1:1e4) %dopar% rnorm(1, 0, 1)

We can also use foreach() to create 2 dimensional objects like matrices and arrays. If we do not change the .combine argument in the function call the results are returned in a list like above, except each element is now a vector of numerics. Setting the .combine argument to rbind instead will yield a matrix.

x <- foreach(i = 1:15) %dopar% rbeta(5, 1, 1)
class(x)

## [1] "list"

x <- foreach(i = 1:15, .combine = rbind) %dopar% rbeta(5, 1, 1)
class(x)

## [1] "matrix"

If we want to make a data frame instead of a matrix, just use the data.frame command around the elements. Remember not to assign this command!

x <- foreach(i = 1:15, .combine = rbind) %dopar% {
  
  chr <- sample(c('a', 'b', 'c', 'd'), 1)
  num <- rnorm(1, 2, 5)
  data.frame(letter = chr, number = num)
  
}
x

If your data are exchangeable, meaning that they don’t have labels and the order does not convey any meaningful information, then you can set the .inorder argument to false to improve performance. If you’re carrying out operations on real data you should leave this set to true, but if you are, for example, generating bootstrap measures of uncertainty, then you can set it to false. If it’s set to false, then a worker which finishes its tasks can immediately write its result to output and begin a new one, instead of having to wait for the other workers to write their results. Compare the speed of two parallel loops, with .inorder = T and .inorder = F.

print(microbenchmark(ordered = foreach(i = 1:1e3, .combine = rbind,
                                       .inorder = T) %dopar% rbeta(5, 1, 1),
               unordered = foreach(i = 1:1e3, .combine = rbind,
                                   .inorder = F) %dopar% rbeta(5, 1, 1)))

## Unit: milliseconds
##       expr min  lq mean median  uq max neval
##    ordered 372 398  420    415 428 542   100
##  unordered 375 404  430    424 437 652   100

The real benefit of parallelization lies in repeating multiple tasks that each take a lot of time to carry out, such as reading in data or saving it to disk storage.

Individual Exercise

Pick a loop or custom function that you’ve written for your own work. Profile a typical execution of it to identify opportunities for improvement. Optimize your code by replacing loops with apply() functions, parallelizing more complicated loops, or byte compiling your function. Benchmark the original and optimized code. Include both versions of your code, the results of your profiling of the original code, and the results of benchmarking both versions.