5 Rcpp or how to easily embed C++ code into a package

Rcpp (“R-C-Plus-Plus”) is a package which facilitates the interface between C++ and . is an interpreted language, a feature that makes a number of things easy (including giving us access to the console in which we can evaluate code and variables on the fly). Nevertheless, this ease of use is counterbalanced by larger computation times compared to lower level languages such as C, Fortran and C++ (but which require compilation).

The curious reader is directed towards the online book Rcpp for everyone by Masaki E. Tsuda, which is a very thorough and complete resource for understanding how to use Rcpp, in complement to the introduction that can be found in the “Rcpp” section of Hadley Wickham’s book Advanced R4.

5.1 First function in Rcpp

👉 Your turn !

  1. To make your package ready for use with Rcpp, start by running the following command:

    usethis::use_rcpp()
  2. See the changes made

  3. You should also add the following 2 Roxygen comments in the general help page of the package, as indicated in the console:

    #' @useDynLib mypkgr
    #' @importFrom Rcpp sourceCpp, .registration = TRUE
    NULL

We are now going to create a first function in Rcpp to invert a matrix. For this, we will use the C++ library Armadillo. It is a modern and simple linear algebra library, highly optimized, and interfaced with via the RcppArmadillo package.

C++ is not a very different language from . The main differences that will have an impact for us:

  • C++ is very efficient for for loops (including nested for loops – ⚠️ there is often one order that is faster than the other, due to the way C++ allocates and walks through memory).

  • Each command must end with a semicolon ;.

  • C++ is a typed language: you must declare the type of each variable before you can use it in the code.

👉 Your turn !

  1. Create a new C++ file from RStudio (via the File > New File > C++ File menu), and save it in the src folder. Take the time to read it and try to understand each line.
  2. Compile and load your package (via the “Install and Restart” button) and try using the timesTwo() function from the console.
  3. Install the RcppArmadillo 👉 package, and don’t forget to make the necessary additions to DESCRIPTION (use usethis::use_rcpp_armadillo())
  4. Using Hadley Wickham’s introduction to Rcpp in his book Advanced R, as well as the documentation for the RcppArmadillo package and for the C++ library Armadillo, try to write a short function invC() in C++ that computes the inverse of a matrix.
  5. When you have successfully compiled your invC function and it is accessible from , create a mvnpdf_invC() function from the mvnpdfsmart implementation replacing only the matrix inverse calculations with a call to invC.
  6. Evaluate the performance gain of this new implementation mvnpdf_invC.
n <- 1000
mb <- microbenchmark(mvtnorm::dmvnorm(matrix(1.96, nrow = n, ncol = 2)),
                     mvnpdf(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfsmart(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfoptim(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdf_invC(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     times=100L)
mb
## Unit: microseconds
##                                                            expr      min
##              mvtnorm::dmvnorm(matrix(1.96, nrow = n, ncol = 2))   43.706
##       mvnpdf(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE) 3334.735
##  mvnpdfsmart(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE) 2309.243
##  mvnpdfoptim(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE) 1737.785
##  mvnpdf_invC(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE) 2295.590
##         lq       mean    median       uq      max neval  cld
##    53.5255   66.97186   66.8095   76.465  101.926   100 a   
##  3453.1430 3609.39687 3541.6005 3638.443 7066.104   100  b  
##  2341.6330 2542.28823 2370.2100 2400.899 6528.512   100   c 
##  1788.3585 1899.61774 1817.4070 1847.993 6017.037   100    d
##  2340.0955 2531.82134 2363.2605 2406.680 6656.678   100   c

5.2 Optimize thanks to C++

As a general rule, not much computational time is gained by replacing an optimized function with a C++ function. Indeed, most of the base functions are in fact already wrappers around well optimized C or Fortran routines. The gain is then limited to the suppression of argument checking and type management (which is there for a reason!).

👉 Your turn !

  1. From mvnpdfsmart, propose a complete implementation in C++ for computating the density of the multivariate Normal distribution mvnpdfsmartC().

  2. Evaluate the performance gain of this new implementation mvnpdfsmartC.

You can download our proposal for mvnpdfsmartC.cpp here.

For (relatively small) additional speed gain (at the cost of code readability!), you can have a look at our optimized Armadillo C++ implementation in mvnpdfoptimC.cpp.

n <- 1000
mb <- microbenchmark(mvtnorm::dmvnorm(matrix(1.96, nrow = n, ncol = 2)),
                     mvnpdf(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfsmart(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfoptim(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdf_invC(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfsmartC(x=matrix(1.96, nrow = 2, ncol = n), mean = rep(0, 2), varcovM = diag(2), Log=FALSE),
                     mvnpdfoptimC(x=matrix(1.96, nrow = 2, ncol = n), mean = rep(0, 2), varcovM = diag(2), Log=FALSE),
                     times=100L)
mb
## Unit: microseconds
##                                                                                                       expr
##                                                         mvtnorm::dmvnorm(matrix(1.96, nrow = n, ncol = 2))
##                                                  mvnpdf(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)
##                                             mvnpdfsmart(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)
##                                             mvnpdfoptim(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)
##                                             mvnpdf_invC(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)
##  mvnpdfsmartC(x = matrix(1.96, nrow = 2, ncol = n), mean = rep(0,      2), varcovM = diag(2), Log = FALSE)
##  mvnpdfoptimC(x = matrix(1.96, nrow = 2, ncol = n), mean = rep(0,      2), varcovM = diag(2), Log = FALSE)
##       min        lq       mean    median        uq      max neval  cld
##    43.829   53.7100   64.21994   61.8895   73.9845  105.903   100 a   
##  3284.797 3390.5770 3620.55092 3462.2655 3582.8465 7417.392   100  b  
##  2300.387 2340.7925 2500.75113 2353.8305 2386.7535 5860.335   100   c 
##  1728.601 1776.1405 1982.84569 1793.4220 1824.0285 9680.633   100    d
##  2297.353 2336.0160 2392.90760 2356.0240 2380.2550 5090.355   100   c 
##    51.209   53.5050   58.07404   56.1905   60.2085  122.344   100 a   
##    35.670   37.8635   42.37268   41.1230   44.4645  115.866   100 a

Note that Rcpp functions can be used outside of a package architecture thanks to the Rcpp::sourceCpp() function. But, as we have seen that it is always desirable to manage all of one’s code inside a package, it is unlikely that you will need this !

Annexe 5.1: premature optimization is a bad idea

Chambers, Software for Data Analysis: Programming with R, Springer, 2008:

“Including additional C code is a serious step, with some added dangers and often a substantial amount of programming and debugging required. You should have a good reason.