5 Rcpp or how to easily embed C++ code into a package

Rcpp (“R-C-Plus-Plus”) is a package which facilitates the interface between C++ and . is an interpreted language, a feature that makes a number of things easy (including giving us access to the console in which we can evaluate code and variables on the fly). Nevertheless, this ease of use is counterbalanced by larger computation times compared to lower level languages such as C, Fortran and C++ (but which require compilation).

The curious reader is directed towards the online book Rcpp for everyone by Masaki E. Tsuda, which is a very thorough and complete resource for understanding how to use Rcpp, in complement to the introduction that can be found in the “Rcpp” section of Hadley Wickham’s book Advanced R3.

5.1 First function in Rcpp

👉 Your turn !

  1. To make your package ready for use with Rcpp, start by running the following command:

    usethis::use_rcpp()
  2. See the changes made

  3. You should also add the following 2 Roxygen comments in the general help page of the package, as indicated in the console:

    #' @useDynLib mypkgr
    #' @importFrom Rcpp sourceCpp, .registration = TRUE
    NULL

We are now going to create a first function in Rcpp to invert a matrix. For this, we will use the C++ library Armadillo. It is a modern and simple linear algebra library, highly optimized, and interfaced with via the RcppArmadillo package.

C++ is not a very different language from . The main differences that will have an impact for us:

  • C++ is very efficient for for loops (including nested for loops – ⚠️ there is often one order that is faster than the other, due to the way C++ allocates and walks through memory).

  • Each command must end with a semicolon ;.

  • C++ is a typed language: you must declare the type of each variable before you can use it in the code.

👉 Your turn !

  1. Create a new C++ file from RStudio (via the File > New File > C++ File menu), and save it in the src folder. Take the time to read it and try to understand each line.
  2. Compile and load your package (via the “Install and Restart” button) and try using the timesTwo() function from the console.
  3. Install the RcppArmadillo 👉 package, and don’t forget to make the necessary additions to DESCRIPTION (use usethis::use_rcpp_armadillo())
  4. Using Hadley Wickham’s introduction to Rcpp in his book Advanced R, as well as the documentation for the RcppArmadillo package and for the C++ library Armadillo, try to write a short function invC() in C++ that computes the inverse of a matrix.
  5. When you have successfully compiled your invC function and it is accessible from , create a mvnpdf_invC() function from the mvnpdfsmart implementation replacing only the matrix inverse calculations with a call to invC.
  6. Evaluate the performance gain of this new implementation mvnpdf_invC.
n <- 1000
mb <- microbenchmark(mvtnorm::dmvnorm(matrix(1.96, nrow = n, ncol = 2)),
                     mvnpdf(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfsmart(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfoptim(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdf_invC(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     times=100L)
mb
## Unit: microseconds
##                                                            expr      min
##              mvtnorm::dmvnorm(matrix(1.96, nrow = n, ncol = 2))   26.568
##       mvnpdf(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE) 1624.830
##  mvnpdfsmart(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE) 1205.195
##  mvnpdfoptim(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)  670.432
##  mvnpdf_invC(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE) 1166.737
##        lq       mean    median        uq      max neval
##    52.562   64.02273   64.0830   75.5630  130.462   100
##  1836.677 1989.70294 1955.1055 2053.3620 4688.022   100
##  1251.135 1341.50524 1267.0025 1293.5910 3291.357   100
##   714.261  785.75762  735.9295  759.0945 3363.558   100
##  1250.910 1391.14394 1268.7245 1296.1535 3995.983   100

5.2 Optimize thanks to C++

As a general rule, not much computational time is gained by replacing an optimized function with a C++ function. Indeed, most of the base functions are in fact already wrappers around well optimized C or Fortran routines. The gain is then limited to the suppression of argument checking and type management (which is there for a reason!).

👉 Your turn !

  1. From mvnpdfsmart, propose a complete implementation in C++ for computating the density of the multivariate Normal distribution mvnpdfsmartC().

  2. Evaluate the performance gain of this new implementation mvnpdfsmartC.

You can download our proposal for mvnpdfsmartC.cpp here.

For (relatively small) additional speed gain (at the cost of code readability!), you can have a look at our optimized Armadillo C++ implementation in mvnpdfoptimC.cpp.

n <- 1000
mb <- microbenchmark(mvtnorm::dmvnorm(matrix(1.96, nrow = n, ncol = 2)),
                     mvnpdf(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfsmart(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfoptim(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdf_invC(x=matrix(1.96, nrow = 2, ncol = n), Log=FALSE),
                     mvnpdfsmartC(x=matrix(1.96, nrow = 2, ncol = n), mean = rep(0, 2), varcovM = diag(2), Log=FALSE),
                     mvnpdfoptimC(x=matrix(1.96, nrow = 2, ncol = n), mean = rep(0, 2), varcovM = diag(2), Log=FALSE),
                     times=100L)
mb
## Unit: microseconds
##                                                                                                       expr
##                                                         mvtnorm::dmvnorm(matrix(1.96, nrow = n, ncol = 2))
##                                                  mvnpdf(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)
##                                             mvnpdfsmart(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)
##                                             mvnpdfoptim(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)
##                                             mvnpdf_invC(x = matrix(1.96, nrow = 2, ncol = n), Log = FALSE)
##  mvnpdfsmartC(x = matrix(1.96, nrow = 2, ncol = n), mean = rep(0,      2), varcovM = diag(2), Log = FALSE)
##  mvnpdfoptimC(x = matrix(1.96, nrow = 2, ncol = n), mean = rep(0,      2), varcovM = diag(2), Log = FALSE)
##       min        lq       mean    median        uq      max neval
##    27.347   53.9150   63.23389   66.2150   73.5745   93.439   100
##  1614.539 1857.1975 2031.68202 1974.0475 2073.8620 4048.422   100
##  1171.862 1256.5475 1369.85961 1276.4120 1295.7025 3715.379   100
##   656.533  715.4295  791.64604  735.1710  748.8240 3474.832   100
##  1156.774 1253.4110 1345.08823 1274.0955 1293.2630 4705.242   100
##    34.030   37.8225   41.17343   39.6060   42.4145  141.573   100
##    18.081   21.3200   24.79147   23.1035   25.4200  135.792   100

Note that Rcpp functions can be used outside of a package architecture thanks to the Rcpp::sourceCpp() function. But, as we have seen that it is always desirable to manage all of one’s code inside a package, it is unlikely that you will need this !

Annexe 5.1: premature optimization is a bad idea

Chambers, Software for Data Analysis: Programming with R, Springer, 2008:

“Including additional C code is a serious step, with some added dangers and often a substantial amount of programming and debugging required. You should have a good reason.