Comparing asymptotic timings of git versions

In this vignette we explain how to use functions which compute asymptotic timings of different git versions of a package (useful for determining when a difference in performance started to happen).

Basic usage, `atime_versions` function

In this vignette we show you how to compare asymptotic timings of an R expression which uses different versions of a package. Let us begin by cloning the binsegRcpp package,

old.opt <- options(width=100)
pkg.path <- tempfile()
dir.create(pkg.path)
git2r::clone("https://github.com/tdhock/binsegRcpp", pkg.path)
#> cloning into 'C:\Users\th798\AppData\Local\Temp\RtmpwJth9N\file9546274db6'...
#> Receiving objects:   1% (13/1258),   63 kb
#> Receiving objects:  11% (139/1258),   63 kb
#> Receiving objects:  21% (265/1258),  127 kb
#> Receiving objects:  31% (390/1258),  127 kb
#> Receiving objects:  41% (516/1258),  183 kb
#> Receiving objects:  51% (642/1258),  183 kb
#> Receiving objects:  61% (768/1258),  183 kb
#> Receiving objects:  71% (894/1258),  239 kb
#> Receiving objects:  81% (1019/1258),  239 kb
#> Receiving objects:  91% (1145/1258),  239 kb
#> Receiving objects: 100% (1258/1258),  251 kb, done.
#> Local:    master C:/Users/th798/AppData/Local/Temp/RtmpwJth9N/file9546274db6
#> Remote:   master @ origin (https://github.com/tdhock/binsegRcpp)
#> Head:     [977f385] 2022-08-24: rm rcppdeepstate yaml action

Next, to satisfy the CRAN requirement that we can not install packages to the default library, we must create a library under /tmp,

tmp.lib.path <- tempfile()
dir.create(tmp.lib.path)
lib.path.vec <- c(tmp.lib.path, .libPaths())
.libPaths(lib.path.vec)

Next, we define a helper function run.atime that will run atime_versions, which is a simple way to compare different github versions of a function:

run.atime.versions <- function(PKG.PATH, LIB.PATH){
  if(!missing(LIB.PATH)).libPaths(LIB.PATH)
  atime::atime_versions(
    pkg.path=PKG.PATH,
    N=2^seq(2, 20),
    setup={
      max.segs <- as.integer(N/2)
      data.vec <- 1:N
    },
    expr=binsegRcpp::binseg_normal(data.vec, max.segs),
    cv="908b77c411bc7f4fcbcf53759245e738ae724c3e",
    "rm unord map"="dcd0808f52b0b9858352106cc7852e36d7f5b15d",
    "mvl_construct"="5942af606641428315b0e63c7da331c4cd44c091")
}

Here is an explanation of the arguments specified above:

pkg.path is the path to the github repository containing the R package,
N is a numeric vector of data sizes,
setup is an R expression which will be run to create data for each size N,
expr is an R expression which will be timed for each package version. Under the hood, a different R package is created for each package version, with package names like Package.SHA, binsegRcpp.908b77c411bc7f4fcbcf53759245e738ae724c3e. This expr must contain double or triple colon package name prefix code, like binsegRcpp::binseg_normal above, which will be translated to several different version-specific expressions, like binsegRcpp.908b77c411bc7f4fcbcf53759245e738ae724c3e::binseg_normal.
The remaining arguments specify the different package versions (names for labels, values for SHA version IDs).

Note that in your code you don't have to create a helper function like run.atime.versions in the code above. We do it in the package vignette code, in order to run the different versions of the code using callr::r, in a separate R process. This allows us to avoid CRAN warnings about unexpected files found in the package check directory, by safely delete/remove the installed packages, after having run the example code. For a more typical usage see example(atime_versions, package="atime").

In the code block below we compute the timings,

atime.ver.list <- if(requireNamespace("callr")){
  requireNamespace("atime")
  callr::r(run.atime.versions, list(pkg.path, lib.path.vec))
}else{
  run.atime.versions(pkg.path)
}
#> Loading required namespace: callr
names(atime.ver.list$measurements)
#>  [1] "N"         "expr.name" "min"       "median"    "itr/sec"   "gc/sec"    "n_itr"     "n_gc"     
#>  [9] "result"    "memory"    "time"      "gc"        "kilobytes" "q25"       "q75"       "max"      
#> [17] "mean"      "sd"
atime.ver.list$measurements[, .(N, expr.name, min, median, max, kilobytes)]
#>         N     expr.name       min     median       max   kilobytes
#>     <num>        <char>     <num>      <num>     <num>       <num>
#>  1:     4            cv 0.0002704 0.00028050 0.0003828  9287.15625
#>  2:     4  rm unord map 0.0011429 0.00121880 0.0013492  2676.03906
#>  3:     4 mvl_construct 0.0007904 0.00088055 0.0046422   278.95312
#>  4:     8            cv 0.0002772 0.00028570 0.0003860    18.74219
#>  5:     8  rm unord map 0.0011047 0.00117040 0.0056885    70.00781
#>  6:     8 mvl_construct 0.0007495 0.00078750 0.0010035    67.48438
#>  7:    16            cv 0.0002791 0.00028660 0.0003750    18.92188
#>  8:    16  rm unord map 0.0011254 0.00117590 0.0014362    70.18750
#>  9:    16 mvl_construct 0.0007927 0.00082335 0.0010313    67.66406
#> 10:    32            cv 0.0002924 0.00030190 0.0003840    19.64062
#> 11:    32  rm unord map 0.0011198 0.00114925 0.0038412    72.82031
#> 12:    32 mvl_construct 0.0008042 0.00080810 0.0010458    70.29688
#> 13:    64            cv 0.0002989 0.00030825 0.0003990    25.38281
#> 14:    64  rm unord map 0.0011732 0.00126680 0.0020850    88.72656
#> 15:    64 mvl_construct 0.0008978 0.00093855 0.0035149    86.20312
#> 16:   128            cv 0.0003013 0.00033235 0.0004542    43.35156
#> 17:   128  rm unord map 0.0011266 0.00114870 0.0013636   120.67969
#> 18:   128 mvl_construct 0.0010265 0.00109995 0.0012513   117.26562
#> 19:   256            cv 0.0003379 0.00036305 0.0004166    64.85156
#> 20:   256  rm unord map 0.0011712 0.00122605 0.0015247   165.67969
#> 21:   256 mvl_construct 0.0013660 0.00148645 0.0017355   161.51562
#> 22:   512            cv 0.0004085 0.00043350 0.0007886   107.85156
#> 23:   512  rm unord map 0.0013379 0.00138370 0.0015692   255.67969
#> 24:   512 mvl_construct 0.0020461 0.00213430 0.0023014   250.01562
#> 25:  1024            cv 0.0004806 0.00052625 0.0007722   193.85156
#> 26:  1024  rm unord map 0.0015276 0.00159040 0.0048540   435.67969
#> 27:  1024 mvl_construct 0.0034157 0.00346345 0.0041241   427.01562
#> 28:  2048            cv 0.0007000 0.00074945 0.0008686   365.85156
#> 29:  2048  rm unord map 0.0020499 0.00221235 0.0024546   795.67969
#> 30:  2048 mvl_construct 0.0064843 0.00654775 0.0068192   781.01562
#> 31:  4096            cv 0.0011759 0.00121870 0.0013393   709.85156
#> 32:  4096  rm unord map 0.0037065 0.00405860 0.0055938  1515.67969
#> 33:  4096 mvl_construct 0.0138652 0.01413085 0.0169543  1489.01562
#> 34:  8192            cv 0.0020957 0.00246470 0.0026336  1397.85156
#> 35:  8192  rm unord map 0.0064265 0.00695385 0.0131263  2955.67969
#> 36: 16384            cv 0.0047889 0.00500260 0.0108923  2773.85156
#> 37: 16384  rm unord map 0.0113848 0.01187885 0.0173260  5835.67969
#> 38: 32768            cv 0.0096927 0.00994195 0.0148679  5525.85156
#> 39: 65536            cv 0.0178206 0.02179605 0.0313384 11032.03125
#>         N     expr.name       min     median       max   kilobytes

The result is a list with a measurements data table that contains measurements of time in seconds (min, median, max) and memory usage (kilobytes) for every version (expr.name) and data size (N). A more convenient version of the data for plotting can be obtained via the code below:

best.ver.list <- atime::references_best(atime.ver.list)
names(best.ver.list$measurements)
#>  [1] "unit"       "N"          "expr.name"  "min"        "median"     "itr/sec"    "gc/sec"    
#>  [8] "n_itr"      "n_gc"       "result"     "memory"     "time"       "gc"         "kilobytes" 
#> [15] "q25"        "q75"        "max"        "mean"       "sd"         "fun.name"   "fun.latex" 
#> [22] "expr.class" "expr.latex" "empirical"
best.ver.list$measurements[, .(N, expr.name, unit, empirical)]
#>         N     expr.name      unit    empirical
#>     <num>        <char>    <char>        <num>
#>  1:     4            cv kilobytes 9.287156e+03
#>  2:     8            cv kilobytes 1.874219e+01
#>  3:    16            cv kilobytes 1.892188e+01
#>  4:    32            cv kilobytes 1.964063e+01
#>  5:    64            cv kilobytes 2.538281e+01
#>  6:   128            cv kilobytes 4.335156e+01
#>  7:   256            cv kilobytes 6.485156e+01
#>  8:   512            cv kilobytes 1.078516e+02
#>  9:  1024            cv kilobytes 1.938516e+02
#> 10:  2048            cv kilobytes 3.658516e+02
#> 11:  4096            cv kilobytes 7.098516e+02
#> 12:  8192            cv kilobytes 1.397852e+03
#> 13: 16384            cv kilobytes 2.773852e+03
#> 14: 32768            cv kilobytes 5.525852e+03
#> 15: 65536            cv kilobytes 1.103203e+04
#> 16:     4  rm unord map kilobytes 2.676039e+03
#> 17:     8  rm unord map kilobytes 7.000781e+01
#> 18:    16  rm unord map kilobytes 7.018750e+01
#> 19:    32  rm unord map kilobytes 7.282031e+01
#> 20:    64  rm unord map kilobytes 8.872656e+01
#> 21:   128  rm unord map kilobytes 1.206797e+02
#> 22:   256  rm unord map kilobytes 1.656797e+02
#> 23:   512  rm unord map kilobytes 2.556797e+02
#> 24:  1024  rm unord map kilobytes 4.356797e+02
#> 25:  2048  rm unord map kilobytes 7.956797e+02
#> 26:  4096  rm unord map kilobytes 1.515680e+03
#> 27:  8192  rm unord map kilobytes 2.955680e+03
#> 28: 16384  rm unord map kilobytes 5.835680e+03
#> 29:     4 mvl_construct kilobytes 2.789531e+02
#> 30:     8 mvl_construct kilobytes 6.748437e+01
#> 31:    16 mvl_construct kilobytes 6.766406e+01
#> 32:    32 mvl_construct kilobytes 7.029687e+01
#> 33:    64 mvl_construct kilobytes 8.620313e+01
#> 34:   128 mvl_construct kilobytes 1.172656e+02
#> 35:   256 mvl_construct kilobytes 1.615156e+02
#> 36:   512 mvl_construct kilobytes 2.500156e+02
#> 37:  1024 mvl_construct kilobytes 4.270156e+02
#> 38:  2048 mvl_construct kilobytes 7.810156e+02
#> 39:  4096 mvl_construct kilobytes 1.489016e+03
#> 40:     4            cv   seconds 2.805000e-04
#> 41:     8            cv   seconds 2.857000e-04
#> 42:    16            cv   seconds 2.866000e-04
#> 43:    32            cv   seconds 3.019000e-04
#> 44:    64            cv   seconds 3.082500e-04
#> 45:   128            cv   seconds 3.323500e-04
#> 46:   256            cv   seconds 3.630500e-04
#> 47:   512            cv   seconds 4.335000e-04
#> 48:  1024            cv   seconds 5.262500e-04
#> 49:  2048            cv   seconds 7.494500e-04
#> 50:  4096            cv   seconds 1.218700e-03
#> 51:  8192            cv   seconds 2.464700e-03
#> 52: 16384            cv   seconds 5.002600e-03
#> 53: 32768            cv   seconds 9.941950e-03
#> 54: 65536            cv   seconds 2.179605e-02
#> 55:     4  rm unord map   seconds 1.218800e-03
#> 56:     8  rm unord map   seconds 1.170400e-03
#> 57:    16  rm unord map   seconds 1.175900e-03
#> 58:    32  rm unord map   seconds 1.149250e-03
#> 59:    64  rm unord map   seconds 1.266800e-03
#> 60:   128  rm unord map   seconds 1.148700e-03
#> 61:   256  rm unord map   seconds 1.226050e-03
#> 62:   512  rm unord map   seconds 1.383700e-03
#> 63:  1024  rm unord map   seconds 1.590400e-03
#> 64:  2048  rm unord map   seconds 2.212350e-03
#> 65:  4096  rm unord map   seconds 4.058600e-03
#> 66:  8192  rm unord map   seconds 6.953850e-03
#> 67: 16384  rm unord map   seconds 1.187885e-02
#> 68:     4 mvl_construct   seconds 8.805500e-04
#> 69:     8 mvl_construct   seconds 7.875000e-04
#> 70:    16 mvl_construct   seconds 8.233500e-04
#> 71:    32 mvl_construct   seconds 8.081000e-04
#> 72:    64 mvl_construct   seconds 9.385500e-04
#> 73:   128 mvl_construct   seconds 1.099950e-03
#> 74:   256 mvl_construct   seconds 1.486450e-03
#> 75:   512 mvl_construct   seconds 2.134300e-03
#> 76:  1024 mvl_construct   seconds 3.463450e-03
#> 77:  2048 mvl_construct   seconds 6.547750e-03
#> 78:  4096 mvl_construct   seconds 1.413085e-02
#>         N     expr.name      unit    empirical

The data table above is a tall/long version of the same data, which can be plotted using the code below:

if(require(ggplot2)){
  hline.df <- with(atime.ver.list, data.frame(seconds.limit, unit="seconds"))
  gg <- ggplot()+
    theme_bw()+
    facet_grid(unit ~ ., scales="free")+
    geom_hline(aes(
      yintercept=seconds.limit),
      color="grey",
      data=hline.df)+
    geom_line(aes(
      N, empirical, color=expr.name),
      data=best.ver.list$meas)+
    geom_ribbon(aes(
      N, ymin=min, ymax=max, fill=expr.name),
      data=best.ver.list$meas[unit=="seconds"],
      alpha=0.5)+
    scale_x_log10()+
    scale_y_log10("median line, min/max band")
  if(require(directlabels)){
    gg+
      directlabels::geom_dl(aes(
        N, empirical, color=expr.name, label=expr.name),
        method="right.polygons",
        data=best.ver.list$meas)+
      theme(legend.position="none")+
      coord_cartesian(xlim=c(1,2e7))
  }else{
    gg
  }
}

plot of chunk unnamed-chunk-6

Advanced usage, `atime_versions_exprs` with `atime`

What if you wanted to compare different versions of one R package, to another R package? Continuing the example above, we can get a list of expressions, each one for a different version of the package, via the code below:

(ver.list <- atime::atime_versions_exprs(
  pkg.path=pkg.path,
  expr=binsegRcpp::binseg_normal(data.vec, max.segs),
  cv="908b77c411bc7f4fcbcf53759245e738ae724c3e",
  "rm unord map"="dcd0808f52b0b9858352106cc7852e36d7f5b15d",
  "mvl_construct"="5942af606641428315b0e63c7da331c4cd44c091"))
#> $cv
#> binsegRcpp.908b77c411bc7f4fcbcf53759245e738ae724c3e::binseg_normal(data.vec, 
#>     max.segs)
#> 
#> $`rm unord map`
#> binsegRcpp.dcd0808f52b0b9858352106cc7852e36d7f5b15d::binseg_normal(data.vec, 
#>     max.segs)
#> 
#> $mvl_construct
#> binsegRcpp.5942af606641428315b0e63c7da331c4cd44c091::binseg_normal(data.vec, 
#>     max.segs)

The ver.list created above can be augmented with other expressions, such as the following alternative implementation of binary segmentation from the changepoint package,

expr.list <- c(ver.list, if(requireNamespace("changepoint")){
  list(changepoint=substitute(changepoint::cpt.mean(
    data.vec, penalty="Manual", pen.value=0, method="BinSeg",
    Q=max.segs-1)))
})

The expr.list created above can be provided as an argument to the atime function as in the code below,

run.atime <- function(ELIST, LIB.PATH){
  if(!missing(LIB.PATH)).libPaths(LIB.PATH)
  atime::atime(
    N=2^seq(2, 20),
    setup={
      max.segs <- as.integer(N/2)
      data.vec <- 1:N
    },
    expr.list=ELIST)
}
atime.list <- if(requireNamespace("callr")){
  requireNamespace("atime")
  callr::r(run.atime, list(expr.list, lib.path.vec))
}else{
  run.atime(expr.list)
}

Again note in the code above that we defined a helper function, run.atime, and used callr::r, to avoid CRAN issues. For a more typical usage see example(atime_versions_exprs, package="atime").

atime.list$measurements[, .(N, expr.name, median, kilobytes)]
#>         N     expr.name     median    kilobytes
#>     <num>        <char>      <num>        <num>
#>  1:     4            cv 0.00027335  9319.429688
#>  2:     4  rm unord map 0.00113670  2783.750000
#>  3:     4 mvl_construct 0.00080920   278.765625
#>  4:     4   changepoint 0.00038040  6463.679688
#>  5:     8            cv 0.00053845    18.742188
#>  6:     8  rm unord map 0.00116285    70.007812
#>  7:     8 mvl_construct 0.00088920    67.484375
#>  8:     8   changepoint 0.00044910    33.375000
#>  9:    16            cv 0.00031435    18.921875
#> 10:    16  rm unord map 0.00109740    70.187500
#> 11:    16 mvl_construct 0.00079900    67.664062
#> 12:    16   changepoint 0.00054340     3.523438
#> 13:    32            cv 0.00033250    19.640625
#> 14:    32  rm unord map 0.00120460    72.820312
#> 15:    32 mvl_construct 0.00091615    70.296875
#> 16:    32   changepoint 0.00085755    10.750000
#> 17:    64            cv 0.00029965    25.382812
#> 18:    64  rm unord map 0.00193415    88.726562
#> 19:    64 mvl_construct 0.00091755    86.203125
#> 20:    64   changepoint 0.00074250    52.734375
#> 21:   128            cv 0.00058440    43.351562
#> 22:   128  rm unord map 0.00127300   120.679688
#> 23:   128 mvl_construct 0.00122245   117.265625
#> 24:   128   changepoint 0.00113980   195.000000
#> 25:   256            cv 0.00034975    64.851562
#> 26:   256  rm unord map 0.00122055   165.679688
#> 27:   256 mvl_construct 0.00141035   161.515625
#> 28:   256   changepoint 0.00218125   704.359375
#> 29:   512            cv 0.00040850   107.851562
#> 30:   512  rm unord map 0.00135290   255.679688
#> 31:   512 mvl_construct 0.00225600   250.015625
#> 32:   512   changepoint 0.00668260  2633.656250
#> 33:  1024            cv 0.00051640   193.851562
#> 34:  1024  rm unord map 0.00164260   435.679688
#> 35:  1024 mvl_construct 0.00364840   429.195312
#> 36:  1024   changepoint 0.03029920 10144.984375
#> 37:  2048            cv 0.00070935   365.851562
#> 38:  2048  rm unord map 0.00211185   795.679688
#> 39:  2048 mvl_construct 0.00668765   781.015625
#> 40:  4096            cv 0.00134405   709.851562
#> 41:  4096  rm unord map 0.00402655  1515.679688
#> 42:  4096 mvl_construct 0.01358965  1489.015625
#> 43:  8192            cv 0.00266665  1397.851562
#> 44:  8192  rm unord map 0.00686065  2955.679688
#> 45: 16384            cv 0.00522300  2773.851562
#> 46: 16384  rm unord map 0.01224850  5835.679688
#> 47: 32768            cv 0.01022960  5533.359375
#>         N     expr.name     median    kilobytes

The results above show that timings were computed for the three different versions of the binsegRcpp code, along with the changepoint code. These data can be plotted via the default method as in the code below,

refs.best <- atime::references_best(atime.list)
plot(refs.best)

plot of chunk unnamed-chunk-11

Cleanup

Below we remove the installed packages, in order to avoid CRAN warnings:

atime::atime_versions_remove("binsegRcpp")
#> [1] 0
options(old.opt)

Comparing asymptotic timings of git versions

Basic usage, atime_versions function

Advanced usage, atime_versions_exprs with atime

Cleanup

Basic usage, `atime_versions` function

Advanced usage, `atime_versions_exprs` with `atime`