In this vignette we compare two different implementations of the cumulative median. The cumstats package provides a naive method, which uses the standard median function in a for loop. Each call to the standard median function is log-linear, so the total expected complexity is log-quadratic. The binsegRcpp package provides a different implementation that uses a log-linear algorithm, previously described in the 2017 NeurIPS research paper Maximum Margin Interval Trees by Alexandre Drouin, Toby Hocking, Francois Laviolette.
atime.list <- atime::atime(
N=2^seq(1, 20),
setup={
set.seed(1)
data.vec <- rnorm(N)
},
"cumstats::cummedian"={
cumstats::cummedian(data.vec)
},
"binsegRcpp::cum_median"={
binsegRcpp::cum_median(data.vec)
},
times=5)
plot(atime.list)
(best.list <- atime::references_best(atime.list))
#> 46 measurements with 268 references, best fit complexity: cumstats::cummedian (N^2 kilobytes, N log N seconds), binsegRcpp::cum_median (N kilobytes, N log N seconds)
(ref.dt <- best.list$ref[each.sign.rank==1])
#> unit expr.name fun.latex fun.name N empirical
#> <char> <char> <char> <char> <num> <num>
#> 1: kilobytes cumstats::cummedian N \\log N N log N 4 0.0000000
#> 2: kilobytes cumstats::cummedian N \\log N N log N 8 0.0000000
#> 3: kilobytes cumstats::cummedian N \\log N N log N 16 0.0000000
#> 4: kilobytes cumstats::cummedian N \\log N N log N 32 8.5156250
#> 5: kilobytes cumstats::cummedian N \\log N N log N 64 59.9843750
#> 6: kilobytes cumstats::cummedian N \\log N N log N 128 246.2343750
#> 7: kilobytes cumstats::cummedian N \\log N N log N 256 954.7343750
#> 8: kilobytes cumstats::cummedian N^2 N^2 16 0.0000000
#> 9: kilobytes cumstats::cummedian N^2 N^2 32 8.5156250
#> 10: kilobytes cumstats::cummedian N^2 N^2 64 59.9843750
#> 11: kilobytes cumstats::cummedian N^2 N^2 128 246.2343750
#> 12: kilobytes cumstats::cummedian N^2 N^2 256 954.7343750
#> 13: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 2 7.9531250
#> 14: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 4 2.4921875
#> 15: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 8 2.4921875
#> 16: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 16 2.4921875
#> 17: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 32 3.0859375
#> 18: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 64 3.5859375
#> 19: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 128 4.5859375
#> 20: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 256 6.5859375
#> 21: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 512 10.5859375
#> 22: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 1024 18.5859375
#> 23: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 2048 34.5859375
#> 24: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 4096 66.5859375
#> 25: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 8192 130.5859375
#> 26: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 16384 258.5859375
#> 27: kilobytes binsegRcpp::cum_median \\sqrt N sqrt N 32768 514.5859375
#> 28: kilobytes binsegRcpp::cum_median N N 256 6.5859375
#> 29: kilobytes binsegRcpp::cum_median N N 512 10.5859375
#> 30: kilobytes binsegRcpp::cum_median N N 1024 18.5859375
#> 31: kilobytes binsegRcpp::cum_median N N 2048 34.5859375
#> 32: kilobytes binsegRcpp::cum_median N N 4096 66.5859375
#> 33: kilobytes binsegRcpp::cum_median N N 8192 130.5859375
#> 34: kilobytes binsegRcpp::cum_median N N 16384 258.5859375
#> 35: kilobytes binsegRcpp::cum_median N N 32768 514.5859375
#> 36: seconds cumstats::cummedian N \\log N N log N 2 0.0001477
#> 37: seconds cumstats::cummedian N \\log N N log N 4 0.0001353
#> 38: seconds cumstats::cummedian N \\log N N log N 8 0.0002389
#> 39: seconds cumstats::cummedian N \\log N N log N 16 0.0004561
#> 40: seconds cumstats::cummedian N \\log N N log N 32 0.0009442
#> 41: seconds cumstats::cummedian N \\log N N log N 64 0.0022225
#> 42: seconds cumstats::cummedian N \\log N N log N 128 0.0039626
#> 43: seconds cumstats::cummedian N \\log N N log N 256 0.0102345
#> 44: seconds cumstats::cummedian N^2 N^2 8 0.0002389
#> 45: seconds cumstats::cummedian N^2 N^2 16 0.0004561
#> 46: seconds cumstats::cummedian N^2 N^2 32 0.0009442
#> 47: seconds cumstats::cummedian N^2 N^2 64 0.0022225
#> 48: seconds cumstats::cummedian N^2 N^2 128 0.0039626
#> 49: seconds cumstats::cummedian N^2 N^2 256 0.0102345
#> 50: seconds binsegRcpp::cum_median N \\log N N log N 64 0.0000237
#> 51: seconds binsegRcpp::cum_median N \\log N N log N 128 0.0000392
#> 52: seconds binsegRcpp::cum_median N \\log N N log N 256 0.0000782
#> 53: seconds binsegRcpp::cum_median N \\log N N log N 512 0.0001574
#> 54: seconds binsegRcpp::cum_median N \\log N N log N 1024 0.0003271
#> 55: seconds binsegRcpp::cum_median N \\log N N log N 2048 0.0006553
#> 56: seconds binsegRcpp::cum_median N \\log N N log N 4096 0.0013828
#> 57: seconds binsegRcpp::cum_median N \\log N N log N 8192 0.0029953
#> 58: seconds binsegRcpp::cum_median N \\log N N log N 16384 0.0065230
#> 59: seconds binsegRcpp::cum_median N \\log N N log N 32768 0.0145061
#> 60: seconds binsegRcpp::cum_median N^2 N^2 1024 0.0003271
#> 61: seconds binsegRcpp::cum_median N^2 N^2 2048 0.0006553
#> 62: seconds binsegRcpp::cum_median N^2 N^2 4096 0.0013828
#> 63: seconds binsegRcpp::cum_median N^2 N^2 8192 0.0029953
#> 64: seconds binsegRcpp::cum_median N^2 N^2 16384 0.0065230
#> 65: seconds binsegRcpp::cum_median N^2 N^2 32768 0.0145061
#> unit expr.name fun.latex fun.name N empirical
#> reference rank i.N i.empirical i.reference i.rank dist sign
#> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 3.729431e+00 7 128 246.2343750 4.176963e+02 2 -0.229511935 -1
#> 2: 1.118829e+01 6 128 246.2343750 4.176963e+02 2 -0.229511935 -1
#> 3: 2.983545e+01 5 128 246.2343750 4.176963e+02 2 -0.229511935 -1
#> 4: 7.458862e+01 4 128 246.2343750 4.176963e+02 2 -0.229511935 -1
#> 5: 1.790127e+02 3 128 246.2343750 4.176963e+02 2 -0.229511935 -1
#> 6: 4.176963e+02 2 128 246.2343750 4.176963e+02 2 -0.229511935 -1
#> 7: 9.547344e+02 1 128 246.2343750 4.176963e+02 2 -0.229511935 -1
#> 8: 3.729431e+00 5 128 246.2343750 2.386836e+02 2 0.013526113 1
#> 9: 1.491772e+01 4 128 246.2343750 2.386836e+02 2 0.013526113 1
#> 10: 5.967090e+01 3 128 246.2343750 2.386836e+02 2 0.013526113 1
#> 11: 2.386836e+02 2 128 246.2343750 2.386836e+02 2 0.013526113 1
#> 12: 9.547344e+02 1 128 246.2343750 2.386836e+02 2 0.013526113 1
#> 13: 4.020203e+00 15 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 14: 5.685425e+00 14 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 15: 8.040405e+00 13 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 16: 1.137085e+01 12 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 17: 1.608081e+01 11 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 18: 2.274170e+01 10 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 19: 3.216162e+01 9 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 20: 4.548340e+01 8 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 21: 6.432324e+01 7 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 22: 9.096680e+01 6 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 23: 1.286465e+02 5 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 24: 1.819336e+02 4 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 25: 2.572930e+02 3 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 26: 3.638672e+02 2 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 27: 5.145859e+02 1 16384 258.5859375 3.638672e+02 2 -0.148338013 -1
#> 28: 4.020203e+00 8 16384 258.5859375 2.572930e+02 2 0.002176985 1
#> 29: 8.040405e+00 7 16384 258.5859375 2.572930e+02 2 0.002176985 1
#> 30: 1.608081e+01 6 16384 258.5859375 2.572930e+02 2 0.002176985 1
#> 31: 3.216162e+01 5 16384 258.5859375 2.572930e+02 2 0.002176985 1
#> 32: 6.432324e+01 4 16384 258.5859375 2.572930e+02 2 0.002176985 1
#> 33: 1.286465e+02 3 16384 258.5859375 2.572930e+02 2 0.002176985 1
#> 34: 2.572930e+02 2 16384 258.5859375 2.572930e+02 2 0.002176985 1
#> 35: 5.145859e+02 1 16384 258.5859375 2.572930e+02 2 0.002176985 1
#> 36: 9.994629e-06 8 128 0.0039626 4.477594e-03 2 -0.053064452 -1
#> 37: 3.997852e-05 7 128 0.0039626 4.477594e-03 2 -0.053064452 -1
#> 38: 1.199355e-04 6 128 0.0039626 4.477594e-03 2 -0.053064452 -1
#> 39: 3.198281e-04 5 128 0.0039626 4.477594e-03 2 -0.053064452 -1
#> 40: 7.995703e-04 4 128 0.0039626 4.477594e-03 2 -0.053064452 -1
#> 41: 1.918969e-03 3 128 0.0039626 4.477594e-03 2 -0.053064452 -1
#> 42: 4.477594e-03 2 128 0.0039626 4.477594e-03 2 -0.053064452 -1
#> 43: 1.023450e-02 1 128 0.0039626 4.477594e-03 2 -0.053064452 -1
#> 44: 9.994629e-06 6 128 0.0039626 2.558625e-03 2 0.189973596 1
#> 45: 3.997852e-05 5 128 0.0039626 2.558625e-03 2 0.189973596 1
#> 46: 1.599141e-04 4 128 0.0039626 2.558625e-03 2 0.189973596 1
#> 47: 6.396563e-04 3 128 0.0039626 2.558625e-03 2 0.189973596 1
#> 48: 2.558625e-03 2 128 0.0039626 2.558625e-03 2 0.189973596 1
#> 49: 1.023450e-02 1 128 0.0039626 2.558625e-03 2 0.189973596 1
#> 50: 1.133289e-05 10 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 51: 2.644341e-05 9 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 52: 6.044208e-05 8 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 53: 1.359947e-04 7 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 54: 3.022104e-04 6 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 55: 6.648629e-04 5 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 56: 1.450610e-03 4 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 57: 3.142988e-03 3 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 58: 6.769513e-03 2 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 59: 1.450610e-02 1 16384 0.0065230 6.769513e-03 2 -0.016110069 -1
#> 60: 1.416611e-05 6 16384 0.0065230 3.626525e-03 2 0.254956703 1
#> 61: 5.666445e-05 5 16384 0.0065230 3.626525e-03 2 0.254956703 1
#> 62: 2.266578e-04 4 16384 0.0065230 3.626525e-03 2 0.254956703 1
#> 63: 9.066312e-04 3 16384 0.0065230 3.626525e-03 2 0.254956703 1
#> 64: 3.626525e-03 2 16384 0.0065230 3.626525e-03 2 0.254956703 1
#> 65: 1.450610e-02 1 16384 0.0065230 3.626525e-03 2 0.254956703 1
#> reference rank i.N i.empirical i.reference i.rank dist sign
#> overall.rank each.sign.rank
#> <num> <num>
#> 1: 3 1
#> 2: 3 1
#> 3: 3 1
#> 4: 3 1
#> 5: 3 1
#> 6: 3 1
#> 7: 3 1
#> 8: 1 1
#> 9: 1 1
#> 10: 1 1
#> 11: 1 1
#> 12: 1 1
#> 13: 3 1
#> 14: 3 1
#> 15: 3 1
#> 16: 3 1
#> 17: 3 1
#> 18: 3 1
#> 19: 3 1
#> 20: 3 1
#> 21: 3 1
#> 22: 3 1
#> 23: 3 1
#> 24: 3 1
#> 25: 3 1
#> 26: 3 1
#> 27: 3 1
#> 28: 1 1
#> 29: 1 1
#> 30: 1 1
#> 31: 1 1
#> 32: 1 1
#> 33: 1 1
#> 34: 1 1
#> 35: 1 1
#> 36: 1 1
#> 37: 1 1
#> 38: 1 1
#> 39: 1 1
#> 40: 1 1
#> 41: 1 1
#> 42: 1 1
#> 43: 1 1
#> 44: 3 1
#> 45: 3 1
#> 46: 3 1
#> 47: 3 1
#> 48: 3 1
#> 49: 3 1
#> 50: 1 1
#> 51: 1 1
#> 52: 1 1
#> 53: 1 1
#> 54: 1 1
#> 55: 1 1
#> 56: 1 1
#> 57: 1 1
#> 58: 1 1
#> 59: 1 1
#> 60: 4 1
#> 61: 4 1
#> 62: 4 1
#> 63: 4 1
#> 64: 4 1
#> 65: 4 1
#> overall.rank each.sign.rank
library(data.table)
## try() to avoid CRAN error 'from' must be a finite number, on
## https://www.stats.ox.ac.uk/pub/bdr/Rblas/README.txt, due to
## https://github.com/r-lib/scales/issues/307
try(if(require(ggplot2)){
hline.df <- with(atime.list, data.frame(seconds.limit, unit="seconds"))
gg <- ggplot()+
theme_bw()+
facet_grid(unit ~ ., scales="free")+
geom_line(aes(
N, reference, group=paste(expr.name, fun.name)),
color="grey",
data=ref.dt)+
geom_hline(aes(
yintercept=seconds.limit),
color="grey",
data=hline.df)+
geom_line(aes(
N, empirical, color=expr.name),
data=best.list$measurements)+
geom_ribbon(aes(
N, ymin=min, ymax=max, fill=expr.name),
data=best.list$measurements[unit=="seconds"],
alpha=0.5)+
scale_x_log10()+
scale_y_log10("median line, min/max band")+
coord_cartesian(xlim=c(1,1e7))
if(require("directlabels")){
gg+
theme(legend.position="none")+
directlabels::geom_dl(aes(
N, empirical, color=expr.name, label=expr.name),
data=best.list$meas,
method="last.polygons")+
directlabels::geom_dl(aes(
N, reference, label.group=paste(expr.name, fun.name), label=fun.name),
data=ref.dt,
color="grey",
method="bottom.polygons")
}else{
gg
}
})
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis
The plots above show significant speed differences between the two implementations of cumulative median. It is clear from the different asymptotic slopes that the empirical timings are consistent with the expected complexity, O(N log N) for binsegRcpp, and O(N2 log N) for cumstats. The code below shows that the results of the two implementations are identical.
result.wide <- dcast(atime.list$meas, N ~ expr.name, value.var="result")
result.complete <- result.wide[sapply(`cumstats::cummedian`, length) == N]
result.complete[, identical(`binsegRcpp::cum_median`,`cumstats::cummedian`)]
#> [1] TRUE
Conclusion: binsegRcpp::cum_median
should be the preferred method
for computing the cumulative median because it is much faster.