tidyfast v0.2.1
The goal of tidyfast
is to provide fast and efficient
alternatives to some tidyr
(and a few dplyr
)
functions using data.table
under the hood. Each have the
prefix of dt_
to allow for autocomplete in IDEs such as
RStudio. These should compliment some of the current functionality in
dtplyr
(but notably does not use the lazy_dt()
framework of dtplyr
). This package imports
data.table
and Rcpp
(no other
dependencies).
These are, in essence, translations from a more
tidyverse
grammar to data.table
. Most
functions herein are in places where, in my opinion, the
data.table
syntax is not obvious or clear. As such, these
functions can translate a simple function call into the fast, efficient,
and concise syntax of data.table
.
The current functions include:
Nesting and unnesting (similar to
dplyr::group_nest()
and tidyr::unnest()
):
dt_nest()
for nesting data tablesdt_unnest()
for unnesting data tablesdt_hoist()
for unnesting vectors in a list-column in a
data tablePivoting (similar to
tidyr::pivot_longer()
and
tidyr::pivot_wider()
)
dt_pivot_longer()
for fast pivoting using
data.table::melt()
dt_pivot_wider()
for fast pivoting using
data.table::dcast()
If Else (similar to
dplyr::case_when()
):
dt_case_when()
for dplyr::case_when()
syntax with the speed of data.table::fifelse()
Fill (similar to tidyr::fill()
)
dt_fill()
for filling NA
values with
values before it, after it, or both. This can be done by a grouping
variable (e.g. fill in NA
values with values within an
individual).Count and Uncount (similar to
tidyr::uncount()
and dplyr::count()
)
dt_count()
for fast counting by group(s)dt_uncount()
for creating full data from a count
tableSeparate (similar to
tidyr::separate()
)
dt_separate()
for splitting a single column into
multiple based on a match within the column (e.g., column with values
like “A.B” could be split into two columns by using the period as the
separator where column 1 would have “A” and 2 would have “B”). It is
built on data.table::tstrsplit()
. This is not well tested
yet and lacks some functionality of tidyr::separate()
.Adjust data.table
print options
dt_print_options()
for adjusting the options for
print.data.table()
tidyfast
attempts to convert syntax from
tidyr
with its accompanying grammar to
data.table
function calls. As such, we have tried to
maintain the tidyr
syntax as closely as possible without
hurting speed and efficiency. Some more advanced use cases in
tidyr
may not translate yet. We try to be transparent about
the shortcomings in syntax and behavior where known.
Each function that takes data (labeled as dt_
in the
package docs) as its first argument automatically coerces it to a data
table with as.data.table()
if it isn’t already a data
table. Each of these functions will return a data table.
You can install the development version from GitHub with:
# install.packages("remotes")
::install_github("TysonStanley/tidyfast") remotes
The initial versions of the nesting and unnesting functions were shown in a preprint. Herein is shown some simple applications and the functions’ speed/efficiency.
The following data table will be used for the nesting/unnesting examples.
set.seed(84322)
library(tidyfast)
library(data.table)
library(dplyr) # to compare with case_when()
library(tidyr) # to compare with fill() and separate()
library(ggplot2) # figures
library(ggbeeswarm) # figures
<- data.table(
dt x = rnorm(1e5),
y = runif(1e5),
grp = sample(1L:5L, 1e5, replace = TRUE),
nested1 = lapply(1:10, sample, 10, replace = TRUE),
nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
id = 1:1e5)
To make all the comparisons herein more equal, we will set the number
of threads that data.table
will use to 1.
setDTthreads(1)
We can nest this data using dt_nest()
:
<- dt_nest(dt, grp)
nested
nested#> grp data
#> 1: 1 <data.table>
#> 2: 2 <data.table>
#> 3: 3 <data.table>
#> 4: 4 <data.table>
#> 5: 5 <data.table>
We can also unnest this with dt_unnest()
:
dt_unnest(nested, col = data)
#> grp x y nested1
#> 1: 1 -1.1813164 0.004599736 2,2,1,2,1,1,...
#> 2: 1 -1.0384420 0.853208540 2,8,4,6,7,7,...
#> 3: 1 -0.6247028 0.072652533 4,2,2,1,1,1,...
#> 4: 1 -1.3651514 0.569079215 1,1,1,3,6,2,...
#> 5: 1 0.1403744 0.864617284 10, 1, 1, 1, 8, 1,...
#> ---
#> 99996: 5 -0.3437795 0.995197776 2,1,2,2,2,1,...
#> 99997: 5 1.6157744 0.241735719 10, 1, 1, 1, 8, 1,...
#> 99998: 5 -0.1321246 0.885283934 2,3,3,2,2,4,...
#> 99999: 5 -1.7019715 0.524621296 5,4,3,3,3,2,...
#> 100000: 5 0.3821493 0.032851280 2,8,4,6,7,7,...
#> nested2 id
#> 1: thing2,thing2,thing2,thing2,thing2,thing2,... 2
#> 2: thing2,thing2,thing2,thing2,thing2,thing2,... 8
#> 3: thing1,thing1,thing1,thing1,thing1,thing1,... 15
#> 4: thing1,thing1,thing1,thing1,thing1,thing1,... 17
#> 5: thing2,thing2,thing2,thing2,thing2,thing2,... 20
#> ---
#> 99996: thing1,thing1,thing1,thing1,thing1,thing1,... 99983
#> 99997: thing2,thing2,thing2,thing2,thing2,thing2,... 99990
#> 99998: thing2,thing2,thing2,thing2,thing2,thing2,... 99994
#> 99999: thing2,thing2,thing2,thing2,thing2,thing2,... 99996
#> 100000: thing2,thing2,thing2,thing2,thing2,thing2,... 99998
When our list columns don’t have data tables (as output from
dt_nest()
) we can use the dt_hoist()
function,
that will unnest vectors. It keeps all the other variables that are not
list-columns as well.
dt_hoist(dt, nested1, nested2)
#> x y grp id nested1 nested2
#> 1: 0.1720703 0.3376675 2 1 1 thing1
#> 2: 0.1720703 0.3376675 2 1 1 thing1
#> 3: 0.1720703 0.3376675 2 1 1 thing1
#> 4: 0.1720703 0.3376675 2 1 1 thing1
#> 5: 0.1720703 0.3376675 2 1 1 thing1
#> ---
#> 999996: 0.6268181 0.7851774 1 100000 1 thing2
#> 999997: 0.6268181 0.7851774 1 100000 5 thing2
#> 999998: 0.6268181 0.7851774 1 100000 7 thing2
#> 999999: 0.6268181 0.7851774 1 100000 6 thing2
#> 1000000: 0.6268181 0.7851774 1 100000 7 thing2
Speed comparisons (similar to those shown in the preprint) are
highlighted below. Notably, the timings are without the
nested1
and nested2
columns of the original
dt
object from above. Also, all dplyr
and
tidyr
functions use a tbl
version of the
dt
table.
#> # A tibble: 2 x 3
#> expression median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 dt_nest 3.43ms 2.88MB
#> 2 group_nest 6.27ms 2.56MB
#> # A tibble: 2 x 3
#> expression median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 dt_unnest 4.92ms 5.48MB
#> 2 unnest 9.84ms 5.77MB
Thanks to [@markfairbanks](https://github.com/markfairbanks), we now have pivoting
translations to data.table::melt()
and
data.table::dcast()
. Consider the following example
(similar to the example in tidyr::pivot_longer()
and
tidyr::pivot_wider()
):
<- tidyr::billboard
billboard
# note the warning - melt is telling us what
# it did with the various data types---logical (where there were just NAs
# and numeric
<- billboard %>%
longer dt_pivot_longer(
cols = c(-artist, -track, -date.entered),
names_to = "week",
values_to = "rank"
)#> Warning in melt.data.table(data = dt_, id.vars = id_vars, measure.vars = cols, :
#> 'measure.vars' [wk1, wk2, wk3, wk4, ...] are not all of the same type. By
#> order of hierarchy, the molten data value column will be of type 'double'. All
#> measure variables not of type 'double' will be coerced too. Check DETAILS in ?
#> melt.data.table for more on coercion.
longer#> artist track date.entered week rank
#> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#> 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
#> 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81
#> 4: 3 Doors Down Loser 2000-10-21 wk1 76
#> 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57
#> ---
#> 24088: Yankee Grey Another Nine Minutes 2000-04-29 wk76 NA
#> 24089: Yearwood, Trisha Real Live Woman 2000-04-01 wk76 NA
#> 24090: Ying Yang Twins Whistle While You Tw... 2000-03-18 wk76 NA
#> 24091: Zombie Nation Kernkraft 400 2000-09-02 wk76 NA
#> 24092: matchbox twenty Bent 2000-04-29 wk76 NA
<- longer %>%
wider dt_pivot_wider(
names_from = week,
values_from = rank
)
wider[, .(artist, track, wk1, wk2)]#> artist track wk1 wk2
#> 1: 2 Pac Baby Don't Cry (Keep... 87 82
#> 2: 2Ge+her The Hardest Part Of ... 91 87
#> 3: 3 Doors Down Kryptonite 81 70
#> 4: 3 Doors Down Loser 76 76
#> 5: 504 Boyz Wobble Wobble 57 34
#> ---
#> 313: Yankee Grey Another Nine Minutes 86 83
#> 314: Yearwood, Trisha Real Live Woman 85 83
#> 315: Ying Yang Twins Whistle While You Tw... 95 94
#> 316: Zombie Nation Kernkraft 400 99 99
#> 317: matchbox twenty Bent 60 37
Notably, there are some current limitations to these: 1)
tidyselect
techniques do not work across the board
(e.g. cannot use start_with()
and friends) and 2) the
functions are new and likely prone to edge-case bugs.
But let’s compare some basic speed and efficiency. Because of the
data.table
functions, these are extremely fast and
efficient.
#> # A tibble: 4 x 3
#> expression median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 dt_pivot_longer 1.58ms 996.21KB
#> 2 pivot_longer 8.35ms 2.67MB
#> 3 dt_pivot_wider 10.47ms 1.86MB
#> 4 pivot_wider 22.29ms 1.96MB
Also, the new dt_case_when()
function is built on the
very fast data.table::fiflese()
but has syntax like unto
dplyr::case_when()
. That is, it looks like:
dt_case_when(condition1 ~ label1,
~ label2,
condition2 ...)
To show that each method, dt_case_when()
,
dplyr::case_when()
, and data.table::fifelse()
produce the same result, consider the following example.
<- rnorm(1e6)
x
<- median(x)
medianx <-
x_cat dt_case_when(x < medianx ~ "low",
>= medianx ~ "high",
x is.na(x) ~ "unknown")
<-
x_cat_dplyr case_when(x < medianx ~ "low",
>= medianx ~ "high",
x is.na(x) ~ "unknown")
<-
x_cat_fif fifelse(x < medianx, "low",
fifelse(x >= medianx, "high",
fifelse(is.na(x), "unknown", NA_character_)))
identical(x_cat, x_cat_dplyr)
#> [1] TRUE
identical(x_cat, x_cat_fif)
#> [1] TRUE
Notably, dt_case_when()
is very fast and memory
efficient, given it is built on data.table::fifelse()
.
#> # A tibble: 3 x 3
#> expression median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 case_when 139.8ms 148.8MB
#> 2 dt_case_when 37.1ms 34.3MB
#> 3 fifelse 35ms 34.3MB
A new function is dt_fill()
, which fulfills the role of
tidyr::fill()
to fill in NA
values with values
around it (either the value above, below, or trying both). This
currently relies on the efficient C++
code from
tidyr
(fillUp()
and
fillDown()
).
= 1:10
x <- data.table(
dt_with_nas x = x,
y = shift(x, 2L),
z = shift(x, -2L),
a = sample(c(rep(NA, 10), x), 10),
id = sample(1:3, 10, replace = TRUE))
# Original
dt_with_nas#> x y z a id
#> 1: 1 NA 3 NA 3
#> 2: 2 NA 4 9 3
#> 3: 3 1 5 NA 1
#> 4: 4 2 6 8 3
#> 5: 5 3 7 NA 2
#> 6: 6 4 8 NA 2
#> 7: 7 5 9 7 3
#> 8: 8 6 10 NA 2
#> 9: 9 7 NA NA 2
#> 10: 10 8 NA 4 2
# All defaults
dt_fill(dt_with_nas, y, z, a)
#> y z a
#> 1: NA 3 NA
#> 2: NA 4 9
#> 3: 1 5 9
#> 4: 2 6 8
#> 5: 3 7 8
#> 6: 4 8 8
#> 7: 5 9 7
#> 8: 6 10 7
#> 9: 7 10 7
#> 10: 8 10 4
# by id variable called `grp`
dt_fill(dt_with_nas,
y, z, a, id = list(id))
#> id y z a
#> 1: 3 NA 3 NA
#> 2: 3 NA 4 9
#> 3: 3 2 6 8
#> 4: 3 5 9 7
#> 5: 1 1 5 NA
#> 6: 2 3 7 NA
#> 7: 2 4 8 NA
#> 8: 2 6 10 NA
#> 9: 2 7 10 NA
#> 10: 2 8 10 4
# both down and then up filling by group
dt_fill(dt_with_nas,
y, z, a, id = list(id),
.direction = "downup")
#> id y z a
#> 1: 3 2 3 9
#> 2: 3 2 4 9
#> 3: 3 2 6 8
#> 4: 3 5 9 7
#> 5: 1 1 5 NA
#> 6: 2 3 7 4
#> 7: 2 4 8 4
#> 8: 2 6 10 4
#> 9: 2 7 10 4
#> 10: 2 8 10 4
In its current form, dt_fill()
is faster than
tidyr::fill()
and uses slightly less memory. Below are the
results of filling in the NA
s within each id
on a 19 MB data set.
= 1:1e6
x <- data.table(
dt3 x = x,
y = shift(x, 10L),
z = shift(x, -10L),
a = sample(c(rep(NA, 10), x), 10),
id = sample(1:3, 10, replace = TRUE))
<- data.frame(dt3)
df3
<-
marks3 ::mark(
bench::fill(dplyr::group_by(df3, id), x, y),
tidyr::dt_fill(dt3, x, y, id = list(id)),
tidyfastcheck = FALSE,
iterations = 50
)
#> # A tibble: 2 x 3
#> expression median mem_alloc
#> <bch:expr> <bch:tm> <bch:byt>
#> 1 tidyr::fill(dplyr::group_by(df3, id), x, y) 62.9ms 30.7MB
#> 2 tidyfast::dt_fill(dt3, x, y, id = list(id)) 25.8ms 29.1MB
The dt_separate()
function is still under heavy
development. Its behavior is similar to tidyr::separate()
but is lacking some functionality currently. For example,
into
needs to be supplied the maximum number of possible
columns to separate.
dt_separate(data.table(col = "A.B.C"), col, into = c("A", "B"))
#> Error in `[.data.table`(dt, , eval(split_it)) :
#> Supplied 2 columns to be assigned 3 items. Please see NEWS for v1.12.2.
For current functionality, consider the following example.
<- data.table(
dt_to_split x = paste(letters, LETTERS, sep = ".")
)
dt_separate(dt_to_split, x, into = c("lower", "upper"))
#> lower upper
#> 1: a A
#> 2: b B
#> 3: c C
#> 4: d D
#> 5: e E
#> 6: f F
Testing with a 4 MB data set with one variable that has columns of
“A.B” repeatedly, shows that dt_separate()
is fast but less
memory efficient than tidyr::separate()
.
#> # A tibble: 3 x 3
#> expression median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 separate 319.3ms 11.7MB
#> 2 dt_separate 105.5ms 30.6MB
#> 3 dt_separate-mutable 94.8ms 26.7MB
The dt_count()
function does essentially what
dplyr::count()
does. Notably, this, unlike the majority of
other dt_
functions, wraps a very simple statement in
data.table
. That is, data.table
makes getting
counts very simple and concise. Nonetheless, dt_count()
fits the general API of tidyfast
. To some degree,
dt_uncount()
is also a fairly simple wrapper, although the
approach may not be as straightforward as that for
dt_count()
.
The following examples show how count and uncount can work. We’ll use
the dt
data table from the nesting examples.
<- dt_count(dt, grp)
counted
counted#> grp N
#> 1: 1 19638
#> 2: 2 19987
#> 3: 3 20033
#> 4: 4 20269
#> 5: 5 20073
<- dt_uncount(counted, N)
uncounted
uncounted[]#> grp
#> 1: 1
#> 2: 1
#> 3: 1
#> 4: 1
#> 5: 1
#> ---
#> 99996: 5
#> 99997: 5
#> 99998: 5
#> 99999: 5
#> 100000: 5
These are also quick (not that the tidyverse
functions
were at all slow here).
<- copy(dt)
dt5 <- data.frame(dt5)
df5
<-
marks5 ::mark(
bench<- dplyr::count(df5, grp),
counted_tbl <- tidyfast::dt_count(dt5, grp),
counted_dt ::uncount(counted_tbl, n),
tidyr::dt_uncount(counted_dt, N),
tidyfastcheck = FALSE,
iterations = 25
)
Please note that the tidyfast
project is released with a
Contributor
Code of Conduct. By contributing to this project, you agree to abide
by its terms.
We want to thank our wonderful contributors:
gdt
package that compliments some of tidyfast
s
functionality.Complementary Packages: