An R interface to the Python module Featuretools.
featuretoolsR
provides functionality from the Python
module featuretools
, which aims to automate feature
engineering. This package is very much a work in progress as
Featuretools offers a lot of functionality. Any PRs are much
appreciated.
The latest stable release is found on CRAN.
You can get the latest version of featuretoolsR
by
installing it straight from Github:
devtools::install_github("magnusfurugard/featuretoolsR")
.
You’ll need to have a working Python environment as well as
featuretools
installed. The recommended way is to use the
built-in function install_featuretools()
which
automatically sets up a virtual environment for the package and installs
featuretools
.
All functions in featuretoolsR
comes with documentation,
but it’s advised to briefly browse through the Featuretools Python
documentation. It’ll cover things like entities
,
relationships
and dfs
.
An entityset is the set which contain all your entities. To create a
set and add an entity straight away, you can use
as_entityset
.
# Libs
library(featuretoolsR)
library(magrittr)
# Create some mock data
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T), a = rep(Sys.Date(), 100))
set_2 <- data.frame(key = 1:100, value = sample(LETTERS, 100, T), b = rep(Sys.time(), 100))
# Create entityset
es <- as_entityset(
set_1,
index = "key",
entity_id = "set_1",
id = "demo",
time_index = "a"
)
To add entities (i.e if you have relational data across multiple
data.frames
), this can be achieved with
add_entity
. This function is pipe friendly. For this
demo-case, we’ll use set_2
.
es <- es %>%
add_entity(
df = set_2,
entity_id = "set_2",
index = "key",
time_index = "b"
)
With relational data, it’s useful to define a relationship between
two or more entities. This can be done with
add_relationship
.
es <- es %>%
add_relationship(
parent_set = "set_1",
child_set = "set_2",
parent_idx = "key",
child_idx = "key"
)
The bread and butter of Featuretools is the dfs
-function
(official docs here).
It will attempt to create features based on *_primitives
you provide (more on primitives below).
ft_matrix <- es %>%
dfs(
target_entity = "set_1",
trans_primitives = c("and", "cum_sum")
)
To use the new data.frame/features created by dfs
, a
function unique for featuretoolsR
,
tidy_feature_matrix
can be used. A few “nice-to-have”
arguments can be passed to clean the new data, like removing near zero
variance variables, as well as replacing NaN
with
NA
.
tidy <- tidy_feature_matrix(ft_matrix, remove_nzv = T, nan_is_na = T, clean_names = T)
Featuretools supports a lot of primitives. These are accessible with
the function list_primitives()
which returns a data.frame
containing type (aggregation (agg_primitives
) or transform
(trans_primitives
)), name (in the example above, “and” and
“divide”) as well as a brief description of the primitive itself.
reticulate - an R interface to Python.