We usually create fast-and-frugal trees (FFTs) from data by using the
FFTrees()
function (see the Main
guide: FFTrees overview and the vignette on Creating FFTs with FFTrees() for
details). However, we occasionally want to design and test a specific
FFT (e.g., to check a hypothesis or use some variables based on
theoretical considerations).
There are two ways to define fast-and-frugal trees manually when
using the FFTrees()
function:
as a sentence using the my.tree
argument (the easier
way), or
as a data frame using the tree.definitions
argument
(the harder way).
Both of these methods require some data to evaluate the performance of FFTs, but will bypass the tree construction algorithms built into the FFTrees package. As manually created FFTs are no longer optimized, the conceptual distinction between data fitting and predicting data disppears for such FFTs. Although we can still distinguish between two sets of ‘train’ vs. ‘test’ data, a manually defined FFT should not be expected to perform systematically better on ‘train’ data than on ‘test’ data.
my.tree
The first method is to use the my.tree
argument, where
my.tree
is a sentence describing a (single) FFT. When this
argument is specified in FFTrees()
, the function
(specifically, an auxiliary fftrees_wordstofftrees()
function) will try to convert the verbal description into the definition
of a FFT (of an FFTrees
object).
For example, let’s look at the heartdisease
data to find
out how some predictor variables (e.g., sex
,
age
, etc.) predict the criterion variable
(diagnosis
):
sex | age | thal | cp | ca | diagnosis |
---|---|---|---|---|---|
1 | 63 | fd | ta | 0 | FALSE |
1 | 67 | normal | a | 3 | TRUE |
1 | 67 | rd | a | 2 | TRUE |
1 | 37 | normal | np | 0 | FALSE |
0 | 41 | normal | aa | 0 | FALSE |
1 | 56 | normal | aa | 0 | FALSE |
Here’s how we could verbally describe an FFT by using the first three cues in conditional sentences:
<- "If sex = 1, predict True.
in_words If age < 45, predict False.
If thal = {fd, normal}, predict True.
Otherwise, predict False."
As we will see shortly, the FFTrees()
function accepts
such descriptions (assigned here to a character string
in_words
) as its my.tree
argument, create a
corresponding FFT, and evaluate it on a corresponding dataset.
Here are some instructions for manually specifying trees:
Each node must start with the word “If” and should correspond to
the form:
If <CUE> <DIRECTION> <THRESHOLD>, predict <EXIT>
.
Numeric thresholds should be specified directly (without
brackets), like age > 21
.
For categorical variables, factor thresholds must be specified
within curly braces, like sex = {male}
. For factors with
sets of values, categories within a threshold should be separated by
commas like eyecolor = {blue,brown}
.
To specify cue directions, standard logical comparisons
=
, !=
, <
, >=
(etc.) are valid. For numeric cues, only use >
,
>=
, <
, or <=
. For
factors, only use =
or !=
.
Positive exits are indicated by True
, while negative
exits are specified by False
.
The final node of an FFT is always bi-directional (i.e., has both
a positive and a negative exit). The description of the final node
always mentions its positive (True
) exit first. The text
Otherwise, predict EXIT
that we have included in the
example above is actually not necessary (and ignored).
Now, let’s use our verbal description of an FFT (assigned to
in_words
above) as the my.tree
argument of the
FFTrees()
function. This creates a corresponding FFT and
applies it to the heartdisease
data:
# Create FFTrees from a verbal FFT description (as my.tree):
<- FFTrees(diagnosis ~.,
my_fft data = heartdisease,
main = "My 1st FFT",
my.tree = in_words)
#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Aiming to create an FFT from 'my.tree' description:
#> Successfully created an FFT from 'my.tree' description.
#> Successfully defined 1 FFT.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.
Let’s see how well our manually constructed FFT (my_fft
)
did:
# Inspect FFTrees object:
plot(my_fft)
Figure 1: An FFT manually constructed using the
my.tree
argument of FFTrees()
.
When manually constructing a tree, the resulting FFTrees
object only contains a single FFT. Hence, the ROC plot (in the right
bottom panel of Figure 1) cannot show a range of FFTs,
but locates the constructed FFT in ROC space.
As it turns out, the performance of our first FFT created from a verbal description is a mixed affair: The tree has a rather high sensitivity (of 91%), but its low specificity (of only 10%) allows for many false alarms. Consequently, its accuracy measures are only around baseline level.
Let’s see if we can come up with a better FFT. The following example
uses the cues thal
, cp
, and ca
in
the my.tree
argument:
# Create 2nd FFTrees from an alternative FFT description (as my.tree):
<- FFTrees(diagnosis ~.,
my_fft_2 data = heartdisease,
main = "My 2nd FFT",
my.tree = "If thal = {rd,fd}, predict True.
If cp != {a}, predict False.
If ca > 1, predict True.
Otherwise, predict False.")
#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Aiming to create an FFT from 'my.tree' description:
#> Successfully created an FFT from 'my.tree' description.
#> Successfully defined 1 FFT.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.
# Inspect FFTrees object:
plot(my_fft_2)
Figure 2: Another FFT manually constructed using the
my.tree
argument of FFTrees()
.
This alternative FFT is nicely balancing sensitivity and specificity and performs much better overall. Nevertheless, it is still far from perfect — so check out whether you can create even better ones!
tree.definitions
More experienced users may want to define and evaluate more than one
FFTs at a time. To achieve this, the FFTrees()
function
allows providing sets of tree.definitions
(as a data
frame). However, as questions regarding specific trees usually arise
late in an exploration of FFTs, the tree.definitions
argument is mostly used in combination with an existing
FFTrees
object x
. In this case, the parameters
(e.g., regarding the formula
, data
and goals
to be used) from x
are being used, but its tree definitions
(stored in x$trees$definitions
) are replaced by those in
tree.definitions
and the object is re-evaluated for those
FFTs.
We illustrate a typical workflow by redefining some FFTs that were
built in the Tutorial: FFTs for heart
disease and evaluating them on the (full) heartdisease
data.
First, we use our default algorithms to create an
FFTrees
object heart.fft
:
# Create an FFTrees object x:
<- FFTrees(formula = diagnosis ~ ., # criterion and (all) predictors
x data = heart.train, # training data
data.test = heart.test, # testing data
main = "Heart Disease 1", # initial label
decision.labels = c("low risk", "high risk"), # exit labels
quiet = TRUE) # hide user feedback
As we have seen in the Tutorial,
evaluating this expression yields a set of 7 FFTs. Rather than
evaluating them individually (by issuing print(x)
or
plot(x)
commands to inspect specific trees), we can obtain
both their definitions and their performance characteristics on a
variety of measures either by running summary(x)
or by
inspecting corresponding parts of the FFTrees
object. For
instance, the following alternatives would both show the current
definitions of the generated FFTs:
# Tree definitions of x:
# summary(x)$definitions # from summary()
$trees$definitions # from FFTrees object x x
#> # A tibble: 7 × 7
#> tree nodes classes cues directions thresholds exits
#> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1;0;0.5
#> 2 2 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;0;1;0.5
#> 3 3 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;1;0.5
#> 4 4 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;1;0;0.5
#> 5 5 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;0;0.5
#> 6 6 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 0;0;0;0.5
#> 7 7 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;1;1;0.5
Each line in these tree definitions defines an FFT in the context of
our current FFTrees
object x
(see the vignette
on Creating FFTs with FFTrees() for
help on interpreting tree definitions). As the “ifan” algorithm
responsible for creating these trees yields a family of highly similar
FFTs (as the FFTs vary only by their exits, and some truncate the last
cue), we may want to examine alternative versions for these trees.
To demonstrate how to create and evaluate manual FFT definitions, we copy the existing tree definitions (as a data frame), select three FFTs (rows), and then create a 4th definition (with a different exit structure):
# 0. Copy and choose some existing FFT definitions:
<- x$trees$definitions # get FFT definitions (as df)
tree_df <- tree_df[c(1, 3, 5), ] # filter 3 particular FFTs
tree_df
# 1. Add a tree with 1;1;0.5 exit structure (a "rake" tree with Signal bias):
4, ] <- tree_df[1, ] # initialize new FFT #4 (as copy of FFT #1)
tree_df[$exits[4] <- c("1; 1; 0.5") # modify exits of FFT #4
tree_df
$tree <- 1:nrow(tree_df) # adjust tree numbers
tree_df# tree_df
Moreover, let’s define four additional FFTs that reverse the order of
the 1st and 2nd cues. As both cues are categorical (i.e., of
class c
) and have the same direction (i.e.,
=
), we only need to reverse the thresholds
(so
that they correspond to the new cue order):
# 2. Change cue orders:
5:8, ] <- tree_df[1:4, ] # 4 new FFTs (as copiess of existing ones)
tree_df[$cues[5:8] <- "cp; thal; ca" # modify order of cues
tree_df$thresholds[5:8] <- "a; rd,fd; 0" # modify order of thresholds accordingly
tree_df
$tree <- 1:nrow(tree_df) # adjust tree numbers
tree_df# tree_df
The resulting data frame tree_df
contains the
definitions of eight FFTs. The first three are copies of trees
in x
, but the other five are new.
tree.definitions
We can evaluate this set by running the FFTrees()
function with the previous FFTrees
object x
(i.e., with its formula
and data
settings) and
specifying tree_df
in the tree.definitions
argument:
# Create a modified FFTrees object y:
<- FFTrees(object = x, # use previous FFTrees object x
y tree.definitions = tree_df, # but with new tree definitions
main = "Heart Disease 2" # revised label
)
#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Using 8 FFTs from 'tree.definitions' as current trees.
#> Successfully defined 8 FFTs.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to apply FFTs to 'test' data:
#> Successfully applied FFTs to 'test' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.
The resulting FFTrees
object y
contains the
trees and summary statistics of all eight FFTs. Although it is unlikely
that one of the newly created trees beats the automatically created
FFTs, we find that reversing the order of the first cues has only
minimal effects on training accuracy (as measured by
bacc
):
$trees$definitions # tree definitions y
#> # A tibble: 8 × 7
#> tree nodes classes cues directions thresholds exits
#> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1;0;0.5
#> 2 2 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 1;0;0.5
#> 3 3 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;1;0.5
#> 4 4 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 0;1;0.5
#> 5 5 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1; 1; 0.5
#> 6 6 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 1; 1; 0.5
#> 7 7 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;0;0.5
#> 8 8 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 0;0;0.5
$trees$stats$train # training statistics y
#> # A tibble: 8 × 20
#> tree n hi fa mi cr sens spec far ppv npv dprime
#> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 150 54 18 12 66 0.818 0.786 0.214 0.75 0.846 1.69
#> 2 2 150 55 20 11 64 0.833 0.762 0.238 0.733 0.853 1.66
#> 3 3 150 44 7 22 77 0.667 0.917 0.0833 0.863 0.778 1.79
#> 4 4 150 44 7 22 77 0.667 0.917 0.0833 0.863 0.778 1.79
#> 5 5 150 63 42 3 42 0.955 0.5 0.5 0.6 0.933 1.66
#> 6 6 150 63 42 3 42 0.955 0.5 0.5 0.6 0.933 1.66
#> 7 7 150 28 2 38 82 0.424 0.976 0.0238 0.933 0.683 1.74
#> 8 8 150 28 2 38 82 0.424 0.976 0.0238 0.933 0.683 1.74
#> # … with 8 more variables: acc <dbl>, bacc <dbl>, wacc <dbl>, cost_dec <dbl>,
#> # cost_cue <dbl>, cost <dbl>, pci <dbl>, mcu <dbl>
Note that the trees in y
were sorted by their
performance on the current goal
(here bacc
).
For instance, the new rake tree with cue order cp; thal; ca
and exits 1; 1; 0.5
is now FFT #6. When examining its
performance on "test"
data (i.e., for prediction):
# Print and plot FFT #6:
print(y, tree = 6, data = "test")
plot(y, tree = 6, data = "test")
we see that it has a balanced accuracy bacc
of 70%. More
precisely, its bias for predicting disease
(i.e., signal or
True) yields near-perfect sensitivity (96%), but very poor specificity
(44%).
If we wanted to change more aspects of x
(e.g., use
different data
or goal
settings), we could
have created a new FFTrees
object without supplying the
previous object x
, as long as the FFTs defined in
tree.definitions
fit to the settings of
formula
and data
.
Here is a complete list of the vignettes available in the FFTrees package:
Vignette | Description | |
---|---|---|
Main guide: FFTrees overview | An overview of the FFTrees package | |
1 | Tutorial: FFTs for heart disease | An example of using FFTrees() to model
heart disease diagnosis |
2 | Accuracy statistics | Definitions of accuracy statistics used throughout the package |
3 | Creating FFTs with FFTrees() | Details on the main function
FFTrees() |
4 | Manually specifying FFTs | How to directly create FFTs with my.tree
without using the built-in algorithms |
5 | Visualizing FFTs with plot() | Plotting FFTrees objects, from full trees
to icon arrays |
6 | Examples of FFTs | Examples of FFTs from different datasets contained in the package |