Manually specifying FFTs

Nathaniel Phillips and Hansjörg Neth

2023-01-06

Manually specifying FFTs

We usually create fast-and-frugal trees (FFTs) from data by using the FFTrees() function (see the Main guide: FFTrees overview and the vignette on Creating FFTs with FFTrees() for details). However, we occasionally want to design and test a specific FFT (e.g., to check a hypothesis or use some variables based on theoretical considerations).

There are two ways to define fast-and-frugal trees manually when using the FFTrees() function:

  1. as a sentence using the my.tree argument (the easier way), or

  2. as a data frame using the tree.definitions argument (the harder way).

Both of these methods require some data to evaluate the performance of FFTs, but will bypass the tree construction algorithms built into the FFTrees package. As manually created FFTs are no longer optimized, the conceptual distinction between data fitting and predicting data disppears for such FFTs. Although we can still distinguish between two sets of ‘train’ vs. ‘test’ data, a manually defined FFT should not be expected to perform systematically better on ‘train’ data than on ‘test’ data.

1. Using my.tree

The first method is to use the my.tree argument, where my.tree is a sentence describing a (single) FFT. When this argument is specified in FFTrees(), the function (specifically, an auxiliary fftrees_wordstofftrees() function) will try to convert the verbal description into the definition of a FFT (of an FFTrees object).

For example, let’s look at the heartdisease data to find out how some predictor variables (e.g., sex, age, etc.) predict the criterion variable (diagnosis):

Table 1: Five cues and the binary criterion variable diagnosis for the first cases of the heartdisease data.
sex age thal cp ca diagnosis
1 63 fd ta 0 FALSE
1 67 normal a 3 TRUE
1 67 rd a 2 TRUE
1 37 normal np 0 FALSE
0 41 normal aa 0 FALSE
1 56 normal aa 0 FALSE

Here’s how we could verbally describe an FFT by using the first three cues in conditional sentences:

in_words <- "If sex = 1, predict True.
             If age < 45, predict False. 
             If thal = {fd, normal}, predict True. 
             Otherwise, predict False."

As we will see shortly, the FFTrees() function accepts such descriptions (assigned here to a character string in_words) as its my.tree argument, create a corresponding FFT, and evaluate it on a corresponding dataset.

Verbally defining FFTs

Here are some instructions for manually specifying trees:

  • Each node must start with the word “If” and should correspond to the form: If <CUE> <DIRECTION> <THRESHOLD>, predict <EXIT>.

  • Numeric thresholds should be specified directly (without brackets), like age > 21.

  • For categorical variables, factor thresholds must be specified within curly braces, like sex = {male}. For factors with sets of values, categories within a threshold should be separated by commas like eyecolor = {blue,brown}.

  • To specify cue directions, standard logical comparisons =, !=, <, >= (etc.) are valid. For numeric cues, only use >, >=, <, or <=. For factors, only use = or !=.

  • Positive exits are indicated by True, while negative exits are specified by False.

  • The final node of an FFT is always bi-directional (i.e., has both a positive and a negative exit). The description of the final node always mentions its positive (True) exit first. The text Otherwise, predict EXIT that we have included in the example above is actually not necessary (and ignored).

Example

Now, let’s use our verbal description of an FFT (assigned to in_words above) as the my.tree argument of the FFTrees() function. This creates a corresponding FFT and applies it to the heartdisease data:

# Create FFTrees from a verbal FFT description (as my.tree): 
my_fft <- FFTrees(diagnosis ~.,
                  data = heartdisease,
                  main = "My 1st FFT", 
                  my.tree = in_words)
#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Aiming to create an FFT from 'my.tree' description:
#> Successfully created an FFT from 'my.tree' description.
#> Successfully defined 1 FFT.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.

Let’s see how well our manually constructed FFT (my_fft) did:

# Inspect FFTrees object:
plot(my_fft)
**Figure 1**: An FFT manually constructed using the `my.tree` argument of `FFTrees()`.

Figure 1: An FFT manually constructed using the my.tree argument of FFTrees().

When manually constructing a tree, the resulting FFTrees object only contains a single FFT. Hence, the ROC plot (in the right bottom panel of Figure 1) cannot show a range of FFTs, but locates the constructed FFT in ROC space.

As it turns out, the performance of our first FFT created from a verbal description is a mixed affair: The tree has a rather high sensitivity (of 91%), but its low specificity (of only 10%) allows for many false alarms. Consequently, its accuracy measures are only around baseline level.

Creating an alternative FFT

Let’s see if we can come up with a better FFT. The following example uses the cues thal, cp, and ca in the my.tree argument:

# Create 2nd FFTrees from an alternative FFT description (as my.tree): 
my_fft_2 <- FFTrees(diagnosis ~.,
                    data = heartdisease, 
                    main = "My 2nd FFT", 
                    my.tree = "If thal = {rd,fd}, predict True.
                               If cp != {a}, predict False. 
                               If ca > 1, predict True. 
                               Otherwise, predict False.")
#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Aiming to create an FFT from 'my.tree' description:
#> Successfully created an FFT from 'my.tree' description.
#> Successfully defined 1 FFT.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.
# Inspect FFTrees object:
plot(my_fft_2)
**Figure 2**: Another FFT manually constructed using the `my.tree` argument of `FFTrees()`.

Figure 2: Another FFT manually constructed using the my.tree argument of FFTrees().

This alternative FFT is nicely balancing sensitivity and specificity and performs much better overall. Nevertheless, it is still far from perfect — so check out whether you can create even better ones!

2. Using tree.definitions

More experienced users may want to define and evaluate more than one FFTs at a time. To achieve this, the FFTrees() function allows providing sets of tree.definitions (as a data frame). However, as questions regarding specific trees usually arise late in an exploration of FFTs, the tree.definitions argument is mostly used in combination with an existing FFTrees object x. In this case, the parameters (e.g., regarding the formula, data and goals to be used) from x are being used, but its tree definitions (stored in x$trees$definitions) are replaced by those in tree.definitions and the object is re-evaluated for those FFTs.

Example

We illustrate a typical workflow by redefining some FFTs that were built in the Tutorial: FFTs for heart disease and evaluating them on the (full) heartdisease data.

First, we use our default algorithms to create an FFTrees object heart.fft:

# Create an FFTrees object x:
x <- FFTrees(formula = diagnosis ~ .,           # criterion and (all) predictors
             data = heart.train,                # training data
             data.test = heart.test,            # testing data
             main = "Heart Disease 1",          # initial label
             decision.labels = c("low risk", "high risk"),  # exit labels
             quiet = TRUE)                      # hide user feedback

As we have seen in the Tutorial, evaluating this expression yields a set of 7 FFTs. Rather than evaluating them individually (by issuing print(x) or plot(x) commands to inspect specific trees), we can obtain both their definitions and their performance characteristics on a variety of measures either by running summary(x) or by inspecting corresponding parts of the FFTrees object. For instance, the following alternatives would both show the current definitions of the generated FFTs:

# Tree definitions of x:
# summary(x)$definitions   # from summary()
x$trees$definitions        # from FFTrees object x
#> # A tibble: 7 × 7
#>    tree nodes classes cues             directions thresholds          exits    
#>   <int> <int> <chr>   <chr>            <chr>      <chr>               <chr>    
#> 1     1     3 c;c;n   thal;cp;ca       =;=;>      rd,fd;a;0           1;0;0.5  
#> 2     2     4 c;c;n;c thal;cp;ca;slope =;=;>;=    rd,fd;a;0;flat,down 1;0;1;0.5
#> 3     3     3 c;c;n   thal;cp;ca       =;=;>      rd,fd;a;0           0;1;0.5  
#> 4     4     4 c;c;n;c thal;cp;ca;slope =;=;>;=    rd,fd;a;0;flat,down 1;1;0;0.5
#> 5     5     3 c;c;n   thal;cp;ca       =;=;>      rd,fd;a;0           0;0;0.5  
#> 6     6     4 c;c;n;c thal;cp;ca;slope =;=;>;=    rd,fd;a;0;flat,down 0;0;0;0.5
#> 7     7     4 c;c;n;c thal;cp;ca;slope =;=;>;=    rd,fd;a;0;flat,down 1;1;1;0.5

Each line in these tree definitions defines an FFT in the context of our current FFTrees object x (see the vignette on Creating FFTs with FFTrees() for help on interpreting tree definitions). As the “ifan” algorithm responsible for creating these trees yields a family of highly similar FFTs (as the FFTs vary only by their exits, and some truncate the last cue), we may want to examine alternative versions for these trees.

Modifying tree definitions

To demonstrate how to create and evaluate manual FFT definitions, we copy the existing tree definitions (as a data frame), select three FFTs (rows), and then create a 4th definition (with a different exit structure):

# 0. Copy and choose some existing FFT definitions:
tree_df <- x$trees$definitions    # get FFT definitions (as df)
tree_df <- tree_df[c(1, 3, 5), ]  # filter 3 particular FFTs

# 1. Add a tree with 1;1;0.5 exit structure (a "rake" tree with Signal bias):
tree_df[4, ] <- tree_df[1, ]        # initialize new FFT #4 (as copy of FFT #1)
tree_df$exits[4] <- c("1; 1; 0.5")  # modify exits of FFT #4

tree_df$tree <- 1:nrow(tree_df)   # adjust tree numbers
# tree_df

Moreover, let’s define four additional FFTs that reverse the order of the 1st and 2nd cues. As both cues are categorical (i.e., of class c) and have the same direction (i.e., =), we only need to reverse the thresholds (so that they correspond to the new cue order):

# 2. Change cue orders:
tree_df[5:8, ] <- tree_df[1:4, ]     # 4 new FFTs (as copiess of existing ones)
tree_df$cues[5:8] <- "cp; thal; ca"       # modify order of cues
tree_df$thresholds[5:8] <- "a; rd,fd; 0"  # modify order of thresholds accordingly

tree_df$tree <- 1:nrow(tree_df)           # adjust tree numbers
# tree_df

The resulting data frame tree_df contains the definitions of eight FFTs. The first three are copies of trees in x, but the other five are new.

Evaluating tree.definitions

We can evaluate this set by running the FFTrees() function with the previous FFTrees object x (i.e., with its formula and data settings) and specifying tree_df in the tree.definitions argument:

# Create a modified FFTrees object y:
y <- FFTrees(object = x,                  # use previous FFTrees object x
             tree.definitions = tree_df,  # but with new tree definitions
             main = "Heart Disease 2"     # revised label
)
#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Using 8 FFTs from 'tree.definitions' as current trees.
#> Successfully defined 8 FFTs.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to apply FFTs to 'test' data:
#> Successfully applied FFTs to 'test' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.

The resulting FFTrees object y contains the trees and summary statistics of all eight FFTs. Although it is unlikely that one of the newly created trees beats the automatically created FFTs, we find that reversing the order of the first cues has only minimal effects on training accuracy (as measured by bacc):

y$trees$definitions  # tree definitions
#> # A tibble: 8 × 7
#>    tree nodes classes cues         directions thresholds  exits    
#>   <int> <int> <chr>   <chr>        <chr>      <chr>       <chr>    
#> 1     1     3 c;c;n   thal;cp;ca   =;=;>      rd,fd;a;0   1;0;0.5  
#> 2     2     3 c;c;n   cp; thal; ca =;=;>      a; rd,fd; 0 1;0;0.5  
#> 3     3     3 c;c;n   thal;cp;ca   =;=;>      rd,fd;a;0   0;1;0.5  
#> 4     4     3 c;c;n   cp; thal; ca =;=;>      a; rd,fd; 0 0;1;0.5  
#> 5     5     3 c;c;n   thal;cp;ca   =;=;>      rd,fd;a;0   1; 1; 0.5
#> 6     6     3 c;c;n   cp; thal; ca =;=;>      a; rd,fd; 0 1; 1; 0.5
#> 7     7     3 c;c;n   thal;cp;ca   =;=;>      rd,fd;a;0   0;0;0.5  
#> 8     8     3 c;c;n   cp; thal; ca =;=;>      a; rd,fd; 0 0;0;0.5
y$trees$stats$train  # training statistics
#> # A tibble: 8 × 20
#>    tree     n    hi    fa    mi    cr  sens  spec    far   ppv   npv dprime
#>   <int> <int> <int> <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
#> 1     1   150    54    18    12    66 0.818 0.786 0.214  0.75  0.846   1.69
#> 2     2   150    55    20    11    64 0.833 0.762 0.238  0.733 0.853   1.66
#> 3     3   150    44     7    22    77 0.667 0.917 0.0833 0.863 0.778   1.79
#> 4     4   150    44     7    22    77 0.667 0.917 0.0833 0.863 0.778   1.79
#> 5     5   150    63    42     3    42 0.955 0.5   0.5    0.6   0.933   1.66
#> 6     6   150    63    42     3    42 0.955 0.5   0.5    0.6   0.933   1.66
#> 7     7   150    28     2    38    82 0.424 0.976 0.0238 0.933 0.683   1.74
#> 8     8   150    28     2    38    82 0.424 0.976 0.0238 0.933 0.683   1.74
#> # … with 8 more variables: acc <dbl>, bacc <dbl>, wacc <dbl>, cost_dec <dbl>,
#> #   cost_cue <dbl>, cost <dbl>, pci <dbl>, mcu <dbl>

Note that the trees in y were sorted by their performance on the current goal (here bacc). For instance, the new rake tree with cue order cp; thal; ca and exits 1; 1; 0.5 is now FFT #6. When examining its performance on "test" data (i.e., for prediction):

# Print and plot FFT #6:
print(y, tree = 6, data = "test")
plot(y,  tree = 6, data = "test")

we see that it has a balanced accuracy bacc of 70%. More precisely, its bias for predicting disease (i.e., signal or True) yields near-perfect sensitivity (96%), but very poor specificity (44%).

If we wanted to change more aspects of x (e.g., use different data or goal settings), we could have created a new FFTrees object without supplying the previous object x, as long as the FFTs defined in tree.definitions fit to the settings of formula and data.

Vignettes

Here is a complete list of the vignettes available in the FFTrees package:

Vignette Description
Main guide: FFTrees overview An overview of the FFTrees package
1 Tutorial: FFTs for heart disease An example of using FFTrees() to model heart disease diagnosis
2 Accuracy statistics Definitions of accuracy statistics used throughout the package
3 Creating FFTs with FFTrees() Details on the main function FFTrees()
4 Manually specifying FFTs How to directly create FFTs with my.tree without using the built-in algorithms
5 Visualizing FFTs with plot() Plotting FFTrees objects, from full trees to icon arrays
6 Examples of FFTs Examples of FFTs from different datasets contained in the package