Creating FFTs with fft()

Nathaniel Phillips

2016-08-16

This function is at the heart of the FFTrees package. The function takes a training dataset as an argument, and generates several FFT (more details about the algorithms coming soon…)

heartdisease example

Let’s start with an example, we’ll create FFTs fitted to the heartdisease dataset. This dataset contains data from 202 patients suspected of having heart disease. Here’s how the dataset looks:

head(heartdisease)
##   age sex cp trestbps chol fbs     restecg thalach exang oldpeak slope ca
## 1  63   1 ta      145  233   1 hypertrophy     150     0     2.3  down  0
## 2  67   1  a      160  286   0 hypertrophy     108     1     1.5  flat  3
## 3  67   1  a      120  229   0 hypertrophy     129     1     2.6  flat  2
## 4  37   1 np      130  250   0      normal     187     0     3.5  down  0
## 5  41   0 aa      130  204   0 hypertrophy     172     0     1.4    up  0
## 6  56   1 aa      120  236   0      normal     178     0     0.8    up  0
##     thal diagnosis
## 1     fd         0
## 2 normal         1
## 3     rd         1
## 4 normal         0
## 5 normal         0
## 6 normal         0

The critical dependent variable is diagnosis which indicates whether a patient has heart diesease or not. The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors.

Now we’ll split the original dataset into a training dataset, and a testing dataset. We will create the trees with the training set, then test its performance in the test dataset:

set.seed(100)
samples <- sample(c(T, F), size = nrow(heartdisease), replace = T)
heartdisease.train <- heartdisease[samples,]
heartdisease.test <- heartdisease[samples == 0,]

We’ll create a new fft object called heart.fft using the fft() function. We’ll specify diagnosis as the (binary) dependent variable, and include all independent varaibles with formula = diagnosis ~ .:

heart.fft <- fft(
  formula = diagnosis ~.,
  data = heartdisease.train,
  data.test = heartdisease.test
  )

Elements of an fft object

As you can see, fft() returns an object with the fft class

class(heart.fft)
## [1] "fft"

There are many elements in an fft object, here are their names:

names(heart.fft)
##  [1] "formula"        "data.train"     "data.test"      "cue.accuracies"
##  [5] "fft.stats"      "lr.stats"       "cart.stats"     "auc"           
##  [9] "lr.model"       "cart.model"     "decision.train" "decision.test" 
## [13] "levelout.train" "levelout.test"

Printing an fft object

You can view basic information about the fft object by printing its name. This will give you a quick summary of the object, includeing how many trees it has, which cues the tree(s) use, and how well they performed.

heart.fft
## [1] "An fft object containing 8 trees using 4 cues {thal,cp,exang,slope} out of an original 13"
## [1] "Data were trained on 149 exemplars, and tested on 154 new exemplars"
## [1] "FFT AUC: (Train = 0.88, Test = 0.85)"
## [1] "My favorite tree is #5 [Training: HR = 0.91, FAR = 0.24], [Testing: HR = 0.79, FAR = 0.25]"

Cue accuracy statistics: cue.accuracies

You can obtain marginal cue accuracy statistics from the cue.accuracies dataframe. This dataframe contains the original, marginal cue accuracies. That is, for each cue, the threshold that maximizes the v-statistic (HR - FAR) is chosen.

heart.fft$cue.accuracies
##    cue.name cue.class      level.threshold level.sigdirection hi.train
## 1       age   numeric                53.89                 >=       43
## 2        ca   numeric                    0                  >       41
## 3      chol   numeric               252.32                  >       36
## 4        cp character             np,aa,ta                 !=       50
## 5     exang   numeric                    1                 >=       40
## 6       fbs   numeric                    0                  >       12
## 7   oldpeak   numeric                 0.98                  >       42
## 8   restecg character hypertrophy,abnormal                  =       39
## 9       sex   numeric                    1                 >=       51
## 10    slope character                   up                 !=       52
## 11     thal character               normal                 !=       47
## 12  thalach   numeric               144.32                 <=       36
## 13 trestbps   numeric               138.74                  >       27
##    mi.train fa.train cr.train hr.train far.train    v.train dprime.train
## 1        21       42       43 0.671875 0.4941176 0.17775735    0.2299210
## 2        23       21       64 0.640625 0.2470588 0.39356618    0.5219521
## 3        28       26       59 0.562500 0.3058824 0.25661765    0.3324334
## 4        14       22       63 0.781250 0.2588235 0.52242647    0.7116992
## 5        24       11       74 0.625000 0.1294118 0.49558824    0.7239078
## 6        52       11       74 0.187500 0.1294118 0.05808824    0.1210148
## 7        22       24       61 0.656250 0.2823529 0.37389706    0.4890579
## 8        25       34       51 0.609375 0.4000000 0.20937500    0.2655188
## 9        13       47       38 0.796875 0.5529412 0.24393382    0.3487076
## 10       12       31       54 0.812500 0.3647059 0.44779412    0.6165273
## 11       17       17       68 0.734375 0.2000000 0.53437500    0.7338601
## 12       28       21       64 0.562500 0.2470588 0.31544118    0.4205425
## 13       37       16       69 0.421875 0.1882353 0.23363971    0.3436595
##    correction hr.weight hi.test mi.test fa.test cr.test   hr.test
## 1        0.25       0.5      57      18      29      50 0.7600000
## 2        0.25       0.5      52      23      13      66 0.6933333
## 3        0.25       0.5      60      15      54      25 0.8000000
## 4        0.25       0.5      55      20      17      62 0.7333333
## 5        0.25       0.5      36      39      12      67 0.4800000
## 6        0.25       0.5      65      10      67      12 0.8666667
## 7        0.25       0.5      51      24      21      58 0.6800000
## 8        0.25       0.5      44      31      35      44 0.5866667
## 9        0.25       0.5      63      12      45      34 0.8400000
## 10       0.25       0.5      51      24      27      52 0.6800000
## 11       0.25       0.5      54      21      17      62 0.7200000
## 12       0.25       0.5      48      27      14      65 0.6400000
## 13       0.25       0.5      74       1      74       5 0.9866667
##     far.test     v.test dprime.test
## 1  0.3670886 0.39291139  0.52293838
## 2  0.1645570 0.52877637  0.74061064
## 3  0.6835443 0.11645570  0.18199408
## 4  0.2151899 0.51814346  0.70573386
## 5  0.1518987 0.32810127  0.48908518
## 6  0.8481013 0.01856540  0.04122383
## 7  0.2658228 0.41417722  0.54659740
## 8  0.4430380 0.14362869  0.18112497
## 9  0.5696203 0.27037975  0.40952522
## 10 0.3417722 0.33822785  0.43766510
## 11 0.2151899 0.50481013  0.68569175
## 12 0.1772152 0.46278481  0.64224441
## 13 0.9367089 0.04995781  0.34432180

You can also view the cue accuracies in an ROC-type plot with showcues():

showcues(heart.fft, 
         main = "Heartdisease Cue Accuracy")

Tree definitions and accuracy statistics: fft.stats

The fft.stats dataframe contains all tree definitions and training (and possibly test) statistics for all (\(2^{max.levels - 1}\)) trees. For our heart.fft example, there are \(2^{4 - 1} = 8\) trees.

heart.fft$fft.stats
##   tree.num          level.name                           level.class
## 1        1 thal;cp;exang;slope character;character;numeric;character
## 5        2       thal;cp;exang           character;character;numeric
## 3        3       thal;cp;exang           character;character;numeric
## 7        4 thal;cp;exang;slope character;character;numeric;character
## 2        5       thal;cp;exang           character;character;numeric
## 6        6 thal;cp;exang;slope character;character;numeric;character
## 4        7       thal;cp;exang           character;character;numeric
## 8        8 thal;cp;exang;slope character;character;numeric;character
##   level.exit level.threshold level.sigdirection n.train hi.train mi.train
## 1  0;0;0;0.5   normal;a;0;up          !=;=;>;!=     149       22       42
## 5    0;0;0.5      normal;a;0             !=;=;>     149       26       38
## 3    0;1;0.5      normal;a;0             !=;=;>     149       39       25
## 7  0;1;1;0.5   normal;a;0;up          !=;=;>;!=     149       46       18
## 2    1;0;0.5      normal;a;0             !=;=;>     149       58        6
## 6  1;0;1;0.5   normal;a;0;up          !=;=;>;!=     149       58        6
## 4    1;1;0.5      normal;a;0             !=;=;>     149       61        3
## 8  1;1;1;0.5   normal;a;0;up          !=;=;>;!=     149       63        1
##   fa.train cr.train hr.train  far.train   v.train dprime.train n.test
## 1        2       83 0.343750 0.02352941 0.3202206    0.7917602    154
## 5        4       81 0.406250 0.04705882 0.3591912    0.7184319    154
## 3        7       78 0.609375 0.08235294 0.5270221    0.8335539    154
## 7       14       71 0.718750 0.16470588 0.5540441    0.7772158    154
## 2       20       65 0.906250 0.23529412 0.6709559    1.0197666    154
## 6       24       61 0.906250 0.28235294 0.6238971    0.9469384    154
## 4       36       49 0.953125 0.42352941 0.5295956    0.9344061    154
## 8       50       35 0.984375 0.58823529 0.3961397    0.9654334    154
##   hi.test mi.test fa.test cr.test   hr.test   far.test    v.test
## 1      21      54       0      79 0.2800000 0.00000000 0.2783123
## 5      28      47       0      79 0.3733333 0.00000000 0.3710275
## 3      47      28       6      73 0.6266667 0.07594937 0.5507173
## 7      52      23      11      68 0.6933333 0.13924051 0.5540928
## 2      59      16      20      59 0.7866667 0.25316456 0.5335021
## 6      63      12      25      54 0.8400000 0.31645570 0.5235443
## 4      65      10      37      42 0.8666667 0.46835443 0.3983122
## 8      69       6      48      31 0.9200000 0.60759494 0.3124051
##   dprime.test
## 1   1.0768926
## 5   1.2057404
## 3   0.8779473
## 7   0.7945295
## 2   0.7297365
## 6   0.7360455
## 4   0.5950893
## 8   0.5660078

You can also use the generic summary() function to get the trees dataframe

summary(heart.fft)  # Same thing as heart.fft$fft.stats

Tree definitions (exit directions, cue order, and cue thresholds) are contained in columns 1 through 6. Training statistics are contained in columns 7:15 and have the .train suffix. For our heart disease dataset, it looks like tree 2 had the highest training v (HR - FAR) values. Test statistics are contained in columns 16:24 and have the .test suffix. It looks like trees 2 and 6 also had the highest test v (HR - FAR) values.

Area under the curve (AUC): auc

AUC (area under the curve) statistics are in the auc dataframe:

heart.fft$auc
##             fft        lr      cart
## train 0.8848346 0.8611213 0.8533088
## test  0.8462447 0.7973840 0.7390717

Other information

train.decision.df, test.decision.df

The train.decision.df and test.decision.df contain the raw classification decisions for each tree for each training (and test) case.

Here are each of the 8 tree decisions for the first 5 training cases.

heart.fft$decision.train[1:5,]
##   tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1      0      0      0      1      1      1      1      1
## 2      0      0      0      0      1      1      1      1
## 3      0      0      0      0      0      0      0      1
## 4      0      0      0      0      0      0      0      0
## 5      0      0      0      0      0      0      0      0

train.levelout.df, test.levelout.df

The train.levelout.df and test.levelout.df contain the levels at which each case was classified for each tree.

Here are the levels at which the first 5 training cases were classified:

heart.fft$levelout.train[1:5,]
##   tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1      2      2      3      4      1      1      1      1
## 2      1      1      1      1      3      3      2      2
## 3      1      1      1      1      2      2      3      4
## 4      1      1      1      1      2      2      3      4
## 5      1      1      1      1      2      2      3      4

Selecting cues

If you want to select specific cues for a tree, just include them in the formula argument.

For example, the following tree heart.as.fft will only consider the cues sex and age:

heart.as.fft <- fft(formula = diagnosis ~ age + sex,
                    data = heartdisease
                    )

Plotting trees

Once you’ve created an fft object using fft() you can visualize the tree (and ROC curves) using plot(). The following code will visualize the best training tree (tree 2) applied to the test data:

plot(heart.fft,
     main = "Heart Disease",
     decision.names = c("Healthy", "Disease")
     )

See the vignette on plot.fft vignette("fft_plot", package = "fft") for more details.

Additional arguments

The fft() function has several additional arguments than change how trees are built. Note: Not all of these arguments have fully tested yet!