Cubist
is an R
port of the Cubist GPL
C
code released by RuleQuest at http://rulequest.com/cubist-info.html
.
See the last section of this document for information on the porting.
The other parts describes the functionality of the R
package.
Cubist is a rule-based model that is an extension of Quinlan’s M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.
This is explained better in Quinlan (1992). Wang and Witten (1997)
attempted to recreate this model using a “rational reconstruction” of
Quinlan (1992) that is the basis for the M5P
model in
Weka
(and the R package RWeka
).
An example of a model tree can be illustrated using the Ames housing
data in the modeldata
package.
library(Cubist)
data(ames, package = "modeldata")
# model the data on the log10 scale
$Sale_Price <- log10(ames$Sale_Price)
ames
set.seed(11)
<- sample(1:nrow(ames), floor(.8*nrow(ames)))
in_train_set
<-
predictors c("Lot_Area", "Alley", "Lot_Shape", "Neighborhood", "Bldg_Type",
"Year_Built", "Total_Bsmt_SF", "Central_Air", "Gr_Liv_Area",
"Bsmt_Full_Bath", "Bsmt_Half_Bath", "Full_Bath", "Half_Bath",
"TotRms_AbvGrd", "Year_Sold", "Longitude", "Latitude")
$Sale_Price <- log10(ames$Sale_Price)
ames
<- ames[ in_train_set, predictors]
train_pred <- ames[-in_train_set, predictors]
test_pred
<- ames$Sale_Price[ in_train_set]
train_resp <- ames$Sale_Price[-in_train_set]
test_resp
<- cubist(x = train_pred, y = train_resp)
model_tree model_tree
##
## Call:
## cubist.default(x = train_pred, y = train_resp)
##
## Number of samples: 2344
## Number of predictors: 17
##
## Number of committees: 1
## Number of rules: 11
summary(model_tree)
##
## Call:
## cubist.default(x = train_pred, y = train_resp)
##
##
## Cubist [Release 2.07 GPL Edition] Tue Nov 8 19:39:40 2022
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 2344 cases (18 attributes) from undefined.data
##
## Model:
##
## Rule 1: [107 cases, mean 0.6922308, range 0.6135074 to 0.725238, est err 0.0086873]
##
## if
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = 2.8339713 + 2.89e-05 Gr_Liv_Area - 0.0011 Year_Sold
## + 2e-05 Year_Built + 1.1e-06 Total_Bsmt_SF
## + 0.0003 Bsmt_Full_Bath
##
## Rule 2: [333 cases, mean 0.7065404, range 0.6720027 to 0.7540954, est err 0.0054064]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer, Sawyer_West,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = Y
## then
## outcome = 0.4666239 + 1.87e-05 Gr_Liv_Area + 0.00011 Year_Built
## + 3.4e-06 Total_Bsmt_SF + 0.0009 Bsmt_Full_Bath
## + 7e-08 Lot_Area
##
## Rule 3: [176 cases, mean 0.7106560, range 0.6802335 to 0.7269588, est err 0.0030744]
##
## if
## Neighborhood in {College_Creek, Edwards, Somerset, Northridge_Heights,
## Gilbert, Sawyer_West, Mitchell, Iowa_DOT_and_Rail_Road,
## Timberland, Clear_Creek, Meadow_Village, Briardale,
## Blueste, Landmark}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.01018 + 0.00034 Year_Built + 1.79e-05 Gr_Liv_Area
## + 2.2e-06 Total_Bsmt_SF + 0.0006 Bsmt_Full_Bath
##
## Rule 4: [87 cases, mean 0.7109181, range 0.6841727 to 0.7294574, est err 0.0049527]
##
## if
## Neighborhood in {North_Ames, Crawford}
## Year_Built <= 1952
## Gr_Liv_Area <= 1692
## then
## outcome = 0.6804143 + 1.45e-05 Gr_Liv_Area + 7e-06 Year_Built
## + 4e-07 Total_Bsmt_SF
##
## Rule 5: [35 cases, mean 0.7117702, range 0.6949773 to 0.7293048, est err 0.0047720]
##
## if
## Neighborhood in {North_Ames, Old_Town, Sawyer, Northwest_Ames, Crawford,
## Green_Hills}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.2370069 + 0.000232 Year_Built + 1.16e-05 Gr_Liv_Area
## + 8.4e-06 Total_Bsmt_SF + 0.0013 Bsmt_Full_Bath
##
## Rule 6: [1428 cases, mean 0.7131603, range 0.6135074 to 0.7540954, est err 0.0043013]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Veenker}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## then
## outcome = 0.3519752 + 1.26e-05 Gr_Liv_Area + 0.000171 Year_Built
## + 7.2e-06 Total_Bsmt_SF + 1.8e-07 Lot_Area
##
## Rule 7: [54 cases, mean 0.7134706, range 0.6808683 to 0.7362626, est err 0.0044510]
##
## if
## Bldg_Type in {Duplex, Twnhs}
## Gr_Liv_Area > 1692
## then
## outcome = 0.3791743 + 1.48e-05 Gr_Liv_Area + 0.000151 Year_Built
## + 4.7e-06 Total_Bsmt_SF + 1.3e-07 Lot_Area
## + 0.0004 Bsmt_Full_Bath
##
## Rule 8: [997 cases, mean 0.7173781, range 0.6792599 to 0.74771, est err 0.0034352]
##
## if
## Year_Built > 1952
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.3008866 + 1.41e-05 Gr_Liv_Area + 0.000195 Year_Built
## + 9e-06 Total_Bsmt_SF + 0.0026 Bsmt_Full_Bath + 5e-08 Lot_Area
##
## Rule 9: [166 cases, mean 0.7354736, range 0.711188 to 0.7687976, est err 0.0044621]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Brookside, Crawford,
## Northridge, Stone_Brook, Clear_Creek}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = 0.451118 + 1.11e-05 Gr_Liv_Area + 0.000126 Year_Built
## + 7.5e-06 Total_Bsmt_SF + 4e-08 Lot_Area
##
## Rule 10: [128 cases, mean 0.7408611, range 0.7227189 to 0.7608459, est err 0.0035555]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF <= 2024
## Gr_Liv_Area > 1692
## then
## outcome = 0.1972512 + 0.000248 Year_Built + 1.44e-05 Gr_Liv_Area
## + 9.1e-06 Total_Bsmt_SF + 1.7e-07 Lot_Area
## + 0.0018 Bsmt_Full_Bath
##
## Rule 11: [20 cases, mean 0.7468034, range 0.7163473 to 0.7624165, est err 0.0068615]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF > 2024
## then
## outcome = 0.7741479 - 1e-05 Gr_Liv_Area
##
##
## Evaluation on training data (2344 cases):
##
## Average |error| 0.0045926
## Relative |error| 0.40
## Correlation coefficient 0.89
##
##
## Attribute usage:
## Conds Model
##
## 98% 99% Year_Built
## 63% Neighborhood
## 50% 100% Gr_Liv_Area
## 42% Bldg_Type
## 38% 99% Total_Bsmt_SF
## 12% Central_Air
## 88% Lot_Area
## 52% Bsmt_Full_Bath
## 3% Year_Sold
##
##
## Time: 0.0 secs
There is no formula method for cubist()
; the predictors
are specified as matrix or data frame, The outcome is a numeric
vector.
There is a predict method for the model:
<- predict(model_tree, test_pred)
model_tree_pred ## Test set RMSE
sqrt(mean((model_tree_pred - test_resp)^2))
## [1] 0.00635
## Test set R^2
cor(model_tree_pred, test_resp)^2
## [1] 0.813
The Cubist model can also use a boosting-like scheme called committees where iterative model trees are created in sequence. The first tree follows the procedure described in the last section. Subsequent trees are created using adjusted versions to the training set outcome: if the model over-predicted a value, the response is adjusted downward for the next model (and so on, see this blog post). Unlike traditional boosting, stage weights for each committee are not used to average the predictions from each model tree; the final prediction is a simple average of the predictions from each model tree.
The committee
option can be used to control number of
model trees:
set.seed(1)
<- cubist(x = train_pred, y = train_resp, committees = 3)
com_model summary(com_model)
##
## Call:
## cubist.default(x = train_pred, y = train_resp, committees = 3)
##
##
## Cubist [Release 2.07 GPL Edition] Tue Nov 8 19:39:40 2022
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 2344 cases (18 attributes) from undefined.data
##
## Model 1:
##
## Rule 1/1: [107 cases, mean 0.6922308, range 0.6135074 to 0.725238, est err 0.0086873]
##
## if
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = 2.8339713 + 2.89e-05 Gr_Liv_Area - 0.0011 Year_Sold
## + 2e-05 Year_Built + 1.1e-06 Total_Bsmt_SF
## + 0.0003 Bsmt_Full_Bath
##
## Rule 1/2: [333 cases, mean 0.7065404, range 0.6720027 to 0.7540954, est err 0.0054064]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer, Sawyer_West,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = Y
## then
## outcome = 0.4666239 + 1.87e-05 Gr_Liv_Area + 0.00011 Year_Built
## + 3.4e-06 Total_Bsmt_SF + 0.0009 Bsmt_Full_Bath
## + 7e-08 Lot_Area
##
## Rule 1/3: [176 cases, mean 0.7106560, range 0.6802335 to 0.7269588, est err 0.0030744]
##
## if
## Neighborhood in {College_Creek, Edwards, Somerset, Northridge_Heights,
## Gilbert, Sawyer_West, Mitchell, Iowa_DOT_and_Rail_Road,
## Timberland, Clear_Creek, Meadow_Village, Briardale,
## Blueste, Landmark}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.01018 + 0.00034 Year_Built + 1.79e-05 Gr_Liv_Area
## + 2.2e-06 Total_Bsmt_SF + 0.0006 Bsmt_Full_Bath
##
## Rule 1/4: [87 cases, mean 0.7109181, range 0.6841727 to 0.7294574, est err 0.0049527]
##
## if
## Neighborhood in {North_Ames, Crawford}
## Year_Built <= 1952
## Gr_Liv_Area <= 1692
## then
## outcome = 0.6804143 + 1.45e-05 Gr_Liv_Area + 7e-06 Year_Built
## + 4e-07 Total_Bsmt_SF
##
## Rule 1/5: [35 cases, mean 0.7117702, range 0.6949773 to 0.7293048, est err 0.0047720]
##
## if
## Neighborhood in {North_Ames, Old_Town, Sawyer, Northwest_Ames, Crawford,
## Green_Hills}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.2370069 + 0.000232 Year_Built + 1.16e-05 Gr_Liv_Area
## + 8.4e-06 Total_Bsmt_SF + 0.0013 Bsmt_Full_Bath
##
## Rule 1/6: [1428 cases, mean 0.7131603, range 0.6135074 to 0.7540954, est err 0.0043013]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Veenker}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## then
## outcome = 0.3519752 + 1.26e-05 Gr_Liv_Area + 0.000171 Year_Built
## + 7.2e-06 Total_Bsmt_SF + 1.8e-07 Lot_Area
##
## Rule 1/7: [54 cases, mean 0.7134706, range 0.6808683 to 0.7362626, est err 0.0044510]
##
## if
## Bldg_Type in {Duplex, Twnhs}
## Gr_Liv_Area > 1692
## then
## outcome = 0.3791743 + 1.48e-05 Gr_Liv_Area + 0.000151 Year_Built
## + 4.7e-06 Total_Bsmt_SF + 1.3e-07 Lot_Area
## + 0.0004 Bsmt_Full_Bath
##
## Rule 1/8: [997 cases, mean 0.7173781, range 0.6792599 to 0.74771, est err 0.0034352]
##
## if
## Year_Built > 1952
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.3008866 + 1.41e-05 Gr_Liv_Area + 0.000195 Year_Built
## + 9e-06 Total_Bsmt_SF + 0.0026 Bsmt_Full_Bath + 5e-08 Lot_Area
##
## Rule 1/9: [166 cases, mean 0.7354736, range 0.711188 to 0.7687976, est err 0.0044621]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Brookside, Crawford,
## Northridge, Stone_Brook, Clear_Creek}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = 0.451118 + 1.11e-05 Gr_Liv_Area + 0.000126 Year_Built
## + 7.5e-06 Total_Bsmt_SF + 4e-08 Lot_Area
##
## Rule 1/10: [128 cases, mean 0.7408611, range 0.7227189 to 0.7608459, est err 0.0035555]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF <= 2024
## Gr_Liv_Area > 1692
## then
## outcome = 0.1972512 + 0.000248 Year_Built + 1.44e-05 Gr_Liv_Area
## + 9.1e-06 Total_Bsmt_SF + 1.7e-07 Lot_Area
## + 0.0018 Bsmt_Full_Bath
##
## Rule 1/11: [20 cases, mean 0.7468034, range 0.7163473 to 0.7624165, est err 0.0068615]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF > 2024
## then
## outcome = 0.7741479 - 1e-05 Gr_Liv_Area
##
## Model 2:
##
## Rule 2/1: [31 cases, mean 0.6767582, range 0.6135074 to 0.7025505, est err 0.0169966]
##
## if
## Central_Air = N
## Gr_Liv_Area <= 832
## then
## outcome = -60.1255459 - 0.649 Longitude + 3.1e-06 Gr_Liv_Area
## + 1.6e-06 Total_Bsmt_SF + 1.7e-05 Year_Built + 5e-08 Lot_Area
##
## Rule 2/2: [91 cases, mean 0.6914554, range 0.6135074 to 0.7112178, est err 0.0085649]
##
## if
## Gr_Liv_Area <= 832
## then
## outcome = -7.0404897 + 3.32e-05 Total_Bsmt_SF + 2.71e-05 Gr_Liv_Area
## + 1.81e-06 Lot_Area + 0.000148 Year_Built - 0.079 Longitude
##
## Rule 2/3: [126 cases, mean 0.7014069, range 0.6629523 to 0.7343875, est err 0.0089801]
##
## if
## Central_Air = N
## Gr_Liv_Area > 832
## then
## outcome = -18.0819225 + 0.585 Latitude + 1.28e-05 Gr_Liv_Area
## - 0.0029 Year_Sold - 0.00035 TotRms_AbvGrd
##
## Rule 2/4: [389 cases, mean 0.7057415, range 0.6629523 to 0.7540954, est err 0.0063773]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer_West, Mitchell,
## Iowa_DOT_and_Rail_Road,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1966
## Gr_Liv_Area > 832
## then
## outcome = 1.8797961 + 1.97e-05 Gr_Liv_Area - 0.00211 TotRms_AbvGrd
## - 0.0006 Year_Sold + 9e-06 Year_Built + 4e-07 Total_Bsmt_SF
##
## Rule 2/5: [615 cases, mean 0.7130445, range 0.6807946 to 0.7433388, est err 0.0040729]
##
## if
## Neighborhood in {North_Ames, Sawyer, Northwest_Ames, Brookside,
## Crawford, Timberland, Clear_Creek}
## Gr_Liv_Area > 832
## Latitude > 42.03093
## then
## outcome = 0.4372397 + 1.06e-05 Gr_Liv_Area + 0.000134 Year_Built
## - 0.00198 TotRms_AbvGrd + 5.1e-07 Lot_Area
## + 6e-06 Total_Bsmt_SF
##
## Rule 2/6: [339 cases, mean 0.7155511, range 0.6858544 to 0.7433388, est err 0.0042090]
##
## if
## Neighborhood in {North_Ames, Old_Town, Edwards, Sawyer_West, Mitchell,
## Iowa_DOT_and_Rail_Road, Meadow_Village, Briardale,
## Bloomington_Heights, Landmark}
## Year_Built > 1966
## Total_Bsmt_SF <= 2220
## Gr_Liv_Area > 832
## then
## outcome = -2.3923159 + 0.000258 Year_Built + 1.29e-05 Gr_Liv_Area
## + 6.6e-07 Lot_Area + 0.006 Bsmt_Full_Bath
## + 6e-06 Total_Bsmt_SF + 0.061 Latitude - 0.00026 TotRms_AbvGrd
##
## Rule 2/7: [63 cases, mean 0.7158888, range 0.7042104 to 0.7362626, est err 0.0022518]
##
## if
## Lot_Area <= 3612
## Neighborhood in {College_Creek, Somerset, Northridge_Heights, Gilbert,
## Sawyer, Northwest_Ames, Crawford, Timberland,
## Northridge, Stone_Brook,
## South_and_West_of_Iowa_State_University, Clear_Creek,
## Veenker, Northpark_Villa, Blueste, Greens, Green_Hills}
## then
## outcome = 0.2224039 + 0.000239 Year_Built + 9.8e-06 Gr_Liv_Area
## + 2.3e-06 Total_Bsmt_SF + 0.0013 Bsmt_Full_Bath
## + 6e-08 Lot_Area
##
## Rule 2/8: [21 cases, mean 0.7226428, range 0.7021946 to 0.747059, est err 0.0093020]
##
## if
## Year_Built > 1966
## Total_Bsmt_SF <= 2220
## Bsmt_Full_Bath > 1
## then
## outcome = 0.6899718 + 1.92e-05 Gr_Liv_Area
##
## Rule 2/9: [1326 cases, mean 0.7238501, range 0.6858544 to 0.7631194, est err 0.0037527]
##
## if
## Year_Built > 1966
## Total_Bsmt_SF <= 2220
## Bsmt_Full_Bath <= 1
## then
## outcome = -0.0413273 + 0.000249 Year_Built + 1.34e-05 Gr_Liv_Area
## + 7.8e-06 Total_Bsmt_SF + 0.0041 Bsmt_Full_Bath
## + 1.1e-07 Lot_Area - 0.00029 TotRms_AbvGrd + 0.019 Latitude
## + 0.006 Longitude
##
## Rule 2/10: [199 cases, mean 0.7242572, range 0.6844585 to 0.7503841, est err 0.0053956]
##
## if
## Neighborhood in {North_Ames, Sawyer, Northwest_Ames, Brookside,
## Crawford, Timberland, Clear_Creek}
## Latitude <= 42.03093
## then
## outcome = 2.9085627 + 1.59e-05 Gr_Liv_Area + 2.7e-06 Total_Bsmt_SF
## - 0.00064 TotRms_AbvGrd - 0.049 Latitude + 2.5e-05 Year_Built
## + 1.3e-07 Lot_Area - 0.0001 Year_Sold
##
## Rule 2/11: [494 cases, mean 0.7320032, range 0.691652 to 0.7625942, est err 0.0039935]
##
## if
## Lot_Area > 3612
## Neighborhood in {Somerset, Northridge_Heights, Northwest_Ames, Crawford,
## Timberland, Stone_Brook,
## South_and_West_of_Iowa_State_University, Clear_Creek,
## Veenker, Blueste, Greens, Green_Hills}
## Year_Built > 1966
## then
## outcome = -0.0693572 + 0.000386 Year_Built + 9.8e-06 Gr_Liv_Area
## + 0.0062 Bsmt_Full_Bath + 7.1e-06 Total_Bsmt_SF
## + 1.5e-07 Lot_Area
##
## Rule 2/12: [20 cases, mean 0.7484384, range 0.7163473 to 0.7687976, est err 0.0111278]
##
## if
## Total_Bsmt_SF > 2220
## then
## outcome = -1.0179687 + 0.000906 Year_Built - 1.49e-05 Total_Bsmt_SF
##
## Model 3:
##
## Rule 3/1: [104 cases, mean 0.6924329, range 0.6135074 to 0.7139776, est err 0.0101724]
##
## if
## Gr_Liv_Area <= 845
## then
## outcome = 0.6422932 + 7.43e-05 Gr_Liv_Area
##
## Rule 3/2: [92 cases, mean 0.6954694, range 0.6146095 to 0.724802, est err 0.0076345]
##
## if
## Neighborhood in {Iowa_DOT_and_Rail_Road, Meadow_Village, Briardale}
## Total_Bsmt_SF <= 800
## then
## outcome = -21.4519142 + 1.8e-05 Gr_Liv_Area + 2.01e-05 Total_Bsmt_SF
## - 0.215 Longitude + 7.1e-05 Year_Built + 0.044 Latitude
##
## Rule 3/3: [431 cases, mean 0.7078069, range 0.6135074 to 0.7294574, est err 0.0054390]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Somerset,
## Gilbert, Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Brookside, Crawford, Timberland, Northridge,
## South_and_West_of_Iowa_State_University, Clear_Creek,
## Veenker, Blueste, Green_Hills}
## Year_Built <= 2005
## Total_Bsmt_SF <= 800
## Gr_Liv_Area <= 1832
## then
## outcome = 0.5911446 + 2.12e-05 Gr_Liv_Area + 1.12e-05 Total_Bsmt_SF
## + 4.2e-05 Year_Built + 0.00027 TotRms_AbvGrd
##
## Rule 3/4: [170 cases, mean 0.7106175, range 0.6802335 to 0.7362626, est err 0.0053771]
##
## if
## Bldg_Type in {Duplex, Twnhs}
## Gr_Liv_Area > 845
## then
## outcome = 0.2154946 + 0.000239 Year_Built + 9.3e-06 Gr_Liv_Area
## + 9.7e-06 Total_Bsmt_SF
##
## Rule 3/5: [1411 cases, mean 0.7132381, range 0.6135074 to 0.7540954, est err 0.0043492]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West,
## Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Bloomington_Heights, Northpark_Villa}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2005
## then
## outcome = 2.7340427 + 1.35e-05 Gr_Liv_Area + 0.000196 Year_Built
## + 9e-06 Total_Bsmt_SF + 0.0014 Half_Bath + 1.2e-07 Lot_Area
## + 0.026 Longitude
##
## Rule 3/6: [52 cases, mean 0.7157766, range 0.6927507 to 0.7362626, est err 0.0056324]
##
## if
## Bldg_Type in {Duplex, Twnhs}
## Total_Bsmt_SF > 1221
## then
## outcome = 0.242855 - 2.58e-05 Total_Bsmt_SF + 0.000254 Year_Built
## + 5.3e-06 Gr_Liv_Area
##
## Rule 3/7: [30 cases, mean 0.7163877, range 0.6906617 to 0.7467062, est err 0.0104669]
##
## if
## Year_Built <= 1960
## Total_Bsmt_SF <= 800
## Gr_Liv_Area > 1832
## then
## outcome = -1.2544615 + 0.000342 Year_Built + 3e-06 Gr_Liv_Area
## + 2.5e-06 Total_Bsmt_SF - 0.009 Longitude + 0.011 Latitude
##
## Rule 3/8: [991 cases, mean 0.7212871, range 0.6807946 to 0.7687976, est err 0.0040766]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Somerset,
## Gilbert, Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Brookside, Crawford, Timberland, Northridge,
## South_and_West_of_Iowa_State_University, Clear_Creek,
## Veenker, Blueste, Green_Hills}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built > 1960
## Year_Built <= 2005
## then
## outcome = -0.4926223 + 1.38e-05 Gr_Liv_Area + 1.3e-06 Total_Bsmt_SF
## + 1.1e-05 Year_Built - 0.008 Longitude + 0.01 Latitude
##
## Rule 3/9: [682 cases, mean 0.7277318, range 0.6879801 to 0.7687976, est err 0.0049696]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Mitchell, Brookside,
## Crawford, Northridge, Stone_Brook, Clear_Creek,
## Veenker, Blueste, Greens, Green_Hills}
## Gr_Liv_Area > 845
## then
## outcome = 0.4310254 + 1.22e-05 Gr_Liv_Area + 1.11e-05 Total_Bsmt_SF
## + 0.000132 Year_Built + 0.0041 Half_Bath + 5e-08 Lot_Area
## - 0.00015 TotRms_AbvGrd
##
## Rule 3/10: [235 cases, mean 0.7322826, range 0.6925699 to 0.7608459, est err 0.0041824]
##
## if
## Year_Built > 2005
## Total_Bsmt_SF <= 2006
## then
## outcome = -2.1757049 + 0.001422 Year_Built + 2.14e-05 Gr_Liv_Area
## + 1.46e-05 Total_Bsmt_SF
##
## Rule 3/11: [20 cases, mean 0.7467262, range 0.7163473 to 0.7624165, est err 0.0108212]
##
## if
## Year_Built > 2005
## Total_Bsmt_SF > 2006
## then
## outcome = 0.7768462 - 3e-05 Gr_Liv_Area + 1.88e-05 Total_Bsmt_SF
##
##
## Evaluation on training data (2344 cases):
##
## Average |error| 0.0043261
## Relative |error| 0.38
## Correlation coefficient 0.90
##
##
## Attribute usage:
## Conds Model
##
## 80% 97% Year_Built
## 69% Neighborhood
## 42% 100% Gr_Liv_Area
## 36% Bldg_Type
## 34% 98% Total_Bsmt_SF
## 12% 35% Bsmt_Full_Bath
## 7% 27% Latitude
## 5% Central_Air
## 5% 73% Lot_Area
## 36% TotRms_AbvGrd
## 35% Longitude
## 18% Half_Bath
## 7% Year_Sold
##
##
## Time: 0.1 secs
For this model:
<- predict(com_model, test_pred)
com_pred ## RMSE
sqrt(mean((com_pred - test_resp)^2))
## [1] 0.00592
## R^2
cor(com_pred, test_resp)^2
## [1] 0.837
Another innovation in Cubist using nearest-neighbors to adjust the predictions from the rule-based model. First, a model tree (with or without committees) is created. Once a sample is predicted by this model, Cubist can find it’s nearest neighbors and determine the average of these training set points. See Quinlan (1993a) for the details of the adjustment as well as this blog post.
The development of rules and committees is independent of the choice
of using instances. The original C
code allowed the program
to choose whether to use instances, not use them or let the program
decide. Our approach is to build a model with the cubist()
function that is ignorant to the decision about instances. When samples
are predicted, the argument neighbors
can be used to adjust
the rule-based model predictions (or not).
We can add instances to the previously fit committee model:
<- predict(com_model, test_pred, neighbors = 5)
inst_pred ## RMSE
sqrt(mean((inst_pred - test_resp)^2))
## [1] 0.00565
## R^2
cor(inst_pred, test_resp)^2
## [1] 0.851
Note that the previous models used the implicit default of
neighbors = 0
for their predictions.
It may also be useful to see how the different models fit a single
predictor. Here is the test set data for a model with one predictor
(Gr_Liv_Area
), 100 committees, and various values of
neighbors
:
After the initial use of the instance-based correction, there is very little change in the mainstream of the data.
R modeling packages such as caret
,
tidymodels
, and mlr3
can be used to tune the
model. See the examples
here for more details.
It should be noted that this variable importance measure does not capture the influence of the predictors when using the instance-based correction.
Rules from a Cubist model can be viewed using summary
as
follows:
summary(model_tree)
##
## Call:
## cubist.default(x = train_pred, y = train_resp)
##
##
## Cubist [Release 2.07 GPL Edition] Tue Nov 8 19:39:40 2022
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 2344 cases (18 attributes) from undefined.data
##
## Model:
##
## Rule 1: [107 cases, mean 0.6922308, range 0.6135074 to 0.725238, est err 0.0086873]
##
## if
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = 2.8339713 + 2.89e-05 Gr_Liv_Area - 0.0011 Year_Sold
## + 2e-05 Year_Built + 1.1e-06 Total_Bsmt_SF
## + 0.0003 Bsmt_Full_Bath
##
## Rule 2: [333 cases, mean 0.7065404, range 0.6720027 to 0.7540954, est err 0.0054064]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer, Sawyer_West,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = Y
## then
## outcome = 0.4666239 + 1.87e-05 Gr_Liv_Area + 0.00011 Year_Built
## + 3.4e-06 Total_Bsmt_SF + 0.0009 Bsmt_Full_Bath
## + 7e-08 Lot_Area
##
## Rule 3: [176 cases, mean 0.7106560, range 0.6802335 to 0.7269588, est err 0.0030744]
##
## if
## Neighborhood in {College_Creek, Edwards, Somerset, Northridge_Heights,
## Gilbert, Sawyer_West, Mitchell, Iowa_DOT_and_Rail_Road,
## Timberland, Clear_Creek, Meadow_Village, Briardale,
## Blueste, Landmark}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.01018 + 0.00034 Year_Built + 1.79e-05 Gr_Liv_Area
## + 2.2e-06 Total_Bsmt_SF + 0.0006 Bsmt_Full_Bath
##
## Rule 4: [87 cases, mean 0.7109181, range 0.6841727 to 0.7294574, est err 0.0049527]
##
## if
## Neighborhood in {North_Ames, Crawford}
## Year_Built <= 1952
## Gr_Liv_Area <= 1692
## then
## outcome = 0.6804143 + 1.45e-05 Gr_Liv_Area + 7e-06 Year_Built
## + 4e-07 Total_Bsmt_SF
##
## Rule 5: [35 cases, mean 0.7117702, range 0.6949773 to 0.7293048, est err 0.0047720]
##
## if
## Neighborhood in {North_Ames, Old_Town, Sawyer, Northwest_Ames, Crawford,
## Green_Hills}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.2370069 + 0.000232 Year_Built + 1.16e-05 Gr_Liv_Area
## + 8.4e-06 Total_Bsmt_SF + 0.0013 Bsmt_Full_Bath
##
## Rule 6: [1428 cases, mean 0.7131603, range 0.6135074 to 0.7540954, est err 0.0043013]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Veenker}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## then
## outcome = 0.3519752 + 1.26e-05 Gr_Liv_Area + 0.000171 Year_Built
## + 7.2e-06 Total_Bsmt_SF + 1.8e-07 Lot_Area
##
## Rule 7: [54 cases, mean 0.7134706, range 0.6808683 to 0.7362626, est err 0.0044510]
##
## if
## Bldg_Type in {Duplex, Twnhs}
## Gr_Liv_Area > 1692
## then
## outcome = 0.3791743 + 1.48e-05 Gr_Liv_Area + 0.000151 Year_Built
## + 4.7e-06 Total_Bsmt_SF + 1.3e-07 Lot_Area
## + 0.0004 Bsmt_Full_Bath
##
## Rule 8: [997 cases, mean 0.7173781, range 0.6792599 to 0.74771, est err 0.0034352]
##
## if
## Year_Built > 1952
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## then
## outcome = 0.3008866 + 1.41e-05 Gr_Liv_Area + 0.000195 Year_Built
## + 9e-06 Total_Bsmt_SF + 0.0026 Bsmt_Full_Bath + 5e-08 Lot_Area
##
## Rule 9: [166 cases, mean 0.7354736, range 0.711188 to 0.7687976, est err 0.0044621]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Brookside, Crawford,
## Northridge, Stone_Brook, Clear_Creek}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = 0.451118 + 1.11e-05 Gr_Liv_Area + 0.000126 Year_Built
## + 7.5e-06 Total_Bsmt_SF + 4e-08 Lot_Area
##
## Rule 10: [128 cases, mean 0.7408611, range 0.7227189 to 0.7608459, est err 0.0035555]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF <= 2024
## Gr_Liv_Area > 1692
## then
## outcome = 0.1972512 + 0.000248 Year_Built + 1.44e-05 Gr_Liv_Area
## + 9.1e-06 Total_Bsmt_SF + 1.7e-07 Lot_Area
## + 0.0018 Bsmt_Full_Bath
##
## Rule 11: [20 cases, mean 0.7468034, range 0.7163473 to 0.7624165, est err 0.0068615]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF > 2024
## then
## outcome = 0.7741479 - 1e-05 Gr_Liv_Area
##
##
## Evaluation on training data (2344 cases):
##
## Average |error| 0.0045926
## Relative |error| 0.40
## Correlation coefficient 0.89
##
##
## Attribute usage:
## Conds Model
##
## 98% 99% Year_Built
## 63% Neighborhood
## 50% 100% Gr_Liv_Area
## 42% Bldg_Type
## 38% 99% Total_Bsmt_SF
## 12% Central_Air
## 88% Lot_Area
## 52% Bsmt_Full_Bath
## 3% Year_Sold
##
##
## Time: 0.0 secs
The tidyRules
function in the tidyrules
package returns rules in a tibble (an extension of data frames) with
one row per rule. The tibble provides these information about the rule:
support
, mean
, min
,
max
, error
, LHS
, RHS
and committee
. The values in LHS
and
RHS
columns are strings which can be parsed as R
expressions. These can be pasted inside the parenthesis of
dplyr::filter()
to obtain the rows of the data
corresponding to the rule and evaluate the response variable.
library(tidyrules)
<- tidyRules(model_tree)
tr tr
## # A tibble: 11 × 9
## id LHS RHS support mean min max error commi…¹
## <int> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 Year_Built <= 1952 & C… (2.8… 107 0.692 0.614 0.725 0.00869 1
## 2 2 Neighborhood %in% c('O… (0.4… 333 0.707 0.672 0.754 0.00541 1
## 3 3 Neighborhood %in% c('C… (0.0… 176 0.711 0.680 0.727 0.00307 1
## 4 4 Neighborhood %in% c('N… (0.6… 87 0.711 0.684 0.729 0.00495 1
## 5 5 Neighborhood %in% c('N… (0.2… 35 0.712 0.695 0.729 0.00477 1
## 6 6 Neighborhood %in% c('N… (0.3… 1428 0.713 0.614 0.754 0.00430 1
## 7 7 Bldg_Type %in% c('Dupl… (0.3… 54 0.713 0.681 0.736 0.00445 1
## 8 8 Year_Built > 1952 & To… (0.3… 997 0.717 0.679 0.748 0.00344 1
## 9 9 Neighborhood %in% c('S… (0.4… 166 0.735 0.711 0.769 0.00446 1
## 10 10 Year_Built > 2004 & To… (0.1… 128 0.741 0.723 0.761 0.00356 1
## 11 11 Year_Built > 2004 & To… (0.7… 20 0.747 0.716 0.762 0.00686 1
## # … with abbreviated variable name ¹committee
c("LHS", "RHS")] tr[,
## # A tibble: 11 × 2
## LHS RHS
## <chr> <chr>
## 1 Year_Built <= 1952 & Central_Air == 'N' & Gr_Liv_Area <= 1692 (2.8…
## 2 Neighborhood %in% c('Old_Town', 'Edwards', 'Gilbert', 'Sawyer', 'Sawye… (0.4…
## 3 Neighborhood %in% c('College_Creek', 'Edwards', 'Somerset', 'Northridg… (0.0…
## 4 Neighborhood %in% c('North_Ames', 'Crawford') & Year_Built <= 1952 & G… (0.6…
## 5 Neighborhood %in% c('North_Ames', 'Old_Town', 'Sawyer', 'Northwest_Ame… (0.2…
## 6 Neighborhood %in% c('North_Ames', 'College_Creek', 'Old_Town', 'Edward… (0.3…
## 7 Bldg_Type %in% c('Duplex', 'Twnhs') & Gr_Liv_Area > 1692 (0.3…
## 8 Year_Built > 1952 & Total_Bsmt_SF > 765 & Gr_Liv_Area <= 1692 (0.3…
## 9 Neighborhood %in% c('Somerset', 'Northridge_Heights', 'Brookside', 'Cr… (0.4…
## 10 Year_Built > 2004 & Total_Bsmt_SF <= 2024 & Gr_Liv_Area > 1692 (0.1…
## 11 Year_Built > 2004 & Total_Bsmt_SF > 2024 (0.7…
# lets look at 7th rule
$LHS[7] tr
## [1] "Bldg_Type %in% c('Duplex', 'Twnhs') & Gr_Liv_Area > 1692"
$RHS[7] tr
## [1] "(0.3791743) + (1.48e-05 * Gr_Liv_Area) + (0.000151 * Year_Built) + (4.7e-06 * Total_Bsmt_SF) + (1.3e-07 * Lot_Area) + (0.0004 * Bsmt_Full_Bath)"
These results can be used to look at specific parts of the data. For example, the 7th rule predictions are:
library(dplyr)
library(rlang)
<- function(x, index = 1, model = TRUE) {
char_to_expr <- x %>% dplyr::slice(index)
x if (model) {
<- x %>% dplyr::pull(RHS) %>% rlang::parse_expr()
x else {
} <- x %>% dplyr::pull(LHS) %>% rlang::parse_expr()
x
}
x
}
<- char_to_expr(tr, 7, model = FALSE)
rule_expr <- char_to_expr(tr, 7, model = TRUE)
model_expr
# filter the data corresponding to rule 7
%>%
ames ::filter(eval_tidy(rule_expr, ames)) %>%
dplyr::mutate(sale_price_pred = eval_tidy(model_expr, .)) %>%
dplyr::select(Sale_Price, sale_price_pred) dplyr
## # A tibble: 59 × 2
## Sale_Price sale_price_pred
## <dbl> <dbl>
## 1 0.703 0.707
## 2 0.693 0.712
## 3 0.715 0.715
## 4 0.710 0.714
## 5 0.711 0.712
## 6 0.716 0.716
## 7 0.727 0.713
## 8 0.719 0.719
## 9 0.711 0.709
## 10 0.716 0.722
## # … with 49 more rows
The summary()
method for Cubist shows the usage of each
variable in either the rule conditions or the (terminal) linear model.
In actuality, many more linear models are used in prediction that are
shown in the output. Because of this, the variable usage statistics
shown at the end of the output of the summary()
function
will probably be inconsistent with the rules also shown in the output.
At each split of the tree, Cubist saves a linear model (after feature
selection) that is allowed to have terms for each variable used in the
current split or any split above it. Quinlan (1992) discusses a
smoothing algorithm where each model prediction is a linear combination
of the parent and child model along the tree. As such, the final
prediction is a function of all the linear models from the initial node
to the terminal node. The percentages shown in the Cubist output
reflects all the models involved in prediction (as opposed to the
terminal models shown in the output).
The raw usage statistics are contained in a data frame called
usage
in the cubist
object.
The caret
and vip
packages have general
variable importance functions caret::varImp()
and
vip::vi()
. When using this function on a
cubist
argument, the variable importance is a linear
combination of the usage in the rule conditions and the model.
For example, to compute the scores:
::varImp(model_tree)
caret
# or
::vi(model_tree) vip
As previously mentioned, this code is a port of the command-line
C
code. To run the C
code, the training set
data must be converted to a specific file format as detailed on the
RuleQuest website. Two files are created. The file.data
file is a header-less, comma delimited version of the data (the
file
part is a name given by the user). The
file.names
file provides information about the columns (eg.
levels for categorical data and so on). After running the C
program, another text file called file.models
, which
contains the information needed for prediction.
Once a model has been built with the R
cubist
package, the exportCubistFiles
can be
used to create the .data
, .names
and
.model
files so that the same model can be run at the
command-line.
There are a few features in the C
code that are not yet
operational in the R
package:
C
code decide on using
instances or not. The choice is more explicit in this packageC
code supports binning of predictors