Individual Exercise Solution

In addition to regression trees, we can also fit classification trees when we have binary or categorical outcomes. Use fl2003.RData, which is a subset of the data in Fearon and Laitin (2003), to fit an ensemble model that explains onset as a function of all other variables. Determine the most important variables in the ensemble, and then produce a partial dependence plot showing the relationship between two variables that are not the most important, and the predicted probability of civil war in a given observation. Discuss this relationship.

# set seed for replication
set.seed(0032185)

library(randomForest) # random forest ensembles
library(pdp) # partial dependence plots
library(doParallel) # parallel processing

# register parallel backend
registerDoParallel(makeCluster(parallel::detectCores()))

#load data
load('fl2003.RData')

# split into training and test sets
train <- sample(1:nrow(fl), (2 / 3) * nrow(fl))
fl_train <- fl[train, ]
fl_test <- fl[-train, ]

# fit random forest model
fl_rf <- randomForest(formula = as.factor(onset) ~., data = fl_train,
                      ntree = 1500, mtry = 3, nodesize = 1)

# variable importance plot
varImpPlot(fl_rf)

# partial dependence
fl_part <- partial(fl_rf, pred.var = c('instab', 'ethfrac'), rug = T,
                   train = fl_train, which.class = 1, prob = T, parallel = T,
                   paropts = list(.packages = "randomForest"))

# 2D plot
plotPartial(fl_part, rug = T, train = fl_train)

# 3D plot
plotPartial(fl_part, train = fl_train, levelplot = F, drape = T, colorkey = T,
            screen = list(z = 240, x = -60))

References

Fearon, James D., and David D. Laitin. 2003. “Ethnicity, Insurgency, and Civil War.” The American Political Science Review 97 (1): 75–90.