BART - A Bayesian machine learning workflow for complex spatial data

In our recent publication from 2020 we analyzed COVID-19 incidence and socioeconomic, infrastructural, and built environment characteristics using a multimethod approach. There are 401 counties in Germany; as shown in the figure below, these vary in size, such that the counties in Southern Germany are generally smaller with higher population densities. Natural log-transformed age-adjusted incidence rates of COVID-19 as of April 1st are shown, indicating spatial variation between the northeast and south-southwest of the study area.

After conducting a throughout spatial exploratory analysis, we used a Bayesian Additive Regression Trees (BART; Chipman, George, and McCulloch (2010)) model, to identify important socioeconomic and built environment covariates with COVID-19 incidence rate. BART is a ensemble-of-trees method, such as random forests (Breiman 2001) and stochastic gradient boosting (Friedman 2002).
Tree-based regression models have an advantage, as they can flexibly fit interactions and non-linearities. A sum-of-trees model - such as BART or random forest - has an even greater ability than a single-tree model. However, BART differs from these two examples, as it uses a underlying Bayesian probability model rather than a pure algorithm (Kapelner and Bleich 2016). One of the advantages of using a Bayesian approach is, that it computes Bayesian posterior distributions to approximate the nonparametric model parameters. The priors aim to prevent a single regression from dominating, thus reducing the risk of overfitting (Kapelner and Bleich 2016; Scarpone et al. 2020).

For our analysis we have used the R package bartMachine (Kapelner and Bleich 2016). Unfortunately bartMachine does not support tibble objects as input features, but apart from that it’s intuitive and simple to use.
In this post I will use the data we have used in our COVID-19 paper, to analyze age-adjusted incidence rates of COVID-19 for German counties. Even though the BART model has great predictive power, we will not use it to predict new cases of COVID-19, but rather use it as a exploratory tool to understand, what factors contribute to the spreading of COVID-19 and how they interact with the incidence rate.

I want to make it clear, that I will use data of one specific date. We have chosen the 1st April 2020, to analyze only the first wave of COVID-19 in Germany. A friend of mine currently analyses what effects contributed to the second and third waves. Even though there are many factors, that kept being important, many features gained or lost importance and some even showed a inverted effect.

Therefore, BART is used as an exploratory tool and with the help of Partial Dependence Plots we will gain insight of the marginal effects of the important predictors of COVID-19.

I will not focus on pre-modelling exploratory data analysis, nor explain the data engineering here. But feel free to read the paper or contact me.

Data download and pre-processing

First, we need to load the packages and set the memory size and number of cores that bartMachine should be using.

# Set to use 45 GB memory - adjust this to your resources
options(java.parameters = "-Xmx45g")

# Load packages
library(bartMachine)
library(dplyr)

# Set to run on 20 threads - adjust this to your resources
set_bart_machine_num_cores(20)
## bartMachine now using 20 cores.

I would suggest, that you double check if the correct amount of RAM has been made available, when loading bartMachine. If the message states a completely different amount of memory, try to manually write the java.parameters string, instead of copy&pasting.

Next, we need to download the data and normalize the church density (Rch_den) variable.

# Function for linear stretching. New range: 0-1
range0_1 <- function(x){(x-min(x))/(max(x)-min(x))}

# Download data
data <- read.csv("https://github.com/CHEST-Lab/BART_Covid-19/raw/master/Data/GermanyTraining.csv",
                 stringsAsFactors = F) %>%
  mutate(Rch_den = range0_1(Rch_den),
         NUTS2_Fact = as.factor(NUTS2_Fact), 
         BL_ID = as.factor(BL),
         S_109 = as.factor(S_109))

# Select variables: Lat/ Long, BL, NUTS2, socioeconomic, build environment and age adjusted case rate
data <- data[c(374, 3, 4, 5, 38, 28, 65:372)]
AdjRateXYNUTS2_FactBL_IDEWZS_001S_002S_003S_004S_005S_006S_007S_008S_009S_010S_011S_012S_013S_014S_015S_016S_017S_018S_019S_020S_021S_022S_023S_024S_025S_026S_027S_028S_029S_030S_031S_032S_033S_034S_035S_036S_037S_038S_039S_040S_041S_042S_043S_044S_045S_046S_047S_048S_049S_050S_051S_052S_053S_054S_055S_056S_057S_058S_059S_060S_061S_062S_063S_064S_065S_066S_067S_068S_069S_070S_071S_072S_073S_074S_075S_076S_077S_078S_079S_080S_081S_082S_083S_084S_085S_086S_087S_088S_089S_090S_091S_092S_093S_094S_095S_096S_097S_098S_099S_100S_101S_102S_103S_104S_105S_106S_107S_108S_109S_110S_111S_112S_113S_114S_115S_116S_117S_118S_119S_120S_121S_122S_123S_124S_125S_126S_127S_128S_129S_130S_131S_132S_133S_134S_135S_136S_137S_138S_139S_140S_141S_142S_143S_144S_145S_146S_147S_148S_149S_150S_151S_152S_153S_154S_155S_156S_157S_158S_159S_160S_161S_162S_163S_164S_165S_166S_167S_168S_169S_170S_171S_172S_173S_174S_175S_176M_km2T_km2P_km2S_km2T_km2_1All_R_2M_popT_popP_popS_popT_pop_1All_R_ptot_Rlib_km2uni_km2sch_km2cll_km2kid_km2ply_km2the_km2ngh_km2cin_km2pub_km2bar_km2std_km2sc_km2hsp_km2doc_km2mll_km2dc_km2con_km2sup_km2pst_km2th_km2cc_km2nh_km2ac_km2phr_km2rst_km2ff_km2caf_km2fc_km2bir_km2bak_km2chm_km2dot_km2har_km2prk_km2Rch_km2Rmu_km2Rth_km2All_P_2lib_popuni_popsch_popcoll_ppkid_popplay_ppthea_ppnigh_ppcin_poppub_popbar_popstad_ppsc_pophosp_ppdoc_popmall_ppdc_popcon_popsup_poppost_ppth_popcc_popnh_popac_popphar_pprest_ppff_popcafe_ppfc_popbier_ppbake_ppchem_ppdoit_pphair_pppark_ppRch_popRmu_popRoth_ppAll_P_ptot_Plib_denuni_densch_dencoll_dnkid_denplay_dnthea_dnnigh_dncin_denpub_denbar_denstad_dnsc_denhosp_dndoc_denmall_dndc_dencon_densup_denpost_dnth_dencc_dennh_denac_denphar_dnrest_dnff_dencafe_dnfc_denbier_dnbake_dnchem_dndoit_dnhair_dnpark_dnRch_denRmu_denRoth_dnAll_P_dPop_Den
27.379.4454.78108Schleswig-Holstein895048.976.019.37.912.543.814.560.38.131.894.824.1-10.972.127.744.151.818.848.622.839.917.232.985.074.52.92.65.59.811.99.123.819.69.620.210.68.22.442.26.16.63.019.030.214.310.011.70.7578.802.370.348.234.092.1109.044.713.291.29.829.56.18.515.714.02.91152723331278.6298616.017.62425811.33375299233732965.53.852.624.46.440.91.014.94568.9444.450.415.627.1-2.49.47369.129.630.1100.355.7123.935.215.892.545.269.17.0371.41282.41.3767.30.0200115623091372516.223.519.626.57.219.17.4125.6134.56.241.805.159800050092543906028918010047852.434.227.714.07.73.63.22.31.519.83.540.11.6131.730.96.2342.562.018.125.35.867.88.51.935.292.5100.042826.64.151.824.897.034.848.816.40.000.620.440.440.952.450.0033.8124.4024.0152.45134.66120.520.080.000.080.060.301.670.100.120.040.910.0800.080.000.910.000.040.300.410.200.000.120.000.060.371.711.240.8500.001.020.220.041.260.020.040.040.0612.464.470.004.473.3516.7691.625.596.702.2350.284.4704.470.0050.280.002.2316.7622.3511.170.006.700.003.3520.1193.8568.1546.9300.0055.8612.292.2369.271.122.232.233.35684.89613000.0000.010.160.010.0100.270.0000.0000.26000.030.030.0100000.020.400.290.18000.230.0100.3200.03000.334566
49.5410.1354.32108Schleswig-Holstein2475489.172.022.66.68.933.114.963.28.837.8310.53.2136.267.332.438.951.519.248.921.339.317.532.679.267.72.82.55.39.411.010.226.519.18.818.59.77.52.341.53.44.92.317.826.71.610.110.00.6480.055.073.531.141.790.4140.631.413.4127.99.430.35.17.313.920.54.561564271293.2330412.114.53518.24015329636933844.76.054.428.75.827.70.25.0886.5383.051.717.131.57.39.84295.230.126.074.667.888.634.930.291.067.039.67.3399.22269.474.5755.353.4100120931122280116.917.717.426.29.812.95.2530.2146.17.311.916.626400046092415945908919310043649.130.327.113.14.63.42.01.30.917.92.921.02.0120.230.66.9645.766.418.330.05.9108.17.31.734.991.00.045821.34.651.526.097.538.348.113.60.200.690.140.680.662.379.0631.136.2930.8330.10107.42265.920.110.000.060.040.330.530.040.110.060.510.3300.210.000.670.000.050.170.560.250.020.040.000.020.611.341.350.8100.000.930.170.081.100.010.160.060.0410.754.850.002.831.6214.9523.832.024.852.8323.0314.9509.700.0030.300.002.427.6825.4511.310.811.620.000.8127.4760.5961.4036.7600.0042.017.683.6449.690.407.272.831.62487.181206000.0000.020.040.000.0100.050.0400.0100.09000.010.050.0100000.060.210.290.13000.100.0100.2000.10000.306862
43.2010.7353.87108Schleswig-Holstein2171988.669.726.57.08.635.717.258.08.536.3176.718.652.575.624.340.556.118.548.421.046.917.631.981.185.62.72.55.210.08.27.025.021.510.523.212.79.63.144.72.26.03.119.536.11.79.112.70.7279.621.070.754.543.0104.449.222.716.360.110.025.64.611.112.814.22.8116325931298.6303615.117.02685923.93784305533242718.65.737.156.815.4152.61.716.82028.8442.152.619.627.316.09.15393.920.536.0120.044.7156.134.322.392.361.735.46.8375.73039.8106.4724.836.9100110114542249116.125.616.531.09.320.16.8226.6132.36.251.895.7390005329070782705882769944443.631.317.915.03.93.01.60.70.643.27.819.72.4109.027.15.1338.065.317.826.65.887.411.12.034.392.3100.038078.14.056.129.295.937.048.114.80.250.180.230.440.451.5424.5117.4322.5142.4343.38150.27326.380.020.010.310.000.750.620.070.030.010.440.0900.160.020.350.000.020.170.460.510.000.050.070.040.261.230.880.4800.050.500.110.060.790.000.010.020.008.622.300.9230.390.4673.2160.776.913.221.3842.829.21015.651.8434.530.002.3016.1145.1249.260.004.606.453.6825.78120.1786.1046.9604.6048.8010.595.5277.350.460.921.840.00840.251825000.0200.040.040.000.0000.060.0000.0100.03000.010.030.0400000.010.230.100.08000.040.0000.1400.00000.136487
31.489.9854.08108Schleswig-Holstein794879.275.625.89.511.548.917.963.78.938.3100.612.59.781.818.143.156.219.550.521.246.216.927.282.178.12.72.65.311.38.26.524.321.810.422.612.29.42.844.23.15.24.521.035.31.08.914.40.7178.654.366.062.339.699.00.00.016.10.011.329.86.07.215.06.90.92157224641316.4284217.921.64223814.93433287230122184.35.350.748.24.540.50.56.07671.7563.955.423.832.87.58.62448.423.830.4126.651.9158.629.313.085.540.162.27.6337.31650.635.2775.43.52001110816452608715.621.317.522.58.516.18.6225.2133.97.761.996.474700058885645826828118710051454.538.625.914.74.43.61.61.61.413.92.426.62.1129.831.35.0639.561.917.525.04.870.67.22.429.385.5100.039620.94.556.227.797.934.748.217.10.130.110.300.430.561.5211.259.5226.6538.5050.19136.11108.190.010.000.000.000.030.270.030.010.010.130.0000.060.010.100.010.030.010.250.060.000.010.000.010.320.560.240.2100.000.320.030.010.130.030.030.030.002.961.260.000.000.002.5223.902.521.261.2611.320.0005.031.268.811.262.521.2622.655.030.001.260.001.2628.9450.3221.3918.8700.0028.942.521.2611.322.522.522.520.00265.45211000.0000.000.050.000.0000.050.0000.0000.02000.000.050.0000000.130.160.080.09000.170.0000.0400.04000.103521
19.759.1154.13108Schleswig-Holstein1332106.855.419.37.613.743.017.841.36.330.552.59.316.694.15.854.355.819.745.223.350.116.130.081.1130.92.42.44.811.27.85.222.623.811.924.612.79.92.845.80.46.12.420.539.24.68.013.30.7779.283.772.223.138.699.313.713.719.717.511.227.83.68.612.76.80.96179323171260.8291412.817.33698513.7334329243248844.50.511.158.63.8402.710.29.51607.3518.446.125.321.70.05.26411.923.628.8125.943.7130.820.62.985.17.954.48.3366.31831.653.4824.77.927420931221110311.813.219.920.26.924.25.3519.194.67.641.916.11471196514186451248411823505768758370.873.5-15.816.74.03.1-0.10.50.584.311.72.64.991.716.01.7731.470.111.818.92.547.28.61.920.685.163.031456.62.855.826.371.832.947.419.70.040.010.080.280.250.6541.706.8990.28297.00262.48698.35930.270.000.000.010.000.020.030.000.000.000.020.0000.010.000.060.000.000.010.010.020.000.010.000.000.020.070.040.0300.000.050.010.000.040.000.010.000.000.454.500.006.010.7518.7728.531.500.751.5025.524.5006.010.0060.060.000.0012.019.0119.520.757.510.751.5020.2778.8237.5327.7800.0052.556.014.5038.290.006.010.000.00481.20641000.0000.000.000.000.0000.000.0000.0000.02000.000.000.0000000.000.030.010.00000.010.0000.0100.00000.014041
54.8310.6053.59108Schleswig-Holstein1972645.544.922.66.411.432.322.344.56.335.9114.122.719.590.79.247.558.220.051.219.845.417.132.582.1136.02.82.85.611.86.84.924.923.810.222.312.09.32.844.64.36.12.522.235.08.39.112.10.8780.583.976.810.336.3100.10.00.016.70.011.822.54.09.610.810.61.11190324041257.6283210.111.75497920.9340428730311344.40.911.559.826.01673.33.35.02476.9490.247.923.319.912.73.14406.725.628.4111.156.2126.231.719.283.138.540.97.5474.91251.852.9791.11.623630155193357998.610.327.413.07.814.94.6213.571.87.772.026.910302031131545632042491804483439457271.779.7-52.415.64.12.9-0.50.20.323.63.44.93.294.415.01.8521.761.09.113.42.641.89.61.731.783.171.721795.12.658.227.492.627.347.725.10.090.010.140.240.300.7858.775.1090.62153.91193.65502.04990.340.010.000.010.000.040.100.000.000.000.010.0000.010.000.030.000.000.010.030.020.010.010.000.000.030.110.050.0500.000.060.010.000.040.000.010.000.000.693.550.009.120.0024.3364.381.520.511.528.112.5304.060.5118.760.001.018.6219.7714.706.087.600.511.0120.7870.9732.9531.9402.0339.038.113.0427.370.513.551.010.00439.51867000.0000.000.020.000.0000.000.0000.0000.00000.000.000.0000000.000.020.010.01000.010.0000.0100.00000.014296

The first column (AdjRate) contains age-adjusted incidence rates. The X and Y coordinates (longitude and latitude) contain the spatial information. The NUTS2 code represents the government regions (Regierungsbezirke) ID, BL_ID contains the name of the federal state (Bundesland) and EWZ refers to the population count. After that follows a ton of socioeconomic, infrastructural, and built environment characteristics. We will thin out the important features later and talk about their meaning, so don’t worry about the coding of the variables.

BART needs a data.frame of predictors and a vector of the response variable.

psych::skew(data$AdjRate)
## [1] 3.64179
# Response variable
y <- log(data$AdjRate)

# Data.frame of predictors
data_bm <- select(data, -c(AdjRate))

AdjRate is highly skewed (skewness = 3.64) and was therefore log-transformed. The results from the BART machine won’t differ with the highly skewed data, since it is non-parametric. However, to improve visual interpretability, we will use the log-rate in all the figures.
Below I have printed the summary statistics of the AdjRate (log), to understand the model outputs.

MeanSDMinMax
4.130.71.756.51

First BART model

It would be fine to use the default values for the hyperparameter (e.g. number of trees), however bartMachine comes with a simple function to compute optimal hyperparameters, called bartMachineCV. This will take quite some time to run! Therefore I have excluded this function from the script and just build a bartMachine with the optimal parameters.

bm_All <- bartMachine(X = data_bm, y = y, 
                      k=2, nu=3, q=0.9, num_trees=100, 
                      num_iterations_after_burn_in=2000, 
                      num_burn_in = 300, 
                      seed = 1234, verbose = FALSE)
## Warning in build_bart_machine(X, y, Xy, num_trees, num_burn_in, num_iterations_after_burn_in, : Setting the seed when using parallelization does not result in deterministic output.
## If you need deterministic output, you must run "set_bart_machine_num_cores(1)" and then build the BART model with the set seed.
summary(bm_All)
## bartMachine v1.2.6 for regression
## 
## training data n = 401 and p = 366 
## built in 7.3 secs on 20 cores, 100 trees, 300 burn-in and 2000 post. samples
## 
## sigsq est for y beforehand: 0.024 
## avg sigsq estimate after burn-in: 0.08411 
## 
## in-sample statistics:
##  L1 = 56.04 
##  L2 = 13.1 
##  rmse = 0.18 
##  Pseudo-Rsq = 0.9323
## p-val for shapiro-wilk test of normality of residuals: 0.00087 
## p-val for zero-mean noise: 0.96076

The Pseudo-R² of 0.93 and RMSE of 0.18 look quiet promising, but let’s check for error assumptions.

check_bart_error_assumptions(bm_All)

Both plots look good. In the second one (Assessment of Heteroskedasticity) we can detect a visible pattern - however not a prominent one! From both plots I would assume that the overall performance of the model is good, however it might struggle with extreme values.

plot_y_vs_yhat(bm_All, credible_intervals = TRUE)

The “Fitted vs. Actual Values” plot above indicates overall good model performance. Again, we can see that the model struggles with extreme values. Low values will be over-predicted and high values will be under-predicted. We could also map the residuals geographically and see, if there is spatial clustering. But first, we will reduce the model complexity by removing non-important variables.

Variable Selection

BART is quiet a powerful machine learning tool, but there is more to it than amazing R² values! If you are dealing with a high dimensional data set (like our data with 313 predictors), often only a relatively small subset of predictor variables truly influences the response variable. Occam’s razor describes this philosophy, that simpler models should be preferred to unnecessarily complex ones.
The var_selection_by_permute function performs variable selection introduced by Bleich et al. (2014), to reduce model complexity. Let’s run the next chunk and investigate the most important variables. You may skip this step and just use my results, as this will take quiet long to run…

# Leave the num_trees_for_permute small, to force variables to compete for entry into the model!
var_sel <- bartMachine::var_selection_by_permute_cv(bm_All, num_trees_for_permute = 20)

# Look at the most important variables
var_sel$important_vars_cv
##  [1] "BL_ID_Baden-Württemberg" "BL_ID_Bayern"           
##  [3] "BL_ID_Hessen"            "ff_pop"                 
##  [5] "hair_pp"                 "thea_pp"                
##  [7] "NUTS2_Fact_11"           "NUTS2_Fact_27"          
##  [9] "NUTS2_Fact_40"           "NUTS2_Fact_71"          
## [11] "S_170"                   "play_dn"                
## [13] "bir_km2"                 "cc_pop"                 
## [15] "sch_den"                 "kid_den"                
## [17] "Rch_den"                 "S_109_2"                
## [19] "EWZ"                     "Pop_Den"                
## [21] "S_004"                   "S_006"                  
## [23] "S_020"                   "S_051"                  
## [25] "S_054"                   "S_066"                  
## [27] "S_070"                   "S_080"                  
## [29] "S_104"                   "S_107"                  
## [31] "S_115"                   "S_123"                  
## [33] "S_130"                   "S_146"                  
## [35] "S_153"                   "X"                      
## [37] "Y"

Below I have categorized the important variables, that will be used for the second BART model. Also I have removed the NUTS2_Fact and BL_ID variables.

data_subset <- data_bm %>%
  select(c(
    # Geographical Units
    X, #Longitude
    Y, #Latitude
    
    # Political units
    S_109, #Rural/Urban
    
    # Socioeconomic
    EWZ, #Population
    Pop_Den, #Population density
    S_004, #Unemployment rate under 25
    S_006, #Household income per capita 
    S_020, #Employment rate 15-<30
    S_051, #Voter participation
    S_054, #Apprenticeship positions
    S_066, #Household income
    S_070, #Deptors rate
    S_080, #Recreationl Space
    S_104, #Income tax
    S_107, #Steuerkraft
    S_115, #Regional population potential
    S_123, #Child poverty
    S_130, #IC train station access
    S_146, #Commuters >150km
    S_153, #Foreign guests in tourist establishments
    S_170, #Longterm unemployment rate
    
    # Built environment
    Rch_den, #Church density
    play_dn, #Playground density
    bir_km2, #Biergarten per km²
    ff_pop, #Fast food places per capita
    hair_pp, #Hairdresser per capita
    thea_pp, #Theatre per capita
    cc_pop, #Community centre density
    sch_den, #School density
    kid_den #Kindergarten density
  ))

Second BART model

With the subset of important predictors, we will build a second BART model. Again, I have already computed the optimal hyperparameters using the bartMachineCV function.

bm_final <- bartMachine(X = data_subset, y = y, 
                        k=3, nu=3, q=0.99, num_trees=225,
                        seed = 1234, verbose = FALSE)
## Warning in build_bart_machine(X, y, Xy, num_trees, num_burn_in, num_iterations_after_burn_in, : Setting the seed when using parallelization does not result in deterministic output.
## If you need deterministic output, you must run "set_bart_machine_num_cores(1)" and then build the BART model with the set seed.
summary(bm_final)
## bartMachine v1.2.6 for regression
## 
## training data n = 401 and p = 31 
## built in 5 secs on 20 cores, 225 trees, 250 burn-in and 1000 post. samples
## 
## sigsq est for y beforehand: 0.185 
## avg sigsq estimate after burn-in: 0.12811 
## 
## in-sample statistics:
##  L1 = 89.74 
##  L2 = 34.06 
##  rmse = 0.29 
##  Pseudo-Rsq = 0.824
## p-val for shapiro-wilk test of normality of residuals: 0.00122 
## p-val for zero-mean noise: 0.94939

Compared to the first BART model, the second one saw a reduction in Pseudo-R² from 0.93 to 0.82, equating to a 12% reduction in explained variability. The RMSE increased from 0.18 to 0.29, indicating that the final model predicted age-adjusted incidence rates of COVID-19 for German counties with an accuracy of +/− 1.3 cases per 100,000. Again we need to check error assumptions and fitted vs. actual values.

check_bart_error_assumptions(bm_final)

The new model is expected to perform worse compared to the first model, but the Q-Q and Fitted vs. Residuals plots both look good. Again, the BART model struggles with extreme (high) values.

plot_y_vs_yhat(bm_final, credible_intervals = TRUE)

The “Fitted vs. Actual Values” plot above indicates overall OK model performance. This time however, a larger amount of values are outside of the confidence interval. We will also map the residuals, to test for spatial clustering.

Spatial autocorrelation

Spatial autocorrelation can be tested by mapping the residuals. I have provided a cleaned shapefile of the NUTS3 (federal states) regions in Germany. The mapped residuals allow us to visualize under- and overpredictions and check for spatial clustering of residuals. From this map I could not detect a discernible pattern of clustering - the residuals are randomly distributed.

library(sf)
library(RColorBrewer)
library(tmap)

# Download shapefile
shp <- read_sf("https://github.com/STBrinkmann/data/raw/main/RKI_sf.gpkg")

# Sort shapefile, that it has the same order as the data_subset
shp <- shp[order(match(shp$EWZ, data_subset$EWZ)),]

# Join residuals to shapefile, then map residuals
shp$resid <- bm_final$residuals
tm_shape(shp) + 
  tm_polygons(col="resid", 
              title="BART Machine Residuals\n(log incidence rate)", 
              breaks=seq(-1.75, 1.75, 0.5), midpoint=NA, palette="RdBu") + 
  tm_layout(frame = FALSE,
            inner.margins=c(0.02, 0.02, 0.02, 0.20),
            legend.position = c(0.7, 0.22),
            legend.frame = TRUE,
            legend.outside = FALSE, 
            bg.color = "white")

To determine whether the spatial autocorrelation of residuals is statistically significant or not, one could compute Moran’s I. I will not explain Moran’s I in this post, but I would highly recommend these posts: Intro to GIS and Spatial Analysis and Spatial autocorrelation analysis in R.

library(spdep)
# Define neighboring polygons
nb <- poly2nb(shp, queen=TRUE)

# Assign weights to each neighboring polygon
lw <- nb2listw(nb, style="B", zero.policy=TRUE)

# Compute Moran's I statistic using a Monte-Carlo simulation 
MC <- moran.mc(shp$resid, lw, nsim=99)
MC
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  shp$resid 
## weights: lw  
## number of simulations + 1: 100 
## 
## statistic = 0.059486, observed rank = 99, p-value = 0.01
## alternative hypothesis: greater

The Moran’s I score is between -1 and 1, where high (low) values determine positive (negative) spatial autocorrelation and 0 identifies the data is randomly distributed. The Morans’s I statistic of 0.06 (p=0.01) further confirms, that there is no spatial autocorrelation.

Partial Dependence Plots

In the previous steps we have build a robust BART machine to predict age-adjusted incidence rates of COVID-19 for German counties. However, as already mentioned in the beginning, I am interested in using BART as a exploratory tool. The high R² indicates, that the model did understand the non-linear relationships of the covariates with the COVID-19 incidence rate. To visualize and explore these relationships, Partial Dependence Plots (PDPs) are a great tool. PDPs are graphical outputs that illustrate the marginal effect of each independent variable on the response variable (Friedman 2002; Scarpone et al. 2017, 2020).

The R package pdp provides great tools for computing partial dependence. However it does not support bartMachine. But that’s no problem, as bartMachine also comes with a build in function to generate PDPs. The function in the bartMachine package is called pd_plot. I will demonstrate a PDP and it’s corresponding histogram on the example of S_115 (Regional Population Potential). The values of S_115 are plotted on the x-axis and its marginal effect in incidence rate (log-adjusted) on the y-axis.

# Set parameters to plot PDP top and histogram bottom
par(mfrow = c(2,1))

pd_plot(bm_final, "S_115", levs = c(0.0, seq(0, 1, 0.1), 1))
## ...........
hist(bm_final$X$S_115, 20, 
     main = "Histogram of the Regional Population Potential",
     xlab = "Regional Population Potential")

The regional population potential measures the likelihood of direct interactions to occur between inhabitants. The PDP indicates small marginal changes in incidence rates for low values of regional population potential, which can be interpreted as evidence that in counties with a lower probability of human interaction, there is a lower probability of viral contagion. The greatest increase in partial dependence is observed between the 20th and 80th percentiles of regional population potential index scores (14,016 to 47,067), indicating a strong non-linear effect of this variable on incidence rates.

To improve visual interpretability in the plot above it would make sense to log-transform the data. However, this is not possible using the build in pd_plot function. Therefore I have slightly rewritten this function to generate a ggplot output instead of a simple base R plot (I have provided the edited function on GitHub). Now we can treat the result as an ggplot2 object, scale the axis, adjust the theme, combine the histogram and PDP in one plot,…

Below, I have included the PDPs from our paper, visualizing the partial dependence of the 10 most prevalent variables. The variable importance can be computed using the investigate_var_importance function. Based on the spatial exploratory data analysis, we have separated Germany into a low rate region (LLR) and high rate region (HHR). You can read the description and interpretation of the PDPs here.

Partial Dependence Plots (PDP) of the 10 most prevalent variables in the final Bayesian Additive Regression Tree (BART) model. Histograms are shown for the entire country (green), for only the low rates region (LRR, teal), and for only the high rates region (HRR, purple). The PDPs indicate marginal changes in the predicted (log-transformed, age-adjusted) incidence rate per 100,000 residents (upper y-axis) for different values of each independent variable (x-axis)

Conclusion

In this post I have demonstrated that BART is a powerful machine learning tool for modelling high-dimensional, non-linear data. The BART modelling demonstrated that although many variables can be used as inputs, the majority of variability explained will largely be determined from a small subset of all variables. Opening these black-box models and exploring and interpreting the results is easy and intuitive, by computing PDPs. The PDPs help us to understand the non-linear relationships between the covariates and the response variable.

In future posts I will further explore the effectiveness of the variable selection function and demonstrate out-of-sample validation to test for overfitting.

References

Bleich, Justin, Adam Kapelner, Edward I. George, and Shane T. Jensen. 2014. “Variable Selection for BART: An Application to Gene Regulation.” The Annals of Applied Statistics 8 (3). https://doi.org/10.1214/14-aoas755.
Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3). https://doi.org/10.1214/ss/1009213726.
Chipman, Hugh A., Edward I. George, and Robert E. McCulloch. 2010. BART: Bayesian additive regressiontrees.” The Annals of Applied Statistics 4 (1): 266–98. https://doi.org/10.1214/09-AOAS285.
Friedman, Jerome H. 2002. “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis 38 (4): 367–78. https://doi.org/10.1016/s0167-9473(01)00065-2.
Kapelner, Adam, and Justin Bleich. 2016. “bartMachine: Machine Learning with Bayesian Additive Regression Trees.” Journal of Statistical Software 70 (4). https://doi.org/10.18637/jss.v070.i04.
Scarpone, Christopher, Sebastian T. Brinkmann, Tim Große, Daniel Sonnenwald, Martin Fuchs, and Blake Byron Walker. 2020. “A Multimethod Approach for County-Scale Geospatial Analysis of Emerging Infectious Diseases: A Cross-Sectional Case Study of COVID-19 Incidence in Germany.” International Journal of Health Geographics 19 (1). https://doi.org/10.1186/s12942-020-00225-1.
Scarpone, Christopher, Margaret G. Schmidt, Chuck E. Bulmer, and Anders Knudby. 2017. “Semi-Automated Classification of Exposed Bedrock Cover in British Columbia’s Southern Mountains Using a Random Forest Approach.” Geomorphology 285 (May): 214–24. https://doi.org/10.1016/j.geomorph.2017.02.013.
Sebastian T. Brinkmann
Sebastian T. Brinkmann
BSc Student in Physical Geography

My research interests include spatial analysis, remote sensing, data science, urban forestry and epidemiology.

Related