BART - A Bayesian machine learning workflow for complex spatial data

Apr 20, 2021 · 23 min read

In our recent publication from 2020, we analyzed COVID-19 incidence and socioeconomic, infrastructural, and built environment characteristics using a multimethod approach. Germany has 401 counties, which vary greatly in size. Counties in Southern Germany are generally smaller and have higher population densities. The figure below shows the natural log-transformed, age-adjusted COVID-19 incidence rates as of April 1, 2020, highlighting spatial differences between the northeast and the south-southwest.

After conducting a thorough spatial exploratory analysis, we applied a Bayesian Additive Regression Trees (BART; (Chipman, George, and McCulloch 2010)) model to identify socioeconomic and built environment covariates associated with COVID-19 incidence. BART is an ensemble-of-trees method, similar to random forests (Breiman 2001) and stochastic gradient boosting (Friedman 2002). Tree-based regression models are flexible enough to capture interactions and non-linearities. Summation-of-trees models (such as BART or random forests) can capture even more complex relationships than single-tree models. However, BART differs from these common algorithms because it uses an underlying Bayesian probability model rather than a purely algorithmic approach (Kapelner and Bleich 2016). One key advantage of the Bayesian framework is that it computes posterior distributions to approximate the model parameters. These priors help prevent any single regression tree from dominating the ensemble, thus lowering the risk of overfitting (Kapelner and Bleich 2016; Scarpone et al. 2020).

For our analysis, we used the R package bartMachine (Kapelner and Bleich 2016). One caveat is that bartMachine does not support tibble objects as input features, but aside from that it is intuitive and simple to use.
In this post, I will illustrate how we used the same data from our COVID-19 paper to analyze age-adjusted incidence rates in German counties using BART. Even though the BART model has strong predictive capabilities, we will not use it to forecast new COVID-19 cases. Instead, we will use it as an exploratory tool to understand which factors influence the spread of COVID-19 and how they interact with incidence rates.

I want to emphasize that we are only examining data from a single date, April 1, 2020, covering the first wave of the pandemic in Germany. A friend of mine is currently investigating which factors contributed to the second and third waves. Although some variables remain important across different time points, others gain or lose relevance, and in some cases the direction of their effects appears reversed.

In sum, BART is used here as an exploratory tool, and we will generate Partial Dependence Plots (PDPs) to visualize and interpret the marginal effects of key predictors.

If you are interested in more detail about pre-modeling exploratory data analysis or data engineering, please see the paper or contact me.

Data download and pre-processing

First, we need to load the packages and set the memory size and number of cores for bartMachine.

# Set to use 45 GB memory - adjust this to your resources
options(java.parameters = "-Xmx45g")

# Load packages
library(bartMachine)
library(dplyr)

# Set to run on 20 threads - adjust this to your resources
set_bart_machine_num_cores(20)
## bartMachine now using 20 cores.

Tip: Double-check that the correct amount of RAM is being allocated after loading bartMachine. If the message shows a different number, try manually typing out the java.parameters string instead of copy-pasting.

Next, we download the data and normalize the church density variable (Rch_den).e.

# Function for linear stretching. New range: 0-1
range0_1 <- function(x){(x-min(x))/(max(x)-min(x))}

# Download data
data <- read.csv("https://github.com/CHEST-Lab/BART_Covid-19/raw/master/Data/GermanyTraining.csv",
                 stringsAsFactors = F) %>%
  mutate(Rch_den = range0_1(Rch_den),
         NUTS2_Fact = as.factor(NUTS2_Fact), 
         BL_ID = as.factor(BL),
         S_109 = as.factor(S_109))

# Select variables: Lat/ Long, BL, NUTS2, socioeconomic, build environment and age adjusted case rate
data <- data[c(374, 3, 4, 5, 38, 28, 65:372)]
AdjRate X Y NUTS2_Fact BL_ID EWZ S_001 S_002 S_003 S_004 S_005 S_006 S_007 S_008 S_009 S_010 S_011 S_012 S_013 S_014 S_015 S_016 S_017 S_018 S_019 S_020 S_021 S_022 S_023 S_024 S_025 S_026 S_027 S_028 S_029 S_030 S_031 S_032 S_033 S_034 S_035 S_036 S_037 S_038 S_039 S_040 S_041 S_042 S_043 S_044 S_045 S_046 S_047 S_048 S_049 S_050 S_051 S_052 S_053 S_054 S_055 S_056 S_057 S_058 S_059 S_060 S_061 S_062 S_063 S_064 S_065 S_066 S_067 S_068 S_069 S_070 S_071 S_072 S_073 S_074 S_075 S_076 S_077 S_078 S_079 S_080 S_081 S_082 S_083 S_084 S_085 S_086 S_087 S_088 S_089 S_090 S_091 S_092 S_093 S_094 S_095 S_096 S_097 S_098 S_099 S_100 S_101 S_102 S_103 S_104 S_105 S_106 S_107 S_108 S_109 S_110 S_111 S_112 S_113 S_114 S_115 S_116 S_117 S_118 S_119 S_120 S_121 S_122 S_123 S_124 S_125 S_126 S_127 S_128 S_129 S_130 S_131 S_132 S_133 S_134 S_135 S_136 S_137 S_138 S_139 S_140 S_141 S_142 S_143 S_144 S_145 S_146 S_147 S_148 S_149 S_150 S_151 S_152 S_153 S_154 S_155 S_156 S_157 S_158 S_159 S_160 S_161 S_162 S_163 S_164 S_165 S_166 S_167 S_168 S_169 S_170 S_171 S_172 S_173 S_174 S_175 S_176 M_km2 T_km2 P_km2 S_km2 T_km2_1 All_R_2 M_pop T_pop P_pop S_pop T_pop_1 All_R_p tot_R lib_km2 uni_km2 sch_km2 cll_km2 kid_km2 ply_km2 the_km2 ngh_km2 cin_km2 pub_km2 bar_km2 std_km2 sc_km2 hsp_km2 doc_km2 mll_km2 dc_km2 con_km2 sup_km2 pst_km2 th_km2 cc_km2 nh_km2 ac_km2 phr_km2 rst_km2 ff_km2 caf_km2 fc_km2 bir_km2 bak_km2 chm_km2 dot_km2 har_km2 prk_km2 Rch_km2 Rmu_km2 Rth_km2 All_P_2 lib_pop uni_pop sch_pop coll_pp kid_pop play_pp thea_pp nigh_pp cin_pop pub_pop bar_pop stad_pp sc_pop hosp_pp doc_pop mall_pp dc_pop con_pop sup_pop post_pp th_pop cc_pop nh_pop ac_pop phar_pp rest_pp ff_pop cafe_pp fc_pop bier_pp bake_pp chem_pp doit_pp hair_pp park_pp Rch_pop Rmu_pop Roth_pp All_P_p tot_P lib_den uni_den sch_den coll_dn kid_den play_dn thea_dn nigh_dn cin_den pub_den bar_den stad_dn sc_den hosp_dn doc_den mall_dn dc_den con_den sup_den post_dn th_den cc_den nh_den ac_den phar_dn rest_dn ff_den cafe_dn fc_den bier_dn bake_dn chem_dn doit_dn hair_dn park_dn Rch_den Rmu_den Roth_dn All_P_d Pop_Den
27.37 9.44 54.78 108 Schleswig-Holstein 89504 8.9 76.0 19.3 7.9 12.5 43.8 14.5 60.3 8.1 31.8 94.8 24.1 -10.9 72.1 27.7 44.1 51.8 18.8 48.6 22.8 39.9 17.2 32.9 85.0 74.5 2.9 2.6 5.5 9.8 11.9 9.1 23.8 19.6 9.6 20.2 10.6 8.2 2.4 42.2 6.1 6.6 3.0 19.0 30.2 14.3 10.0 11.7 0.75 78.80 2.3 70.3 48.2 34.0 92.1 109.0 44.7 13.2 91.2 9.8 29.5 6.1 8.5 15.7 14.0 2.91 1527 2333 1278.6 2986 16.0 17.6 24258 11.3 3375 2992 3373 2965.5 3.8 52.6 24.4 6.4 40.9 1.0 14.9 4568.9 444.4 50.4 15.6 27.1 -2.4 9.47 369.1 29.6 30.1 100.3 55.7 123.9 35.2 15.8 92.5 45.2 69.1 7.0 371.4 1282.4 1.3 767.3 0.0 2 0 0 1 156 2309 13725 16.2 23.5 19.6 26.5 7.2 19.1 7.41 25.6 134.5 6.24 1.80 5.1 5 98 0 0 0 500 92 543 90 602 89 180 100 478 52.4 34.2 27.7 14.0 7.7 3.6 3.2 2.3 1.5 19.8 3.5 40.1 1.6 131.7 30.9 6.23 42.5 62.0 18.1 25.3 5.8 67.8 8.5 1.9 35.2 92.5 100.0 42826.6 4.1 51.8 24.8 97.0 34.8 48.8 16.4 0.00 0.62 0.44 0.44 0.95 2.45 0.00 33.81 24.40 24.01 52.45 134.66 120.52 0.08 0.00 0.08 0.06 0.30 1.67 0.10 0.12 0.04 0.91 0.08 0 0.08 0.00 0.91 0.00 0.04 0.30 0.41 0.20 0.00 0.12 0.00 0.06 0.37 1.71 1.24 0.85 0 0.00 1.02 0.22 0.04 1.26 0.02 0.04 0.04 0.06 12.46 4.47 0.00 4.47 3.35 16.76 91.62 5.59 6.70 2.23 50.28 4.47 0 4.47 0.00 50.28 0.00 2.23 16.76 22.35 11.17 0.00 6.70 0.00 3.35 20.11 93.85 68.15 46.93 0 0.00 55.86 12.29 2.23 69.27 1.12 2.23 2.23 3.35 684.89 613 0 0 0.00 0 0.01 0.16 0.01 0.01 0 0.27 0.00 0 0.00 0 0.26 0 0 0.03 0.03 0.01 0 0 0 0 0.02 0.40 0.29 0.18 0 0 0.23 0.01 0 0.32 0 0.03 0 0 0.33 4566
49.54 10.13 54.32 108 Schleswig-Holstein 247548 9.1 72.0 22.6 6.6 8.9 33.1 14.9 63.2 8.8 37.8 310.5 3.2 136.2 67.3 32.4 38.9 51.5 19.2 48.9 21.3 39.3 17.5 32.6 79.2 67.7 2.8 2.5 5.3 9.4 11.0 10.2 26.5 19.1 8.8 18.5 9.7 7.5 2.3 41.5 3.4 4.9 2.3 17.8 26.7 1.6 10.1 10.0 0.64 80.05 5.0 73.5 31.1 41.7 90.4 140.6 31.4 13.4 127.9 9.4 30.3 5.1 7.3 13.9 20.5 4.56 1564 27 1293.2 3304 12.1 14.5 35 18.2 4015 3296 3693 3844.7 6.0 54.4 28.7 5.8 27.7 0.2 5.0 886.5 383.0 51.7 17.1 31.5 7.3 9.84 295.2 30.1 26.0 74.6 67.8 88.6 34.9 30.2 91.0 67.0 39.6 7.3 399.2 2269.4 74.5 755.3 53.4 1 0 0 1 209 3112 22801 16.9 17.7 17.4 26.2 9.8 12.9 5.25 30.2 146.1 7.31 1.91 6.6 2 64 0 0 0 460 92 415 94 590 89 193 100 436 49.1 30.3 27.1 13.1 4.6 3.4 2.0 1.3 0.9 17.9 2.9 21.0 2.0 120.2 30.6 6.96 45.7 66.4 18.3 30.0 5.9 108.1 7.3 1.7 34.9 91.0 0.0 45821.3 4.6 51.5 26.0 97.5 38.3 48.1 13.6 0.20 0.69 0.14 0.68 0.66 2.37 9.06 31.13 6.29 30.83 30.10 107.42 265.92 0.11 0.00 0.06 0.04 0.33 0.53 0.04 0.11 0.06 0.51 0.33 0 0.21 0.00 0.67 0.00 0.05 0.17 0.56 0.25 0.02 0.04 0.00 0.02 0.61 1.34 1.35 0.81 0 0.00 0.93 0.17 0.08 1.10 0.01 0.16 0.06 0.04 10.75 4.85 0.00 2.83 1.62 14.95 23.83 2.02 4.85 2.83 23.03 14.95 0 9.70 0.00 30.30 0.00 2.42 7.68 25.45 11.31 0.81 1.62 0.00 0.81 27.47 60.59 61.40 36.76 0 0.00 42.01 7.68 3.64 49.69 0.40 7.27 2.83 1.62 487.18 1206 0 0 0.00 0 0.02 0.04 0.00 0.01 0 0.05 0.04 0 0.01 0 0.09 0 0 0.01 0.05 0.01 0 0 0 0 0.06 0.21 0.29 0.13 0 0 0.10 0.01 0 0.20 0 0.10 0 0 0.30 6862
43.20 10.73 53.87 108 Schleswig-Holstein 217198 8.6 69.7 26.5 7.0 8.6 35.7 17.2 58.0 8.5 36.3 176.7 18.6 52.5 75.6 24.3 40.5 56.1 18.5 48.4 21.0 46.9 17.6 31.9 81.1 85.6 2.7 2.5 5.2 10.0 8.2 7.0 25.0 21.5 10.5 23.2 12.7 9.6 3.1 44.7 2.2 6.0 3.1 19.5 36.1 1.7 9.1 12.7 0.72 79.62 1.0 70.7 54.5 43.0 104.4 49.2 22.7 16.3 60.1 10.0 25.6 4.6 11.1 12.8 14.2 2.81 163 2593 1298.6 3036 15.1 17.0 26859 23.9 3784 3055 3324 2718.6 5.7 37.1 56.8 15.4 152.6 1.7 16.8 2028.8 442.1 52.6 19.6 27.3 16.0 9.15 393.9 20.5 36.0 120.0 44.7 156.1 34.3 22.3 92.3 61.7 35.4 6.8 375.7 3039.8 106.4 724.8 36.9 1 0 0 1 101 1454 22491 16.1 25.6 16.5 31.0 9.3 20.1 6.82 26.6 132.3 6.25 1.89 5.7 3 9 0 0 0 532 90 707 82 705 88 276 99 444 43.6 31.3 17.9 15.0 3.9 3.0 1.6 0.7 0.6 43.2 7.8 19.7 2.4 109.0 27.1 5.13 38.0 65.3 17.8 26.6 5.8 87.4 11.1 2.0 34.3 92.3 100.0 38078.1 4.0 56.1 29.2 95.9 37.0 48.1 14.8 0.25 0.18 0.23 0.44 0.45 1.54 24.51 17.43 22.51 42.43 43.38 150.27 326.38 0.02 0.01 0.31 0.00 0.75 0.62 0.07 0.03 0.01 0.44 0.09 0 0.16 0.02 0.35 0.00 0.02 0.17 0.46 0.51 0.00 0.05 0.07 0.04 0.26 1.23 0.88 0.48 0 0.05 0.50 0.11 0.06 0.79 0.00 0.01 0.02 0.00 8.62 2.30 0.92 30.39 0.46 73.21 60.77 6.91 3.22 1.38 42.82 9.21 0 15.65 1.84 34.53 0.00 2.30 16.11 45.12 49.26 0.00 4.60 6.45 3.68 25.78 120.17 86.10 46.96 0 4.60 48.80 10.59 5.52 77.35 0.46 0.92 1.84 0.00 840.25 1825 0 0 0.02 0 0.04 0.04 0.00 0.00 0 0.06 0.00 0 0.01 0 0.03 0 0 0.01 0.03 0.04 0 0 0 0 0.01 0.23 0.10 0.08 0 0 0.04 0.00 0 0.14 0 0.00 0 0 0.13 6487
31.48 9.98 54.08 108 Schleswig-Holstein 79487 9.2 75.6 25.8 9.5 11.5 48.9 17.9 63.7 8.9 38.3 100.6 12.5 9.7 81.8 18.1 43.1 56.2 19.5 50.5 21.2 46.2 16.9 27.2 82.1 78.1 2.7 2.6 5.3 11.3 8.2 6.5 24.3 21.8 10.4 22.6 12.2 9.4 2.8 44.2 3.1 5.2 4.5 21.0 35.3 1.0 8.9 14.4 0.71 78.65 4.3 66.0 62.3 39.6 99.0 0.0 0.0 16.1 0.0 11.3 29.8 6.0 7.2 15.0 6.9 0.92 1572 2464 1316.4 2842 17.9 21.6 42238 14.9 3433 2872 3012 2184.3 5.3 50.7 48.2 4.5 40.5 0.5 6.0 7671.7 563.9 55.4 23.8 32.8 7.5 8.62 448.4 23.8 30.4 126.6 51.9 158.6 29.3 13.0 85.5 40.1 62.2 7.6 337.3 1650.6 35.2 775.4 3.5 2 0 0 1 1108 1645 26087 15.6 21.3 17.5 22.5 8.5 16.1 8.62 25.2 133.9 7.76 1.99 6.4 7 47 0 0 0 588 85 645 82 682 81 187 100 514 54.5 38.6 25.9 14.7 4.4 3.6 1.6 1.6 1.4 13.9 2.4 26.6 2.1 129.8 31.3 5.06 39.5 61.9 17.5 25.0 4.8 70.6 7.2 2.4 29.3 85.5 100.0 39620.9 4.5 56.2 27.7 97.9 34.7 48.2 17.1 0.13 0.11 0.30 0.43 0.56 1.52 11.25 9.52 26.65 38.50 50.19 136.11 108.19 0.01 0.00 0.00 0.00 0.03 0.27 0.03 0.01 0.01 0.13 0.00 0 0.06 0.01 0.10 0.01 0.03 0.01 0.25 0.06 0.00 0.01 0.00 0.01 0.32 0.56 0.24 0.21 0 0.00 0.32 0.03 0.01 0.13 0.03 0.03 0.03 0.00 2.96 1.26 0.00 0.00 0.00 2.52 23.90 2.52 1.26 1.26 11.32 0.00 0 5.03 1.26 8.81 1.26 2.52 1.26 22.65 5.03 0.00 1.26 0.00 1.26 28.94 50.32 21.39 18.87 0 0.00 28.94 2.52 1.26 11.32 2.52 2.52 2.52 0.00 265.45 211 0 0 0.00 0 0.00 0.05 0.00 0.00 0 0.05 0.00 0 0.00 0 0.02 0 0 0.00 0.05 0.00 0 0 0 0 0.13 0.16 0.08 0.09 0 0 0.17 0.00 0 0.04 0 0.04 0 0 0.10 3521
19.75 9.11 54.13 108 Schleswig-Holstein 133210 6.8 55.4 19.3 7.6 13.7 43.0 17.8 41.3 6.3 30.5 52.5 9.3 16.6 94.1 5.8 54.3 55.8 19.7 45.2 23.3 50.1 16.1 30.0 81.1 130.9 2.4 2.4 4.8 11.2 7.8 5.2 22.6 23.8 11.9 24.6 12.7 9.9 2.8 45.8 0.4 6.1 2.4 20.5 39.2 4.6 8.0 13.3 0.77 79.28 3.7 72.2 23.1 38.6 99.3 13.7 13.7 19.7 17.5 11.2 27.8 3.6 8.6 12.7 6.8 0.96 1793 2317 1260.8 2914 12.8 17.3 36985 13.7 3343 2924 3248 844.5 0.5 11.1 58.6 3.8 402.7 10.2 9.5 1607.3 518.4 46.1 25.3 21.7 0.0 5.26 411.9 23.6 28.8 125.9 43.7 130.8 20.6 2.9 85.1 7.9 54.4 8.3 366.3 1831.6 53.4 824.7 7.9 2 74 2 0 93 122 11103 11.8 13.2 19.9 20.2 6.9 24.2 5.35 19.1 94.6 7.64 1.91 6.1 14 71 19 65 14 1864 51 248 41 1823 50 576 87 583 70.8 73.5 -15.8 16.7 4.0 3.1 -0.1 0.5 0.5 84.3 11.7 2.6 4.9 91.7 16.0 1.77 31.4 70.1 11.8 18.9 2.5 47.2 8.6 1.9 20.6 85.1 63.0 31456.6 2.8 55.8 26.3 71.8 32.9 47.4 19.7 0.04 0.01 0.08 0.28 0.25 0.65 41.70 6.89 90.28 297.00 262.48 698.35 930.27 0.00 0.00 0.01 0.00 0.02 0.03 0.00 0.00 0.00 0.02 0.00 0 0.01 0.00 0.06 0.00 0.00 0.01 0.01 0.02 0.00 0.01 0.00 0.00 0.02 0.07 0.04 0.03 0 0.00 0.05 0.01 0.00 0.04 0.00 0.01 0.00 0.00 0.45 4.50 0.00 6.01 0.75 18.77 28.53 1.50 0.75 1.50 25.52 4.50 0 6.01 0.00 60.06 0.00 0.00 12.01 9.01 19.52 0.75 7.51 0.75 1.50 20.27 78.82 37.53 27.78 0 0.00 52.55 6.01 4.50 38.29 0.00 6.01 0.00 0.00 481.20 641 0 0 0.00 0 0.00 0.00 0.00 0.00 0 0.00 0.00 0 0.00 0 0.02 0 0 0.00 0.00 0.00 0 0 0 0 0.00 0.03 0.01 0.00 0 0 0.01 0.00 0 0.01 0 0.00 0 0 0.01 4041
54.83 10.60 53.59 108 Schleswig-Holstein 197264 5.5 44.9 22.6 6.4 11.4 32.3 22.3 44.5 6.3 35.9 114.1 22.7 19.5 90.7 9.2 47.5 58.2 20.0 51.2 19.8 45.4 17.1 32.5 82.1 136.0 2.8 2.8 5.6 11.8 6.8 4.9 24.9 23.8 10.2 22.3 12.0 9.3 2.8 44.6 4.3 6.1 2.5 22.2 35.0 8.3 9.1 12.1 0.87 80.58 3.9 76.8 10.3 36.3 100.1 0.0 0.0 16.7 0.0 11.8 22.5 4.0 9.6 10.8 10.6 1.11 1903 2404 1257.6 2832 10.1 11.7 54979 20.9 3404 287 3031 1344.4 0.9 11.5 59.8 26.0 1673.3 3.3 5.0 2476.9 490.2 47.9 23.3 19.9 12.7 3.14 406.7 25.6 28.4 111.1 56.2 126.2 31.7 19.2 83.1 38.5 40.9 7.5 474.9 1251.8 52.9 791.1 1.6 2 36 3 0 155 193 35799 8.6 10.3 27.4 13.0 7.8 14.9 4.62 13.5 71.8 7.77 2.02 6.9 10 30 20 31 13 1545 63 2042 49 1804 48 343 94 572 71.7 79.7 -52.4 15.6 4.1 2.9 -0.5 0.2 0.3 23.6 3.4 4.9 3.2 94.4 15.0 1.85 21.7 61.0 9.1 13.4 2.6 41.8 9.6 1.7 31.7 83.1 71.7 21795.1 2.6 58.2 27.4 92.6 27.3 47.7 25.1 0.09 0.01 0.14 0.24 0.30 0.78 58.77 5.10 90.62 153.91 193.65 502.04 990.34 0.01 0.00 0.01 0.00 0.04 0.10 0.00 0.00 0.00 0.01 0.00 0 0.01 0.00 0.03 0.00 0.00 0.01 0.03 0.02 0.01 0.01 0.00 0.00 0.03 0.11 0.05 0.05 0 0.00 0.06 0.01 0.00 0.04 0.00 0.01 0.00 0.00 0.69 3.55 0.00 9.12 0.00 24.33 64.38 1.52 0.51 1.52 8.11 2.53 0 4.06 0.51 18.76 0.00 1.01 8.62 19.77 14.70 6.08 7.60 0.51 1.01 20.78 70.97 32.95 31.94 0 2.03 39.03 8.11 3.04 27.37 0.51 3.55 1.01 0.00 439.51 867 0 0 0.00 0 0.00 0.02 0.00 0.00 0 0.00 0.00 0 0.00 0 0.00 0 0 0.00 0.00 0.00 0 0 0 0 0.00 0.02 0.01 0.01 0 0 0.01 0.00 0 0.01 0 0.00 0 0 0.01 4296

The first column (AdjRate) contains age-adjusted incidence rates. The X and Y columns represent longitude and latitude. NUTS2_Fact represents government regions (Regierungsbezirke), BL_ID is the federal state (Bundesland), and EWZ is the population. The remaining columns cover socioeconomic, infrastructural, and built environment features. We will filter out only the most relevant features later.

BART requires a data.frame of predictors and a separate vector for the response variable:

psych::skew(data$AdjRate)
## [1] 3.64179
# Response variable
y <- log(data$AdjRate)

# Data.frame of predictors
data_bm <- select(data, -c(AdjRate))

Because AdjRate is highly skewed (skewness = 3.64), we apply a log transformation. Though BART is non-parametric and can handle skewed data, transforming the response variable aids in visually interpreting results.

Below is a summary of the log-transformed AdjRate:

Mean SD Min Max
4.13 0.7 1.75 6.51

First BART model

We can use default hyperparameter values for BART, but bartMachine offers a convenient tuning function called bartMachineCV to find optimal values. Because that process can be time-consuming, I’ve omitted it here. Below, I simply build a BART model using those pre-determined optimal hyperparameters.

bm_All <- bartMachine(X = data_bm, y = y, 
                      k=2, nu=3, q=0.9, num_trees=100, 
                      num_iterations_after_burn_in=2000, 
                      num_burn_in = 300, 
                      seed = 1234, verbose = FALSE)
## Warning in build_bart_machine(X = X, y = y, Xy = Xy, num_trees = num_trees, : Setting the seed when using parallelization does not result in deterministic output.
## If you need deterministic output, you must run "set_bart_machine_num_cores(1)" and then build the BART model with the set seed.
summary(bm_All)
## bartMachine v1.3.4.1 for regression
## 
## training data size: n = 401 and p = 366 
## built in 6.4 secs on 20 cores, 100 trees, 300 burn-in and 2000 post. samples
## 
## sigsq est for y beforehand: 0.024 
## avg sigsq estimate after burn-in: 0.09236 
## 
## in-sample statistics:
##  L1 = 58.88 
##  L2 = 14.93 
##  rmse = 0.19 
##  Pseudo-Rsq = 0.9228
## p-val for shapiro-wilk test of normality of residuals: 2e-05 
## p-val for zero-mean noise: 0.92439

The Pseudo-R² of 0.92 and RMSE of 0.19 look promising. Let’s check the error assumptions next:

check_bart_error_assumptions(bm_All)

Both diagnostic plots look reasonable. In the second one (Assessment of Heteroskedasticity), there is a mild pattern, but not a stark one. Overall, the model seems to fit well, though it might struggle slightly with extreme values.

plot_y_vs_yhat(bm_All, credible_intervals = TRUE)

The “Fitted vs. Actual Values” plot also demonstrates good model performance. Again, we see that the model tends to over-predict extremely low values and under-predict extremely high values. We could also map the residuals to check for spatial clustering, but let’s first reduce the model’s complexity by removing non-important predictors.

Variable Selection

Although BART yields excellent R² values, part of its power lies in its ability to handle high-dimensional data. In large datasets, often only a fraction of the predictors significantly influence the response variable. Occam’s razor suggests favoring simpler models over overly complex ones.

The function var_selection_by_permute (introduced by (Bleich et al. 2014)) helps reduce model complexity. Feel free to skip the following code and just trust my results—this step can be time-consuming:

# Leave the num_trees_for_permute small, to force variables to compete for entry into the model!
var_sel <- bartMachine::var_selection_by_permute_cv(bm_All, num_trees_for_permute = 20)

# Look at the most important variables
var_sel$important_vars_cv
##  [1] "BL_ID_Baden-Württemberg" "BL_ID_Bayern"           
##  [3] "BL_ID_Hessen"            "ff_pop"                 
##  [5] "hair_pp"                 "thea_pp"                
##  [7] "NUTS2_Fact_11"           "NUTS2_Fact_27"          
##  [9] "NUTS2_Fact_40"           "NUTS2_Fact_71"          
## [11] "S_170"                   "play_dn"                
## [13] "bir_km2"                 "cc_pop"                 
## [15] "sch_den"                 "kid_den"                
## [17] "Rch_den"                 "S_109_2"                
## [19] "EWZ"                     "Pop_Den"                
## [21] "S_004"                   "S_006"                  
## [23] "S_020"                   "S_051"                  
## [25] "S_054"                   "S_066"                  
## [27] "S_070"                   "S_080"                  
## [29] "S_104"                   "S_107"                  
## [31] "S_115"                   "S_123"                  
## [33] "S_130"                   "S_146"                  
## [35] "S_153"                   "X"                      
## [37] "Y"

Below is the subset of important variables (grouped thematically). I also removed NUTS2_Fact and BL_ID:

data_subset <- data_bm %>%
  select(c(
    # Geographical Units
    X, #Longitude
    Y, #Latitude
    
    # Political units
    S_109, #Rural/Urban
    
    # Socioeconomic
    EWZ, #Population
    Pop_Den, #Population density
    S_004, #Unemployment rate under 25
    S_006, #Household income per capita 
    S_020, #Employment rate 15-<30
    S_051, #Voter participation
    S_054, #Apprenticeship positions
    S_066, #Household income
    S_070, #Deptors rate
    S_080, #Recreationl Space
    S_104, #Income tax
    S_107, #Steuerkraft
    S_115, #Regional population potential
    S_123, #Child poverty
    S_130, #IC train station access
    S_146, #Commuters >150km
    S_153, #Foreign guests in tourist establishments
    S_170, #Longterm unemployment rate
    
    # Built environment
    Rch_den, #Church density
    play_dn, #Playground density
    bir_km2, #Biergarten per km²
    ff_pop, #Fast food places per capita
    hair_pp, #Hairdresser per capita
    thea_pp, #Theatre per capita
    cc_pop, #Community centre density
    sch_den, #School density
    kid_den #Kindergarten density
  ))

Second BART model

With this subset of key predictors, I now build a new BART model (again using pre-determined optimal hyperparameters):

bm_final <- bartMachine(X = data_subset, y = y, 
                        k=3, nu=3, q=0.99, num_trees=225,
                        seed = 1234, verbose = FALSE)
## Warning in build_bart_machine(X = X, y = y, Xy = Xy, num_trees = num_trees, : Setting the seed when using parallelization does not result in deterministic output.
## If you need deterministic output, you must run "set_bart_machine_num_cores(1)" and then build the BART model with the set seed.
summary(bm_final)
## bartMachine v1.3.4.1 for regression
## 
## training data size: n = 401 and p = 31 
## built in 6.5 secs on 20 cores, 225 trees, 250 burn-in and 1000 post. samples
## 
## sigsq est for y beforehand: 0.185 
## avg sigsq estimate after burn-in: 0.12859 
## 
## in-sample statistics:
##  L1 = 90.04 
##  L2 = 34.13 
##  rmse = 0.29 
##  Pseudo-Rsq = 0.8236
## p-val for shapiro-wilk test of normality of residuals: 0.00156 
## p-val for zero-mean noise: 0.95958

Compared to the first model, the Pseudo-R² decreases from 0.92 to 0.82, equating to a 11% reduction in explained variance. The RMSE increased from 0.19 to 0.29, indicating that the final model predicts the age-adjusted incidence rate of COVID-19 with an accuracy of roughly +/− 1.3 cases per 100,000. Let’s again check the diagnostic plots:

check_bart_error_assumptions(bm_final)

The new model’s Q-Q and residuals plots look fine, but as expected, it performs slightly worse on extreme values.

plot_y_vs_yhat(bm_final, credible_intervals = TRUE)

The “Fitted vs. Actual Values” plot shows okay performance, though more points lie outside the confidence intervals. Next, let’s map the residuals to check for any spatial clustering.

Spatial autocorrelation

We can quickly visualize residuals by linking them to a shapefile of German counties (NUTS3). Any obvious clustering of positive or negative residuals would indicate spatial autocorrelation.

library(sf)
library(RColorBrewer)
library(tmap)

# Download shapefile
shp <- read_sf("https://github.com/STBrinkmann/data/raw/main/RKI_sf.gpkg")

# Sort shapefile, that it has the same order as the data_subset
shp <- shp[order(match(shp$EWZ, data_subset$EWZ)),]

# Join residuals to shapefile, then map residuals
shp$resid <- bm_final$residuals
tm_shape(shp) + 
  tm_polygons(col="resid", 
              title="BART Machine Residuals\n(log incidence rate)", 
              breaks=seq(-1.75, 1.75, 0.5), midpoint=NA, palette="RdBu") + 
  tm_layout(frame = FALSE,
            inner.margins=c(0.02, 0.02, 0.02, 0.20),
            legend.position = c(0.7, 0.22),
            legend.frame = TRUE,
            legend.outside = FALSE, 
            bg.color = "white")

No clear geographical pattern emerges, suggesting no strong spatial clusters. For a formal test, we can compute Moran’s I. I will not explain Moran’s I in this post, but I would highly recommend these posts: Intro to GIS and Spatial Analysis and Spatial autocorrelation analysis in R.

library(spdep)
# Define neighboring polygons
nb <- poly2nb(shp, queen=TRUE)

# Assign weights to each neighboring polygon
lw <- nb2listw(nb, style="B", zero.policy=TRUE)

# Compute Moran's I statistic using a Monte-Carlo simulation 
MC <- moran.mc(shp$resid, lw, nsim=99)
MC
## 
## 	Monte-Carlo simulation of Moran I
## 
## data:  shp$resid 
## weights: lw  
## number of simulations + 1: 100 
## 
## statistic = 0.061242, observed rank = 97, p-value = 0.03
## alternative hypothesis: greater

Moran’s I ranges from -1 to 1, where values closer to ±1 indicate strong positive/negative spatial autocorrelation, and 0 implies randomness. With a Moran’s I of 0.06 (p=0.03), we conclude there is no significant spatial autocorrelation in the residuals.

Partial Dependence Plots

So far, we’ve built and refined a BART model to predict age-adjusted incidence rates of COVID-19 in German counties. However, our main goal is to explore the data rather than simply predict. BART’s high R² suggests it captures these non-linear relationships, and Partial Dependence Plots (PDPs) let us visualize and interpret them (Friedman 2002; Scarpone et al. 2017, 2020).

Although the pdp is great for computing PDPs, it does not directly support bartMachine. Fortunately, bartMachine provides its own PDP function, pd_plot. Below is a demonstration using S_115 (Regional Population Potential):

# Set parameters to plot PDP top and histogram bottom
par(mfrow = c(2,1))

pd_plot(bm_final, "S_115", levs = c(0.0, seq(0, 1, 0.1), 1))
## ...........
hist(bm_final$X$S_115, 20, 
     main = "Histogram of the Regional Population Potential",
     xlab = "Regional Population Potential")

Regional population potential measures the likelihood of direct human interactions in a region. This PDP suggests minimal marginal change in incidence at the lower end of the distribution, implying less chance of viral contagion in areas with fewer human interactions. The curve rises notably between the 20th and 80th percentiles (14,016 to 47,067), indicating a strong non-linear effect on incidence rates.

To log-transform the x-axis for clarity, you would need a custom PDP function, since pd_plot does not support that directly. I wrote a variant that returns ggplot2 objects GitHub), which lets you customize plots extensively.

Below is a multi-panel PDP from our publication, showing the top 10 variables. You can see more details and interpretation of these variables here.

Partial Dependence Plots (PDP) of the 10 most prevalent variables in the final Bayesian Additive Regression Tree (BART) model. Histograms are shown for the entire country (green), for only the low rates region (LRR, teal), and for only the high rates region (HRR, purple). The PDPs indicate marginal changes in the predicted (log-transformed, age-adjusted) incidence rate per 100,000 residents (upper y-axis) for different values of each independent variable (x-axis)

Conclusion

In this post, we developed and refined a BART-based workflow for modeling age-adjusted COVID-19 incidence rates across German counties. We started by applying BART to a high-dimensional dataset of socioeconomic, infrastructural, and built environment predictors, then used a permutation-based variable selection technique to isolate the most important features. Although BART can handle dozens or even hundreds of predictors, our results showed that only a smaller subset of variables drove most of the explanatory power.

By visualizing these relationships with Partial Dependence Plots, we uncovered non-linear effects and interactions that more conventional methods might have missed. These plots also help “open the black box,” translating an otherwise opaque machine learning model into actionable insights. Although BART performed well overall, we observed that it tended to under-predict very high values and over-predict very low ones, highlighting the challenges of modeling extreme outcomes.

In future posts, I will focus on out-of-sample validation to confirm the model’s robustness and detect any lingering overfitting issues. I also plan to dive deeper into how the variable selection procedure can be tuned for different research contexts. Ultimately, BART’s Bayesian foundation—combined with flexible tree-ensemble strategies—offers a compelling toolkit for complex, spatially heterogeneous data like COVID-19 incidence.

References

Bleich, Justin, Adam Kapelner, Edward I. George, and Shane T. Jensen. 2014. “Variable Selection for BART: An Application to Gene Regulation.” The Annals of Applied Statistics 8 (3). https://doi.org/10.1214/14-aoas755.

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3). https://doi.org/10.1214/ss/1009213726.

Chipman, Hugh A., Edward I. George, and Robert E. McCulloch. 2010. “BART: Bayesian additive regressiontrees.” The Annals of Applied Statistics 4 (1): 266–98. https://doi.org/10.1214/09-AOAS285.

Friedman, Jerome H. 2002. “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis 38 (4): 367–78. https://doi.org/10.1016/s0167-9473(01)00065-2.

Kapelner, Adam, and Justin Bleich. 2016. “bartMachine: Machine Learning with Bayesian Additive Regression Trees.” Journal of Statistical Software 70 (4). https://doi.org/10.18637/jss.v070.i04.

Scarpone, Christopher, Sebastian T. Brinkmann, Tim Große, Daniel Sonnenwald, Martin Fuchs, and Blake Byron Walker. 2020. “A Multimethod Approach for County-Scale Geospatial Analysis of Emerging Infectious Diseases: A Cross-Sectional Case Study of COVID-19 Incidence in Germany.” International Journal of Health Geographics 19 (1). https://doi.org/10.1186/s12942-020-00225-1.

Scarpone, Christopher, Margaret G. Schmidt, Chuck E. Bulmer, and Anders Knudby. 2017. “Semi-Automated Classification of Exposed Bedrock Cover in British Columbia’s Southern Mountains Using a Random Forest Approach.” Geomorphology 285 (May): 214–24. https://doi.org/10.1016/j.geomorph.2017.02.013.