BART - A Bayesian machine learning workflow for complex spatial data

In our recent publication from 2020, we analyzed COVID-19 incidence and socioeconomic, infrastructural, and built environment characteristics using a multimethod approach. Germany has 401 counties, which vary greatly in size. Counties in Southern Germany are generally smaller and have higher population densities. The figure below shows the natural log-transformed, age-adjusted COVID-19 incidence rates as of April 1, 2020, highlighting spatial differences between the northeast and the south-southwest.

After conducting a thorough spatial exploratory analysis, we applied a Bayesian Additive Regression Trees (BART; (Chipman, George, and McCulloch 2010)) model to identify socioeconomic and built environment covariates associated with COVID-19 incidence. BART is an ensemble-of-trees method, similar to random forests (Breiman 2001) and stochastic gradient boosting (Friedman 2002). Tree-based regression models are flexible enough to capture interactions and non-linearities. Summation-of-trees models (such as BART or random forests) can capture even more complex relationships than single-tree models. However, BART differs from these common algorithms because it uses an underlying Bayesian probability model rather than a purely algorithmic approach (Kapelner and Bleich 2016). One key advantage of the Bayesian framework is that it computes posterior distributions to approximate the model parameters. These priors help prevent any single regression tree from dominating the ensemble, thus lowering the risk of overfitting (Kapelner and Bleich 2016; Scarpone et al. 2020).

For our analysis, we used the R package bartMachine (Kapelner and Bleich 2016). One caveat is that bartMachine does not support tibble objects as input features, but aside from that it is intuitive and simple to use.
In this post, I will illustrate how we used the same data from our COVID-19 paper to analyze age-adjusted incidence rates in German counties using BART. Even though the BART model has strong predictive capabilities, we will not use it to forecast new COVID-19 cases. Instead, we will use it as an exploratory tool to understand which factors influence the spread of COVID-19 and how they interact with incidence rates.

I want to emphasize that we are only examining data from a single date, April 1, 2020, covering the first wave of the pandemic in Germany. A friend of mine is currently investigating which factors contributed to the second and third waves. Although some variables remain important across different time points, others gain or lose relevance, and in some cases the direction of their effects appears reversed.

In sum, BART is used here as an exploratory tool, and we will generate Partial Dependence Plots (PDPs) to visualize and interpret the marginal effects of key predictors.

If you are interested in more detail about pre-modeling exploratory data analysis or data engineering, please see the paper or contact me.

Data download and pre-processing

First, we need to load the packages and set the memory size and number of cores for bartMachine.

# Set to use 45 GB memory - adjust this to your resources
options(java.parameters = "-Xmx45g")

# Load packages
library(bartMachine)
library(dplyr)

# Set to run on 20 threads - adjust this to your resources
set_bart_machine_num_cores(20)

## bartMachine now using 20 cores.

Tip: Double-check that the correct amount of RAM is being allocated after loading bartMachine. If the message shows a different number, try manually typing out the java.parameters string instead of copy-pasting.

Next, we download the data and normalize the church density variable (Rch_den).e.

# Function for linear stretching. New range: 0-1
range0_1 <- function(x){(x-min(x))/(max(x)-min(x))}

# Download data
data <- read.csv("https://github.com/CHEST-Lab/BART_Covid-19/raw/master/Data/GermanyTraining.csv",
                 stringsAsFactors = F) %>%
  mutate(Rch_den = range0_1(Rch_den),
         NUTS2_Fact = as.factor(NUTS2_Fact), 
         BL_ID = as.factor(BL),
         S_109 = as.factor(S_109))

# Select variables: Lat/ Long, BL, NUTS2, socioeconomic, build environment and age adjusted case rate
data <- data[c(374, 3, 4, 5, 38, 28, 65:372)]

AdjRate	X	Y	NUTS2_Fact	BL_ID	EWZ	S_001	S_002	S_003	S_004	S_005	S_006	S_007	S_008	S_009	S_010	S_011	S_012	S_013	S_014	S_015	S_016	S_017	S_018	S_019	S_020	S_021	S_022	S_023	S_024	S_025	S_026	S_027	S_028	S_029	S_030	S_031	S_032	S_033	S_034	S_035	S_036	S_037	S_038	S_039	S_040	S_041	S_042	S_043	S_044	S_045	S_046	S_047	S_048	S_049	S_050	S_051	S_052	S_053	S_054	S_055	S_056	S_057	S_058	S_059	S_060	S_061	S_062	S_063	S_064	S_065	S_066	S_067	S_068	S_069	S_070	S_071	S_072	S_073	S_074	S_075	S_076	S_077	S_078	S_079	S_080	S_081	S_082	S_083	S_084	S_085	S_086	S_087	S_088	S_089	S_090	S_091	S_092	S_093	S_094	S_095	S_096	S_097	S_098	S_099	S_100	S_101	S_102	S_103	S_104	S_105	S_106	S_107	S_108	S_109	S_110	S_111	S_112	S_113	S_114	S_115	S_116	S_117	S_118	S_119	S_120	S_121	S_122	S_123	S_124	S_125	S_126	S_127	S_128	S_129	S_130	S_131	S_132	S_133	S_134	S_135	S_136	S_137	S_138	S_139	S_140	S_141	S_142	S_143	S_144	S_145	S_146	S_147	S_148	S_149	S_150	S_151	S_152	S_153	S_154	S_155	S_156	S_157	S_158	S_159	S_160	S_161	S_162	S_163	S_164	S_165	S_166	S_167	S_168	S_169	S_170	S_171	S_172	S_173	S_174	S_175	S_176	M_km2	T_km2	P_km2	S_km2	T_km2_1	All_R_2	M_pop	T_pop	P_pop	S_pop	T_pop_1	All_R_p	tot_R	lib_km2	uni_km2	sch_km2	cll_km2	kid_km2	ply_km2	the_km2	ngh_km2	cin_km2	pub_km2	bar_km2	sc_km2	hsp_km2	doc_km2	mll_km2	dc_km2	con_km2	sup_km2	pst_km2	th_km2	cc_km2	nh_km2	ac_km2	phr_km2	rst_km2	ff_km2	caf_km2	bir_km2	bak_km2	chm_km2	dot_km2	har_km2	prk_km2	Rch_km2	Rmu_km2	Rth_km2	All_P_2	lib_pop	uni_pop	sch_pop	coll_pp	kid_pop	play_pp	thea_pp	nigh_pp	cin_pop	pub_pop	bar_pop	sc_pop	hosp_pp	doc_pop	mall_pp	dc_pop	con_pop	sup_pop	post_pp	th_pop	cc_pop	nh_pop	ac_pop	phar_pp	rest_pp	ff_pop	cafe_pp	bier_pp	bake_pp	chem_pp	doit_pp	hair_pp	park_pp	Rch_pop	Rmu_pop	Roth_pp	All_P_p	tot_P	sch_den	kid_den	play_dn	thea_dn	nigh_dn	pub_den	bar_den	sc_den	doc_den	con_den	sup_den	post_dn	phar_dn	rest_dn	ff_den	cafe_dn	bake_dn	chem_dn	hair_dn	Rch_den	All_P_d	Pop_Den
27.37	9.44	54.78	108	Schleswig-Holstein	89504	8.9	76.0	19.3	7.9	12.5	43.8	14.5	60.3	8.1	31.8	94.8	24.1	-10.9	72.1	27.7	44.1	51.8	18.8	48.6	22.8	39.9	17.2	32.9	85.0	74.5	2.9	2.6	5.5	9.8	11.9	9.1	23.8	19.6	9.6	20.2	10.6	8.2	2.4	42.2	6.1	6.6	3.0	19.0	30.2	14.3	10.0	11.7	0.75	78.80	2.3	70.3	48.2	34.0	92.1	109.0	44.7	13.2	91.2	9.8	29.5	6.1	8.5	15.7	14.0	2.91	1527	2333	1278.6	2986	16.0	17.6	24258	11.3	3375	2992	3373	2965.5	3.8	52.6	24.4	6.4	40.9	1.0	14.9	4568.9	444.4	50.4	15.6	27.1	-2.4	9.47	369.1	29.6	30.1	100.3	55.7	123.9	35.2	15.8	92.5	45.2	69.1	7.0	371.4	1282.4	1.3	767.3	0.0	2	0	0	1	156	2309	13725	16.2	23.5	19.6	26.5	7.2	19.1	7.41	25.6	134.5	6.24	1.80	5.1	5	98	0	0	0	500	92	543	90	602	89	180	100	478	52.4	34.2	27.7	14.0	7.7	3.6	3.2	2.3	1.5	19.8	3.5	40.1	1.6	131.7	30.9	6.23	42.5	62.0	18.1	25.3	5.8	67.8	8.5	1.9	35.2	92.5	100.0	42826.6	4.1	51.8	24.8	97.0	34.8	48.8	16.4	0.00	0.62	0.44	0.44	0.95	2.45	0.00	33.81	24.40	24.01	52.45	134.66	120.52	0.08	0.00	0.08	0.06	0.30	1.67	0.10	0.12	0.04	0.91	0.08	0.08	0.00	0.91	0.00	0.04	0.30	0.41	0.20	0.00	0.12	0.00	0.06	0.37	1.71	1.24	0.85	0.00	1.02	0.22	0.04	1.26	0.02	0.04	0.04	0.06	12.46	4.47	0.00	4.47	3.35	16.76	91.62	5.59	6.70	2.23	50.28	4.47	4.47	0.00	50.28	0.00	2.23	16.76	22.35	11.17	0.00	6.70	0.00	3.35	20.11	93.85	68.15	46.93	0.00	55.86	12.29	2.23	69.27	1.12	2.23	2.23	3.35	684.89	613	0.00	0.01	0.16	0.01	0.01	0.27	0.00	0.00	0.26	0.03	0.03	0.01	0.02	0.40	0.29	0.18	0.23	0.01	0.32	0.03	0.33	4566
49.54	10.13	54.32	108	Schleswig-Holstein	247548	9.1	72.0	22.6	6.6	8.9	33.1	14.9	63.2	8.8	37.8	310.5	3.2	136.2	67.3	32.4	38.9	51.5	19.2	48.9	21.3	39.3	17.5	32.6	79.2	67.7	2.8	2.5	5.3	9.4	11.0	10.2	26.5	19.1	8.8	18.5	9.7	7.5	2.3	41.5	3.4	4.9	2.3	17.8	26.7	1.6	10.1	10.0	0.64	80.05	5.0	73.5	31.1	41.7	90.4	140.6	31.4	13.4	127.9	9.4	30.3	5.1	7.3	13.9	20.5	4.56	1564	27	1293.2	3304	12.1	14.5	35	18.2	4015	3296	3693	3844.7	6.0	54.4	28.7	5.8	27.7	0.2	5.0	886.5	383.0	51.7	17.1	31.5	7.3	9.84	295.2	30.1	26.0	74.6	67.8	88.6	34.9	30.2	91.0	67.0	39.6	7.3	399.2	2269.4	74.5	755.3	53.4	1	0	0	1	209	3112	22801	16.9	17.7	17.4	26.2	9.8	12.9	5.25	30.2	146.1	7.31	1.91	6.6	2	64	0	0	0	460	92	415	94	590	89	193	100	436	49.1	30.3	27.1	13.1	4.6	3.4	2.0	1.3	0.9	17.9	2.9	21.0	2.0	120.2	30.6	6.96	45.7	66.4	18.3	30.0	5.9	108.1	7.3	1.7	34.9	91.0	0.0	45821.3	4.6	51.5	26.0	97.5	38.3	48.1	13.6	0.20	0.69	0.14	0.68	0.66	2.37	9.06	31.13	6.29	30.83	30.10	107.42	265.92	0.11	0.00	0.06	0.04	0.33	0.53	0.04	0.11	0.06	0.51	0.33	0.21	0.00	0.67	0.00	0.05	0.17	0.56	0.25	0.02	0.04	0.00	0.02	0.61	1.34	1.35	0.81	0.00	0.93	0.17	0.08	1.10	0.01	0.16	0.06	0.04	10.75	4.85	0.00	2.83	1.62	14.95	23.83	2.02	4.85	2.83	23.03	14.95	9.70	0.00	30.30	0.00	2.42	7.68	25.45	11.31	0.81	1.62	0.00	0.81	27.47	60.59	61.40	36.76	0.00	42.01	7.68	3.64	49.69	0.40	7.27	2.83	1.62	487.18	1206	0.00	0.02	0.04	0.00	0.01	0.05	0.04	0.01	0.09	0.01	0.05	0.01	0.06	0.21	0.29	0.13	0.10	0.01	0.20	0.10	0.30	6862
43.20	10.73	53.87	108	Schleswig-Holstein	217198	8.6	69.7	26.5	7.0	8.6	35.7	17.2	58.0	8.5	36.3	176.7	18.6	52.5	75.6	24.3	40.5	56.1	18.5	48.4	21.0	46.9	17.6	31.9	81.1	85.6	2.7	2.5	5.2	10.0	8.2	7.0	25.0	21.5	10.5	23.2	12.7	9.6	3.1	44.7	2.2	6.0	3.1	19.5	36.1	1.7	9.1	12.7	0.72	79.62	1.0	70.7	54.5	43.0	104.4	49.2	22.7	16.3	60.1	10.0	25.6	4.6	11.1	12.8	14.2	2.81	163	2593	1298.6	3036	15.1	17.0	26859	23.9	3784	3055	3324	2718.6	5.7	37.1	56.8	15.4	152.6	1.7	16.8	2028.8	442.1	52.6	19.6	27.3	16.0	9.15	393.9	20.5	36.0	120.0	44.7	156.1	34.3	22.3	92.3	61.7	35.4	6.8	375.7	3039.8	106.4	724.8	36.9	1	0	0	1	101	1454	22491	16.1	25.6	16.5	31.0	9.3	20.1	6.82	26.6	132.3	6.25	1.89	5.7	3	9	0	0	0	532	90	707	82	705	88	276	99	444	43.6	31.3	17.9	15.0	3.9	3.0	1.6	0.7	0.6	43.2	7.8	19.7	2.4	109.0	27.1	5.13	38.0	65.3	17.8	26.6	5.8	87.4	11.1	2.0	34.3	92.3	100.0	38078.1	4.0	56.1	29.2	95.9	37.0	48.1	14.8	0.25	0.18	0.23	0.44	0.45	1.54	24.51	17.43	22.51	42.43	43.38	150.27	326.38	0.02	0.01	0.31	0.00	0.75	0.62	0.07	0.03	0.01	0.44	0.09	0.16	0.02	0.35	0.00	0.02	0.17	0.46	0.51	0.00	0.05	0.07	0.04	0.26	1.23	0.88	0.48	0.05	0.50	0.11	0.06	0.79	0.00	0.01	0.02	0.00	8.62	2.30	0.92	30.39	0.46	73.21	60.77	6.91	3.22	1.38	42.82	9.21	15.65	1.84	34.53	0.00	2.30	16.11	45.12	49.26	0.00	4.60	6.45	3.68	25.78	120.17	86.10	46.96	4.60	48.80	10.59	5.52	77.35	0.46	0.92	1.84	0.00	840.25	1825	0.02	0.04	0.04	0.00	0.00	0.06	0.00	0.01	0.03	0.01	0.03	0.04	0.01	0.23	0.10	0.08	0.04	0.00	0.14	0.00	0.13	6487
31.48	9.98	54.08	108	Schleswig-Holstein	79487	9.2	75.6	25.8	9.5	11.5	48.9	17.9	63.7	8.9	38.3	100.6	12.5	9.7	81.8	18.1	43.1	56.2	19.5	50.5	21.2	46.2	16.9	27.2	82.1	78.1	2.7	2.6	5.3	11.3	8.2	6.5	24.3	21.8	10.4	22.6	12.2	9.4	2.8	44.2	3.1	5.2	4.5	21.0	35.3	1.0	8.9	14.4	0.71	78.65	4.3	66.0	62.3	39.6	99.0	0.0	0.0	16.1	0.0	11.3	29.8	6.0	7.2	15.0	6.9	0.92	1572	2464	1316.4	2842	17.9	21.6	42238	14.9	3433	2872	3012	2184.3	5.3	50.7	48.2	4.5	40.5	0.5	6.0	7671.7	563.9	55.4	23.8	32.8	7.5	8.62	448.4	23.8	30.4	126.6	51.9	158.6	29.3	13.0	85.5	40.1	62.2	7.6	337.3	1650.6	35.2	775.4	3.5	2	0	0	1	1108	1645	26087	15.6	21.3	17.5	22.5	8.5	16.1	8.62	25.2	133.9	7.76	1.99	6.4	7	47	0	0	0	588	85	645	82	682	81	187	100	514	54.5	38.6	25.9	14.7	4.4	3.6	1.6	1.6	1.4	13.9	2.4	26.6	2.1	129.8	31.3	5.06	39.5	61.9	17.5	25.0	4.8	70.6	7.2	2.4	29.3	85.5	100.0	39620.9	4.5	56.2	27.7	97.9	34.7	48.2	17.1	0.13	0.11	0.30	0.43	0.56	1.52	11.25	9.52	26.65	38.50	50.19	136.11	108.19	0.01	0.00	0.00	0.00	0.03	0.27	0.03	0.01	0.01	0.13	0.00	0.06	0.01	0.10	0.01	0.03	0.01	0.25	0.06	0.00	0.01	0.00	0.01	0.32	0.56	0.24	0.21	0.00	0.32	0.03	0.01	0.13	0.03	0.03	0.03	0.00	2.96	1.26	0.00	0.00	0.00	2.52	23.90	2.52	1.26	1.26	11.32	0.00	5.03	1.26	8.81	1.26	2.52	1.26	22.65	5.03	0.00	1.26	0.00	1.26	28.94	50.32	21.39	18.87	0.00	28.94	2.52	1.26	11.32	2.52	2.52	2.52	0.00	265.45	211	0.00	0.00	0.05	0.00	0.00	0.05	0.00	0.00	0.02	0.00	0.05	0.00	0.13	0.16	0.08	0.09	0.17	0.00	0.04	0.04	0.10	3521
19.75	9.11	54.13	108	Schleswig-Holstein	133210	6.8	55.4	19.3	7.6	13.7	43.0	17.8	41.3	6.3	30.5	52.5	9.3	16.6	94.1	5.8	54.3	55.8	19.7	45.2	23.3	50.1	16.1	30.0	81.1	130.9	2.4	2.4	4.8	11.2	7.8	5.2	22.6	23.8	11.9	24.6	12.7	9.9	2.8	45.8	0.4	6.1	2.4	20.5	39.2	4.6	8.0	13.3	0.77	79.28	3.7	72.2	23.1	38.6	99.3	13.7	13.7	19.7	17.5	11.2	27.8	3.6	8.6	12.7	6.8	0.96	1793	2317	1260.8	2914	12.8	17.3	36985	13.7	3343	2924	3248	844.5	0.5	11.1	58.6	3.8	402.7	10.2	9.5	1607.3	518.4	46.1	25.3	21.7	0.0	5.26	411.9	23.6	28.8	125.9	43.7	130.8	20.6	2.9	85.1	7.9	54.4	8.3	366.3	1831.6	53.4	824.7	7.9	2	74	2	0	93	122	11103	11.8	13.2	19.9	20.2	6.9	24.2	5.35	19.1	94.6	7.64	1.91	6.1	14	71	19	65	14	1864	51	248	41	1823	50	576	87	583	70.8	73.5	-15.8	16.7	4.0	3.1	-0.1	0.5	0.5	84.3	11.7	2.6	4.9	91.7	16.0	1.77	31.4	70.1	11.8	18.9	2.5	47.2	8.6	1.9	20.6	85.1	63.0	31456.6	2.8	55.8	26.3	71.8	32.9	47.4	19.7	0.04	0.01	0.08	0.28	0.25	0.65	41.70	6.89	90.28	297.00	262.48	698.35	930.27	0.00	0.00	0.01	0.00	0.02	0.03	0.00	0.00	0.00	0.02	0.00	0.01	0.00	0.06	0.00	0.00	0.01	0.01	0.02	0.00	0.01	0.00	0.00	0.02	0.07	0.04	0.03	0.00	0.05	0.01	0.00	0.04	0.00	0.01	0.00	0.00	0.45	4.50	0.00	6.01	0.75	18.77	28.53	1.50	0.75	1.50	25.52	4.50	6.01	0.00	60.06	0.00	0.00	12.01	9.01	19.52	0.75	7.51	0.75	1.50	20.27	78.82	37.53	27.78	0.00	52.55	6.01	4.50	38.29	0.00	6.01	0.00	0.00	481.20	641	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.02	0.00	0.00	0.00	0.00	0.03	0.01	0.00	0.01	0.00	0.01	0.00	0.01	4041
54.83	10.60	53.59	108	Schleswig-Holstein	197264	5.5	44.9	22.6	6.4	11.4	32.3	22.3	44.5	6.3	35.9	114.1	22.7	19.5	90.7	9.2	47.5	58.2	20.0	51.2	19.8	45.4	17.1	32.5	82.1	136.0	2.8	2.8	5.6	11.8	6.8	4.9	24.9	23.8	10.2	22.3	12.0	9.3	2.8	44.6	4.3	6.1	2.5	22.2	35.0	8.3	9.1	12.1	0.87	80.58	3.9	76.8	10.3	36.3	100.1	0.0	0.0	16.7	0.0	11.8	22.5	4.0	9.6	10.8	10.6	1.11	1903	2404	1257.6	2832	10.1	11.7	54979	20.9	3404	287	3031	1344.4	0.9	11.5	59.8	26.0	1673.3	3.3	5.0	2476.9	490.2	47.9	23.3	19.9	12.7	3.14	406.7	25.6	28.4	111.1	56.2	126.2	31.7	19.2	83.1	38.5	40.9	7.5	474.9	1251.8	52.9	791.1	1.6	2	36	3	0	155	193	35799	8.6	10.3	27.4	13.0	7.8	14.9	4.62	13.5	71.8	7.77	2.02	6.9	10	30	20	31	13	1545	63	2042	49	1804	48	343	94	572	71.7	79.7	-52.4	15.6	4.1	2.9	-0.5	0.2	0.3	23.6	3.4	4.9	3.2	94.4	15.0	1.85	21.7	61.0	9.1	13.4	2.6	41.8	9.6	1.7	31.7	83.1	71.7	21795.1	2.6	58.2	27.4	92.6	27.3	47.7	25.1	0.09	0.01	0.14	0.24	0.30	0.78	58.77	5.10	90.62	153.91	193.65	502.04	990.34	0.01	0.00	0.01	0.00	0.04	0.10	0.00	0.00	0.00	0.01	0.00	0.01	0.00	0.03	0.00	0.00	0.01	0.03	0.02	0.01	0.01	0.00	0.00	0.03	0.11	0.05	0.05	0.00	0.06	0.01	0.00	0.04	0.00	0.01	0.00	0.00	0.69	3.55	0.00	9.12	0.00	24.33	64.38	1.52	0.51	1.52	8.11	2.53	4.06	0.51	18.76	0.00	1.01	8.62	19.77	14.70	6.08	7.60	0.51	1.01	20.78	70.97	32.95	31.94	2.03	39.03	8.11	3.04	27.37	0.51	3.55	1.01	0.00	439.51	867	0.00	0.00	0.02	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.02	0.01	0.01	0.01	0.00	0.01	0.00	0.01	4296

The first column (AdjRate) contains age-adjusted incidence rates. The X and Y columns represent longitude and latitude. NUTS2_Fact represents government regions (Regierungsbezirke), BL_ID is the federal state (Bundesland), and EWZ is the population. The remaining columns cover socioeconomic, infrastructural, and built environment features. We will filter out only the most relevant features later.

BART requires a data.frame of predictors and a separate vector for the response variable:

psych::skew(data$AdjRate)

## [1] 3.64179

# Response variable
y <- log(data$AdjRate)

# Data.frame of predictors
data_bm <- select(data, -c(AdjRate))

Because AdjRate is highly skewed (skewness = 3.64), we apply a log transformation. Though BART is non-parametric and can handle skewed data, transforming the response variable aids in visually interpreting results.

Below is a summary of the log-transformed AdjRate:

Mean	SD	Min	Max
4.13	0.7	1.75	6.51

First BART model

We can use default hyperparameter values for BART, but bartMachine offers a convenient tuning function called bartMachineCV to find optimal values. Because that process can be time-consuming, I’ve omitted it here. Below, I simply build a BART model using those pre-determined optimal hyperparameters.

bm_All <- bartMachine(X = data_bm, y = y, 
                      k=2, nu=3, q=0.9, num_trees=100, 
                      num_iterations_after_burn_in=2000, 
                      num_burn_in = 300, 
                      seed = 1234, verbose = FALSE)

## Warning in build_bart_machine(X = X, y = y, Xy = Xy, num_trees = num_trees, : Setting the seed when using parallelization does not result in deterministic output.
## If you need deterministic output, you must run "set_bart_machine_num_cores(1)" and then build the BART model with the set seed.

summary(bm_All)

## bartMachine v1.3.4.1 for regression
## 
## training data size: n = 401 and p = 366 
## built in 6.4 secs on 20 cores, 100 trees, 300 burn-in and 2000 post. samples
## 
## sigsq est for y beforehand: 0.024 
## avg sigsq estimate after burn-in: 0.09236 
## 
## in-sample statistics:
##  L1 = 58.88 
##  L2 = 14.93 
##  rmse = 0.19 
##  Pseudo-Rsq = 0.9228
## p-val for shapiro-wilk test of normality of residuals: 2e-05 
## p-val for zero-mean noise: 0.92439

The Pseudo-R² of 0.92 and RMSE of 0.19 look promising. Let’s check the error assumptions next:

check_bart_error_assumptions(bm_All)

Both diagnostic plots look reasonable. In the second one (Assessment of Heteroskedasticity), there is a mild pattern, but not a stark one. Overall, the model seems to fit well, though it might struggle slightly with extreme values.

plot_y_vs_yhat(bm_All, credible_intervals = TRUE)

The “Fitted vs. Actual Values” plot also demonstrates good model performance. Again, we see that the model tends to over-predict extremely low values and under-predict extremely high values. We could also map the residuals to check for spatial clustering, but let’s first reduce the model’s complexity by removing non-important predictors.

Variable Selection

Although BART yields excellent R² values, part of its power lies in its ability to handle high-dimensional data. In large datasets, often only a fraction of the predictors significantly influence the response variable. Occam’s razor suggests favoring simpler models over overly complex ones.

The function var_selection_by_permute (introduced by (Bleich et al. 2014)) helps reduce model complexity. Feel free to skip the following code and just trust my results—this step can be time-consuming:

# Leave the num_trees_for_permute small, to force variables to compete for entry into the model!
var_sel <- bartMachine::var_selection_by_permute_cv(bm_All, num_trees_for_permute = 20)

# Look at the most important variables
var_sel$important_vars_cv

##  [1] "BL_ID_Baden-Württemberg" "BL_ID_Bayern"           
##  [3] "BL_ID_Hessen"            "ff_pop"                 
##  [5] "hair_pp"                 "thea_pp"                
##  [7] "NUTS2_Fact_11"           "NUTS2_Fact_27"          
##  [9] "NUTS2_Fact_40"           "NUTS2_Fact_71"          
## [11] "S_170"                   "play_dn"                
## [13] "bir_km2"                 "cc_pop"                 
## [15] "sch_den"                 "kid_den"                
## [17] "Rch_den"                 "S_109_2"                
## [19] "EWZ"                     "Pop_Den"                
## [21] "S_004"                   "S_006"                  
## [23] "S_020"                   "S_051"                  
## [25] "S_054"                   "S_066"                  
## [27] "S_070"                   "S_080"                  
## [29] "S_104"                   "S_107"                  
## [31] "S_115"                   "S_123"                  
## [33] "S_130"                   "S_146"                  
## [35] "S_153"                   "X"                      
## [37] "Y"

Below is the subset of important variables (grouped thematically). I also removed NUTS2_Fact and BL_ID:

data_subset <- data_bm %>%
  select(c(
    # Geographical Units
    X, #Longitude
    Y, #Latitude
    
    # Political units
    S_109, #Rural/Urban
    
    # Socioeconomic
    EWZ, #Population
    Pop_Den, #Population density
    S_004, #Unemployment rate under 25
    S_006, #Household income per capita 
    S_020, #Employment rate 15-<30
    S_051, #Voter participation
    S_054, #Apprenticeship positions
    S_066, #Household income
    S_070, #Deptors rate
    S_080, #Recreationl Space
    S_104, #Income tax
    S_107, #Steuerkraft
    S_115, #Regional population potential
    S_123, #Child poverty
    S_130, #IC train station access
    S_146, #Commuters >150km
    S_153, #Foreign guests in tourist establishments
    S_170, #Longterm unemployment rate
    
    # Built environment
    Rch_den, #Church density
    play_dn, #Playground density
    bir_km2, #Biergarten per km²
    ff_pop, #Fast food places per capita
    hair_pp, #Hairdresser per capita
    thea_pp, #Theatre per capita
    cc_pop, #Community centre density
    sch_den, #School density
    kid_den #Kindergarten density
  ))

Second BART model

With this subset of key predictors, I now build a new BART model (again using pre-determined optimal hyperparameters):

bm_final <- bartMachine(X = data_subset, y = y, 
                        k=3, nu=3, q=0.99, num_trees=225,
                        seed = 1234, verbose = FALSE)

## Warning in build_bart_machine(X = X, y = y, Xy = Xy, num_trees = num_trees, : Setting the seed when using parallelization does not result in deterministic output.
## If you need deterministic output, you must run "set_bart_machine_num_cores(1)" and then build the BART model with the set seed.

summary(bm_final)

## bartMachine v1.3.4.1 for regression
## 
## training data size: n = 401 and p = 31 
## built in 6.5 secs on 20 cores, 225 trees, 250 burn-in and 1000 post. samples
## 
## sigsq est for y beforehand: 0.185 
## avg sigsq estimate after burn-in: 0.12859 
## 
## in-sample statistics:
##  L1 = 90.04 
##  L2 = 34.13 
##  rmse = 0.29 
##  Pseudo-Rsq = 0.8236
## p-val for shapiro-wilk test of normality of residuals: 0.00156 
## p-val for zero-mean noise: 0.95958

Compared to the first model, the Pseudo-R² decreases from 0.92 to 0.82, equating to a 11% reduction in explained variance. The RMSE increased from 0.19 to 0.29, indicating that the final model predicts the age-adjusted incidence rate of COVID-19 with an accuracy of roughly +/− 1.3 cases per 100,000. Let’s again check the diagnostic plots:

check_bart_error_assumptions(bm_final)

The new model’s Q-Q and residuals plots look fine, but as expected, it performs slightly worse on extreme values.

plot_y_vs_yhat(bm_final, credible_intervals = TRUE)

The “Fitted vs. Actual Values” plot shows okay performance, though more points lie outside the confidence intervals. Next, let’s map the residuals to check for any spatial clustering.

Spatial autocorrelation

We can quickly visualize residuals by linking them to a shapefile of German counties (NUTS3). Any obvious clustering of positive or negative residuals would indicate spatial autocorrelation.

library(sf)
library(RColorBrewer)
library(tmap)

# Download shapefile
shp <- read_sf("https://github.com/STBrinkmann/data/raw/main/RKI_sf.gpkg")

# Sort shapefile, that it has the same order as the data_subset
shp <- shp[order(match(shp$EWZ, data_subset$EWZ)),]

# Join residuals to shapefile, then map residuals
shp$resid <- bm_final$residuals
tm_shape(shp) + 
  tm_polygons(col="resid", 
              title="BART Machine Residuals\n(log incidence rate)", 
              breaks=seq(-1.75, 1.75, 0.5), midpoint=NA, palette="RdBu") + 
  tm_layout(frame = FALSE,
            inner.margins=c(0.02, 0.02, 0.02, 0.20),
            legend.position = c(0.7, 0.22),
            legend.frame = TRUE,
            legend.outside = FALSE, 
            bg.color = "white")

No clear geographical pattern emerges, suggesting no strong spatial clusters. For a formal test, we can compute Moran’s I. I will not explain Moran’s I in this post, but I would highly recommend these posts: Intro to GIS and Spatial Analysis and Spatial autocorrelation analysis in R.

library(spdep)
# Define neighboring polygons
nb <- poly2nb(shp, queen=TRUE)

# Assign weights to each neighboring polygon
lw <- nb2listw(nb, style="B", zero.policy=TRUE)

# Compute Moran's I statistic using a Monte-Carlo simulation 
MC <- moran.mc(shp$resid, lw, nsim=99)
MC

## 
## 	Monte-Carlo simulation of Moran I
## 
## data:  shp$resid 
## weights: lw  
## number of simulations + 1: 100 
## 
## statistic = 0.061242, observed rank = 97, p-value = 0.03
## alternative hypothesis: greater

Moran’s I ranges from -1 to 1, where values closer to ±1 indicate strong positive/negative spatial autocorrelation, and 0 implies randomness. With a Moran’s I of 0.06 (p=0.03), we conclude there is no significant spatial autocorrelation in the residuals.

Partial Dependence Plots

So far, we’ve built and refined a BART model to predict age-adjusted incidence rates of COVID-19 in German counties. However, our main goal is to explore the data rather than simply predict. BART’s high R² suggests it captures these non-linear relationships, and Partial Dependence Plots (PDPs) let us visualize and interpret them (Friedman 2002; Scarpone et al. 2017, 2020).

Although the pdp is great for computing PDPs, it does not directly support bartMachine. Fortunately, bartMachine provides its own PDP function, pd_plot. Below is a demonstration using S_115 (Regional Population Potential):

# Set parameters to plot PDP top and histogram bottom
par(mfrow = c(2,1))

pd_plot(bm_final, "S_115", levs = c(0.0, seq(0, 1, 0.1), 1))

## ...........

hist(bm_final$X$S_115, 20, 
     main = "Histogram of the Regional Population Potential",
     xlab = "Regional Population Potential")

Regional population potential measures the likelihood of direct human interactions in a region. This PDP suggests minimal marginal change in incidence at the lower end of the distribution, implying less chance of viral contagion in areas with fewer human interactions. The curve rises notably between the 20th and 80th percentiles (14,016 to 47,067), indicating a strong non-linear effect on incidence rates.

To log-transform the x-axis for clarity, you would need a custom PDP function, since pd_plot does not support that directly. I wrote a variant that returns ggplot2 objects GitHub), which lets you customize plots extensively.

Below is a multi-panel PDP from our publication, showing the top 10 variables. You can see more details and interpretation of these variables here.

Partial Dependence Plots (PDP) of the 10 most prevalent variables in the final Bayesian Additive Regression Tree (BART) model. Histograms are shown for the entire country (green), for only the low rates region (LRR, teal), and for only the high rates region (HRR, purple). The PDPs indicate marginal changes in the predicted (log-transformed, age-adjusted) incidence rate per 100,000 residents (upper y-axis) for different values of each independent variable (x-axis)

Conclusion

In this post, we developed and refined a BART-based workflow for modeling age-adjusted COVID-19 incidence rates across German counties. We started by applying BART to a high-dimensional dataset of socioeconomic, infrastructural, and built environment predictors, then used a permutation-based variable selection technique to isolate the most important features. Although BART can handle dozens or even hundreds of predictors, our results showed that only a smaller subset of variables drove most of the explanatory power.

By visualizing these relationships with Partial Dependence Plots, we uncovered non-linear effects and interactions that more conventional methods might have missed. These plots also help “open the black box,” translating an otherwise opaque machine learning model into actionable insights. Although BART performed well overall, we observed that it tended to under-predict very high values and over-predict very low ones, highlighting the challenges of modeling extreme outcomes.

In future posts, I will focus on out-of-sample validation to confirm the model’s robustness and detect any lingering overfitting issues. I also plan to dive deeper into how the variable selection procedure can be tuned for different research contexts. Ultimately, BART’s Bayesian foundation—combined with flexible tree-ensemble strategies—offers a compelling toolkit for complex, spatially heterogeneous data like COVID-19 incidence.

References

Bleich, Justin, Adam Kapelner, Edward I. George, and Shane T. Jensen. 2014. “Variable Selection for BART: An Application to Gene Regulation.” The Annals of Applied Statistics 8 (3). https://doi.org/10.1214/14-aoas755.

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3). https://doi.org/10.1214/ss/1009213726.

Chipman, Hugh A., Edward I. George, and Robert E. McCulloch. 2010. “BART: Bayesian additive regressiontrees.” The Annals of Applied Statistics 4 (1): 266–98. https://doi.org/10.1214/09-AOAS285.

Friedman, Jerome H. 2002. “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis 38 (4): 367–78. https://doi.org/10.1016/s0167-9473(01)00065-2.

Kapelner, Adam, and Justin Bleich. 2016. “bartMachine: Machine Learning with Bayesian Additive Regression Trees.” Journal of Statistical Software 70 (4). https://doi.org/10.18637/jss.v070.i04.

Scarpone, Christopher, Sebastian T. Brinkmann, Tim Große, Daniel Sonnenwald, Martin Fuchs, and Blake Byron Walker. 2020. “A Multimethod Approach for County-Scale Geospatial Analysis of Emerging Infectious Diseases: A Cross-Sectional Case Study of COVID-19 Incidence in Germany.” International Journal of Health Geographics 19 (1). https://doi.org/10.1186/s12942-020-00225-1.

Scarpone, Christopher, Margaret G. Schmidt, Chuck E. Bulmer, and Anders Knudby. 2017. “Semi-Automated Classification of Exposed Bedrock Cover in British Columbia’s Southern Mountains Using a Random Forest Approach.” Geomorphology 285 (May): 214–24. https://doi.org/10.1016/j.geomorph.2017.02.013.