Spatial autocorrelation can severely bias transfer function
performance estimates. Telford and Birks (2009) suggested
*h*-block cross-validation as a means of obtaining unbiased
transfer function estimates. The problem is to estimate the optimal
value of *h*: too small and the performance estimates are still
over-optimistic, too large and the performance estimates are
pessimistic. Trachsel and Telford (2015) presented three methods to
estimate *h*:

- the performance of a transfer function on a spatially independent test set
- the autocorrelation in residuals of a weighted averaging transfer function
- Comparing the variance explained of a transfer function trained on random environmental data with similar spatial structure as the environmental data of influence with the correlation between the simulated data and the environmental data of influence.

```
library(palaeoSig)
library(rioja)
library(sp)
library(gstat)
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(ggplot2)
# suppress warnings from MAT about duplicate samples
# (presumably 100% N. pachyderma)
<- function(...) {
MAT suppressWarnings(rioja::MAT(...))
}
```

We use the foraminifera dataset by Kucera et al. (2005). We split the dataset into two parts, a North Atlantic (NA) dataset (north of 3°N) and a South Atlantic (SA) data set (south of 3°S). We use the NA dataset as training set and the SA dataset as spatially independent test set.

```
# load data
data(Atlantic)
<- c("Core", "Latitude", "Longitude", "summ50")
meta
# N Atlantic
<- Atlantic %>%
N_Atlantic filter(Latitude > 3)
<- N_Atlantic %>%
N_Atlantic_meta select(one_of(meta)) %>%
as.data.frame() # to keep rdist.earth happy
<- N_Atlantic %>%
N_Atlantic select(-one_of(meta))
# S Atlantic
<- Atlantic %>%
S_Atlantic filter(Latitude < -3)
<- S_Atlantic %>%
S_Atlantic_meta select(one_of(meta))
<- S_Atlantic %>%
S_Atlantic select(-one_of(meta))
# calculating distances among the sampled points in the
# North Atlantic foraminifera data set
<- fields::rdist.earth(select(N_Atlantic_meta, Longitude, Latitude),
geodist miles = FALSE
)
# values of h for which h-block cross-validation is calculated
<- c(0.01, 100, 200, 400, 600, 800, 1000, 1500)
threshs
# h-block cross-validation of the NA foraminifera dataset for
# different values of h
<- map_dfr(threshs, function(h) {
res_h <- MAT(N_Atlantic, N_Atlantic_meta$summ50, k = 5, lean = FALSE)
mod <- crossval(mod, cv.method = "h-block", h.dist = geodist, h.cutoff = h)
mod
tibble(
h = h,
RMSE = performance(mod)$crossval["N05", "RMSE"],
R2 = performance(mod)$crossval["N05", "R2"]
) })
```

First, we estimate the performance of the North Atlantic (NA) foraminifera training set when applied to the spatially independent data set from the South Atlantic (SA), which in our case contains the samples used by Kucera et al. (2005) that are situated south of 3°S.

```
# Leave-one-out cross-validated RMSEP using MAT with k = 5
round(res_h[1, "RMSE"], 2)
# Predicting the South Atlantic test set
<- MAT(N_Atlantic, N_Atlantic_meta$summ50, k = 5)
mod_NA <- predict(mod_NA, newdata = S_Atlantic)$fit
pred_SA # Determining RMSEP of the SA test set
<- sqrt(mean((pred_SA[, 1] - S_Atlantic_meta$summ50)^2))
rmse_mat # RMSEP of the SA test set using MAT with k = 5
round(rmse_mat, 2)
```

The RMSEP of the NA training set is 1.15 while the RMSEP of the
spatially independent SA data set is somewhat larger: 1.92. This is
indicative of spatial autocorrelation. We have to find the the removal
distance *h* at which the *h*-block cross-validated RMSEP
and the RMSEP of the SA test set are similar.

Using linear interpolation, an *h*-block distance of 736.31 km
gives a cross-validated RMSEP equivalent to the the RMSEP of a spatially
independent test set.

The second method proposed in Trachsel and Telford is to fit a
variogram to detrended residuals of a weighted average model and use the
range of the variogram as *h*.

```
# WA model
<- crossval(WA(sqrt(N_Atlantic), N_Atlantic_meta$summ50, mono = TRUE))
modwa # residuals of the WA model
<- residuals(modwa, cv = TRUE)
wa_resid # detrend to remove edge effects (loess with span = 0.1)
<- resid(loess(wa_resid[, 1] ~ N_Atlantic_meta$summ50,
detrended_resid span = 0.1
))# copy meta data and add coordinate system
<- N_Atlantic_meta
N_Atlantic_meta_c coordinates(N_Atlantic_meta_c) <- ~ Longitude + Latitude
proj4string(N_Atlantic_meta_c) <- CRS("+proj=longlat +datum=WGS84")
# variogram of the detrended residuals of the WA model
<- variogram(detrended_resid ~ 1, data = N_Atlantic_meta_c)
v # Fitting a spherical variogram (partial sill, range and nugget are
# approximately estimated from the empirical variogram)
<- fit.variogram(v, vgm(psill = 2, "Sph", range = 1500, nugget = 0.5))
vm plot(v, vm)
```

The estimated range of a spherical variogram model fitted to the detrended residual of a WA model is 987 km.

The third method suggested is based on simulated environmental
variables with the same spatial structure as the environmental variable
of interest. The transfer function performance (r^{2}) of the
simulated variable is compared to the r^{2} between the
environmental variable of interest and the simulated variables. The
optimal value for *h* is the value that minimises the difference
these two r^{2}.

To simulate the environmental variables of interest, we first have to estimate their spatial structure with a variogram and then use kriging with the variogram to simulate environmental variables with same spatial structure as the observed variable.

```
# Estimate the variogram model for the environmental variable of interest
<- variogram(summ50 ~ 1, data = N_Atlantic_meta_c)
ve <- fit.variogram(ve, vgm(40, "Mat", 5000, .1, kappa = 1.8))
vem plot(ve, vem)
```

```
# Simulating environmental variables
<- krige(sim ~ 1,
sim locations = N_Atlantic_meta_c,
dummy = TRUE,
nsim = 100,
beta = mean(N_Atlantic_meta[, "summ50"]),
model = vem,
newdata = N_Atlantic_meta_c
)#> [using unconditional Gaussian simulation]
# convert spatialpointsdataframe back to a regular data.frame
<- as.data.frame(sim) %>%
sim select(-Longitude, -Latitude)
```

Running a MAT models for each simulation at each value of *h*
would be very slow, but since the same analogues would be chosen for any
environmental variable, we can make a much faster version of MAT.

```
# Function for h-block cross-validating several simulations at a time
<- function(y, x, noanalogues, geodist, thresh) {
mat_h1 if (!inherits(y, "dist")) {
if (is.data.frame(y) || !(ncol(y) == nrow(y) && sum(diag(y)) == 0)) {
<- dist(sqrt(y))^2 # squared chord distance
y
}
}<- as.matrix(y)
y diag(y) <- Inf
if (inherits(geodist, "dist")) {
<- as.matrix(geodist)
geodist
}sapply(seq_len(nrow(y)), function(n) {
<- geodist[n, ] >= thresh
exneigh <- x[exneigh, ]
x2 <- y[n, ][exneigh]
y2 <- which(rank(y2, ties.method = "random") <= noanalogues)
analogues colMeans(x2[analogues, ])
})
}
# h-block cross-validation of the simulated variables
<- sapply(threshs, function(h) {
simhr <- mat_h1(N_Atlantic, sim, noanalogues = 5, geodist = geodist, thresh = h)
hn diag(cor(t(hn), sim)^2)
})
# Estimating squared correlation between environmental variable of interest and
# simulated variables
<- sapply(sim, cor, N_Atlantic_meta$summ50)^2
sim_obs_r2 # Calculating sum of squares between the two squared correlations
<- apply(simhr, 2, function(x) {
so_squares sum((x - sim_obs_r2)^2)
})
```

```
%>%
simhr as.data.frame() %>%
set_names(threshs) %>%
mutate(sim_obs_r2 = sim_obs_r2) %>%
pivot_longer(cols = -sim_obs_r2, names_to = "h", values_to = "value") %>%
mutate(h = factor(h, levels = threshs)) %>%
ggplot(aes(x = sim_obs_r2, y = value)) +
geom_point() +
geom_abline() +
facet_wrap(~h) +
labs(x = "Simulated-observed environmental r²", y = "Transfer function r²")
```

```
tibble(threshs, so_squares) %>%
ggplot(aes(x = threshs, y = so_squares)) +
geom_point() +
labs(x = "h km", y = "Sum of squares")
```

Following the rule described in Trachsel and Telford the value of
*h* is estimated as *h* = 800 km.

Hence the three methods proposed result in similar values of
*h*: *h* = 736 km for the spatially independent test set,
*h* = 987 km for the variogram range method and *h* = 800
km for the variance explained method.

Kucera, M., Weinelt, M., Kiefer, T., Pflaumann, U., Hayes, A.,
Weinelt, M., Chen, M.-T., Mix, A.C., Barrows, T.T., Cortijo, E., Duprat,
J., Juggins, S., Waelbroeck, C. 2005. Reconstruction of the glacial
Atlantic and Pacific sea-surface temperatures from assemblages of
planktonic foraminifera: multi-technique approach based on
geographically constrained calibration datasets. *Quaternary Science
Reviews* 24, 951-998 doi:10.1016/j.quascirev.2004.07.014.

Telford, R.J., Birks, H.J.B. 2009. Evaluation of transfer functions
in spatially structured environments. *Quaternary Science
Reviews* 28, 1309-1316 doi:10.1016/j.quascirev.2008.12.020.

Trachsel, M., Telford, R.J. (2016). Technical note: Estimating
unbiased transfer function performances in spatially structured
environments. submitted to: *Climate of the Past* 12, 1215-1223
doi:10.5194/cp-12-1215-2016