Throughout this document we use the Danube dataset from Asadi et al. (2015) to illustrate functions and
concepts. The dataset is provided in the package as danube
,
see the corresponding help page for details and related functions.
# Plot the physical locations of measuring stations, indicating flow volume by width
plotDanube(useStationVolume = TRUE, useConnectionVolume = TRUE, xyRatio = 1.5)
The goal is to study the extremal tail of a multivariate random vector X = (X1, …, Xd). Here, we are interested only in the extremal dependence and therefore normalize the marginal distributions Fj of Xj to standard Pareto distributions by We assume in the sequel that the vector X has been normalized to standard Pareto margins.
A multivariate Pareto distribution (MPD) is defined as the limiting distribution the exceedances of X over a high threshold, where the multivariate threshold is chosen in terms of the l∞-norm. For any u > 0, we define exceedance distribution as By sending u → ∞ and normalizing properly, the random vector Xu converges to a limiting distribution Y called the MPD: Y = limu → ∞Xu/u, where the limit is in terms of convergence in distribution. By construction, the distribution of the MPD Y is defined on the space ℒ = {x ∈ [0, ∞)d : ∥x∥∞ > 1}, and if Y possesses a density fY, then it is proportional to the so-called exponent measure density λ; for details see Engelke and Hitz (2020).
In practice, we use the approximation Y ≈ Xu
for some large value u, where
u can be chosen as the pth quantile of the distribution of
∥X∥∞. Given
n oberservations X1, …, Xn
of X, the function
data2mpareto()
first computes the standardization in
@ref(eq:pareto) based on the empirical distribution functions F̂j and then
selects the exceedances Xu
according to @ref(eq:exc).
The extremal logistic distribution with parameter θ ∈ (0, 1) induces a multivariate Pareto distribution with density The parameter θ governs the extremal dependence, ranging from complete dependence for θ → 0 and independence for θ → 1.
The function rmpareto()
generates samples from a MPD
Y based on the exact
algorithm described in Engelke and Hitz
(2020).
The d-dimensional Hüsler–Reiss distribution (Hüsler and Reiss, 1989) is parameterized by the variogram matrix Γ = {Γij}1 ≤ i, j ≤ d. The corresponding density of the exponent measure can be written for any k ∈ {1, …, d} as where ϕp(⋅; Σ) is the density of a centred p-dimensional normal distribution with covariance matrix Σ, ỹ = {log (yi/yk) + Γik/2}i = 1, …, d and The matrix Σ(k) is positive definite and will play an important role in the theory of extremal graphical models. The representation of the density in @ref(eq:fYHR) seems to depend on the choice of k, but, in fact, the value of the right-hand side of this equation is independent of k. The Hüsler–Reiss multivariate Pareto distribution has density fY(y) ∝ λ(y) and the strength of dependence between the ith and jth component is parameterized by Γij, ranging from complete dependence for Γij → 0 and independence for Γij → ∞.
The extension of Hüsler–Reiss distributions to random fields are Brown–Resnick processes (Kabluchko et al., 2009), which are widely used models for spatial extremes.
Note that we can also generate samples from the corresponding
max-stable distribution with the function rmstable()
,
following the exact algorithm in Dombry et al.
(2016).
The extremal correlation χij ∈ [0, 1] measures the dependence between the largest values of the random variables Xi and Xj. It is defined as where the boundary cases 0 and 1 correspond to asymptotic independence and complete dependence, respectively.
For n observations X1, …, Xn of the d-dimensional vector X, we can empirically estimate the d × d matrix of all pairwise extremal correlations for a fixed threshold p close to 1.
In this and other functions, the default argument
p = NULL
implies that the data is already expected to be on
MPD scale, and no thresholding is performed. For the danube
data, we may therefore directly use Y instead of X:
For the Hüsler–Reiss distribution with parameter matrix Γ = {Γij}1 ≤ i, j ≤ d,
the extremal correlation is given by $$
\chi_{ij} = 2 - 2 \Phi(\sqrt{\Gamma_{ij}}/2),$$ where Φ is the standard normal
distribution function. We can use the functions Gamma2chi()
and chi2Gamma()
to switch between the two coefficients.
There exist several other summary statistics for extremal dependence. The extremal variogram is introduced in Engelke and Volgushev (2022) and turns out to be particularly useful for inference of extremal graphical models discussed below.
For any root node k ∈ V, the pre-asymptotic
extremal variogram is defined as the matrix Γ(k) with
entries
whenever right-hand side exists. Note that −log {1 − Fi(Xi)}
transforms Xi to a standard
exponential distribution, such that Γ(k) is simply
the variogram matrix of X on exponential scale,
conditioned on the kth
variable being large.
The limit as p → 1 is
called the extremal variogram and can be expressed in terms of the MPD
Y:
Weak/strong extremal dependence is indicated by large/small values of
the extremal variogram, respectively. The function
emp_vario()
estimates the (pre-asymptotic) extremal
variogram, for instance for k = 1.
In general, the matrices Γ(k) can be different for k ∈ V, but for example for the Hüsler–Reiss distribution, they all coincide.
For the Hüsler–Reiss distribution with parameter matrix Γ, the extremal variogram matrices satisfy Γ = Γ(1) = … = Γ(d).
In this case it makes sense to estimate the extremal variogram Γ̂ as the average of the estimators Γ̂(k):
Let G = (V, E) be an undirected graph with index set V = {1, …, d} and edges E ⊂ V × V. The figure below shows examples of different graphical structures: a tree, a decomposable graph and a non-decomposable graph. Engelke and Hitz (2020) introduce a new notion of extremal conditional independence for MPDs, denoted by ⊥e. They define an extremal graphical model on G as a MPD Y = (Yj : j ∈ V) that satisfies the pairwise Markov property Yi⊥eYj ∣ Y\{i, j}, if (i, j) ∉ E, that is, Yi and Yj are conditionally independent in the extremal sense ⊥e given all other nodes whenever there is no edge between i and j in G.
If the MPD possesses a density fY, then the graph G has to be connected. Engelke and Hitz (2020) then show a Hammersley–Clifford theorem stating that for an extremal graphical model on a decomposable graph G, the density fY factorizes into the marginal density on the cliques.
# A tree graph, a decomposable graph, and a non-decomposable (cycle) graph
plot(graph_from_literal(1--2--3, 2--4))
plot(graph_from_literal(1--2--3--1, 2--4))
plot(graph_from_literal(1--2--3--4--1))
A tree T = (V, E) is a particularly simple type of graph, which is connected and does not have cycles. This implies that there are exactly |E| = d − 1 edges. The Hammersley–Clifford theorem shown in Engelke and Hitz (2020) yields that the density of an extremal tree model Y on the tree T = (V, E) factorizes into where λij are the bivariate marginals of the exponent measure density λ corresponding to Y. The d-dimensional density of the MPD is therefore reduced to bivariate interactions only.
For extremal graphical models on trees, the extremal variograms Γ(k), k ∈ V, are very natural
since they define a so-called additive tree metric, that is,
where ph(ij) denotes
the path between node i and
j on the tree T.
By representation @ref(eq:treedens) we can specify any bivariate models λij for (i, j) ∈ E and combine them to a valid tree model. For instance, if we use bivariate Hüsler–Reiss distributions on all edges of the tree T, the resulting d-dimensional tree model is again Hüsler–Reiss and its parameter matrix Γ is implied by @ref(eq:treemetric). If we use bivariate logistic distributions on all edges, the resulting extremal tree model is not a d-dimensional logistic model, but a more flexible model with d − 1 parameters.
The function rmpareto_tree()
generates samples from an
extremal tree model Y.
set.seed(42)
my_model <- generate_random_model(d = 4, graph_type = "tree")
Ysim <- rmpareto_tree(
n = 100,
model = "HR",
tree = my_model$graph,
par = my_model$Gamma
)
theta_vec <- c(.2, .8, .3)
Ysim <- rmpareto_tree(
n = 100,
model = "logistic",
tree = my_model$graph,
par = theta_vec
)
Note that we can also generate samples from the corresponding
max-stable distribution with the function
rmstable_tree()
.
For a given tree T = (V, E) and a parametric model {fY(y; θ) : θ ∈ Θ} of MPDs on the tree T, estimation of the model parameters is fairly straight-forward. If θ = {θij : (i, j) ∈ E} consists of one parameter per edge, then thanks to the factorization @ref(eq:treedens) we may fit each parameter θij separately. This can be done by (censored) maximum likelihood methods, or by other methods such as M-estimation.
If provided with a tree and data, the function
fmpareto_graph_HR()
estimates the d − 1 parameters of a Hüsler–Reiss
model on that tree. As an example, we fit an extremal tree model to the
(undirected) tree resulting from the flow connections in the
danube()
data set and compare the fitted with the empirical
extremal coefficients.
# Utility function to plot fitted parameters against true/empirical ones
plot_fitted_params <- function(G0, G1, xlab = 'True', ylab = 'Fitted'){
return(
ggplot()
+ geom_point(aes(
x = G0[upper.tri(G0)],
y = G1[upper.tri(G1)]
))
+ geom_abline(slope = 1, intercept = 0)
+ xlab(xlab)
+ ylab(ylab)
)
}
# Get flow graph from package
flow_graph <- getDanubeFlowGraph()
# Fit parameter matrix with this graph structure
flow_Gamma <- fmpareto_graph_HR(data = Y, graph = flow_graph)
# Compute likelihood/ICs and plot parameters
flow_loglik <- loglik_HR(
data = Y,
Gamma = flow_Gamma,
graph = flow_graph
)
plot_fitted_params(chi_hat, Gamma2chi(flow_Gamma), xlab = 'Empirical')
In practice, the underlying conditional independence tree T = (V, E) is usually unknown and has to be estimated in a data-driven way. For extremal tree models, it turns out that this can be done efficiently in a completely non-parametric way; see Engelke and Volgushev (2022) for details.
The method is based on the notion of a minimum spanning tree. For a set of symmetric weights ρij > 0 associated with any pair of nodes i, j ∈ V, i ≠ j, the latter is defined as the tree structure that minimizes the sum of distances on that tree.
Engelke and Volgushev (2022) show that if Y is an extremal graphical model on an unknown tree T, then the minimum spanning tree with the extremal variogram ρij = Γij(k) as weights is unique and satisfies Tmst = T. Using empirical estimates Γ̂ij(k) as weights, we can thus consistently recover the underlying tree structure in a fully non-parametric way.
In fact, taking the average of all Γ̂(k), k ∈ V, and using this as weights ρij = Γ̂ij makes better use of the data and usually improves the performance of structure estimation significantly. Engelke and Volgushev (2022) further show that the empirical extremal correlation ρij = χ̂ij may also be used for consistent tree recovery, but the performance is typically inferior to extremal variogram based algorithms.
The function emst()
estimates the extremal tree
structure by a minimum spanning tree algorithm based on the different
summary statistics for extremal dependence. It provides the estimated
tree and the implied extremal variogram matrix through
@ref(eq:treemetric).
# Fit tree grpah to the data
emst_fit <- emst(data = Y, method = "vario")
# Compute likelihood/ICs, and plot fitted graph, parameters
loglik_emst <- loglik_HR(
data = Y,
Gamma = emst_fit$Gamma,
graph = emst_fit$graph
)
plotDanubeIGraph(graph = emst_fit$graph)
plot_fitted_params(chi_hat, Gamma2chi(emst_fit$Gamma), xlab = 'Empirical')
The Hüsler–Reiss Pareto distribution Y is parametrized by the variogram matrix Γ. Using the linear transformation @ref(eq:Sigmak), we obtain the covariance matrix Σ(k) for k ∈ V. The inverse, called a precision matrix, is denoted by Θ(k). Following Hentschel et al. (2022), these (d − 1) × (d − 1) matrices can be combined into the d × d precision matrix Θ by Engelke and Hitz (2020) show that this matrix is well-defined and encodes extremal conditional independence as zeros on the off-diagonal:
While the Hüsler–Reiss distribution is parameterized by the variogram
matrix Γ, other objects such
as Θ(k),
Σ(k),
Θ, and its Moore–Penrose
inverse Σ := Θ+ are
often required, for instance, to identify the extremal graphical
structure. The functions Gamma2Sigma()
,
Sigma2Gamma()
, Gamma2Theta()
, etc. perform
these transformations. Note that all of these function are bijections.
The functions Gamma2graph()
, Theta2graph()
etc. create an igraph::graph()
object, which represents the
corresponding extremal graph structure.
Gamma <- cbind(
c(0, 1.5, 1.5, 2),
c(1.5, 0, 2, 1.5),
c(1.5, 2, 0, 1.5),
c(2, 1.5, 1.5, 0)
)
Gamma2Sigma(Gamma, k = 1)
#> [,1] [,2] [,3]
#> [1,] 1.5 0.5 1
#> [2,] 0.5 1.5 1
#> [3,] 1.0 1.0 2
Gamma2Theta(Gamma)
#> [,1] [,2] [,3] [,4]
#> [1,] 1.0 -0.5 -0.5 0.0
#> [2,] -0.5 1.0 0.0 -0.5
#> [3,] -0.5 0.0 1.0 -0.5
#> [4,] 0.0 -0.5 -0.5 1.0
Gamma2graph(Gamma)
#> IGRAPH 591371e U--- 4 4 --
#> + edges from 591371e:
#> [1] 1--2 1--3 2--4 3--4
For Hüsler–Reiss graphical models it suffices to specify the graph structure G = (V, E) on all entries of the parameter matrix Γij for (i, j) ∈ E. The remaining entries of the matrix Γ are then implied by the graph structure (Hentschel et al., 2022).
The function complete_Gamma()
takes as inputs a
partially specified Γ matrix
and the corresponding graph structure, and completes the Γ matrix. The mathematical theory
for completion is different depending on whether the graph is
decomposable or non-decomposable.
The simplest case is a tree, where the entries Γij for (i, j) ∉ E can be easily obtained from the additive tree metric property @ref(eq:treemetric).
set.seed(42)
my_tree <- generate_random_tree(d = 4)
Gamma_vec <- c(.5, 1.4, .8)
Gamma_comp <- complete_Gamma(Gamma = Gamma_vec, graph = my_tree)
print(Gamma_comp)
#> [,1] [,2] [,3] [,4]
#> [1,] 0.0 0.5 1.9 1.3
#> [2,] 0.5 0.0 1.4 0.8
#> [3,] 1.9 1.4 0.0 2.2
#> [4,] 1.3 0.8 2.2 0.0
plot(Gamma2graph(Gamma_comp))
This also works for more general decomposable graphs, where the matrix Γ has to be specified on all cliques. For decomposable graphs, the graph structure can also be implied by Γij = NA for (i, j) ∉ E.
G <- rbind(
c(0, 5, 7, 6, NA),
c(5, 0, 14, 15, NA),
c(7, 14, 0, 5, 5),
c(6, 15, 5, 0, 6),
c(NA, NA, 5, 6, 0)
)
Gamma_comp <- complete_Gamma(Gamma = G)
plot(Gamma2graph(Gamma_comp))
For non-decomposable graphs, a valid Γ matrix and an undirected, connected graph has to be provided.
Let us first suppose that a connected graph G is given. For some data set, the
function fmpareto_graph_HR()
fits a Hüsler–Reiss graphical
model on the graph G.
If the graph is decomposable, then the parameters of each clique can
be fitted separately and combined together to a full Γ matrix. If the cliques are small
enough, then (censored) maximum likelihood estimation is feasible,
otherwise the empirical extremal variogram is used. Combining the clique
estimates relies on the same principle as in the
complete_Gamma()
function, but with some care required to
ensure that the Γij
estimates are consistent if (i, j) belongs to a
separator set between two or more cliques.
set.seed(42)
d <- 10
my_model <- generate_random_model(d = d, graph_type = "decomposable")
plot(my_model$graph)
Ysim <- rmpareto(n = 100, d = d, model = "HR", par = my_model$Gamma)
my_fit <- fmpareto_graph_HR(data = Ysim, graph = my_model$graph, p = NULL)
plot_fitted_params(Gamma2chi(my_model$Gamma), Gamma2chi(my_fit))
If the graph G is
non-decomposable, then the empirical extremal variogram is first
computed and then it is fitted to graph structure of G using the function
complete_Gamma()
.
For structure learning of more general, possibly non-decomposable
graphs, we can use the eglearn
method. Given an estimate
Γ̂ of the parameter matrix,
e.g., obtained by emp_vario()
, we compute the corresponding
matrices Σ̂(k) through the
function Gamma2Sigma()
. In order to enforce sparsity, we
compute the ℓ1-penalized
precision matrices for each k ∈ V. For a tuning
parameter ρ ≥ 0 and choice of
k, the estimated (d − 1) × (d − 1) precision
matrix is Since we run d
different graphical lasso algorithms and the kth only enforces sparsity for all
edges that do not involve the kth node, we determine the estimated
graph structure Ĝρ = (Êρ, V)
by majority vote: $$ (i,j) \in \widehat
E_\rho \quad \Leftrightarrow \quad \frac1{d-2} \# \left\{k \in V
\setminus \{i,j\}: \right(\widehat\Theta^{(k)}_\rho \left)_{ij} \neq 0
\right\} \geq 1/2.$$
The best parameter ρ can be
chosen for instance as the minimizer of the BIC of the resulting models.
The extremal graphical lasso is implemented in the
eglearn()
function. It returns the (sequence) of estimated
graphs and, if desired, the corresponding Γ̂ρ
estimates.
# Run eglearn for a suitable list of penalization parameters
rholist <- seq(0, 0.1, length.out = 11)
eglearn_fit <- eglearn(
data = Y,
rholist = rholist,
complete_Gamma = TRUE
)
# Compute the corresponding likelihoods/ICs
logliks_eglearn <- sapply(
seq_along(rholist),
FUN = function(j) {
Gamma <- eglearn_fit$Gamma[[j]]
if(is.null(Gamma)) return(c(NA, NA, NA))
loglik_HR(
data = Y,
Gamma = Gamma,
graph = eglearn_fit$graph[[j]]
)
}
)
# Plot the BIC vs rho/number of edges
ggplot(mapping = aes(x = rholist, y = logliks_eglearn['bic', ])) +
geom_line() +
geom_point(shape = 21, size = 3, stroke = 1, fill = "white") +
geom_hline(aes(yintercept = flow_loglik['bic']), lty = "dashed") +
geom_hline(aes(yintercept = loglik_emst['bic']), lty = "dotted") +
xlab("rho") +
ylab("BIC") +
scale_x_continuous(
breaks = rholist,
labels = round(rholist, 3),
sec.axis = sec_axis(
trans = ~., breaks = rholist,
labels = sapply(eglearn_fit$graph, igraph::gsize),
name = "Number of edges"
)
)