Introduction and application to Danube data

library(graphicalExtremes)
library(dplyr)
library(ggplot2)
library(igraph)

Danube Data

Throughout this document we use the Danube dataset from Asadi et al. (2015) to illustrate functions and concepts. The dataset is provided in the package as danube, see the corresponding help page for details and related functions.

# Plot the physical locations of measuring stations, indicating flow volume by width
plotDanube(useStationVolume = TRUE, useConnectionVolume = TRUE, xyRatio = 1.5)

# Plot only the flow graph
danube_flow <- getDanubeFlowGraph()
plotDanubeIGraph(graph = danube_flow)

# Plot some pairwise scatter plots
plotData <- as_tibble(danube$data_clustered)
ggplot(plotData) + geom_point(aes(x = X1, y = X2))
ggplot(plotData) + geom_point(aes(x = X22, y = X28))

Multivariate Pareto distributions

Definition

The goal is to study the extremal tail of a multivariate random vector X = (X1, …, Xd). Here, we are interested only in the extremal dependence and therefore normalize the marginal distributions Fj of Xj to standard Pareto distributions by We assume in the sequel that the vector X has been normalized to standard Pareto margins.

A multivariate Pareto distribution (MPD) is defined as the limiting distribution the exceedances of X over a high threshold, where the multivariate threshold is chosen in terms of the l-norm. For any u > 0, we define exceedance distribution as By sending u → ∞ and normalizing properly, the random vector Xu converges to a limiting distribution Y called the MPD: Y = limu → ∞Xu/u, where the limit is in terms of convergence in distribution. By construction, the distribution of the MPD Y is defined on the space ℒ = {x ∈ [0, ∞)d : ∥x > 1}, and if Y possesses a density fY, then it is proportional to the so-called exponent measure density λ; for details see Engelke and Hitz (2020).

In practice, we use the approximation Y ≈ Xu for some large value u, where u can be chosen as the pth quantile of the distribution of X. Given n oberservations X1, …, Xn of X, the function data2mpareto() first computes the standardization in @ref(eq:pareto) based on the empirical distribution functions j and then selects the exceedances Xu according to @ref(eq:exc).

X <- danube$data_clustered
Y <- data2mpareto(data = X, p = .8)

Examples

The extremal logistic distribution with parameter θ ∈ (0, 1) induces a multivariate Pareto distribution with density The parameter θ governs the extremal dependence, ranging from complete dependence for θ → 0 and independence for θ → 1.

The function rmpareto() generates samples from a MPD Y based on the exact algorithm described in Engelke and Hitz (2020).

theta <- .5
Ysim <- rmpareto(n = 100, model = "logistic", d = 2, par = theta)

The d-dimensional Hüsler–Reiss distribution (Hüsler and Reiss, 1989) is parameterized by the variogram matrix Γ = {Γij}1 ≤ i, j ≤ d. The corresponding density of the exponent measure can be written for any k ∈ {1, …, d} as where ϕp(⋅; Σ) is the density of a centred p-dimensional normal distribution with covariance matrix Σ,  = {log (yi/yk) + Γik/2}i = 1, …, d and The matrix Σ(k) is positive definite and will play an important role in the theory of extremal graphical models. The representation of the density in @ref(eq:fYHR) seems to depend on the choice of k, but, in fact, the value of the right-hand side of this equation is independent of k. The Hüsler–Reiss multivariate Pareto distribution has density fY(y) ∝ λ(y) and the strength of dependence between the ith and jth component is parameterized by Γij, ranging from complete dependence for Γij → 0 and independence for Γij → ∞.

The extension of Hüsler–Reiss distributions to random fields are Brown–Resnick processes (Kabluchko et al., 2009), which are widely used models for spatial extremes.

G <- cbind(c(0, 1.5), c(1.5, 0))
Ysim <- rmpareto(n = 100, model = "HR", par = G)

Note that we can also generate samples from the corresponding max-stable distribution with the function rmstable(), following the exact algorithm in Dombry et al. (2016).

Measures of extremal dependence

Extremal correlation

The extremal correlation χij ∈ [0, 1] measures the dependence between the largest values of the random variables Xi and Xj. It is defined as where the boundary cases 0 and 1 correspond to asymptotic independence and complete dependence, respectively.

For n observations X1, …, Xn of the d-dimensional vector X, we can empirically estimate the d × d matrix of all pairwise extremal correlations for a fixed threshold p close to 1.

chi_hat <- emp_chi(data = X, p = .8)

In this and other functions, the default argument p = NULL implies that the data is already expected to be on MPD scale, and no thresholding is performed. For the danube data, we may therefore directly use Y instead of X:

chi_hat <- emp_chi(data = Y)

For the Hüsler–Reiss distribution with parameter matrix Γ = {Γij}1 ≤ i, j ≤ d, the extremal correlation is given by $$ \chi_{ij} = 2 - 2 \Phi(\sqrt{\Gamma_{ij}}/2),$$ where Φ is the standard normal distribution function. We can use the functions Gamma2chi() and chi2Gamma() to switch between the two coefficients.

Extremal variogram

There exist several other summary statistics for extremal dependence. The extremal variogram is introduced in Engelke and Volgushev (2022) and turns out to be particularly useful for inference of extremal graphical models discussed below.

For any root node k ∈ V, the pre-asymptotic extremal variogram is defined as the matrix Γ(k) with entries
whenever right-hand side exists. Note that −log {1 − Fi(Xi)} transforms Xi to a standard exponential distribution, such that Γ(k) is simply the variogram matrix of X on exponential scale, conditioned on the kth variable being large.

The limit as p → 1 is called the extremal variogram and can be expressed in terms of the MPD Y:
Weak/strong extremal dependence is indicated by large/small values of the extremal variogram, respectively. The function emp_vario() estimates the (pre-asymptotic) extremal variogram, for instance for k = 1.

Gamma_1_hat <- emp_vario(data = X, k = 1, p = 0.8)

In general, the matrices Γ(k) can be different for k ∈ V, but for example for the Hüsler–Reiss distribution, they all coincide.

For the Hüsler–Reiss distribution with parameter matrix Γ, the extremal variogram matrices satisfy Γ = Γ(1) = … = Γ(d).

In this case it makes sense to estimate the extremal variogram Γ̂ as the average of the estimators Γ̂(k):

Gamma_hat <- emp_vario(data = X, p = 0.8)
Gamma_hat <- emp_vario(data = Y)

Extremal graphical models

Let G = (V, E) be an undirected graph with index set V = {1, …, d} and edges E ⊂ V × V. The figure below shows examples of different graphical structures: a tree, a decomposable graph and a non-decomposable graph. Engelke and Hitz (2020) introduce a new notion of extremal conditional independence for MPDs, denoted by e. They define an extremal graphical model on G as a MPD Y = (Yj : j ∈ V) that satisfies the pairwise Markov property YieYj ∣ Y\{i, j},   if (i, j) ∉ E, that is, Yi and Yj are conditionally independent in the extremal sense e given all other nodes whenever there is no edge between i and j in G.

If the MPD possesses a density fY, then the graph G has to be connected. Engelke and Hitz (2020) then show a Hammersley–Clifford theorem stating that for an extremal graphical model on a decomposable graph G, the density fY factorizes into the marginal density on the cliques.

# A tree graph, a decomposable graph, and a non-decomposable (cycle) graph
plot(graph_from_literal(1--2--3, 2--4))
plot(graph_from_literal(1--2--3--1, 2--4))
plot(graph_from_literal(1--2--3--4--1))

Trees

A tree T = (V, E) is a particularly simple type of graph, which is connected and does not have cycles. This implies that there are exactly |E| = d − 1 edges. The Hammersley–Clifford theorem shown in Engelke and Hitz (2020) yields that the density of an extremal tree model Y on the tree T = (V, E) factorizes into where λij are the bivariate marginals of the exponent measure density λ corresponding to Y. The d-dimensional density of the MPD is therefore reduced to bivariate interactions only.

For extremal graphical models on trees, the extremal variograms Γ(k), k ∈ V, are very natural since they define a so-called additive tree metric, that is,
where ph(ij) denotes the path between node i and j on the tree T.

Simulation

By representation @ref(eq:treedens) we can specify any bivariate models λij for (i, j) ∈ E and combine them to a valid tree model. For instance, if we use bivariate Hüsler–Reiss distributions on all edges of the tree T, the resulting d-dimensional tree model is again Hüsler–Reiss and its parameter matrix Γ is implied by @ref(eq:treemetric). If we use bivariate logistic distributions on all edges, the resulting extremal tree model is not a d-dimensional logistic model, but a more flexible model with d − 1 parameters.

The function rmpareto_tree() generates samples from an extremal tree model Y.

set.seed(42)

my_model <- generate_random_model(d = 4, graph_type = "tree")
Ysim <- rmpareto_tree(
    n = 100,
    model = "HR",
    tree = my_model$graph,
    par = my_model$Gamma
)

theta_vec <- c(.2, .8, .3)
Ysim <- rmpareto_tree(
    n = 100,
    model = "logistic",
    tree = my_model$graph,
    par = theta_vec
)

Note that we can also generate samples from the corresponding max-stable distribution with the function rmstable_tree().

Estimation

For a given tree T = (V, E) and a parametric model {fY(y; θ) : θ ∈ Θ} of MPDs on the tree T, estimation of the model parameters is fairly straight-forward. If θ = {θij : (i, j) ∈ E} consists of one parameter per edge, then thanks to the factorization @ref(eq:treedens) we may fit each parameter θij separately. This can be done by (censored) maximum likelihood methods, or by other methods such as M-estimation.

If provided with a tree and data, the function fmpareto_graph_HR() estimates the d − 1 parameters of a Hüsler–Reiss model on that tree. As an example, we fit an extremal tree model to the (undirected) tree resulting from the flow connections in the danube() data set and compare the fitted with the empirical extremal coefficients.

# Utility function to plot fitted parameters against true/empirical ones
plot_fitted_params <- function(G0, G1, xlab = 'True', ylab = 'Fitted'){
    return(
        ggplot()
        + geom_point(aes(
            x = G0[upper.tri(G0)],
            y = G1[upper.tri(G1)]
        ))
        + geom_abline(slope = 1, intercept = 0)
        + xlab(xlab)
        + ylab(ylab)
    )
}
# Get flow graph from package
flow_graph <- getDanubeFlowGraph()

# Fit parameter matrix with this graph structure
flow_Gamma <- fmpareto_graph_HR(data = Y, graph = flow_graph)

# Compute likelihood/ICs and plot parameters
flow_loglik <- loglik_HR(
    data = Y,
    Gamma = flow_Gamma,
    graph = flow_graph
)
plot_fitted_params(chi_hat, Gamma2chi(flow_Gamma), xlab = 'Empirical')

Structure learning

In practice, the underlying conditional independence tree T = (V, E) is usually unknown and has to be estimated in a data-driven way. For extremal tree models, it turns out that this can be done efficiently in a completely non-parametric way; see Engelke and Volgushev (2022) for details.

The method is based on the notion of a minimum spanning tree. For a set of symmetric weights ρij > 0 associated with any pair of nodes i, j ∈ V, i ≠ j, the latter is defined as the tree structure that minimizes the sum of distances on that tree.

Engelke and Volgushev (2022) show that if Y is an extremal graphical model on an unknown tree T, then the minimum spanning tree with the extremal variogram ρij = Γij(k) as weights is unique and satisfies Tmst = T. Using empirical estimates Γ̂ij(k) as weights, we can thus consistently recover the underlying tree structure in a fully non-parametric way.

In fact, taking the average of all Γ̂(k), k ∈ V, and using this as weights ρij = Γ̂ij makes better use of the data and usually improves the performance of structure estimation significantly. Engelke and Volgushev (2022) further show that the empirical extremal correlation ρij = χ̂ij may also be used for consistent tree recovery, but the performance is typically inferior to extremal variogram based algorithms.

The function emst() estimates the extremal tree structure by a minimum spanning tree algorithm based on the different summary statistics for extremal dependence. It provides the estimated tree and the implied extremal variogram matrix through @ref(eq:treemetric).

# Fit tree grpah to the data
emst_fit <- emst(data = Y, method = "vario")

# Compute likelihood/ICs, and plot fitted graph, parameters
loglik_emst <- loglik_HR(
    data = Y,
    Gamma = emst_fit$Gamma,
    graph = emst_fit$graph
)
plotDanubeIGraph(graph = emst_fit$graph)
plot_fitted_params(chi_hat, Gamma2chi(emst_fit$Gamma), xlab = 'Empirical')

Huesler–Reiss graphical models

The Hüsler–Reiss Pareto distribution Y is parametrized by the variogram matrix Γ. Using the linear transformation @ref(eq:Sigmak), we obtain the covariance matrix Σ(k) for k ∈ V. The inverse, called a precision matrix, is denoted by Θ(k). Following Hentschel et al. (2022), these (d − 1) × (d − 1) matrices can be combined into the d × d precision matrix Θ by Engelke and Hitz (2020) show that this matrix is well-defined and encodes extremal conditional independence as zeros on the off-diagonal:

Transformations

While the Hüsler–Reiss distribution is parameterized by the variogram matrix Γ, other objects such as Θ(k), Σ(k), Θ, and its Moore–Penrose inverse Σ := Θ+ are often required, for instance, to identify the extremal graphical structure. The functions Gamma2Sigma(), Sigma2Gamma(), Gamma2Theta(), etc. perform these transformations. Note that all of these function are bijections. The functions Gamma2graph(), Theta2graph() etc. create an igraph::graph() object, which represents the corresponding extremal graph structure.

Gamma <- cbind(
    c(0, 1.5, 1.5, 2),
    c(1.5, 0, 2, 1.5),
    c(1.5, 2, 0, 1.5),
    c(2, 1.5, 1.5, 0)
)
Gamma2Sigma(Gamma, k = 1)
#>      [,1] [,2] [,3]
#> [1,]  1.5  0.5    1
#> [2,]  0.5  1.5    1
#> [3,]  1.0  1.0    2
Gamma2Theta(Gamma)
#>      [,1] [,2] [,3] [,4]
#> [1,]  1.0 -0.5 -0.5  0.0
#> [2,] -0.5  1.0  0.0 -0.5
#> [3,] -0.5  0.0  1.0 -0.5
#> [4,]  0.0 -0.5 -0.5  1.0
Gamma2graph(Gamma)
#> IGRAPH 591371e U--- 4 4 -- 
#> + edges from 591371e:
#> [1] 1--2 1--3 2--4 3--4

Completion of Γ

For Hüsler–Reiss graphical models it suffices to specify the graph structure G = (V, E) on all entries of the parameter matrix Γij for (i, j) ∈ E. The remaining entries of the matrix Γ are then implied by the graph structure (Hentschel et al., 2022).

The function complete_Gamma() takes as inputs a partially specified Γ matrix and the corresponding graph structure, and completes the Γ matrix. The mathematical theory for completion is different depending on whether the graph is decomposable or non-decomposable.

The simplest case is a tree, where the entries Γij for (i, j) ∉ E can be easily obtained from the additive tree metric property @ref(eq:treemetric).

set.seed(42)
my_tree <- generate_random_tree(d = 4)
Gamma_vec <- c(.5, 1.4, .8)
Gamma_comp <- complete_Gamma(Gamma = Gamma_vec, graph = my_tree)
print(Gamma_comp)
#>      [,1] [,2] [,3] [,4]
#> [1,]  0.0  0.5  1.9  1.3
#> [2,]  0.5  0.0  1.4  0.8
#> [3,]  1.9  1.4  0.0  2.2
#> [4,]  1.3  0.8  2.2  0.0
plot(Gamma2graph(Gamma_comp))

This also works for more general decomposable graphs, where the matrix Γ has to be specified on all cliques. For decomposable graphs, the graph structure can also be implied by Γij = NA for (i, j) ∉ E.

G <- rbind(
    c(0, 5, 7, 6, NA),
    c(5, 0, 14, 15, NA),
    c(7, 14, 0, 5, 5),
    c(6, 15, 5, 0, 6),
    c(NA, NA, 5, 6, 0)
)
Gamma_comp <- complete_Gamma(Gamma = G)
plot(Gamma2graph(Gamma_comp))

For non-decomposable graphs, a valid Γ matrix and an undirected, connected graph has to be provided.

set.seed(42)
G <- rbind(
    c(0, 5, 7, 6, 6),
    c(5, 0, 14, 15, 13),
    c(7, 14, 0, 5, 5),
    c(6, 15, 5, 0, 6),
    c(6, 13, 5, 6, 0)
)
my_graph <- generate_random_connected_graph(d = 5, m = 5)
Gamma_comp <- complete_Gamma(Gamma = G, graph = my_graph)
plot(Gamma2graph(Gamma_comp))

Estimation

Let us first suppose that a connected graph G is given. For some data set, the function fmpareto_graph_HR() fits a Hüsler–Reiss graphical model on the graph G.

If the graph is decomposable, then the parameters of each clique can be fitted separately and combined together to a full Γ matrix. If the cliques are small enough, then (censored) maximum likelihood estimation is feasible, otherwise the empirical extremal variogram is used. Combining the clique estimates relies on the same principle as in the complete_Gamma() function, but with some care required to ensure that the Γij estimates are consistent if (i, j) belongs to a separator set between two or more cliques.

set.seed(42)
d <- 10
my_model <- generate_random_model(d = d, graph_type = "decomposable")
plot(my_model$graph)

Ysim <- rmpareto(n = 100, d = d, model = "HR", par = my_model$Gamma)
my_fit <- fmpareto_graph_HR(data = Ysim, graph = my_model$graph, p = NULL)
plot_fitted_params(Gamma2chi(my_model$Gamma), Gamma2chi(my_fit))

If the graph G is non-decomposable, then the empirical extremal variogram is first computed and then it is fitted to graph structure of G using the function complete_Gamma().

set.seed(1)
d <- 20
my_model <- generate_random_model(d = d, graph_type = "general")
plot(my_model$graph)

Ysim <- rmpareto(n = 100, d = d, model = "HR", par = my_model$Gamma)
my_fit <- fmpareto_graph_HR(data = Ysim, graph = my_model$graph, p = NULL, method = "vario")
plot_fitted_params(Gamma2chi(my_model$Gamma), Gamma2chi(my_fit))

General structure learning

For structure learning of more general, possibly non-decomposable graphs, we can use the eglearn method. Given an estimate Γ̂ of the parameter matrix, e.g., obtained by emp_vario(), we compute the corresponding matrices Σ̂(k) through the function Gamma2Sigma(). In order to enforce sparsity, we compute the 1-penalized precision matrices for each k ∈ V. For a tuning parameter ρ ≥ 0 and choice of k, the estimated (d − 1) × (d − 1) precision matrix is Since we run d different graphical lasso algorithms and the kth only enforces sparsity for all edges that do not involve the kth node, we determine the estimated graph structure ρ = (ρ, V) by majority vote: $$ (i,j) \in \widehat E_\rho \quad \Leftrightarrow \quad \frac1{d-2} \# \left\{k \in V \setminus \{i,j\}: \right(\widehat\Theta^{(k)}_\rho \left)_{ij} \neq 0 \right\} \geq 1/2.$$

The best parameter ρ can be chosen for instance as the minimizer of the BIC of the resulting models. The extremal graphical lasso is implemented in the eglearn() function. It returns the (sequence) of estimated graphs and, if desired, the corresponding Γ̂ρ estimates.

# Run eglearn for a suitable list of penalization parameters
rholist <- seq(0, 0.1, length.out = 11)
eglearn_fit <- eglearn(
    data = Y,
    rholist = rholist,
    complete_Gamma = TRUE
)

# Compute the corresponding likelihoods/ICs
logliks_eglearn <- sapply(
    seq_along(rholist),
    FUN = function(j) {
        Gamma <- eglearn_fit$Gamma[[j]]
        if(is.null(Gamma)) return(c(NA, NA, NA))
        loglik_HR(
            data = Y,
            Gamma = Gamma,
            graph = eglearn_fit$graph[[j]]
        )
    }
)

# Plot the BIC vs rho/number of edges
ggplot(mapping = aes(x = rholist, y = logliks_eglearn['bic', ])) +
    geom_line() +
    geom_point(shape = 21, size = 3, stroke = 1, fill = "white") +
    geom_hline(aes(yintercept = flow_loglik['bic']), lty = "dashed") +
    geom_hline(aes(yintercept = loglik_emst['bic']), lty = "dotted") +
    xlab("rho") +
    ylab("BIC") +
    scale_x_continuous(
        breaks = rholist,
        labels = round(rholist, 3),
        sec.axis = sec_axis(
            trans = ~., breaks = rholist,
            labels = sapply(eglearn_fit$graph, igraph::gsize),
            name = "Number of edges"
        )
    )

# Compare fitted chi to empirical one
best_index <- which.min(logliks_eglearn['bic', ])
best_Gamma <- eglearn_fit$Gamma[[best_index]]
best_graph <- eglearn_fit$graph[[best_index]]
plotDanubeIGraph(graph = best_graph)
plot_fitted_params(chi_hat, Gamma2chi(best_Gamma), xlab='Empirical')

References

Asadi, P., Davison, A. C., and Engelke, S. (2015). Extremes on river networks. Ann. Appl. Stat., 9(4), 2023–2050. https://doi.org/10.1214/15-AOAS863
Dombry, C., Engelke, S., and Oesting, M. (2016). Exact simulation of max-stable processes. Biometrika, 103, 303–317.
Engelke, S., and Hitz, A. S. (2020). Graphical models for extremes (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol., 82, 871–932.
Engelke, S., and Volgushev, S. (2022). Structure learning for extremal tree models. J. R. Stat. Soc. Ser. B Stat. Methodol. https://doi.org/https://doi.org/10.1111/rssb.12556
Hentschel, M., Engelke, S., and Segers, J. (2022). Statistical inference for Hüsler-Reiss graphical models through matrix completions. arXiv. https://doi.org/10.48550/ARXIV.2210.14292
Hüsler, J., and Reiss, R.-D. (1989). Maxima of normal random vectors: Between independence and complete dependence. Statist. Probab. Lett., 7, 283–286.
Kabluchko, Z., Schlather, M., and Haan, L. de. (2009). Stationary max-stable fields associated to negative definite functions. Ann. Probab., 37, 2042–2065.