Learning graphs from data via spectral constraints

Ze Vinicius, Daniel P. Palomar, Jiaxi Ying, and Sandeep Kumar
Hong Kong University of Science and Technology (HKUST)

2019-09-22



Installation

Check out https://mirca.github.io/spectralGraphTopology for installation instructions.

Problem Statement

Graphs are arguably one of the most popular mathematical structures that find applications in a myriad of scientific and engineering fields.

In the era of big data, graphs can be used to model a vast diversity of phenomena, including customer preferences, brain activity, genetic structures, just to name a few. Therefore, it is of utmost importance to be able to reliably estimate such structures from noisy, often sparse, low-rank datasets.

The Laplacian matrix of a graph contains the information about its topology, i.e., how nodes are connected among themselves. By definition a (combinatorial) Laplacian matrix is positive semi-definite, symmetric, and with sum of rows equal to zero.

One common approach to estimate the Laplacian matrix of a graph (without satisfying the zero row-sum property) would be via the generalized inverse of the sample covariance matrix, which is an assymptotically unbiased and efficient estimator. In this document, we call this approach the naive one. In R, this estimator can be computed simply as MASS::ginv(cov(Y)), whereY is the data matrix. However, this estimator performs very poorly when the sample size is small when compared with the number of nodes, which makes its use questionable for practical purposes.

Another classical approach, the well-known graphical lasso algorithm, was proposed in [1] where a \(\ell_1\)-norm penalty term was incorporated in order to induce sparsity on the solution. The R package glasso provides an implementation of this estimator.

We, however, begin by defining a Laplacian linear operator \(\mathcal{L}\) that maps a vector of edge weights \(\mathbf{w}\) into a valid Laplacian matrix. Additionally, we impose constraints on the eigenvalues and eigenvectors of the Laplacian matrix in such a way that the underlying optimization problem may be expressed as follows: \[\begin{array}{ll} \underset{\mathbf{w}, \boldsymbol{\lambda}, \mathbf{U}}{\textsf{minimize}} & - \log \det\left({\sf Diag}(\boldsymbol{\lambda})\right) + \mathrm{tr}\left(\mathbf{S}\mathcal{L}\mathbf{w}\right) + \alpha h(\mathcal{L}\mathbf{w}) + \frac{\beta}{2}\big\|\mathcal{L}\mathbf{w} - \mathbf{U}{\sf Diag}(\boldsymbol{\lambda})\mathbf{U}^{T}\big\|^{2}_{F}\\ \textsf{subject to} & \mathbf{w} \geq 0, \boldsymbol{\lambda} \in \mathcal{S}_{\boldsymbol{\Lambda}},~\text{and}~ \mathbf{U}^{T}\mathbf{U} = \mathbf{I} \end{array}\] where \(h()\) is a regularization function (e.g. to induce sparsity), \(\mathbf{S}\) is the sample covariance matrix, \(\mathcal{S}_{\boldsymbol{\Lambda}}\) further constrains the eigenvalues of the Laplacian matrix. For example, for a \(k\)-component graph with \(p\) nodes, \(\mathcal{S}_{\boldsymbol{\Lambda}} = \left\{\{\lambda_i\}_{i=1}^{p} | \lambda_1 = \lambda_2 = \cdots = \lambda_k = 0,\; 0 < \lambda_{k+1} \leq \lambda_{k+2} \leq \cdots \leq \lambda_{p} \right\}\).

To solve this optimization problem, we employ a block majorization-minimization framework that updates each of the variables (\(\mathbf{w}, \boldsymbol{\lambda}, \mathbf{U}\)) at once while fixing the remaning ones. For the mathematical details of the solution, including a convergence proof, please refer to our paper at: https://arxiv.org/pdf/1904.09792.pdf.

In order to learn bipartite graphs, we take advantage of the fact that the eigenvalues of the adjacency matrix of graph are symmetric around 0, and we formulate the following optimization problem: \[ \begin{array}{ll} \underset{{\mathbf{w}},{\boldsymbol{\psi}},{\mathbf{V}}}{\textsf{minimize}} & \begin{array}{c} - \log \det (\mathcal{L} \mathbf{w}+\frac{1}{p}\mathbf{11}^{T})+\text{tr}({\mathbf{S}\mathcal{L} \mathbf{w}})+ \alpha h(\mathcal{L}\mathbf{w})+ \frac{\nu}{2}\Vert \mathcal{A} \mathbf{w}-\mathbf{V} {\sf Diag}(\boldsymbol{\psi}) \mathbf{V}^T \Vert_F^2, \end{array}\\ \text{subject to} & \begin{array}[t]{l} \mathbf{w} \geq 0, \ \boldsymbol{\psi} \in \mathcal{S}_{\boldsymbol{\psi}}, \ \text{and} \ \mathbf{V}^T\mathbf{V}=\mathbf{I}, \end{array} \end{array} \]

In a similar fashion, we construct the optimization problem to estimate a \(k\)-component bipartite graph by combining the constraints related to the Laplacian and adjacency matrices.

Package usage

The spectralGraphTopology package provides three main functions to estimate k-component, bipartite, and k-component bipartite graphs, respectively: learn_k_component_graph, learn_bipartite_graph, and learn_bipartite_k_component_graph. In the next subsections, we will check out how to apply those functions in synthetic datasets.

Learning a bipartite graph

Learning a 2-component bipartite graph

Performance comparison

We use the following baseline algorithms for performance comparison:

  1. the generalized inverse of the sample covariance matrix, denoted as Naive;
  2. a quadratic program estimator given by \(\textsf{min}_\mathbf{w}\;\Vert\mathbf{S}^{\dagger} - \mathcal{L}\mathbf{w}\Vert_{F}\), where \(\mathbf{S}^{\dagger}\) is the generalized inverse of the sample covariance matrix, denoted as QP;
  3. the Combinatorial Graph Laplacian proposed by [2] denoted as CGL.

The plots below show the performance in terms of F-score and relative error among the proposed algorithm, denoted as SGL, and the baseline ones when learning a grid graph with 64 nodes and edges uniformly drawn from the interval [.1, 3]. For each algorithm, the shaded area and the solid curve represent the standard deviation and the mean of several Monte Carlo realizations. It can be noticed that SGL outperforms all the baseline algorithms in all sample size regimes. Such superior performance maybe be attributed to the highly structured nature of grid graphs.

In a similar fashion, the plots below shows algorithmic performance for modular graphs with 64 nodes and 4 modules, such that the probability of connection within module was set to 50%, whereas the probability of connection accross modules was set to 1%. In this scenario, SGL outperforms the baselines algorithms QP and Naive, while having a similar performance to that of CGL. This may be explained by the fact that the edges connecting nodes do not quite have a deterministic structure like those of the grid graphs.

Clustering

One of the most direct applications of learning k-component graphs is on the classical unsupervised machine learning problem: data clustering. For this task, we make use of two datasets: the animals dataset [3] the Cancer RNA-Seq dataset [4].

The animals dataset consists of binary answers to questions such as “is warm-blooded?,” “has lungs?”, etc. There are a total of 102 such questions, which make up the features for 33 animal categories.

The cancer-RNA Seq dataset consists of genetic features which map 5 types of cancer: breast carcinoma (BRCA), kidney renal clear-cell carcinoma (KIRC), lung adenocarcinoma (LUAD), colon adenocarcinoma (COAD), and prostate adenocarcinoma (PRAD). This dataset consists of 801 labeled samples, in which every sample has 20531 genetic features.

The clustering results for these datasets are shown below. The code used for the cancer-rna dataset can be found at our GitHub repo: https://github.com/dppalomar/spectralGraphTopology/tree/master/benchmarks/cancer-rna

Appendix I: exploiting prior knowledge of the nodes connections

In many practical cases, the connections between nodes is a known property. For example, in social networks, the undirected graph of a public user A is determined by the set of users followed by A and who also follow A. Another example arises in the field of finance, where stocks in the same industry or sector are naturally more likely to behave in a similar fashion. For those cases, we can assume that the unweighted adjacency matrix of the graph is known a priori and we can use state-of-the-art algorithms that leverages this fact to estimate the weighted relationship between nodes (edges). In this package, we implement the methods named “GLE-MM” and “GLE-ADMM” that were proposed in [5]. The graph estimation is carried out as follows:

As discussed in [5], the estimation performance of GLE-MM and GLE-ADMM are practically the same. The toy example below shows the relative error between the estimation given by the MM and ADMM methods and the true Laplacian as the number of samples increase.

library(spectralGraphTopology)
library(igraph)
set.seed(42)

# sample ratios
ratios <- c(5, 10, 100, 250, 500, 1000)
# number of nodes
p <- 100
# number of modules
m <- 4
# Prefix matrix
Prefix <- diag(.2, 4) + 0.025
# relative errors between the true Laplacian and the estimated ones
re_mm <- rep(0, length(ratios))
re_admm <- rep(0, length(ratios))
for (k in c(1:length(ratios))) {
  # Generate a modular graph
  mgraph <- sample_sbm(p, pref.matrix = Prefix, block.sizes = c(rep(p / m, m)))
  # Randomly assign weights to the edges
  E(mgraph)$weight <- runif(gsize(mgraph), min = 1e-1, max = 3)
  # Get the true Laplacian and Adjacency matrices
  Ltrue <- as.matrix(laplacian_matrix(mgraph))
  Atrue <- diag(diag(Ltrue)) - Ltrue
  # Get the unweighted true Adjacency matrix (assumed to be known)
  A <- 1 * (Ltrue < 0)
  # Generate samples from the Laplacian matrix
  Y <- MASS::mvrnorm(ratios[k] * p, mu = rep(0, p), Sigma = MASS::ginv(Ltrue))
  # Compute the sample covariance matrix
  S <- cov(Y)
  # Estimate a graph from the samples using the MM method
  graph_mm <- learn_laplacian_gle_mm(S = S, A = A, verbose = FALSE)
  # Estimate a graph from the samples using the ADMM method
  graph_admm <- learn_laplacian_gle_admm(S = S, A = A, verbose = FALSE)
  # record relative error between true and estimated Laplacians
  re_mm[k] <- relative_error(Ltrue, graph_mm$Laplacian)
  re_admm[k] <- relative_error(Ltrue, graph_admm$Laplacian)
}
xlab <- latex2exp::TeX("$\\mathit{n} / \\mathit{p}$")
colors <- c("#0B032D", "#843B62")
pch <- c(11, 7)
legend <- c("MM", "ADMM")
plot(c(1:length(ratios)), re_mm, ylim=c(0, max(re_mm) + 0.01), xlab = xlab,
     ylab = "Relative Error", type = "b", pch=pch[1], lty=1,
     cex=.75, col = colors[1], xaxt = "n")
lines(c(1:length(ratios)), re_admm, type = "b", lty=1, pch=pch[2],
      cex=.75, col = colors[2], xaxt = "n")
axis(side = 1, at = c(1:length(ratios)), labels = ratios)
legend("topright", legend=legend, col=colors, pch=pch, lty=c(1, 1), bty="n")

The plot above confirms that the performance of both algorithms are in fact substantially similar.

References

[1] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.

[2] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from data under laplacian and structural constrints,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 6, pp. 825–841, 2017.

[3] D. N. Osherson, J. Stern, O. Wilkie, M. Stob, and E. E. Smith, “Default probability,” Cognitive Science, vol. 15, no. 2, pp. 251–269, 1991.

[4] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository.” University of California, Irvine, School of Information; Computer Sciences, 2017.

[5] L. Zhao, Y. Wang, S. Kumar, and D. Palomar, “Optimization algorithms for graph laplacian estimation via admm and mm,” IEEE Trans. on Signal Processing, vol. 67, no. 16, pp. 4231–4244, 2019.