1 Jan 2017 • on machine learning semi-supervised learning

Some notes on Gaussian Fields and Label Propagation

In several occasions, we find ourselves in need of propagating information among nodes in an undirected graph.

For instance, consider graph-based Semi-Supervised Learning (SSL): here, labeled and unlabeled examples are represented by an undirected graph, referred to as the similarity graph.

The task consists in finding a label assignment to all examples, such that:

The final labeling is consistent with training data (e.g. positive training examples are still classified as positive at the end of the learning process), and
Similar examples are assigned similar labels: this is referred to as the semi-supervised smoothness assumption.

Similarly, in networked data such as social networks, we might assume that related entities (such as friends) are associated to similar attributes (such as political and religious views, musical tastes and so on): in social network analysis, this phenomenon is commonly referred to as homophily (love of the same).

In both cases, propagating information from a limited set of nodes in a graph to all nodes provides a method for predicting the attributes of such nodes, when this information is missing.

In the following, we introduce a really clever method for efficiently propagating information about nodes in undirected graphs, known as the Gaussian Fields method.

Propagation as a Cost Minimization Problem

We now cast the propagation problem as a binary classification task. Let $X = \{ x_{1}, x_{2}, \ldots, x_{n} \}$ be a set of $n$ instances, of which only $l$ are labeled: $X^{+}$ are positive examples, while $X^{-}$ are negative examples

Similarity relations between instances can be represented by means of an undirected similarity graph having adjacency matrix $\mathbf{W} \in \mathbb{R}^{n \times n}$: if two instances are connected in the similarity graph, it means that they are considered similar, and should be assigned the same label. Specifically, $\mathbf{W}_{ij} > 0$ iff the instances $x_{i}, x_{j} \in X$ are connected by an edge in the similarity graph, and $\mathbf{W}_{ij} = 0$ otherwise.

Let $y_{i} \in \{ \pm 1 \}$ be the label assigned to the $i$-th instance $x_{i} \in X$. We can encode our assumption that similar instances should be assigned similar labels by defining a quadratic cost function over labeling functions in the form $f : X \mapsto \{ \pm 1 \}$:

\[E(f) = \frac{1}{2} \sum_{x_{i} \in X} \sum_{x_{j} \in X} \mathbf{W}_{ij} \left[ f(x_{i}) - f(x_{j}) \right]^{2}.\]

Given an input labeling function $f$, the cost function $E(\cdot)$ associates, for each pair of instances $x_{i}, x_{j} \in X$, a non-negative cost $\mathbf{W}_{ij} \left[ f(x_{i}) - f(x_{j}) \right]$: this quantity is $0$ when $\mathbf{W}_{ij} = 0$ (i.e. $x_{i}$ and $X_{j}$ are not linked in the similarity graph), or when $f(x_{i}) = f(x_{j})$ (i.e. they are assigned the same label).

For such a reason, the cost function $E(\cdot)$ favors labeling functions that are more likely to assign the same labels to instances that are linked by an edge in the similarity graph.

Now, the problem of finding a labeling function that is both consistent with training labels, and assigns similar labels to similar instances, can be cast as a cost minimization problem. Let’s represent a labeling function $f$ by a vector $\mathbf{f} \in \mathbb{R}^{n}$, $L \subset X$ denote labeled instances, and $\mathbf{y}_{i} \in \{ \pm 1 \}$ denote the label of the $x_{i}$-th instance. The optimization problem can be defined as follows:

\[\begin{aligned} & \underset{\mathbf{f} \in \{ \pm 1 \}^{n}}{\text{minimize}} & & E(\mathbf{f}) \\ & \text{subject to} & & \forall x \in L: \; \mathbf{f}_{i} = \mathbf{y}_{i}. \end{aligned}\]

The constraint $\forall x \in L : \mathbf{f}_{i} = \mathbf{y}_{i}$ enforces the label of each labeled example $x_{i} \in L$ to $\mathbf{f}_{i} = +1$ if the instance has a positive label, and to $\mathbf{f}_{i} = -1$ if the instance has a negative label, so to achieve consistency with training labels.

However, constraining labeling functions $f$ to only take discrete values has two main drawbacks:

Each function $f$ can only provide hard classifications, without yielding any measure of confidence in the provided classification.
The cost term $E(\cdot)$ can be hard to optimize in a multi-label classification setting.

For overcoming such limitations, Zhu et al. propose a continuous relaxation of the previous optimization problem:

\[\begin{aligned} & \underset{\mathbf{f} \in \mathbb{R}^{n}}{\text{minimize}} & & E(\mathbf{f}) \\ & \text{subject to} & & \forall x \in L: \; \mathbf{f}_{i} = \mathbf{y}_{i}, \end{aligned}\]

where the term $\sum_{x_{i} \in X} \mathbf{f}_{i}^{2} = \mathbf{f}^{T} \mathbf{f}$ is a $L_{2}$ regularizer over $\mathbf{f}$, weighted by a parameter $\epsilon > 0$ which ensures that the optimization problem has a unique global solution.

The parameter $\epsilon$ can be interpreted as the decay of the propagation process: as the distance from a labeled instance within the similarity graph increases, the confidence in the classification (as measured by the continuous label) gets closer to zero.

This optimization problem has a unique, global solution that can be calculated in closed-form. Specifically, the optimal (relaxed) discriminant function $f : X \mapsto \mathbb{R}$ is given by $\mathbf{\hat{f}} = \left[ \mathbf{f}_{L}, \mathbf{f}_{U} \right]^{T}$, where $\mathbf{\hat{f}}_{L} = \mathbf{y}_{L}$ (i.e. labels for labeled examples in $L$ coincide with training labels), while $\mathbf{\hat{f}}_{U}$ is given by:

\[\mathbf{\hat{f}}_{U} = (\mathbf{L}_{UU} + \epsilon \mathbf{I})^{-1} \mathbf{W}_{UL} \mathbf{\hat{f}}_{L},\]

where $\mathbf{L} = \mathbf{D} - \mathbf{W}$ is the graph Laplacian of the similarity graph with adjacency matrix $\mathbf{W}$, and $\mathbf{D}$ is a diagonal matrix such that $\mathbf{D}_{ii} = \sum_{j} \mathbf{W}_{ij}$.