neuralnoise.comHomepage of Dr. Pasquale Minervini, Research Associate at University College London, United Kingdom
http://neuralnoise.com///
Sat, 04 Aug 2018 11:03:53 +0100Sat, 04 Aug 2018 11:03:53 +0100Jekyll v3.1.6Knowledge Propagation in Graphs using Gaussian Fields, Part One<p>In several occasions, we find ourselves in need of <em>propagating</em> information among nodes in an undirected graph.</p>
<p>For instance, consider graph-based Semi-Supervised Learning (SSL): here, labeled and unlabeled examples are represented by an undirected graph, referred to as the <em>similarity graph</em>.</p>
<p>The task consists in finding a <em>label assignment</em> to all examples, such that:</p>
<ol>
<li>The final labeling is consistent with training data (e.g. positive training examples are still classified as positive at the end of the learning process), and</li>
<li>Similar examples are assigned similar labels: this is referred to as the <em>semi-supervised smoothness assumption</em>.</li>
</ol>
<p>Similarly, in networked data such as social networks, we might assume that related entities (such as <em>friends</em>) are associated to similar attributes (such as political and religious views, musical tastes and so on): in social network analysis, this phenomenon is commonly referred to as <em>homophily</em> (love of the same).</p>
<p>In both cases, propagating information from a limited set of nodes in a graph to all nodes provides a method for predicting the attributes of such nodes, when this information is missing.</p>
<p>In the following, we introduce a really clever method for efficiently propagating information about nodes in undirected graphs, known as the <em>Gaussian Fields</em> method.</p>
<h3 id="propagation-as-a-cost-minimization-problem">Propagation as a Cost Minimization Problem</h3>
<p>We now cast the propagation problem as a binary classification task.
Let $X = \{ x_{1}, x_{2}, \ldots, x_{n} \}$ be a set of $n$ instances, of which only $l$ are labeled: $X^{+}$ are positive examples, while $X^{-}$ are negative examples</p>
<p>Similarity relations between instances can be represented by means of an undirected similarity graph having adjacency matrix $\mathbf{W} \in \mathbb{R}^{n \times n}$: if two instances are connected in the similarity graph, it means that they are considered <em>similar</em>, and should be assigned the same label.
Specifically, $\mathbf{W}_{ij} > 0$ iff the instances $x_{i}, x_{j} \in X$ are connected by an edge in the similarity graph, and $\mathbf{W}_{ij} = 0$ otherwise.</p>
<p>Let $y_{i} \in \{ \pm 1 \}$ be the label assigned to the $i$-th instance $x_{i} \in X$.
We can encode our assumption that <em>similar instances should be assigned similar labels</em> by defining a quadratic cost function over labeling functions in the form $f : X \mapsto \{ \pm 1 \}$:</p>
<script type="math/tex; mode=display">E(f) = \frac{1}{2} \sum_{x_{i} \in X} \sum_{x_{j} \in X} \mathbf{W}_{ij} \left[ f(x_{i}) - f(x_{j}) \right]^{2}.</script>
<p>Given an input labeling function $f$, the cost function $E(\cdot)$ associates, for each pair of instances $x_{i}, x_{j} \in X$, a non-negative cost $\mathbf{W}_{ij} \left[ f(x_{i}) - f(x_{j}) \right]$: this quantity is $0$ when $\mathbf{W}_{ij} = 0$ (i.e. $x_{i}$ and $X_{j}$ are not linked in the similarity graph), or when $f(x_{i}) = f(x_{j})$ (i.e. they are assigned the same label).</p>
<p>For such a reason, the cost function $E(\cdot)$ favors labeling functions that are more likely to assign the same labels to instances that are linked by an edge in the similarity graph.</p>
<p>Now, the problem of finding a labeling function that is both consistent with training labels, and assigns similar labels to similar instances, can be cast as a <em>cost minimization problem</em>. Letâ€™s represent a labeling function $f$ by a vector $\mathbf{f} \in \mathbb{R}^{n}$, $L \subset X$ denote labeled instances, and $\mathbf{y}_{i} \in \{ \pm 1 \}$ denote the label of the $x_{i}$-th instance.
The optimization problem can be defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
& \underset{\mathbf{f} \in \{ \pm 1 \}^{n}}{\text{minimize}}
& & E(\mathbf{f}) \\
& \text{subject to}
& & \forall x \in L: \; \mathbf{f}_{i} = \mathbf{y}_{i}.
\end{aligned} %]]></script>
<p>The constraint $\forall x \in L : \mathbf{f}_{i} = \mathbf{y}_{i}$ enforces the label of each labeled example $x_{i} \in L$ to $\mathbf{f}_{i} = +1$ if the instance has a positive label, and to $\mathbf{f}_{i} = -1$ if the instance has a negative label, so to achieve consistency with training labels.</p>
<p>However, constraining labeling functions $f$ to only take discrete values has two main drawbacks:</p>
<ul>
<li>Each function $f$ can only provide <em>hard</em> classifications, without yielding any measure of confidence in the provided classification.</li>
<li>The cost term $E(\cdot)$ can be hard to optimize in a multi-label classification setting.</li>
</ul>
<p>For overcoming such limitations, Zhu et al. propose a <em>continuous relaxation</em> of the previous optimization problem:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
& \underset{\mathbf{f} \in \mathbb{R}^{n}}{\text{minimize}}
& & E(\mathbf{f}) \\
& \text{subject to}
& & \forall x \in L: \; \mathbf{f}_{i} = \mathbf{y}_{i},
\end{aligned} %]]></script>
<p>where the term $\sum_{x_{i} \in X} \mathbf{f}_{i}^{2} = \mathbf{f}^{T} \mathbf{f}$ is a $L_{2}$ regularizer over $\mathbf{f}$, weighted by a parameter $\epsilon > 0$ which ensures that the optimization problem has a unique global solution.</p>
<p>The parameter $\epsilon$ can be interpreted as the <em>decay</em> of the propagation process: as the distance from a labeled instance within the similarity graph increases, the confidence in the classification (as measured by the continuous label) gets closer to zero.</p>
<p>This optimization problem has a unique, global solution that can be calculated in closed-form. Specifically, the optimal (relaxed) discriminant function $f : X \mapsto \mathbb{R}$ is given by $\mathbf{\hat{f}} = \left[ \mathbf{f}_{L}, \mathbf{f}_{U} \right]^{T}$, where $\mathbf{\hat{f}}_{L} = \mathbf{y}_{L}$ (i.e. labels for labeled examples in $L$ coincide with training labels), while $\mathbf{\hat{f}}_{U}$ is given by:</p>
<script type="math/tex; mode=display">\mathbf{\hat{f}}_{U} = (\mathbf{L}_{UU} + \epsilon \mathbf{I})^{-1} \mathbf{W}_{UL} \mathbf{\hat{f}}_{L},</script>
<p>where $\mathbf{L} = \mathbf{D} - \mathbf{W}$ is the <em>graph Laplacian</em> of the similarity graph with adjacency matrix $\mathbf{W}$, and $\mathbf{D}$ is a diagonal matrix such that $\mathbf{D}_{ii} = \sum_{j} \mathbf{W}_{ij}$.</p>
Fri, 01 Jan 2016 00:00:00 +0000
http://neuralnoise.com///2016/gaussian-fields/
http://neuralnoise.com///2016/gaussian-fields/machine learningsemi-supervised learningknowledge graph completionmachine learningsemi-supervised learningknowledge graph completion