neuralnoise.com

Looking for Postdocs!

Tue, 01 Nov 2022 01:00:00 +0100

We have an opening for a 2-year postdoc – more details are available here – on a project titled Gradient-based Learning of Complex Latent Structures, with me as the Principal Investigator (PI), and Antonio Vergari (IANC) and Edoardo Ponti (ILCC) as co-PIs. The position is entirely funded by the Edinburgh Laboratory for Integrated Artificial Intelligence (ELIAI) – if you want to know more, feel free to reach out!

You can apply at this link.

Project description

Imposing structural constraints on the latent representations learned by deep neural models has several applications, which can improve their explainability, their robustness, and their ability to generalise to out-of-domain distributions. For example, we can learn more explainable models by making them selectively decide which parts of the input to consider; and we can improve their generalisation properties by learning representations suitable for reasoning tasks, such as deductive reasoning and planning, and comply with any desired constraints. For instance, the intermediate structure can represent a relational graph between objects in the world; the relationships between multiple sub-questions in a complex question; or computation graphs which can be executed to produce a prediction.

In this project, we aim to investigate how we can derive better methods for back-propagating through mixed continuous-discrete complex latent structures, and how we can leverage them for learning more explainable, data-efficient, and robust deep neural models. The reason why discrete latent representations are not widely adopted by deep neural models is that they tend to not interact well with gradient-based optimisation methods, but this started to change recently (e.g., see Niepert et al., 2021; Minervini et al. 2022), enabling a wide range of applications and use cases.

Position

The post holder will work on projects involving the design and application of deep learning models with discrete latent structures for improving their explainability, generalisation, and robustness properties. They will be part of the new Edinburgh Laboratory for Integrated Artificial Intelligence and the Edinburgh NLP Group, a world-leading research group in Natural Language Processing.

The School of Informatics is one of the largest research centres in Computer Science in Europe, and it has been ranked #1 in the UK in terms of research power by a large margin. The Edinburgh NLP Group is consistently ranked among the world’s leading research groups in Natural Language Processing. We are offering an exciting opportunity to work in an interdisciplinary, collaborative, friendly, and supportive environment, integrating different sub-fields of Computer Science and Artificial Intelligence.

PhD Projects

Sat, 01 Oct 2022 02:00:00 +0200

As mentioned here, in September 2022 I joined the Institute for Language, Cognition and Communication (ILCC) at the School of Informatics, University of Edinburgh, one of the world’s best schools in NLP and related areas, as a faculty member in NLP! If you are interested in working with me, I have funding for multiple PhD students: make sure to apply either to the UKRI CDT in Natural Language Processing or to the ILCC 3-year PhD program!

Some more details on the ILCC PhD program – there are two deadlines for applying: the first round is on 25th November 2022, and the second round is on 27th January 2023. I strongly recommend that non-UK applicants submit their applications in the first round, to maximise their chances of funding.

Regarding the NLP CDT program – there are also two deadlines for applying: the first round is on 25th November 2022, and the second round is on 27th January 2023. Likewise, I strongly recommend that non-UK applicants submit their applications in the first round, to maximise their chances of funding.

If you are interested in working with me, you can apply via the ILCC PhD program’s and the NLP CDT program’s application portals. You will be asked to submit a research proposal: this is mostly used for assessing candidate PhD students and for matching them with potential faculty supervisor, and you can decide to work on different problems during your PhD. If you would like some feedback on your research proposal, get in touch!

In the following there’s a (non-exhaustive but fairly up-to-date) list of PhD topics we may decide to work on – this list is also available on the Possible PhD topics in ILCC page. An older list of possible research topics is also available at this link, and feel free to propose new project topics that intest you! I’m always happy to explore new directions!

Open-Domain Complex Question Answering at Scale

Open-Domain Question Answering (ODQA) is a task where a system needs to generate the answer to a given general-domain question, and the evidence is not given as input to the system. A core limitation of modern ODQA models (and, more generally, of all models for solving knowledge-intensive tasks) is that they remain limited to answering simple, factoid questions, where the answer to the question is explicit in a single piece of evidence. In contrast, complex questions involve aggregating information from multiple documents, requiring some form of logical reasoning and sequential, multi-hop processing in order to generate the answer. Projects in this area involve proposing new ODQA models for answering complex questions, for example, by taking inspiration from models for answering complex queries in Knowledge Graphs (Arakaleyan et al., 2021; Minervini et al., 2022a) and Neural Theorem Provers (Minervini et al., 2020a; Minervini et al., 2020b) and proposing methods by which neural ODQA models can learn to search in massively large text corpora, such as the entire Web.

Neuro-Symbolic and Hybrid Discrete-Continuous Natural Language Processing Models

Incorporating discrete components, such as discrete decision steps and symbolic reasoning algorithms, in neural models can significantly improve their interpretability, data efficiency, and predictive properties — for example, see (Niepert et al., 2021; Minervini et al., 2022b; Minervini et al., 2020a; Minervini et al., 2020b). However, approaches in this space rely either on ad-hoc continuous relaxations (e.g., Minervini et al., 2020a, Minervini et al., 2020b) or on gradient estimation techniques that require some assumptions on the distributions of the discrete variables (Niepert et al., 2021; Minervini et al., 2022b). Projects in this area involve devising neuro-symbolic approaches for solving NLP tasks that require some degree of reasoning and compositionality and identifying gradient estimation techniques (for back-propagating through discrete decision steps) that are both data-efficient, hyperparameter-free, accurate, and require fewer assumptions on the distribution of the discrete variables.

Learning from Graph-Structured Data

Graph-structured data is everywhere – e.g. consider Knowledge Graphs, social networks, protein and drug interaction networks, and molecular profiles. In this project, we aim to improve models for learning from graph-structured data and their evaluation protocols. Projects in this area involve incorporating invariances and constraints in graph machine learning models (e.g., see Minervini et al., 2017), proposing methods of transferring knowledge between graph representations, automatically identifying functional inductive biases for learning from graphs from a given domain (such as Knowledge Graphs – for example, see our NeurIPS 2022 paper on incorporating the inductive biases used by factorisation-based models into GNNs) and proposing techniques for explaining the output of black-box graph machine learning methods (such as graph embeddings).

Call for PhD Students

Fri, 01 Oct 2021 02:00:00 +0200

From September 2022, I will join the Institute for Language, Cognition and Communication (ILCC) at the School of Informatics, University of Edinburgh!

And there is more! I have funding for multiple PhD students: if you are interested in working with me, make sure to apply either to the UKRI CDT in Natural Language Processing or to the ILCC 3-year PhD program.

In general, I care about anything that can help Deep Learning models become more data-efficient, statistically robust, and explainable. As Artificial Intelligence and Machine Learning systems become more pervasive in areas like critical infrastructures, education, and healthcare, there is an increasing need of AI-based systems that we can trust. For example, the European Union is working on a new set of regulations that will enforce AI-based systems used in high-risk areas to be able to produce high-quality explanations to their users and high levels of robustness and accuracy, among other things. This will automatically exclude the vast majority of the Deep Learning systems that we love and work with on a daily basis.

My research focuses about filling this gap, and developing Deep Learning systems that can produce faithful explanations, that can learn from fewer examples (e.g. thanks to stronger inductive biases), and that can work even on out-of-distribution samples (such as adversarial inputs).

Probably you may want to know a bit more about my research so far in these directions – here are some pointers. Let me now if any of these clicks with you, and feel free to reach out!

Bridging Neural and Symbolic Computation

One way I am trying to address some of the limitations of modern Deep Learning models is by designing hybrid approaches, that inheret the strength of both neural and symbolic systems.

For example, let’s consider the problem of answering complex symbolic queries on (potentially very large) Knowledge Graphs. In our paper Complex Query Answering with Neural Link Predictors, presented at ICLR 2021, we presented an hybrid approach where the query answering task is reduced to solving an optimisation problem whose structure follows the compositional logic structure of the query. Using orders of magnitude less training data, our approach obtains significant improvements in comparison with the purely-neural state-of-the-art models developed in this space, while also being able to produce faithful explanations to its users. This paper obtained an Outstanding Paper Award at ICLR 2021.

Or, for example, let’s consider the problem of deductive reasoning – i.e. deriving logical conclusions. Previous research shows that even BERT-based models do not generalise properly when required to perform reasoning tasks that differ from these observed during training – e.g. because they require composing multiple reasoning patterns, that were never observed together at training time. We proposed several approaches for solving this problem, by designing neural models whose behaviour mimics the behaviour of logic deductive reasoners. Our approaches enable neural models to perform multi-hop reasoning over multiple documents (ACL 2019), and learn logic rules from graph-structured data (ICML 2020 and AAAI 2020).

More recently, we were wondering whether it could be possible to incorporate black-box algorithmic components, like Dijkstra’s shortest path algorithm or any ILP solver in a neural model. In our paper Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions, presented at NeurIPS 2021, we developed a very general (and extremely simple!) method for back-propagating through a massive variety of algorithmic components, effectively allowing neural models to use them as off-the-shelf components. See our presentation of this paper, as well as Yannic Kilcher’s explanation.

Incorporating Constraints in Neural Models

Some other times, we would like a neural model to comply with a given set constraints. For example, we would like that, when our model predicts that $X$ is a parent of $Y$ *, and *$Y$ is a parent of $Z$, we would also like it to predict that $X$ is a grandparent of $Z$. Constraints are key for developing statistically robust model – for example, think of adversarial perturbations in computer vision. In the case of adversarial perturbations, the model is essentially violating a single constraint, i.e. given an image $X$, if $Y$ is a semantically-invariant perturbation of $X$, the model should produce the same output for both $X$ and $Y$.

In our paper Adversarial Sets for Regularising Neural Link Predictors, presented at UAI 2017, we presented the first method for incorporating arbitrary constraints encoded in the form of First-Order Logic rules in a wide class of neural models. Our idea is very simple and general: during training, at each step, we can define an adversary that finds on which inputs the model maximally violates a given constraint, and then require the model to reduce the degree of such violations. We also show that, for a wide class of models and constraint types, we can have efficient and globally-optimal solutions to the problem of finding where the model maximally violates a constraint. This is pretty amazing, since (1) it makes the training procedure extremely efficient, adding very little overhead, and (2) if the search process does not return any significant violation of a constraint, it means that the model will never violate that constraint, for every possible input it may encounter. This provides a way of producing some kind of safety guarantees for a large set of neural models, which are very desirable in a lot of high-risk settings.

We explored further applications of these ideas in several settings. For example, in Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge (CoNLL 2018), we show that some common-sense reasoning patterns can also be represented as constraints, and incporporating these in neural Natural Language Inference (NLI) models yields improvements both on in-distribution and out-of-distribution data. In Gone At Last: Removing the Hypothesis-Only Bias in Natural Language Inference via Ensemble Adversarial Training (EMNLP 2020), we show that we can use ensembles of adversaries for de-biasing neural NLI models. In Undersensitivity in Neural Reading Comprehension (Findings of EMNLP 2020), we found that neural Question Answering (QA) models can often ignore semantically meaningful variations in the input questions, and proposed a related training process for correcting such behaviour. In Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations (ACL 2020), we identified that models for producing natural language explanations often violate self-consistency constraints, and can produce mutually inconsistent explanations.

Some notes on Gaussian Fields and Label Propagation

Sun, 01 Jan 2017 01:00:00 +0100

In several occasions, we find ourselves in need of propagating information among nodes in an undirected graph.

For instance, consider graph-based Semi-Supervised Learning (SSL): here, labeled and unlabeled examples are represented by an undirected graph, referred to as the similarity graph.

The task consists in finding a label assignment to all examples, such that:

The final labeling is consistent with training data (e.g. positive training examples are still classified as positive at the end of the learning process), and
Similar examples are assigned similar labels: this is referred to as the semi-supervised smoothness assumption.

Similarly, in networked data such as social networks, we might assume that related entities (such as friends) are associated to similar attributes (such as political and religious views, musical tastes and so on): in social network analysis, this phenomenon is commonly referred to as homophily (love of the same).

In both cases, propagating information from a limited set of nodes in a graph to all nodes provides a method for predicting the attributes of such nodes, when this information is missing.

In the following, we introduce a really clever method for efficiently propagating information about nodes in undirected graphs, known as the Gaussian Fields method.

Propagation as a Cost Minimization Problem

We now cast the propagation problem as a binary classification task. Let $X = \{ x_{1}, x_{2}, \ldots, x_{n} \}$ be a set of $n$ instances, of which only $l$ are labeled: $X^{+}$ are positive examples, while $X^{-}$ are negative examples

Similarity relations between instances can be represented by means of an undirected similarity graph having adjacency matrix $\mathbf{W} \in \mathbb{R}^{n \times n}$: if two instances are connected in the similarity graph, it means that they are considered similar, and should be assigned the same label. Specifically, $\mathbf{W}_{ij} > 0$ iff the instances $x_{i}, x_{j} \in X$ are connected by an edge in the similarity graph, and $\mathbf{W}_{ij} = 0$ otherwise.

Let $y_{i} \in \{ \pm 1 \}$ be the label assigned to the $i$-th instance $x_{i} \in X$. We can encode our assumption that similar instances should be assigned similar labels by defining a quadratic cost function over labeling functions in the form $f : X \mapsto \{ \pm 1 \}$:

\[E(f) = \frac{1}{2} \sum_{x_{i} \in X} \sum_{x_{j} \in X} \mathbf{W}_{ij} \left[ f(x_{i}) - f(x_{j}) \right]^{2}.\]

Given an input labeling function $f$, the cost function $E(\cdot)$ associates, for each pair of instances $x_{i}, x_{j} \in X$, a non-negative cost $\mathbf{W}_{ij} \left[ f(x_{i}) - f(x_{j}) \right]$: this quantity is $0$ when $\mathbf{W}_{ij} = 0$ (i.e. $x_{i}$ and $X_{j}$ are not linked in the similarity graph), or when $f(x_{i}) = f(x_{j})$ (i.e. they are assigned the same label).

For such a reason, the cost function $E(\cdot)$ favors labeling functions that are more likely to assign the same labels to instances that are linked by an edge in the similarity graph.

Now, the problem of finding a labeling function that is both consistent with training labels, and assigns similar labels to similar instances, can be cast as a cost minimization problem. Let’s represent a labeling function $f$ by a vector $\mathbf{f} \in \mathbb{R}^{n}$, $L \subset X$ denote labeled instances, and $\mathbf{y}_{i} \in \{ \pm 1 \}$ denote the label of the $x_{i}$-th instance. The optimization problem can be defined as follows:

\[\begin{aligned} & \underset{\mathbf{f} \in \{ \pm 1 \}^{n}}{\text{minimize}} & & E(\mathbf{f}) \\ & \text{subject to} & & \forall x \in L: \; \mathbf{f}_{i} = \mathbf{y}_{i}. \end{aligned}\]

The constraint $\forall x \in L : \mathbf{f}_{i} = \mathbf{y}_{i}$ enforces the label of each labeled example $x_{i} \in L$ to $\mathbf{f}_{i} = +1$ if the instance has a positive label, and to $\mathbf{f}_{i} = -1$ if the instance has a negative label, so to achieve consistency with training labels.

However, constraining labeling functions $f$ to only take discrete values has two main drawbacks:

Each function $f$ can only provide hard classifications, without yielding any measure of confidence in the provided classification.
The cost term $E(\cdot)$ can be hard to optimize in a multi-label classification setting.

For overcoming such limitations, Zhu et al. propose a continuous relaxation of the previous optimization problem:

\[\begin{aligned} & \underset{\mathbf{f} \in \mathbb{R}^{n}}{\text{minimize}} & & E(\mathbf{f}) \\ & \text{subject to} & & \forall x \in L: \; \mathbf{f}_{i} = \mathbf{y}_{i}, \end{aligned}\]

where the term $\sum_{x_{i} \in X} \mathbf{f}_{i}^{2} = \mathbf{f}^{T} \mathbf{f}$ is a $L_{2}$ regularizer over $\mathbf{f}$, weighted by a parameter $\epsilon > 0$ which ensures that the optimization problem has a unique global solution.

The parameter $\epsilon$ can be interpreted as the decay of the propagation process: as the distance from a labeled instance within the similarity graph increases, the confidence in the classification (as measured by the continuous label) gets closer to zero.

This optimization problem has a unique, global solution that can be calculated in closed-form. Specifically, the optimal (relaxed) discriminant function $f : X \mapsto \mathbb{R}$ is given by $\mathbf{\hat{f}} = \left[ \mathbf{f}_{L}, \mathbf{f}_{U} \right]^{T}$, where $\mathbf{\hat{f}}_{L} = \mathbf{y}_{L}$ (i.e. labels for labeled examples in $L$ coincide with training labels), while $\mathbf{\hat{f}}_{U}$ is given by:

\[\mathbf{\hat{f}}_{U} = (\mathbf{L}_{UU} + \epsilon \mathbf{I})^{-1} \mathbf{W}_{UL} \mathbf{\hat{f}}_{L},\]

where $\mathbf{L} = \mathbf{D} - \mathbf{W}$ is the graph Laplacian of the similarity graph with adjacency matrix $\mathbf{W}$, and $\mathbf{D}$ is a diagonal matrix such that $\mathbf{D}_{ii} = \sum_{j} \mathbf{W}_{ij}$.