`

Timezone: »

 
Workshop
5th Workshop on Meta-Learning
Erin Grant · Fábio Ferreira · Frank Hutter · Jonathan Schwarz · Joaquin Vanschoren · Huaxiu Yao

Mon Dec 13 03:00 AM -- 12:30 PM (PST) @ None
Event URL: https://meta-learn.github.io/ »

Recent years have seen rapid progress in meta-learning methods, which transfer knowledge across tasks and domains to efficiently learn new tasks, optimize the learning process itself, and even generate new learning methods from scratch. Meta-learning can be seen as the logical conclusion of the arc that machine learning has undergone in the last decade, from learning classifiers, to learning representations, and finally to learning algorithms that themselves acquire representations, classifiers, and policies for acting in environments. In practice, meta-learning has been shown to yield new state-of-the-art automated machine learning methods, novel deep learning architectures, and substantially improved one-shot learning systems. Moreover, to improve one’s own learning capabilities through experience can also be viewed as a hallmark of intelligent beings, and neuroscience shows a strong connection between human and reward learning and the growing sub-field of meta-reinforcement learning.

Mon 3:00 a.m. - 3:10 a.m.
Introduction and opening remarks (Live speech)
Mon 3:10 a.m. - 3:35 a.m.
Ying Wei (Invited talk)
Ying Wei
Mon 3:35 a.m. - 3:40 a.m.
Ying Wei Q&A (Q&A)
Ying Wei
Mon 3:40 a.m. - 4:00 a.m.
[ OpenReview  link »   

When data are scarce meta-learning can improve a learner's accuracy by harnessing previous experience from related learning tasks. However, existing methods have unreliable uncertainty estimates which are often overconfident. Addressing these shortcomings, we introduce a novel meta-learning framework, called F-PACOH, that treats meta-learned priors as stochastic processes and performs meta-level regularization directly in the function space. This allows us to directly steer the probabilistic predictions of the meta-learner towards high epistemic uncertainty in regions of insufficient meta-training data and, thus, obtain well-calibrated uncertainty estimates. Finally, we showcase how our approach can be integrated with sequential decision making, where reliable uncertainty quantification is imperative. In our benchmark study on meta-learning for Bayesian Optimization (BO), F-PACOH significantly outperforms all other meta-learners and standard baselines.

Jonas Rothfuss · Dominique Heyn · jinfan Chen · Andreas Krause
Mon 4:00 a.m. - 5:00 a.m.
Poster session 1 (Poster session)
Mon 5:00 a.m. - 5:25 a.m.
Carlo Ciliberto (Invited talk)
Carlo Ciliberto
Mon 5:25 a.m. - 5:30 a.m.
Carlo Ciliberto Q&A (Q&A)
Carlo Ciliberto
Mon 5:30 a.m. - 5:55 a.m.
Mihaela Van Der Schaar (Invited talk)
Mihaela van der Schaar
Mon 5:55 a.m. - 6:00 a.m.
Mihaela Van Der Schaar Q&A (Q&A)
Mihaela van der Schaar
Mon 6:00 a.m. - 7:00 a.m.
Break
Mon 7:00 a.m. - 8:00 a.m.
Panel Discussion
Mon 8:00 a.m. - 8:20 a.m.
[ OpenReview  link »   

We propose an algorithm for meta-optimization that lets the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under some loss. Focusing on meta-learning with gradients, we establish conditions that guarantee performance improvements and show that the improvement is related to the target distance. Thus, by controlling curvature, the distance measure can be used to ease meta-optimization. Further, the bootstrapping mechanism can extend the effective meta-learning horizon without requiring backpropagation through all updates. The algorithm is versatile and easy to implement. We achieve a new state-of-the art for model-free agents on the Atari ALE benchmark, improve upon MAML in few-shot learning, and demonstrate how our approach opens up new possibilities by meta-learning efficient exploration in an epsilon-greedy Q-learning agent.

Sebastian Flennerhag · Yannick Schroecker · Tom Zahavy · Hado van Hasselt · David Silver · Satinder Singh
Mon 8:20 a.m. - 8:45 a.m.
Nan Rosemary Ke (Invited talk)   
Nan Rosemary Ke
Mon 8:45 a.m. - 8:50 a.m.
Nan Rosemary Ke Q&A (Q&A)
Nan Rosemary Ke
Mon 8:50 a.m. - 10:00 a.m.
Poster session 2 (Poster session)
Mon 10:00 a.m. - 10:25 a.m.
Luke Metz (Invited talk)
Luke Metz
Mon 10:25 a.m. - 10:30 a.m.
Luke Metz Q&A (Q&A)
Luke Metz
Mon 10:30 a.m. - 10:55 a.m.
Eleni Triantafillou (Invited talk)   
Eleni Triantafillou
Mon 10:55 a.m. - 11:00 a.m.
Eleni Triantafillou Q&A (Q&A)
Eleni Triantafillou
Mon 11:00 a.m. - 11:20 a.m.
[ OpenReview  link »

Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We do not want to remove this distributional shift by simply adopting a conservative exploration strategy, because learning an exploration strategy enables an agent to collect better data for faster adaptation. Instead, we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels, to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks.

Vitchyr Pong · Ashvin Nair · Laura Smith · Catherine Huang · Sergey Levine
Mon 11:20 a.m. - 12:30 p.m.
Poster session 3 (Poster session)
-
[ OpenReview  link »   

Meta-learning allows an intelligent agent to leverage prior learning episodes as a basis for quickly improving performance on novel tasks. A critical challenge lies in the inherent uncertainty about whether new tasks can be considered similar to those observed before, and robust meta-learning methods would ideally reason about this to produce corresponding uncertainty estimates. We extend model-agnostic meta-learning with variational inference: we model the identity of new tasks as a latent random variable, which modulates the fine-tuning of meta-learned neural networks. Our approach requires little additional computation and doesn't make strong assumptions about the distribution of the neural network weights, and allows the algorithm to generalize to more divergent task distributions, resulting in better-calibrated uncertainty measures while maintaining accurate predictions.

Joaquin Vanschoren
-
[ OpenReview  link »

Meta-learning aims to train a model on various tasks so that given sample data from a task, even if unforeseen, it can adapt fast and perform well. We apply techniques from compressed sensing to shed light on the effect of inner-loop regularization in meta-learning, with an algorithm that minimizes cross-task interference without compromising weight-sharing.In our algorithm, which is representative of numerous similar variations, the model is explicitly trained such that upon adding a pertinent sparse output layer, it can perform well on a new task with very few number of updates, where cross-task interference is minimized by the sparse recovery of output layer. We demonstrate that this approach produces good results on few-shot regression, classification and reinforcement learning, with several benefits in terms of training efficiency, stability and generalization.

Beicheng Lou · Nathan Zhao · Jiahui Wang
-
[ OpenReview  link »

Teaching robots to learn diverse locomotion skills under complex three-dimensional environmental settings via Reinforcement Learning (RL) is still challenging. It has been shown that training agents in simple settings before moving them on to complex settings improves the training process, but so far only in the context of relatively simple locomotion skills. In this work, we adapt the Enhanced Paired Open-Ended Trailblazer (ePOET) approach to train more complex agents to walk efficiently on complex three-dimensional terrains. First, to generate more rugged and diverse three-dimensional training terrains with increasing complexity, we extend the Compositional Pattern Producing Networks - Neuroevolution of Augmenting Topologies (CPPN-NEAT) approach and include randomized shapes. Second, we combine ePOET with Soft Actor-Critic off-policy optimization, yielding ePOET-SAC, to ensure that the agent could learn more diverse skills to solve more challenging tasks.

Joaquin Vanschoren
-
[ OpenReview  link »

Neural processes (NPs) aim to stochastically complete unseen data points based on a given context dataset. NPs essentially leverage a given dataset as a context embedding to derive an identifier suitable for a novel task. To improve the prediction accuracy, many variants of NPs have investigated context embedding approaches that generally design novel network architectures and aggregation functions satisfying permutation invariant. This paper proposes a stochastic attention mechanism for NPs to capture appropriate context information. From the perspective of information theory, we demonstrate that the proposed method encourages context embedding to be differentiated from a target dataset. The differentiated information induces NPs to learn to derive appropriate identifiers by considering together context embeddings and features in a target dataset. We empirically show that our approach substantially outperforms various conventional NPs in 1D regression and lotka-Volterra problem as well as image completion. Plus, we observe that the proposed method maintains performance and captures context embedding under restricted task distributions, where typical NPs suffer from lack of effective tasks to learn context embeddings. The proposed method achieves comparable results with state-of-the-art methods in the MovieLens-10k dataset, one of the real-world problems with limited users, perform well for the image completion task even with very limited meta-training dataset.

Mingyu Kim · KyeongRyeol Go · Se-Young Yun
-
[ OpenReview  link »

Currently, it is hard to reap the benefits of deep learning for Bayesian methods.We present Prior-Data Fitted Networks (PFNs), a method that allows to employ large-scale machine learning techniques to approximate a large set of posteriors.The only requirement for PFNs is the ability to sample from a prior distribution over supervised learning tasks (or functions).The method repeatedly draws a task (or function) from this prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points.Presented with samples from a new supervised learning task as input, it can then make probabilistic predictions for arbitrary other data points in a single forward propagation, effectively having learned to perform Bayesian inference.We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in such diverse areas as Gaussian process regression and Bayesian neural networks, demonstrating the generality of PFNs.

Samuel Müller · Noah Hollmann · Sebastian Pineda Arango · Josif Grabocka · Frank Hutter
-
[ OpenReview  link »

Consistency, the theoretical property of a meta learning algorithm of being able to adapt to any task at test time under its default settings (and various assumptions), has been frequently named as desirable in the literature. An open question is whether and how theoretical consistency translates into practice, in comparison to inconsistent algorithms. In this paper, we empirically investigate this question on a set of representative meta-RL algorithms. We find that usually, theoretically consistent algorithms can indeed adapt to out-of-distribution (OOD) tasks, while inconsistent ones cannot, although they can still fail in practice due to reasons like poor exploration. We further find that theoretically inconsistent algorithms can be made consistent by continuing to train on the OOD tasks, and adapt as well or better than consistent ones. We conclude that theoretical consistency is indeed a desirable property, albeit not as advantageous in practice as often assumed.

Zheng Xiong · Luisa Zintgraf · Jacob Beck · Risto Vuorio · Shimon Whiteson
-
[ OpenReview  link »   

Catastrophic forgetting in neural networks is a significant problem for continual1learning. A majority of the current methods replay previous data during training, which violates the constraints of a strict continual learning setup. Additionally, current approaches that deal with forgetting ignore the problem of catastrophic remembering, i.e. the worsening ability to discriminate between data from different tasks. In our work, we introduce Relevance Mapping Networks (RMNs). The mappings reflect the relevance of the weights for the task at hand by assigning large weights to essential parameters. We show that RMNs learn an optimized representational overlap that overcomes the twin problem of catastrophic forgetting and remembering. Our approach achieves state-of-the-art performance across many common continual learning benchmarks, even significantly outperforming data replay methods while not violating the constraints for a strict continual learning setup. Moreover, RMNs retain the ability to discriminate between old and new tasks in an unsupervised manner, thus proving their resilience against catastrophic remembering

prakhar kaushik · Adam Kortylewski · Alex Gain · Alan Yuille
-
[ OpenReview  link »

Recent generative models such as generative adversarial networks have achieved remarkable success in generating realistic images, but they require large training datasets and computational resources. The goal of few-shot image generation is to learn the distribution of a new dataset from only a handful of examples by transferring knowledge learned from structurally similar datasets. Towards achieving this goal, we propose the “Implicit Support Set Autoencoder” (ISSA) that adversarially learns the relationship across datasets using an unsupervised dataset representation, while the distribution of each individual dataset is learned using implicit distributions. Given a few examples from a new dataset, ISSA can generate new samples by inferring the representation of the underlying distribution using a single forward pass. We showcase significant gains from our method on generating high quality and diverse images for unseen classes in the Omniglot and CelebA datasets in few-shot image generation settings.

Shenyang Huang · Kuan-Chieh Wang · Guillaume Rabusseau · Alireza Makhzani
-
[ OpenReview  link »

Many advances in machine learning can be attributed to designing systems with inductive biases well-suited for particular tasks. However, it can be challenging to ascertain what inductive biases a learning system has, much less control them in the design process. We propose a framework to capture the inductive biases in a learning system by meta-learning hyperparameters of a Gaussian process from observations of the behavior of a machine learning system. We illustrate the potential of this framework across several case studies, including investigating the inductive biases of both untrained and trained neural networks, and assessing whether a given neural network family is well-suited for a task family.

Michael Li · Erin Grant · Tom Griffiths
-
[ OpenReview  link »

The success of neural architecture search (NAS) has historically been limited by excessive compute requirements. While modern weight-sharing NAS methods such as DARTS are able to finish the search in single-digit GPU days, extracting the final best architecture from the shared weights is notoriously unreliable. Training-Speed-Estimate (TSE), a recently developed generalization estimator with a Bayesian marginal likelihood interpretation, has previously been used in place of the validation loss for gradient-based optimization in DARTS. This prevents the DARTS skip connection collapse, which significantly improves performance on NASBench-201 and the original DARTS search space. We extend those results by applying various DARTS diagnostics and show several unusual behaviors arising from not using a validation set. Furthermore, our experiments yield concrete examples of the depth gap and topology selection in DARTS having a strongly negative impact on the search performance despite generally receiving limited attention in the literature compared to the operations selection.

Miroslav Fil · Robin Ru · Clare Lyle · Yarin Gal
-
[ OpenReview  link »

We propose an adaptation of the curriculum training framework, applicable to state-of-the-art meta learning techniques for few-shot classification. Curriculum-based training popularly attempts to mimic human learning by progressively increasing the training complexity to enable incremental concept learning. As the meta-learner's goal is learning how to learn from as few samples as possible, the exact number of those samples (i.e. the size of the support set) arises as a natural proxy of a given task's difficulty. We define a simple yet novel curriculum schedule that begins with a larger support size and progressively reduces it throughout training to eventually match the desired shot-size of the test setup. This proposed method boosts the learning efficiency as well as the generalization capability. Our experiments with the MAML algorithm on two few-shot image classification tasks show significant gains with the curriculum training framework. Ablation studies corroborate the independence of our proposed method from the model architecture as well as the meta-learning hyperparameters.

Priyanka Agrawal
-
[ OpenReview  link »   

Hyperparameter optimization (HPO) is a crucial component of deploying machine learning models, however, it remains an open problem due to the resource-constrained number of possible hyperparameter evaluations. As a result, prior work focus on exploring the direction of transfer learning for tackling the sample inefficiency of HPO. In contrast to existing approaches, we propose a novel Deep Kernel Gaussian Process surrogate with Landmark Meta-features (DKLM) that can be jointly meta-trained on a set of source tasks and then transferred efficiently on a new (unseen) target task. We design DKLM to capture the similarity between hyperparameter configurations with an end-to-end meta-feature network that embeds the set of evaluated configurations and their respective performance. As a result, our novel DKLM can learn contextualized dataset-specific similarity representations for hyperparameter configurations. We experimentally validate the performance of DKLM in a wide range of HPO meta-datasets from OpenML and demonstrate the empirical superiority of our method against a series of state-of-the-art baselines.

Hadi Jomaa · Sebastian Pineda Arango · Lars Schmidt-Thieme · Josif Grabocka
-
[ OpenReview  link »

A longstanding goal in reinforcement learning is to build intelligent agents that show fast learning and a flexible transfer of skills akin to humans and animals. This paper investigates the integration of two frameworks for tackling those goals: episodic control and successor features. Episodic control is a cognitively inspired approach relying on episodic memory, an instance-based memory model of an agent's experiences. Meanwhile, successor features and generalized policy improvement (SF&GPI) is a meta and transfer learning framework allowing to learn policies for tasks that can be efficiently reused for later tasks which have a different reward function. Individually, these two techniques have shown impressive results in vastly improving sample efficiency and the elegant reuse of previously learned policies. Thus, we outline a combination of both approaches in a single reinforcement learning framework and empirically illustrate its benefits.

David Emukpere · Xavier Alameda-Pineda · Chris Reinke
-
[ OpenReview  link »

Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FedMix, that takes into account the unique challenges brought by federated learning. FedMix has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FedMix does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FedMix formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.

Elnur Gasanov · Ahmed Khaled Ragab Bayoumi · Samuel Horváth · Peter Richtarik
-
[ OpenReview  link »   

Meta-learning for few-shot classification has been challenged on its effectiveness compared to simpler pretraining methods and the validity of its claim of "learning to learn". Recent work has suggested that MAML-based models do not perform "rapid-learning" in the inner-loop but reuse features by only adapting the final linear layer. Separately, BatchNorm, a near ubiquitous inclusion in model architectures, has been shown to have an implicit learning rate decay effect on the preceding layers of a network. We study the impact of BatchNorm's implicit learning rate decay on feature reuse in meta-learning methods and find that counteracting it increases change in intermediate layers during adaptation. We also find that counteracting this learning rate decay sometimes improves performance on few-shot classification tasks.

Alexander Wang · Sasha (Alexandre) Doubov · Gary Leung
-
[ OpenReview  link »   

Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that task distribution plays a vital role in the performance of the model. Conventional wisdom is that task diversity should improve the performance of meta-learning. In this work, we find evidence to the contrary; we study different task distributions on a myriad of models and datasets to evaluate the effect of task diversity on meta-learning algorithms. For this experiment, we train on two datasets - Omniglot and miniImageNet and with three broad classes of meta-learning models - Metric-based (i.e., Protonet, Matching Networks), Optimization-based (i.e., MAML, Reptile, and MetaOptNet), and Bayesian meta-learning models (i.e., CNAPs). Our experiments demonstrate that the effect of task diversity on all these algorithms follows a similar trend, and task diversity does not seem to offer any benefits to the learning of the model. Furthermore, we also demonstrate that even a handful of tasks, repeated over multiple batches, would be sufficient to achieve a performance similar to uniform sampling and draws into question the need for additional tasks to create better models.

Ramnath Kumar · Tristan Deleu · Yoshua Bengio
-
[ OpenReview  link »   

Meta-learning models transfer the knowledge acquired from previous tasks to quickly learn new ones. They are trained on benchmarks with a fixed number of data points per task. This number is usually arbitrary and it is unknown how it affects performance at testing. Since labelling of data is expensive, finding the optimal allocation of labels across training tasks may reduce costs. Given a fixed budget of labels, should we use a small number of highly labelled tasks, or many tasks with few labels each? Should we allocate more labels to some tasks and less to others?We show that: 1) If tasks are homogeneous, there is a uniform optimal allocation, whereby all tasks get the same amount of data; 2) At fixed budget, there is a trade-off between number of tasks and number of data points per task, with a unique and constant optimum; 3) When trained separately, harder task should get more data, at the cost of a smaller number of tasks; 4) When training on a mixture of easy and hard tasks, more data should be allocated to easy tasks. Interestingly, Neuroscience experiments have shown that human visual skills also transfer better from easy tasks. We prove these results mathematically on mixed linear regression, and we show empirically that the same results hold for few-shot image classification on CIFAR-FS and mini-ImageNet. Our results provide guidance for allocating labels across tasks when collecting data for meta-learning.

Alexandru Cioba · Michael Bromberg · Qian Wang · RITWIK NIYOGI · Georgios Batzolis · Jezabel Garcia · Da-shan Shiu · Alberto Bernacchia
-
[ OpenReview  link »

A trustworthy AI system demands the ability to learn a broad range of knowledge with modest data and transfer the learned prior to the concrete task. In this work, we shall discuss the unsupervised meta-learning. We propose to learn multi-modal task-specific priors in the latent space using energy-based prior model, where the energy term couples a continuous latent vector and a symbolic one-hot label. Such coupling in the latent space informs the latent vector of the underlying category from the observed example. Our model can be learned in an unsupervised manner in the meta-training phase and evaluated in a semi-supervised manner in the meta-test phase. Our experiments show our method outperforms all the state-of-the-arts on miniImageNet and gives competitive results on Omniglot.

Bo Pang · Deqian Kong · Ying Nian Wu
-
[ OpenReview  link »

Model-Agnostic Meta-Learning (MAML), a popular gradient-based meta-learning framework, assumes that the contribution of each task or instance to the meta-learner is equal. Hence, it fails to address the domain shift between base and novel classes in few-shot learning. In this work, we propose a novel robust meta-learning algorithm, NESTEDMAML, which learns to assign weights to training tasks or instances. We consider weights as hyper-parameters and iteratively optimize them using a small set of validation tasks set in a nested bi-level optimization approach (in contrast to the standard bi-level optimization in MAML). We then apply NESTEDMAML in the meta-training stage, which involves (1) several tasks sampled from a distribution different from the meta-test task distribution, or (2) some data samples with noisy labels. Extensive experiments on synthetic and real-world datasets demonstrate that NESTEDMAML efficiently mitigates the effects of "unwanted" tasks or instances, leading to significant improvement over the state-of-the-art robust meta-learning methods.

Krishnateja Killamsetty · Changbin Li · Chen Zhao · Rishabh Iyer · Feng Chen
-
[ OpenReview  link »   

Meta reinforcement learning (RL) attempts to discover new RL algorithms automatically from environment interaction. In so-called black-box approaches, the policy and the learning algorithm are jointly represented by a single neural network. These methods are very flexible, but they tend to underperform in terms of generalisation to new, unseen environments. In this paper, we explore the role of symmetries in meta-generalisation. We show that a recent successful meta RL approach that meta-learns an objective for backpropagation-based learning exhibits certain symmetries (specifically the reuse of the learning rule, and invariance to input and output permutations) that are not present in typical black-box meta RL systems. We hypothesise that these symmetries can play an important role in meta-generalisation. Building off recent work in black-box supervised meta learning, we develop a black-box meta RL system that exhibits these same symmetries. We show through careful experimentation that incorporating these symmetries can lead to algorithms with a greater ability to generalise to unseen action & observation spaces, tasks, and environments.

Louis Kirsch · Sebastian Flennerhag · Hado van Hasselt · Abram Friesen · Junhyuk Oh · Yutian Chen
-
[ OpenReview  link »

In cooperative multi-agent reinforcement learning (MARL), agents often can only partially observe the environment state, and thus communication is crucial to achieving coordination. Communicating agents must simultaneously learn to whom to communicate (i.e., communication topology) and how to interpret the received message for decision-making. Although agents can efficiently learn communication interpretation by end-to-end backpropagation, learning communication topology is much trickier since the binary decisions of whether to communicate impede end-to-end differentiation. As evidenced in our experiments, existing solutions, such as reparameterization tricks and reformulating topology learning as reinforcement learning, often fall short. This paper introduces a meta-learning framework that aims to discover and continually adapt the update rules for communication topology learning. Empirical results show that our meta-learning approach outperforms existing alternatives in a range of cooperative MARL tasks and demonstrates a reasonably strong ability to generalize to tasks different from meta-training. Preliminary analyses suggest that, interestingly, the discovered update rules occasionally resemble the human-designed rules such as policy gradients, yet remaining qualitatively different in most cases.

Qi Zhang · Dingyang Chen
-
[ OpenReview  link »

Meta-learning (ML) has emerged as a promising direction in learning models under constrained resource settings like few-shot learning. The popular approaches for ML either learn a generalizable initial model or a generic parametric optimizer through episodic training. The former approaches leverage the knowledge from a batch of tasks to learn an optimal prior. In this work, we study the importance of tasks in a batch for ML. We hypothesize that the common assumption in batch episodic training where each task in a batch has an equal contribution to learning an optimal meta-model need not be true. We propose to weight the tasks in a batch according to their ``importance" in improving the meta-model's learning. To this end, we introduce a training curriculum, called task attended meta-training, to weight the tasks in a batch. The task attention is a standalone unit and can be integrated with any batch episodic training regimen. The comparisons of the task-attended ML models with their non-task-attended counterparts on complex datasets like miniImageNet, FC100 and tieredImageNet validate its effectiveness.

Aroof Aimen · Bharat Ladrecha · Narayanan C Krishnan
-
[ OpenReview  link »

Self-tuning algorithms that adapt the learning process online encourage more effective and robust learning. Among all the methods available, meta-gradients have emerged as a promising approach. They leverage the differentiability of the learning rule with respect to some hyper-parameters to adapt them in an online fashion. Although meta-gradients can be accumulated over multiple learning steps to avoid myopic updates, this is rarely used in practice. In this work, we demonstrate that whilst multi-step meta-gradients do provide a better learning signal in expectation, this comes at the cost of a significant increase in variance, hindering performance. In the light of this analysis, we introduce a novel method mixing multiple inner steps that enjoys a more accurate and robust meta-gradient signal, essentially trading off bias and variance in meta-gradient estimation. When applied to the Snake game, the mixing meta-gradient algorithm can cut the variance by a factor of 3 while achieving similar or higher performance.

Clément Bonnet · Paul Caron · Thomas D Barrett · Ian Davies · Alexandre Laterre
-
[ OpenReview  link »

While deep reinforcement learning methods have shown impressive results in robot learning, their sample inefficiency makes the learning of complex, long-horizon behaviors with real robot systems infeasible. To mitigate this issue, meta-reinforcement learning methods aim to enable fast learning on novel tasks by learning how to learn. Yet, the application has been limited to short-horizon tasks with dense rewards. To enable learning long-horizon behaviors, recent works have explored leveraging prior experience in the form of offline datasets without reward or task annotations. While these approaches yield improved sample efficiency, millions of interactions with the environment are still required to solve long-horizon tasks. In this work, we devise a method that enables meta-learning on long-horizon, sparse-reward tasks, allowing us to solve unseen target tasks with orders of magnitude fewer environment interactions than prior works. Our core idea is to leverage prior experience extracted from offline datasets during meta-learning. Specifically, we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose the skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task. Experimental results demonstrate that the proposed method can efficiently solve long-horizon novel target tasks by combining the strengths of meta-learning and the usage of offline datasets, while prior approaches in RL, meta-RL, and multi-task RL require substantially more environment interactions to solve the tasks.

Taewook Nam · Shao-Hua Sun · Karl Pertsch · Sung Ju Hwang · Joseph Lim
-
[ OpenReview  link »

Prominent online experimentation approaches in industry, such as A/B testing, are often not scalable with respect to the number of candidate models. To address this shortcoming, recent work has introduced an automated online experimentation (AOE) scheme that uses a probabilistic model of user behavior to predict online performance of candidate models. While effective, these predictions of online performance may be biased due to various unforeseen circumstances, such as user modelling bias, a shift in data distribution or an incomplete set of features. In this work, we leverage advances from multi-fidelity optimization in order to combine AOE with Bayesian optimization (BO). This mitigates the effect of biased predictions, while still retaining scalability and performance. Furthermore, our approach also allows us to optimally adjust the number of users in a test cell, which is typically kept constant for online experimentation schemes, leading to a more effective allocation of resources. Our synthetic experiments show that our method yields improved performance, when compared to AOE, BO and other baseline approaches.

Steven Kleinegesse · Zhenwen Dai · Andreas Damianou · Kamil Ciosek · Federico Tomasi
-
[ OpenReview  link »   

We propose a new computationally-efficient first-order algorithm for Model-Agnostic Meta-Learning (MAML). The key enabling technique is to interpret MAML as a bilevel optimization (BLO) problem and leverage the sign-based SGD (signSGD) as a lower-level optimizer of BLO. We show that MAML, through the lens of signSGD-oriented BLO, naturally yields an alternating optimization scheme that just requires first-order gradients of a learned meta-model. We term the resulting MAML algorithm Sign-MAML. Compared to the conventional first-order MAML (FO-MAML) algorithm, Sign-MAML is theoretically-grounded as it does not impose any assumption on the absence of second-order derivatives during meta training. In practice, we show that Sign-MAML outperforms FO-MAML in various few-shot image classification tasks, and compared to MAML, it achieves a much more graceful tradeoff between classification accuracy and computation efficiency.

Chen Fan · Parikshit Ram · Sijia Liu
-
[ OpenReview  link »

Meta-learning has received considerable attention as one approach to enable deep neural networks to learn from a few data. Recent results suggest that simply fine-tuning a pre-trained network may be more effective at learning new image classification tasks from limited data than more complicated meta-learning techniques such as MAML. This is surprising as the learning behavior of MAML mimics that of fine-tuning. We investigate this phenomenon and show that the pre-trained features are more diverse and discriminative than those learned by MAML and Reptile, which specialize for adaptation in low-data regimes of a similar data distribution as the one used for training. Due to this specialization and lack of diversity, MAML and Reptile may fail to generalize to out-of-distribution tasks whereas fine-tuning can fall back on the diversity of the learned features.

Mike Huisman · Jan van Rijn · Aske Plaat
-
[ OpenReview  link »   

Few-shot learning aims to classify unknown classes of examples with a few new examples per class. There are two key routes for few-shot learning. One is to (pre-)train a classifier with examples from known classes, and then transfer the pre-trained classifier to unknown classes using the new examples. The other, called meta few-shot learning, is to couple pre-training with episodic training, which contains episodes of few-shot learning tasks simulated from the known classes. Pre-training is known to play a crucial role for the transfer route, but the role of pre-training for the episodic route is less clear. In this work, we study the role of pre-training for the episodic route. We find that pre-training serves as major role of disentangling representations of known classes, which makes the resulting learning tasks easier for episodic training. The finding allows us to shift the huge simulation burden of episodic training to a simpler pre-training stage. We justify such a benefit of shift by designing a new disentanglement-based pre-training model, which helps episodic training achieve competitive performance more efficiently.

Chia-You Chen · Hsuan-Tien Lin · Masashi Sugiyama · Gang Niu
-
[ OpenReview  link »   

Bayesian optimisation (BO) has been used to search in structured spaces described by a context-free grammar, such as chemical molecules. Previous work has used a probabilistic generative model, such as a variational autoencoder, to map the structured representations into a compact continuous embedding within which BO can take advantage of local proximity and identify good search areas. However, the resultant embedding does not fully capture the structural proximity relations of the input space, which leads to inefficient search. In this paper, we propose to use contrastive learning to learn an alternative embedding. We outline how a subtree replacement strategy can generate structurally similar pairs of samples from the input space for use in contrastive learning. We demonstrate that the resulting embedding captures more of the structural proximity relationships of the input space and improves BO performance when applied to a synthetic arithmetic expression fitting task and a real-world molecule optimisation task.

Josh Tingey · Ciarán Lee · Zhenwen Dai
-
[ OpenReview  link »

A few-shot generative model should be able to generate data from a distribution by only observing a limited set of examples. In few-shot learning the model is trained on data from many sets from different distributions sharing some underlying properties such as sets of characters from different alphabets or sets of images of different type objects. We study a latent variables approach that extends the Neural Statistician to a fully hierarchical approach with an attention-based point to set-level aggregation. We extend the previous work to iterative data sampling, likelihood-based model comparison, and adaptation-free out of distribution generalization. Our results show that the hierarchical formulation better captures the intrinsic variability within the sets in the small data regime.With this work we generalize deep latent variable approaches to few-shot learning, taking a step towards large-scale few-shot generation with a formulation that readily can work with current state-of-the-art deep generative models.

Giorgio Giannone · Ole Winther

Author Information

Erin Grant (UC Berkeley)
Fabio Ferreira (University of Freiburg)
Frank Hutter (University of Freiburg & Bosch)

Frank Hutter is a Full Professor for Machine Learning at the Computer Science Department of the University of Freiburg (Germany), where he previously was an assistant professor 2013-2017. Before that, he was at the University of British Columbia (UBC) for eight years, for his PhD and postdoc. Frank's main research interests lie in machine learning, artificial intelligence and automated algorithm design. For his 2009 PhD thesis on algorithm configuration, he received the CAIAC doctoral dissertation award for the best thesis in AI in Canada that year, and with his coauthors, he received several best paper awards and prizes in international competitions on machine learning, SAT solving, and AI planning. Since 2016 he holds an ERC Starting Grant for a project on automating deep learning based on Bayesian optimization, Bayesian neural networks, and deep reinforcement learning.

Jonathan Schwarz (DeepMind & Gatsby Unit, UCL)
Joaquin Vanschoren (Eindhoven University of Technology)
Huaxiu Yao (Stanford University)

More from the Same Authors