Timezone: »
Industry-wide successes of machine learning at the dawn of the (so-called) big data era has led to an increasing gap between practitioners and theoreticians. The former are using off-the-shelf statistical and machine learning methods, while the latter are designing and studying the mathematical properties of such algorithms. The tradeoff between those two movements is somewhat addressed by Bayesian researchers, where sound mathematical guarantees often meet efficient implementation and provide model selection criteria. In the late 90s, a new paradigm has emerged in the statistical learning community, used to derive probably approximately correct (PAC) bounds on Bayesian-flavored estimators. This PAC-Bayesian theory has been pioneered by Shawe-Taylor and Willamson (1997), and McAllester (1998, 1999). It has been extensively formalized by Catoni (2004, 2007) and has triggered, slowly but surely, increasing research efforts during last decades.
We believe it is time to pinpoint the current PAC-Bayesian trends relatively to other modern approaches in the (statistical) machine learning community. Indeed, we observe that, while the field grows by its own, it took some undesirable distance from some related areas. Firstly, it seems to us that the relation to Bayesian methods has been forsaken in numerous works, despite the potential of PAC-Bayesian theory to bring new insights to the Bayesian community and to go beyond the classical Bayesian/frequentist divide. Secondly, the PAC-Bayesian methods share similarities with other quasi-Bayesian (or pseudo-Bayesian) methods studying Bayesian practices from a frequentist standpoint, such as the Minimum Description Length (MDL) principle (Grünwald, 2007). Last but not least, even if some practical and theory grounded learning algorithm has emerged from PAC-Bayesian works, these are almost unused for real-world problems.
In short, this workshop aims at gathering statisticians and machine learning researchers to discuss current trends and the future of {PAC,quasi}-Bayesian learning. From a broader perspective, we aim to bridge the gap between several communities that can all benefit from sharper statistical guarantees and sound theory-driven learning algorithms.
References
[1] J. Shawe-Taylor and R. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of COLT, 1997.
[2] D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of COLT, 1998.
[3] D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of COLT, 1999.
[4] O. Catoni. Statistical Learning Theory and Stochastic Optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2004.
[5] O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 56. Institute of Mathematical Statistics, 2007.
[6] P. D. Grünwald. The Minimum Description Length Principle. The MIT Press, 2007.
Sat 8:30 a.m. - 8:30 a.m.
|
Overture
(
Opening remarks
)
|
Benjamin Guedj · Francis Bach · Pascal Germain 🔗 |
Sat 8:30 a.m. - 9:30 a.m.
|
François Laviolette - A Tutorial on PAC-Bayesian Theory ( Talk ) link » | Francois Laviolette 🔗 |
Sat 9:30 a.m. - 10:15 a.m.
|
Peter Grünwald - A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity
(
Talk
)
link »
Over the last 15 years, machine learning theorists have bounded the performance of empirical risk minimization by (localized) Rademacher complexity; Bayesians with frequentist sympathies have studied Bayesian consistency and rate of convergence theorems, sometimes under misspecification, and PAC-Bayesians have studied convergence properties of generalized Bayesian and Gibbs posteriors. We show that, amazingly, most such bounds readily follow from essentially a single result that bounds excess risk in terms of a novel complexity COMP$(\eta,w)$. which depends on a learning rate $\eta$ and a luckiness function $w$, the latter generalizing the concept of a 'prior'. Depending on the choice of $w$, COMP$(\eta,w)$ specializes to PAC-Bayesian (KL(posterior||prior) complexity, MDL (normalized maximum likelihood) complexity and Rademacher complexity, and the bounds obtained are optimized for generalized Bayes, ERM, penalized ERM (such as Lasso) or other methods. Tuning $\eta$ leads to optimal excess risk convergence rates, even for very large (polynomial entropy) classes which have always been problematic for the PAC-Bayesian approach; the optimal $\eta$ depends on 'fast rate' properties of the domain, such as central, Bernstein and Tsybakov conditions.
Joint work with Nishant Mehta, University of Victoria. See https://arxiv.org/abs/1710.07732
|
Peter Grünwald 🔗 |
Sat 11:00 a.m. - 11:45 a.m.
|
Jean-Michel Marin - Some recent advances on Approximate Bayesian Computation techniques
(
Talk
)
link »
In an increasing number of application domains, the statistical model is so complex that the point-wise computation of the likelihood is intractable. That is typically the case when the underlying probability distribution involves numerous latent variables. Approximate Bayesian Computation (ABC) is a widely used technique to bypass that difficulty. I will review some recent developments on ABC techniques, emphazing the fact that modern machine learning approaches are useful in this field. Although intrinsically very different of PAC-Bayesian strategies - the choice of a generative model is essential within the ABC paradigm - I will highlight some links between these two methodologies. |
Jean-Michel Marin 🔗 |
Sat 11:45 a.m. - 12:05 p.m.
|
Contributed talk 1 - A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
(
Talk
)
|
Behnam Neyshabur 🔗 |
Sat 2:00 p.m. - 2:40 p.m.
|
Olivier Catoni - Dimension-free PAC-Bayesian Bounds
(
Talk
)
link »
PAC-Bayesian inequalities have already proved to be a great tool to obtain dimension free generalization bounds, such as margin bounds for Support Vector Machines. In this talk, we will play with PAC-Bayesian inequalities and influence functions to present new robust estimators for the mean of random vectors and random matrices, as well as for linear least squares regression. A common theme of the presentation will be to establish dimension free bounds and to work under mild polynomial moment assumptions regarding the tail of the sample distribution. Joint work with Ilaria Giulini. |
Olivier Catoni 🔗 |
Sat 2:40 p.m. - 3:00 p.m.
|
Contributed talk 2 - Dimension free PAC-Bayesian bounds for the estimation of the mean of a random vector
(
Talk
)
|
Olivier Catoni 🔗 |
Sat 3:30 p.m. - 4:15 p.m.
|
Yevgeny Seldin - A Strongly Quasiconvex PAC-Bayesian Bound
(
Talk
)
link »
We propose a new PAC-Bayesian bound and a way of constructing a hypothesis space, so that the bound is convex in the posterior distribution and also convex in a trade-off parameter between empirical performance of the posterior distribution and its complexity. The complexity is measured by the Kullback-Leibler divergence to a prior. We derive an alternating procedure for minimizing the bound. We show that the bound can be rewritten as a one-dimensional function of the trade-off parameter and provide sufficient conditions under which the function has a single global minimum. When the conditions are satisfied the alternating minimization is guaranteed to converge to the global minimum of the bound. We provide experimental results demonstrating that rigorous minimization of the bound is competitive with cross-validation in tuning the trade-off between complexity and empirical performance. In all our experiments the trade-off turned to be quasiconvex even when the sufficient conditions were violated. Joint work with Niklas Thiemann, Christian Igel, and Olivier Wintenberger. |
Yevgeny Seldin 🔗 |
Sat 4:15 p.m. - 5:00 p.m.
|
John Shawe-Taylor - Distribution Dependent Priors for Stable Learning ( Talk ) link » | John Shawe-Taylor 🔗 |
Sat 5:00 p.m. - 5:30 p.m.
|
Daniel Roy - Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes
(
Talk
)
link »
One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error. One roadblock to explaining these phenomena in terms of implicit regularization, structural properties of the solution, and/or easiness of the data is that many learning bounds are quantitatively vacuous when applied to networks learned by SGD in this "deep learning" regime. Logically, in order to explain generalization, we need nonvacuous bounds. I will discuss recent work using PAC-Bayesian bounds and optimization to arrive at nonvacuous generalization bounds for neural networks with millions of parameters trained on only tens of thousands of examples. We connect our findings to recent and old work on flat minima and MDL-based explanations of generalization, as well as to variational inference for deep learning. Time permitting, I'll discuss new work interpreting Entropy-SGD as a PAC-Bayesian method. Joint work with Gintare Karolina Dziugaite, based on https://arxiv.org/abs/1703.11008 |
Daniel Roy 🔗 |
Sat 5:30 p.m. - 6:25 p.m.
|
Neil Lawrence, Francis Bach and François Laviolette ( Discussion ) link » | Neil Lawrence · Francis Bach · Francois Laviolette 🔗 |
Sat 6:25 p.m. - 6:25 p.m.
|
Concluding remarks
|
Francis Bach · Benjamin Guedj · Pascal Germain 🔗 |
Author Information
Benjamin Guedj (Inria & University College London)
Benjamin Guedj is a tenured research scientist at Inria since 2014, affiliated to the Lille - Nord Europe research centre in France. He is also affiliated with the mathematics department of the University of Lille. Since 2018, he is a Principal Research Fellow at the Centre for Artificial Intelligence and Department of Computer Science at University College London. He is also a visiting researcher at The Alan Turing Institute. Since 2020, he is the founder and scientific director of The Inria London Programme, a strategic partnership between Inria and UCL as part of a France-UK scientific initiative. He obtained his Ph.D. in mathematics in 2013 from UPMC (Université Pierre & Marie Curie, France) under the supervision of Gérard Biau and Éric Moulines. Prior to that, he was a research assistant at DTU Compute (Denmark). His main line of research is in statistical machine learning, both from theoretical and algorithmic perspectives. He is primarily interested in the design, analysis and implementation of statistical machine learning methods for high dimensional problems, mainly using the PAC-Bayesian theory.
Pascal Germain (INRIA Paris)
Francis Bach (Inria)
Francis Bach is a researcher at INRIA, leading since 2011 the SIERRA project-team, which is part of the Computer Science Department at Ecole Normale Supérieure in Paris, France. After completing his Ph.D. in Computer Science at U.C. Berkeley, he spent two years at Ecole des Mines, and joined INRIA and Ecole Normale Supérieure in 2007. He is interested in statistical machine learning, and especially in convex optimization, combinatorial optimization, sparse methods, kernel-based learning, vision and signal processing. He gave numerous courses on optimization in the last few years in summer schools. He has been program co-chair for the International Conference on Machine Learning in 2015.
More from the Same Authors
-
2021 : Progress in Self-Certified Neural Networks »
Maria Perez-Ortiz · Omar Rivasplata · Emilio Parrado-Hernández · Benjamin Guedj · John Shawe-Taylor -
2022 Poster: A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning »
Eloïse Berthier · Ziad Kobeissi · Francis Bach -
2023 Poster: On the impact of activation and normalization in obtaining isometric embeddings at initialization »
Amir Joudaki · Hadi Daneshmand · Francis Bach -
2023 Poster: Differentiable Clustering with Perturbed Spanning Forests »
Lawrence Stewart · Francis Bach · Felipe Llinares-Lopez · Quentin Berthet -
2023 Poster: Statistical Guarantees for Variational Autoencoders using PAC-Bayesian Theory »
Sokhna Diarra Mbacke · Florence Clerc · Pascal Germain -
2023 Poster: Regularization properties of adversarially-trained linear regression »
Antonio Ribeiro · Dave Zachariah · Francis Bach · Thomas Schön -
2023 Poster: Learning via Wasserstein-Based High Probability Generalization Bounds »
Paul Viallard · Maxime Haddouche · Umut Simsekli · Benjamin Guedj -
2022 Spotlight: Lightning Talks 1A-4 »
Siwei Wang · Jing Liu · Nianqiao Ju · Shiqian Li · Eloïse Berthier · Muhammad Faaiz Taufiq · Arsene Fansi Tchango · Chen Liang · Chulin Xie · Jordan Awan · Jean-Francois Ton · Ziad Kobeissi · Wenguan Wang · Xinwang Liu · Kewen Wu · Rishab Goel · Jiaxu Miao · Suyuan Liu · Julien Martel · Ruobin Gong · Francis Bach · Chi Zhang · Rob Cornish · Sanmi Koyejo · Zhi Wen · Yee Whye Teh · Yi Yang · Jiaqi Jin · Bo Li · Yixin Zhu · Vinayak Rao · Wenxuan Tu · Gaetan Marceau Caron · Arnaud Doucet · Xinzhong Zhu · Joumana Ghosn · En Zhu -
2022 Spotlight: A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning »
Eloïse Berthier · Ziad Kobeissi · Francis Bach -
2022 Poster: Variational inference via Wasserstein gradient flows »
Marc Lambert · Sinho Chewi · Francis Bach · Silvère Bonnabel · Philippe Rigollet -
2022 Poster: KSD Aggregated Goodness-of-fit Test »
Antonin Schrab · Benjamin Guedj · Arthur Gretton -
2022 Poster: Efficient Aggregated Kernel Tests using Incomplete $U$-statistics »
Antonin Schrab · Ilmun Kim · Benjamin Guedj · Arthur Gretton -
2022 Poster: Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays »
Konstantin Mishchenko · Francis Bach · Mathieu Even · Blake Woodworth -
2022 Poster: On the Theoretical Properties of Noise Correlation in Stochastic Optimization »
Aurelien Lucchi · Frank Proske · Antonio Orvieto · Francis Bach · Hans Kersting -
2022 Poster: On Margins and Generalisation for Voting Classifiers »
Felix Biggs · Valentina Zantedeschi · Benjamin Guedj -
2022 Poster: Fast Stochastic Composite Minimization and an Accelerated Frank-Wolfe Algorithm under Parallelization »
Benjamin Dubois-Taine · Francis Bach · Quentin Berthet · Adrien Taylor -
2022 Poster: Online PAC-Bayes Learning »
Maxime Haddouche · Benjamin Guedj -
2022 Poster: Active Labeling: Streaming Stochastic Gradients »
Vivien Cabannes · Francis Bach · Vianney Perchet · Alessandro Rudi -
2021 Poster: Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound »
Valentina Zantedeschi · Paul Viallard · Emilie Morvant · Rémi Emonet · Amaury Habrard · Pascal Germain · Benjamin Guedj -
2020 : Francis Bach - Where is Machine Learning Going? »
Francis Bach -
2020 Poster: PAC-Bayesian Bound for the Conditional Value at Risk »
Zakaria Mhammedi · Benjamin Guedj · Robert Williamson -
2020 Spotlight: PAC-Bayesian Bound for the Conditional Value at Risk »
Zakaria Mhammedi · Benjamin Guedj · Robert Williamson -
2019 Poster: PAC-Bayes Un-Expected Bernstein Inequality »
Zakaria Mhammedi · Peter Grünwald · Benjamin Guedj -
2019 Poster: Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks »
Gaël Letarte · Pascal Germain · Benjamin Guedj · Francois Laviolette -
2017 : Concluding remarks »
Francis Bach · Benjamin Guedj · Pascal Germain -
2017 : Neil Lawrence, Francis Bach and François Laviolette »
Neil Lawrence · Francis Bach · Francois Laviolette -
2017 : Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance »
Francis Bach -
2017 : Overture »
Benjamin Guedj · Francis Bach · Pascal Germain -
2017 Poster: On Structured Prediction Theory with Calibrated Convex Surrogate Losses »
Anton Osokin · Francis Bach · Simon Lacoste-Julien -
2017 Oral: On Structured Prediction Theory with Calibrated Convex Surrogate Losses »
Anton Osokin · Francis Bach · Simon Lacoste-Julien -
2017 Poster: Nonlinear Acceleration of Stochastic Algorithms »
Damien Scieur · Francis Bach · Alexandre d'Aspremont -
2017 Poster: Integration Methods and Optimization Algorithms »
Damien Scieur · Vincent Roulet · Francis Bach · Alexandre d'Aspremont -
2016 : Francis Bach. Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression. »
Francis Bach -
2016 : Submodular Functions: from Discrete to Continuous Domains »
Francis Bach -
2016 Tutorial: Large-Scale Optimization: Beyond Stochastic Gradient Descent and Convexity »
Suvrit Sra · Francis Bach -
2014 Workshop: Second Workshop on Transfer and Multi-Task Learning: Theory meets Practice »
Urun Dogan · Tatiana Tommasi · Yoshua Bengio · Francesco Orabona · Marius Kloft · Andres Munoz · Gunnar Rätsch · Hal Daumé III · Mehryar Mohri · Xuezhi Wang · Daniel Hernández-lobato · Song Liu · Thomas Unterthiner · Pascal Germain · Vinay P Namboodiri · Michael Goetz · Christopher Berlind · Sigurd Spieckermann · Marta Soare · Yujia Li · Vitaly Kuznetsov · Wenzhao Lian · Daniele Calandriello · Emilie Morvant -
2009 Poster: From PAC-Bayes Bounds to KL Regularization »
Pascal Germain · Alexandre Lacasse · Francois Laviolette · Mario Marchand · Sara Shanian