Timezone: »
Statistical Efficiency of Score Matching: The View from Isoperimetry
Frederic Koehler · Alexander Heckett · Andrej Risteski
Event URL: https://openreview.net/forum?id=8JNghZ7LIO3 »
Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ --- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood --- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated --- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant --- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show thatthe score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant --- even for simple families of distributions like exponential families with rich enough sufficient statistics --- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.
Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ --- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood --- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated --- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant --- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show thatthe score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant --- even for simple families of distributions like exponential families with rich enough sufficient statistics --- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.
Author Information
Frederic Koehler (MIT)
Alexander Heckett (Carnegie Mellon University)
Andrej Risteski (CMU)
Assistant Professor in the ML department at CMU. Prior to that I was a Wiener Fellow at MIT, and prior to that finished my PhD at Princeton University.
More from the Same Authors
-
2021 Spotlight: Parametric Complexity Bounds for Approximating PDEs with Neural Networks »
Tanya Marwah · Zachary Lipton · Andrej Risteski -
2022 : Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization »
Elan Rosenfeld · Pradeep Ravikumar · Andrej Risteski -
2023 Poster: Provable benefits of score matching »
Chirag Pabbaraju · Dhruv Rohatgi · Anish Prasad Sevekari · Holden Lee · Ankur Moitra · Andrej Risteski -
2023 Poster: Deep Equilibrium Based Neural Operators for Steady-State PDEs »
Tanya Marwah · Ashwini Pokle · J. Zico Kolter · Zachary Lipton · Jianfeng Lu · Andrej Risteski -
2023 Poster: Provable benefits of annealing for estimating normalizing constants »
Omar Chehab · Aapo Hyvarinen · Andrej Risteski -
2023 Poster: Feature Adaptation for Sparse Linear Regression »
Jonathan Kelner · Frederic Koehler · Raghu Meka · Dhruv Rohatgi -
2023 Poster: Uniform Convergence with Square-Root Lipschitz Loss »
Lijia Zhou · Zhen Dai · Frederic Koehler · Nati Srebro -
2023 Poster: (Un)interpretability of Transformers: a case study with Dyck grammars »
Kaiyue Wen · Yuchen Li · Bingbin Liu · Andrej Risteski -
2022 Panel: Panel 2C-7: Optimal Rates for… & Reconstruction on Trees… »
Frederic Koehler · Zhu Li -
2022 : Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization »
Elan Rosenfeld · Pradeep Ravikumar · Andrej Risteski -
2022 Poster: A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models »
Lijia Zhou · Frederic Koehler · Pragya Sur · Danica J. Sutherland · Nati Srebro -
2022 Poster: Reconstruction on Trees and Low-Degree Polynomials »
Frederic Koehler · Elchanan Mossel -
2022 Poster: Lower Bounds on Randomly Preconditioned Lasso via Robust Sparse Designs »
Jonathan Kelner · Frederic Koehler · Raghu Meka · Dhruv Rohatgi -
2022 Poster: Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments »
Yining Chen · Elan Rosenfeld · Mark Sellke · Tengyu Ma · Andrej Risteski -
2022 Poster: Masked Prediction: A Parameter Identifiability View »
Bingbin Liu · Daniel Hsu · Pradeep Ravikumar · Andrej Risteski -
2022 Poster: Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions »
Binghui Peng · Andrej Risteski -
2021 Poster: Parametric Complexity Bounds for Approximating PDEs with Neural Networks »
Tanya Marwah · Zachary Lipton · Andrej Risteski -
2021 Oral: Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting »
Frederic Koehler · Lijia Zhou · Danica J. Sutherland · Nathan Srebro -
2021 Poster: Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting »
Frederic Koehler · Lijia Zhou · Danica J. Sutherland · Nathan Srebro -
2021 Poster: Universal Approximation Using Well-Conditioned Normalizing Flows »
Holden Lee · Chirag Pabbaraju · Anish Prasad Sevekari · Andrej Risteski -
2020 Poster: Learning Some Popular Gaussian Graphical Models without Condition Number Bounds »
Jonathan Kelner · Frederic Koehler · Raghu Meka · Ankur Moitra -
2020 Poster: From Boltzmann Machines to Neural Networks and Back Again »
Surbhi Goel · Adam Klivans · Frederic Koehler -
2020 Spotlight: Learning Some Popular Gaussian Graphical Models without Condition Number Bounds »
Jonathan Kelner · Frederic Koehler · Raghu Meka · Ankur Moitra -
2020 Poster: Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Evolvability »
Sitan Chen · Frederic Koehler · Ankur Moitra · Morris Yau -
2020 Spotlight: Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Evolvability »
Sitan Chen · Frederic Koehler · Ankur Moitra · Morris Yau -
2019 Poster: Fast Convergence of Belief Propagation to Global Optima: Beyond Correlation Decay »
Frederic Koehler -
2019 Spotlight: Fast Convergence of Belief Propagation to Global Optima: Beyond Correlation Decay »
Frederic Koehler -
2017 Poster: Information Theoretic Properties of Markov Random Fields, and their Algorithmic Applications »
Linus Hamilton · Frederic Koehler · Ankur Moitra