Timezone: »
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.
Author Information
Brenda Betancourt (Duke University)
Giacomo Zanella (The University of Warick)
Jeffrey Miller (Duke University)
Hanna Wallach (Microsoft Research)
Abbas Zaidi (Duke University)
Beka Steorts (Duke University)
More from the Same Authors
-
2021 Panel: How Should a Machine Learning Researcher Think About AI Ethics? »
Amanda Askell · Abeba Birhane · Jesse Dodge · Casey Fiesler · Pascale N Fung · Hanna Wallach -
2020 : Panel & Closing »
Tamara Broderick · Laurent Dinh · Neil Lawrence · Kristian Lum · Hanna Wallach · Sinead Williamson -
2020 : Morning keynote »
Hanna Wallach · Rosie Campbell -
2020 Workshop: I Can’t Believe It’s Not Better! Bridging the gap between theory and empiricism in probabilistic machine learning »
Jessica Forde · Francisco Ruiz · Melanie Fernandez Pradier · Aaron Schein · Finale Doshi-Velez · Isabel Valera · David Blei · Hanna Wallach -
2019 Poster: Poisson-Randomized Gamma Dynamical Systems »
Aaron Schein · Scott Linderman · Mingyuan Zhou · David Blei · Hanna Wallach -
2018 : Research Panel »
Sinead Williamson · Barbara Engelhardt · Tom Griffiths · Neil Lawrence · Hanna Wallach -
2018 : Panel on research process »
Zachary Lipton · Charles Sutton · Finale Doshi-Velez · Hanna Wallach · Suchi Saria · Rich Caruana · Thomas Rainforth -
2018 : Hanna Wallach - Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? »
Hanna Wallach -
2017 : Poster spotlights »
Hiroshi Kuwajima · Masayuki Tanaka · Qingkai Liang · Matthieu Komorowski · Fanyu Que · Thalita F Drumond · Aniruddh Raghu · Leo Anthony Celi · Christina Göpfert · Andrew Ross · Sarah Tan · Rich Caruana · Yin Lou · Devinder Kumar · Graham Taylor · Forough Poursabzi-Sangdeh · Jennifer Wortman Vaughan · Hanna Wallach -
2016 Workshop: Practical Bayesian Nonparametrics »
Nick Foti · Tamara Broderick · Trevor Campbell · Michael Hughes · Jeffrey Miller · Aaron Schein · Sinead Williamson · Yanxun Xu -
2016 Poster: Poisson-Gamma dynamical systems »
Aaron Schein · Hanna Wallach · Mingyuan Zhou -
2016 Oral: Poisson-Gamma dynamical systems »
Aaron Schein · Hanna Wallach · Mingyuan Zhou -
2015 : Non-standard approaches to nonparametric Bayes »
Jeffrey Miller -
2015 Workshop: Bayesian Nonparametrics: The Next Generation »
Tamara Broderick · Nick Foti · Aaron Schein · Alex Tank · Hanna Wallach · Sinead Williamson -
2013 Workshop: Topic Models: Computation, Application, and Evaluation »
David Mimno · Amr Ahmed · Jordan Boyd-Graber · Ankur Moitra · Hanna Wallach · Alexander Smola · David Blei · Anima Anandkumar -
2012 Poster: Topic-Partitioned Multinetwork Embeddings »
Peter Krafft · Juston S Moore · Hanna Wallach · Bruce Desmarais -
2011 Workshop: 2nd Workshop on Computational Social Science and the Wisdom of Crowds »
Winter Mason · Jennifer Wortman Vaughan · Hanna Wallach -
2010 Workshop: Computational Social Science and the Wisdom of Crowds »
Jennifer Wortman Vaughan · Hanna Wallach -
2009 Workshop: Applications for Topic Models: Text and Beyond »
David Blei · Jordan Boyd-Graber · Jonathan Chang · Katherine Heller · Hanna Wallach -
2009 Poster: Rethinking LDA: Why Priors Matter »
Hanna Wallach · David Mimno · Andrew McCallum -
2009 Spotlight: Rethinking LDA: Why Priors Matter »
Hanna Wallach · David Mimno · Andrew McCallum