Timezone: »
Poster
A provable SVDbased algorithm for learning topics in dominant admixture corpus
Trapit Bansal · Chiranjib Bhattacharyya · Ravindran Kannan
Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from such a collection of documents drawn from admixtures, is NPhard. Making a strong assumption called separability, [4] gave the first provable algorithm for inference. For the widely used LDA model, [6] gave a provable algorithm using clever tensormethods. But [4, 6] do not learn topic vectors with bounded $l_1$ error (a natural measure for probability vectors). Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded $l_1$ error. A topic in LDA and other models is essentially characterized by a group of cooccurring words. Motivated by this, we introduce topic specific Catchwords, a group of words which occur with strictly greater frequency in a topic than any other topic individually and are required to have high frequency together rather than individually. A major contribution of the paper is to show that under this more realistic assumption, which is empirically verified on real corpora, a singular value decomposition (SVD) based algorithm with a crucial preprocessing step of thresholding, can provably recover the topics from a collection of documents drawn from Dominant admixtures. Dominant admixtures are convex combination of distributions in which one distribution has a significantly higher contribution than the others. Apart from the simplicity of the algorithm, the sample complexity has near optimal dependence on $w_0$, the lowest probability that a topic is dominant, and is better than [4]. Empirical evidence shows that on several real world corpora, both Catchwords and Dominant admixture assumptions hold and the proposed algorithm substantially outperforms the state of the art [5].
Author Information
Trapit Bansal (University of Massachusetts Amherst)
Chiranjib Bhattacharyya (Indian Institute of Science)
Ravindran Kannan (Indian Institute of Science)
More from the Same Authors

2015 Poster: Weighted Theta Functions and Embeddings with Applications to MaxCut, Clustering and Summarization »
Fredrik D Johansson · Ankani Chattoraj · Chiranjib Bhattacharyya · Devdatt Dubhashi 
2015 Poster: Spectral Norm Regularization of Orthonormal Representations for Graph Transduction »
Rakesh Shivanna · Bibaswan K Chatterjee · Raman Sankaran · Chiranjib Bhattacharyya · Francis Bach 
2014 Poster: Learning on graphs using Orthonormal Representation is Statistically Consistent »
Rakesh Shivanna · Chiranjib Bhattacharyya 
2012 Poster: The Lovasz $\theta$ function, SVMs and finding large dense subgraphs »
Vinay Jethava · Anders Martinsson · Chiranjib Bhattacharyya · Devdatt Dubhashi 
2010 Poster: Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions »
Achintya Kundu · Vikram Mukunda Rao Tankasali · Chiranjib Bhattacharyya · Aharon BenTal 
2009 Poster: On the Algorithmics and Applications of a Mixednorm based Kernel Learning Formulation »
Saketha Nath Jagarlapudi · dinesh govindaraj · Raman S · Chiranjib Bhattacharyya · Aharon BenTal · K. R. Ramakrishnan 
2007 Poster: Kernels on Attributed Pointsets with Applications »
Mehul Parsana · Sourangshu Bhattacharya · Chiranjib Bhattacharyya · K. R. Ramakrishnan 
2007 Poster: A Randomized Algorithm for Large Scale Support Vector Learning »
Krishnan S Kumar · Chiranjib Bhattacharyya · Ramesh Hariharan