Timezone: »

Distributed Inference for Latent Dirichlet Allocation
David Newman · Arthur Asuncion · Padhraic Smyth · Max Welling

Tue Dec 04 11:50 AM -- 12:00 PM (PST) @

We investigate the problem of learning a widely-used latent-variable model -- the Latent Dirichlet Allocation (LDA) or topic model -- using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates---it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors---it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using three real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

Author Information

David Newman (University of California, Irvine)
Arthur Asuncion (University of California, Irvine)
Padhraic Smyth (University of California, Irvine)
Max Welling (Microsoft Research AI4Science / University of Amsterdam)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors