Track: Theory

Tue 5 Dec. 14:50 - 15:05 PST

Oral

Safe and Nested Subgame Solving for Imperfect-Information Games

Noam Brown · Tuomas Sandholm

Unlike perfect-information games, imperfect-information games cannot be solved by decomposing the game into subgames that are solved independently. Thus more computationally intensive equilibrium-finding techniques are used, and all decisions must consider the strategy of the game as a whole. While it is not possible to solve an imperfect-information game exactly through decomposition, it is possible to approximate solutions, or improve existing solutions, by solving disjoint subgames. This process is referred to as subgame solving. We introduce subgame solving techniques that outperform prior methods both in theory and practice. We also show how to adapt them, and past subgame-solving techniques, to respond to opponent actions that are outside the original action abstraction; this significantly outperforms the prior state-of-the-art approach, action translation. Finally, we show that subgame solving can be repeated as the game progresses down the tree, leading to significantly lower exploitability. We applied these techniques to develop the first AI to defeat top humans in heads-up no-limit Texas hold'em poker.

Tue 5 Dec. 15:05 - 15:20 PST

Oral

A graph-theoretic approach to multitasking

Noga Alon · Daniel Reichman · Igor Shinkar · Tal Wagner · Sebastian Musslick · Jonathan D Cohen · Tom Griffiths · Biswadip dey · Kayhan Ozcimder

A key feature of neural network architectures is their ability to support the simultaneous interaction among large numbers of units in the learning and processing of representations. However, how the richness of such interactions trades off against the ability of a network to simultaneously carry out multiple independent processes -- a salient limitation in many domains of human cognition -- remains largely unexplored. In this paper we use a graph-theoretic analysis of network architecture to address this question, where tasks are represented as edges in a bipartite graph $G=(A \cup B, E)$. We define a new measure of multitasking capacity of such networks, based on the assumptions that tasks that \emph{need} to be multitasked rely on independent resources, i.e., form a matching, and that tasks \emph{can} be performed without interference if they form an induced matching. Our main result is an inherent tradeoff between the multitasking capacity and the average degree of the network that holds \emph{regardless of the network architecture}. These results are also extended to networks of depth greater than $2$. On the positive side, we demonstrate that networks that are random-like (e.g., locally sparse) can have desirable multitasking properties. Our results shed light into the parallel-processing limitations of neural systems and provide insights that may be useful for the analysis and design of parallel architectures.

Tue 5 Dec. 15:20 - 15:25 PST

Spotlight

Information-theoretic analysis of generalization capability of learning algorithms

Aolin Xu · Maxim Raginsky

We derive upper bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and output. The upper bounds provide theoretical guidelines for striking the right balance between data fit and generalization by controlling the input-output mutual information of a learning algorithm. The results can also be used to analyze the generalization capability of learning algorithms under adaptive composition, and the bias-accuracy tradeoffs in adaptive data analytics. Our work extends and leads to nontrivial improvements on the recent results of Russo and Zou.

Tue 5 Dec. 15:25 - 15:30 PST

Spotlight

Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee

Alireza Aghasi · Afshin Abdi · Nam Nguyen · Justin Romberg

We introduce and analyze a new technique for model reduction for deep neural networks. While large networks are theoretically capable of learning arbitrarily complex models, overfitting and model redundancy negatively affects the prediction accuracy and model variance. Our Net-Trim algorithm prunes (sparsifies) a trained network layer-wise, removing connections at each layer by solving a convex optimization program. This program seeks a sparse set of weights at each layer that keeps the layer inputs and outputs consistent with the originally trained model. The algorithms and associated analysis are applicable to neural networks operating with the rectified linear unit (ReLU) as the nonlinear activation. We present both parallel and cascade versions of the algorithm. While the latter can achieve slightly simpler models with the same generalization performance, the former can be computed in a distributed manner. In both cases, Net-Trim significantly reduces the number of connections in the network, while also providing enough regularization to slightly reduce the generalization error. We also provide a mathematical analysis of the consistency between the initial network and the retrained model. To analyze the model sample complexity, we derive the general sufficient conditions for the recovery of a sparse transform matrix. For a single layer taking independent Gaussian random vectors as inputs, we show that if the network response can be described using a maximum number of $s$ non-zero weights per node, these weights can be learned from $O(s\log N)$ samples.

Tue 5 Dec. 15:30 - 15:35 PST

Spotlight

Clustering Billions of Reads for DNA Data Storage

Cyrus Rashtchian · Konstantin Makarychev · Miklos Racz · Siena Ang · Djordje Jevdjic · Sergey Yekhanin · Luis Ceze · Karin Strauss

Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. We observe that datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain separability properties, such as those coming from DNA storage systems. We also prove that, under these assumptions, our algorithm is robust to outliers and high levels of noise. We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. Compared to the state-of-the-art algorithm for clustering DNA sequences, our algorithm simultaneously achieves higher accuracy and a 1000x speedup on three real datasets.

Tue 5 Dec. 15:35 - 15:40 PST

Spotlight

On the Complexity of Learning Neural Networks

Le Song · Santosh Vempala · John Wilmes · Bo Xie

The stunning empirical successes of neural networks currently lack rigorous theoretical eplanation. What form would such an explanation take, in the face of existing complexity-theoretic lower bounds? A first step might be to show that data generated by neural networks a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently. We demonstrate here a comprehensive lower bound ruling out this possibility: for a wide class of activation functions (including all currently used), and inputs drawn from any logconcave distribution, there is a family of one-hidden-layer functions whose output is a sum gate that are hard to learn in a precise sense: any statistical query algorithm (which includes all known variants of stochastic gradient descent with any loss function) needs an exponential number of queries even using tolerance inversely proportional to the input dimensionality. Moreover, this hard family of functions is realizable with a small (sublinear in dimension) number of activation units in the single hidden layer. The lower bound is also robust to small perturbations of the true weights. Systematic experiments illustrate a phase transition in the training error as predicted by the analysis.

Tue 5 Dec. 15:40 - 15:45 PST

Spotlight

Multiplicative Weights Update with Constant Step-Size in Congestion Games: Convergence, Limit Cycles and Chaos

Gerasimos Palaiopanos · Ioannis Panageas · Georgios Piliouras

The Multiplicative Weights Update (MWU) method is a ubiquitous meta-algorithm that works as follows: A distribution is maintained on a certain set, and at each step the probability assigned to action $\gamma$ is multiplied by $(1 -\epsilon C(\gamma))>0$ where $C(\gamma)$ is the ``cost" of action $\gamma$ and then rescaled to ensure that the new values form a distribution. We analyze MWU in congestion games where agents use \textit{arbitrary admissible constants} as learning rates $\epsilon$ and prove convergence to \textit{exact Nash equilibria}. Interestingly, this convergence result does not carry over to the nearly homologous MWU variant where at each step the probability assigned to action $\gamma$ is multiplied by $(1 -\epsilon)^{C(\gamma)}$ even for the most innocuous case of two-agent, two-strategy load balancing games, where such dynamics can provably lead to limit cycles or even chaotic behavior.

Tue 5 Dec. 15:45 - 15:50 PST

Spotlight

Estimating Mutual Information for Discrete-Continuous Mixtures

Weihao Gao · Sreeram Kannan · Sewoong Oh · Pramod Viswanath

Estimation of mutual information from observed samples is a basic primitive in machine learning, useful in several learning tasks including correlation mining, information bottleneck, Chow-Liu tree, and conditional independence testing in (causal) graphical models. While mutual information is a quantity well-defined for general probability spaces, estimators have been developed only in the special case of discrete or continuous pairs of random variables. Most of these estimators operate using the 3H -principle, i.e., by calculating the three (differential) entropies of X, Y and the pair (X,Y). However, in general mixture spaces, such individual entropies are not well defined, even though mutual information is. In this paper, we develop a novel estimator for estimating mutual information in discrete-continuous mixtures. We prove the consistency of this estimator theoretically as well as demonstrate its excellent empirical performance. This problem is relevant in a wide-array of applications, where some variables are discrete, some continuous, and others are a mixture between continuous and discrete components.

Main Navigation

Session

Theory

Safe and Nested Subgame Solving for Imperfect-Information Games

A graph-theoretic approach to multitasking

Information-theoretic analysis of generalization capability of learning algorithms

Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee

Clustering Billions of Reads for DNA Data Storage

On the Complexity of Learning Neural Networks

Multiplicative Weights Update with Constant Step-Size in Congestion Games: Convergence, Limit Cycles and Chaos

Estimating Mutual Information for Discrete-Continuous Mixtures