Timezone: »

Don’t just prune by magnitude! Your mask topology is a secret weapon
Duc Hoang · Souvik Kundu · Shiwei Liu · Zhangyang "Atlas" Wang

Wed Dec 13 08:45 AM -- 10:45 AM (PST) @ Great Hall & Hall B1+B2 #821

Recent years have witnessed significant progress in understanding the relationship between the connectivity of a deep network's architecture as a graph, and the network's performance. A few prior arts connected deep architectures to expander graphs or Ramanujan graphs, and particularly,[7] demonstrated the use of such graph connectivity measures with ranking and relative performance of various obtained sparse sub-networks (i.e. models with prune masks) without the need for training. However, no prior work explicitly explores the role of parameters in the graph's connectivity, making the graph-based understanding of prune masks and the magnitude/gradient-based pruning practice isolated from one another. This paper strives to fill in this gap, by analyzing the Weighted Spectral Gap of Ramanujan structures in sparse neural networks and investigates its correlation with final performance. We specifically examine the evolution of sparse structures under a popular dynamic sparse-to-sparse network training scheme, and intriguingly find that the generated random topologies inherently maximize Ramanujan graphs. We also identify a strong correlation between masks, performance, and the weighted spectral gap. Leveraging this observation, we propose to construct a new "full-spectrum coordinate'' aiming to comprehensively characterize a sparse neural network's promise. Concretely, it consists of the classical Ramanujan's gap (structure), our proposed weighted spectral gap (parameters), and the constituent nested regular graphs within. In this new coordinate system, a sparse subnetwork's L2-distance from its original initialization is found to have nearly linear correlated with its performance. Eventually, we apply this unified perspective to develop a new actionable pruning method, by sampling sparse masks to maximize the L2-coordinate distance. Our method can be augmented with the "pruning at initialization" (PaI) method, and significantly outperforms existing PaI methods. With only a few iterations of training (e.g 500 iterations), we can get LTH-comparable performance as that yielded via "pruning after training", significantly saving pre-training costs. Codes can be found at: https://github.com/VITA-Group/FullSpectrum-PAI.

Author Information

Duc Hoang (University of Texas, Austin)
Souvik Kundu (Intel)

I am Research Scientist at Intel Labs, USA. My research area focuses on efficient, robust, and privacy preserving AI systems.

Shiwei Liu (UT Austin)

I am a third-year Ph.D. student in the Data Mining Group, Department of Mathematics and Computer Science, Eindhoven University of Technology (TU/e). My current research topics are dynamic sparse training, sparse neural networks, pruning, the generalization of neural networks, etc. I am looking for a postdoc position in machine learning.

Zhangyang "Atlas" Wang (University of Texas at Austin)

More from the Same Authors