Timezone: »

Diving into the shallows: a computational perspective on large-scale shallow learning
SIYUAN MA · Mikhail Belkin

Tue Dec 05 05:25 PM -- 05:30 PM (PST) @ Hall C

Remarkable recent success of deep neural networks has not been easy to analyze theoretically. It has been particularly hard to disentangle relative significance of architecture and optimization in achieving accurate classification on large datasets. On the flip side, shallow methods (e.g. kernel methods) have encountered obstacles in scaling to large data. Practical methods, such as variants of gradient descent used so successfully in deep learning, seem to perform below par when applied to kernel methods. This difficulty has sometimes been attributed to the limitations of shallow architecture. In this paper we first identify a basic limitation in gradient descent-based optimization methods when used in conjunctions with smooth kernels. An analysis demonstrates that only a vanishingly small fraction of the function space is reachable after a polynomial number of gradient descent iterations. That drastically limits approximating power of gradient descent for a fixed computational budget and leading to serious over-regularization. The issue is purely algorithmic, persisting even in the limit of infinite data. To address this shortcoming in practice, we introduce EigenPro iteration, based on a simple and direct preconditioning scheme using a small number of approximate eigenvectors. It can also be viewed as learning a new kernel optimized for gradient descent. It turns out that injecting this small amount of approximate second-order information leads to major improvements in convergence. For large data, this translates into significant performance boost over the state-of-the-art for kernel methods. In particular, we are able to match or improve the results recently reported in the literature at a small fraction of their computational budget. Finally, we feel that these results show a need for a broader computational perspective on modern large-scale learning to complement more traditional statistical and convergence analyses.

Author Information

SIYUAN MA (The Ohio State University)
Mikhail Belkin (Ohio State University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors