NeurIPS Poster CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

Poster

CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

Zhize Li · Peter Richtarik

Virtual

Keywords: [ Optimization ] [ Federated Learning ]

[ Abstract ]

[ Slides] [ OpenReview]

Abstract: Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov's accelerated gradient descent [31, 32] and Adam [14]. In order to combine the benefits of communication compression and convergence acceleration, we propose a \emph{compressed and accelerated} gradient method based on ANITA [20] for distributed optimization, which we call CANITA. Our CANITA achieves the \emph{first accelerated rate}

$O\bigg(\sqrt{\Big(1+\sqrt{\frac{\omega^3}{n}}\Big)\frac{L}{\epsilon}} + \omega\big(\frac{1}{\epsilon}\big)^{\frac{1}{3}}\bigg)$ , which improves upon the state-of-the-art non-accelerated rate

$O\left((1+\frac{\omega}{n})\frac{L}{\epsilon} + \frac{\omega^2+\omega}{\omega+n}\frac{1}{\epsilon}\right)$ of DIANA [12] for distributed general convex problems, where

$\epsilon$ is the target error,

$L$ is the smooth parameter of the objective,

$n$ is the number of machines/devices, and

$\omega$ is the compression parameter (larger

$\omega$ means more compression can be applied, and no compression implies

$\omega=0$ ). Our results show that as long as the number of devices

$n$ is large (often true in distributed/federated learning), or the compression

$\omega$ is not very high, CANITA achieves the faster convergence rate

$O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$ , i.e., the number of communication rounds is

$O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$ (vs.

$O\big(\frac{L}{\epsilon}\big)$ achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).

Chat is not available.