Timezone: »

Poster
Taming Fat-Tailed (“Heavier-Tailed” with Potentially Infinite Variance) Noise in Federated Learning
Haibo Yang · Peiwen Qiu · Jia Liu

Tue Nov 29 09:00 AM -- 11:00 AM (PST) @ Hall J #821
In recent years, federated learning (FL) has emerged as an important distributed machine learning paradigm to collaboratively learn a global model with multiple clients, while keeping data local and private. However, a key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i.e., heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise. This motivates us to fill this gap in this paper by proposing an algorithmic framework called $\mathsf{FAT}$-$\mathsf{Clipping}~$ (\ul{f}ederated \ul{a}veraging with \ul{t}wo-sided learning rates and \ul{clipping}), which contains two variants: $\mathsf{FAT}$-$\mathsf{Clipping}~$ per-round ($\mathsf{FAT}$-$\mathsf{Clipping}$-$\mathsf{PR}$) and $\mathsf{FAT}$-$\mathsf{Clipping}~$ per-iteration ($\mathsf{FAT}$-$\mathsf{Clipping}$-$\mathsf{PI}$). Specifically, for the largest $\alpha \in (1,2]$ such that the fat-tailed noise in FL still has a bounded $\alpha$-moment, we show that both variants achieve $\mathcal{O}((mT)^{\frac{2-\alpha}{\alpha}})$ and $\mathcal{O}((mT)^{\frac{1-\alpha}{3\alpha-2}})$ convergence rates in the strongly-convex and general non-convex settings, respectively, where $m$ and $T$ are the numbers of clients and communication rounds. Moreover, at the expense of more clipping operations compared to $\mathsf{FAT}$-$\mathsf{Clipping}$-$\mathsf{PR}$, $\mathsf{FAT}$-$\mathsf{Clipping}$-$\mathsf{PI}~$ further enjoys a linear speedup effect with respect to the number of local updates at each client and being lower-bound-matching (i.e., order-optimal). Collectively, our results advance the understanding of designing efficient algorithms for FL systems that exhibit fat-tailed first-order oracle information.

#### Author Information

##### Jia Liu (The Ohio State University)

Jia (Kevin) Liu is an Assistant Professor in the Dept. of Electrical and Computer Engineering at The Ohio State University and an Amazon Visiting Academics (AVA). He received his Ph.D. degree from the Dept. of Electrical and Computer Engineering at Virginia Tech in 2010. From Aug. 2017 to Aug. 2020, he was an Assistant Professor in the Dept. of Computer Science at Iowa State University. His research areas include theoretical machine learning, stochastic network optimization and control, and performance analysis for data analytics infrastructure and cyber-physical systems. Dr. Liu is a senior member of IEEE and a member of ACM. He has received numerous awards at top venues, including IEEE INFOCOM'19 Best Paper Award, IEEE INFOCOM'16 Best Paper Award, IEEE INFOCOM'13 Best Paper Runner-up Award, IEEE INFOCOM'11 Best Paper Runner-up Award, IEEE ICC'08 Best Paper Award, and honors of long/spotlight presentations at ICML, NeurIPS, and ICLR. He is an NSF CAREER Award recipient in 2020 and a winner of the Google Faculty Research Award in 2020. He received the LAS Award for Early Achievement in Research at Iowa State University in 2020, and the Bell Labs President Gold Award. His research is supported by NSF, AFOSR, AFRL, and ONR.