Workshop
Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023 (FL@FM-NeurIPS'23)
Jinghui Chen · Lixin Fan · Gauri Joshi · Sai Praneeth Karimireddy · Stacy Patterson · Shiqiang Wang · Han Yu
Hall D2 (level 1)
An exciting forum for researchers to exchange the recent developments in federated learning in the modern age of foundation models.
Please visit our workshop webpage for full details: https://federated-learning.org/fl@fm-neurips-2023/
Schedule
Sat 6:25 a.m. - 6:30 a.m.
|
Opening remarks
(
Presentation
)
>
|
🔗 |
Sat 6:30 a.m. - 6:40 a.m.
|
Text-driven Prompt Generation for Vision-Language Models in Federated Learning
(
Oral
)
>
link
SlidesLive Video Prompt learning for vision-language models, e.g., CoOp, has shown great success in adapting CLIP to different downstream tasks, making it a promising solution for federated learning due to computational reasons. Existing prompt learning techniques replace hand-crafted text prompts with learned vectors that offer improvements on seen classes, but struggle to generalize to unseen classes. Our work addresses this challenge by proposing Federated Text-driven Prompt Generation (FedTPG), which learns a unified prompt generation network across multiple remote clients in a scalable manner. The prompt generation network is conditioned on task-related text input, thus is context-aware, making it suitable to generalize for both seen and unseen classes. Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods, that achieve overall better generalization on both seen and unseen classes and is also generalizable to unseen datasets. |
Chen Qiu · Xingyu Li · Chaithanya Kumar Mummadi · Madan Ganesh · Zhenzhen Li · Lu Peng · Wan-Yi Lin 🔗 |
Sat 6:40 a.m. - 6:50 a.m.
|
HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning
(
Oral
)
>
link
SlidesLive Video In this paper, we focus on the important yet understudied problem of Continual Federated Learning (CFL), where a server communicates with a set of clients to incrementally learn new concepts over time without sharing or storing any data. The complexity of this problem is compounded by challenges from both the Continual and Federated Learning perspectives. Specifically, models trained in a CFL setup suffer from catastrophic forgetting which is exacerbated by data heterogeneity across clients. Existing attempts at this problem tend to impose large overheads on clients and communication channels or require access to stored data which renders them unsuitable for real-world use due to privacy. We study this problem in the context of Foundation Models and showcase their effectiveness in mitigating forgetting while minimizing overhead costs and without requiring access to any stored data. We achieve this by leveraging a prompting based approach (such that only prompts and classifier heads have to be communicated) and proposing a novel and lightweight generation and distillation scheme to aggregate client models at the server.We formulate this problem for image classification and establish strong baselines for comparison, conduct experiments on CIFAR-100 as well as challenging, large-scale datasets like ImageNet-R and DomainNet. Our approach outperforms both existing methods and our own baselines by more than 7\% while significantly reducing communication and client-level computation costs. |
Shaunak Halbe · James S Smith · Junjiao Tian · Zsolt Kira 🔗 |
Sat 6:50 a.m. - 7:00 a.m.
|
Beyond Gradient and Priors in Privacy Attacks: Leveraging Pooler Layer Inputs of Language Models in Federated Learning
(
Oral
)
>
link
SlidesLive Video Federated learning (FL) emphasizes decentralized training by storing data locally and transmitting only model updates, underlining user privacy. However, a line of work on privacy attacks undermines user privacy by extracting sensitive data from large language models during FL.Yet, these attack techniques face distinct hurdles: some work chiefly with limited batch sizes (e.g., batch size of 1), and others can be easily defended or are transparently detectable. This paper introduces an innovative approach that is challenging to detect and defend, significantly enhancing the recovery rate of text in various batch-size settings. Building on fundamental gradient matching and domain prior knowledge, we enhance the recovery by tapping into the input of the Pooler layer of language models, offering additional feature-level guidance that effectively assists optimization-based attacks. We benchmark our method using text classification tasks on datasets such as CoLA, SST, and Rotten Tomatoes. Across different batch sizes and models, our approach consistently outperforms previous state-of-the-art results. |
Jianwei Li · Sheng Liu · Qi Lei 🔗 |
Sat 7:00 a.m. - 7:10 a.m.
|
FOCUS: Fairness via Agent-Awareness for Federated Learning on Heterogeneous Data
(
Poster
)
>
link
SlidesLive Video Federated learning (FL) allows agents to jointly train a global model without sharing their local data to protect the privacy of local agents. However, due to the heterogeneous nature of local data, existing definitions of fairness in the context of FL are prone to noisy agents in the network. For instance, existing work usually considers accuracy parity as the fairness metric for different agents, which is not robust under the heterogeneous setting, since it will enforce agents with high-quality data to achieve similar accuracy to those who contribute low-quality data and may discourage the agents with high-quality data from participating in FL. In this work, we propose a formal FL fairness definition, fairness via agent-awareness (FAA), which takes the heterogeneity of different agents into account by measuring the data quality with approximated Bayes optimal error. Under FAA, the performance of agents with high-quality data will not be sacrificed just due to the existence of large numbers of agents with low-quality data. In addition, we propose a fair FL training algorithm leveraging agent clustering (FOCUS) to achieve fairness in FL, as measured by FAA and other fairness metrics. Theoretically, we prove the convergence and optimality of FOCUS under mild conditions for both linear and general convex loss functions with bounded smoothness. We also prove that FOCUS always achieves higher fairness in terms of FAA compared with standard FedAvg under both linear and general convex loss functions. Empirically, we show that on four FL datasets, including synthetic data, images, and texts, FOCUS achieves significantly higher fairness in terms of FAA and other fairness metrics, while maintaining competitive prediction accuracy compared with FedAvg and four state-of-the-art fair FL algorithms. |
Wenda Chu · Chulin Xie · Boxin Wang · Linyi Li · Lang Yin · Arash Nourian · Han Zhao · Bo Li 🔗 |
Sat 7:10 a.m. - 7:35 a.m.
|
Invited talk: Federated Learning by Dataset Distillation
(
Oral
)
>
SlidesLive Video |
Cho-Jui Hsieh 🔗 |
Sat 7:35 a.m. - 8:00 a.m.
|
Invited talk: Federated Learning with Public and Private Data: From Small Models to Large, and Back
(
Oral
)
>
SlidesLive Video |
Zheng Xu 🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Break
(
Break
)
>
|
🔗 |
Sat 8:30 a.m. - 8:55 a.m.
|
Invited talk: When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions
(
Oral
)
>
SlidesLive Video |
Lingjuan Lyu 🔗 |
Sat 8:55 a.m. - 9:05 a.m.
|
FedSoL: Bridging Global Alignment and Local Generality in Federated Learning
(
Oral
)
>
link
SlidesLive Video While FL enables learning a model with data privacy, it often suffers from significant performance degradation when client data distributions are heterogeneous. Many previous FL algorithms have addressed this issue by introducing various proximal restrictions. These restrictions aim to encourage global alignment by constraining the deviation of local learning from the global objective. However, they inherently limit local learning by interfering with the original local objectives. Recently, an alternative approach has emerged to improve local learning generality. By obtaining local models within a smooth loss landscape, this approach mitigates conflicts among different local objectives of the clients. Yet, it does not ensure stable global alignment, as local learning does not take the global objective into account. In this study, we propose Federated Stability on Learning (FedSoL), which combines both the concepts of global alignment and local generality. In FedSoL, the local learning seeks a parameter region robust against proximal perturbations. This strategy introduces an implicit proximal restriction effect in local learning while maintaining the original local objective for parameter update. |
Gihun Lee · Minchan Jeong · SangMook Kim · Jaehoon Oh · Se-Young Yun 🔗 |
Sat 9:05 a.m. - 9:15 a.m.
|
One-shot Empirical Privacy Estimation for Federated Learning
(
Oral
)
>
link
SlidesLive Video Privacy estimation techniques for differentially private (DP) algorithms are useful for comparing against analytical bounds, or to empirically measure privacy loss in settings where known analytical bounds are not tight. However, existing privacy auditing techniques usually make strong assumptions on the adversary (e.g., knowledge of intermediate model iterates or the training data distribution), are tailored to specific tasks, model architectures, or DP algorithm, and/or require retraining the model many times (typically on the order of thousands). These shortcomings make deploying such techniques at scale difficult in practice, especially in federated settings where model training can take days or weeks. In this work, we present a novel ``one-shot'' approach that can systematically address these challenges, allowing efficient auditing or estimation of the privacy loss of a model during the same, single training run used to fit model parameters, and without requiring any a priori knowledge about the model architecture, task, or DP training algorithm. We show that our method provides provably correct estimates for the privacy loss under the Gaussian mechanism, and we demonstrate its performance on well-established FL benchmark datasets under several adversarial threat models. |
Galen Andrew · Peter Kairouz · Sewoong Oh · Alina Oprea · H. Brendan McMahan · Vinith Suriyakumar 🔗 |
Sat 9:15 a.m. - 10:00 a.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Sai Praneeth Karimireddy 🔗 |
Sat 10:00 a.m. - 11:30 a.m.
|
Lunch
(
Break
)
>
|
🔗 |
Sat 11:30 a.m. - 11:55 a.m.
|
Invited talk: Federated Learning in Medical Imaging
(
Oral
)
>
SlidesLive Video |
Jayashree Kalpathy-Cramer 🔗 |
Sat 11:55 a.m. - 12:20 p.m.
|
Invited talk: Decentralized LLM Agent Cloud Platform
(
Oral
)
>
SlidesLive Video |
Chaoyang He 🔗 |
Sat 12:20 p.m. - 12:30 p.m.
|
Profit: Benchmarking Personalization and Robustness Trade-off in Federated Prompt Tuning
(
Oral
)
>
link
SlidesLive Video In many applications of federated learning (FL), clients desire models that are personalized using their local data, yet are also robust in the sense that they retain general global knowledge. However, the presence of data heterogeneity across clients induces a fundamental trade-off between personalization (i.e., adaptation to a local distribution) and robustness (i.e., not forgetting previously learned general knowledge). It is critical to understand how to navigate this personalization vs robustness trade-off when designing federated systems, which are increasingly moving towards a paradigm of fine-tuning large foundation models. Due to limited computational and communication capabilities in most federated settings, this foundation model fine-tuning must be done using parameter-efficient fine-tuning (PEFT) approaches. While some recent work has studied federated approaches to PEFT, the personalization vs robustness trade-off of federated PEFT has been largely unexplored. In this work, we take a step towards bridging this gap by benchmarking fundamental FL algorithms -- FedAvg and FedSGD plus personalization (via client local fine-tuning) -- applied to one of the most ubiquitous PEFT approaches to large language models (LLMs) -- prompt tuning -- in a multitude of hyperparameter settings under varying levels of data heterogeneity.Our results show that federated-trained prompts can be surprisingly robust when using a small learning rate with many local epochs for personalization, especially when using an adaptive optimizer as the client optimizer during federated training. We also demonstrate that simple approaches such as adding regularization and interpolating two prompts are effective in improving the personalization vs robustness trade-off in computation-limited settings with few local updates allowed for personalization. |
Liam Collins · Shanshan Wu · Sewoong Oh · Khe Sim 🔗 |
Sat 12:30 p.m. - 12:40 p.m.
|
SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
(
Oral
)
>
link
SlidesLive Video
Fine-tuning pre-trained models has gained significant success in delivering SOTA results across various NLP tasks. In the absence of centralized data, Federated Learning (FL) helps the model to benefit from clients' private data for fine-tuning. However, due to the limited communication, computation, and storage capabilities of edge devices and the huge sizes of popular pre-trained models, efficient fine-tuning is crucial. This work explores the opportunities and challenges of applying parameter efficient fine-tuning (PEFT) methods in FL for language tasks. Specifically, our investigations reveal that with increasing data heterogeneity across users, the gap between fully fine-tuning the model and employing PEFT methods widens. To bridge this performance gap, we propose a method, SLoRA, which overcomes the key limitations of LoRA in high heterogeneous data scenarios through a novel data-driven initialization technique. Our experimental results demonstrate that SLoRA achieves performance comparable to full fine-tuning, with significant sparse updates with $\sim 1\%$ density while reducing training time by up to $90\%$.
|
Sara Babakniya · Ahmed Elkordy · Yahya Ezzeldin · Qingfeng Liu · Kee-Bong Song · Mostafa El-Khamy · Salman Avestimehr 🔗 |
Sat 12:40 p.m. - 12:50 p.m.
|
The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning
(
Oral
)
>
link
SlidesLive Video Modern data aggregation often involves a platform collecting data from a network of users with various privacy options. Platforms must solve the problem of how to allocate incentives to users to convince them to share their data. This paper puts forth an idea for a fair amount to compensate users for their data at a given privacy level based on an axiomatic definition of fairness, along the lines of the celebrated Shapley value. To the best of our knowledge, these are the first fairness concepts for data that explicitly consider privacy constraints. We also formulate a heterogeneous federated learning problem for the platform with privacy level options for users. By studying this problem, we investigate the amount of compensation users receive under fair allocations with different privacy levels, amounts of data, and degrees of heterogeneity. We also discuss what happens when the platform if forced to design fair incentives. Under certain conditions we find that when privacy sensitivity is low, the platform will set incentives to ensure that it collects all the data with the lowest privacy options. When the privacy sensitivity is above a given threshold, the platform will provide no incentives to users. Between these two extremes, the platform will set the incentives so some fraction of the users chooses the higher privacy option and the other chooses the lower privacy option |
Justin Kang · Kannan Ramchandran · Ramtin Pedarsani 🔗 |
Sat 12:50 p.m. - 1:00 p.m.
|
Towards Building the FederatedGPT: Federated Instruction Tuning
(
Oral
)
>
link
SlidesLive Video While "instruction-tuned" generative large language models (LLMs) have demonstrated an impressive ability to generalize to new tasks, the training phases heavily rely on large amounts of diverse and high-quality instruction data (such as ChatGPT and GPT-4). Unfortunately, acquiring high-quality data, especially when it comes to human-written data, can pose significant challenges both in terms of cost and accessibility. Moreover, concerns related to privacy can further limit access to such data, making the process of obtaining it a complex and nuanced undertaking. To tackle this issue, our study introduces a new approach called \textbf{Fed}erated \textbf{I}nstruction \textbf{T}uning (FedIT), which leverages federated learning (FL) as the learning framework for the instruction tuning of LLMs. This marks the first exploration of FL-based instruction tuning for LLMs. This is especially important since text data is predominantly generated by end users. For example, collecting extensive amounts of everyday user conversations can be a useful approach to improving the generalizability of LLMs, allowing them to generate authentic and natural responses. Therefore, it is imperative to design and adapt FL approaches to effectively leverage these users' diverse instructions stored on local devices while mitigating concerns related to data sensitivity and the cost of data transmission. In this study, we leverage extensive qualitative analysis, including the prevalent GPT-4 auto-evaluation, to illustrate how our FedIT framework enhances the performance of LLMs. Utilizing diverse instruction sets on the client side, FedIT outperforms centralized training with only limited local instructions. |
Jianyi Zhang · Saeed Vahidian · Martin Kuo · Chunyuan Li · Ruiyi Zhang · Tong Yu · Guoyin Wang · Yiran Chen 🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Break
(
Break
)
>
|
🔗 |
Sat 1:30 p.m. - 1:55 p.m.
|
Invited talk: On the 5th Generation of Local Training Methods in Federated Learning
(
Oral
)
>
SlidesLive Video |
Peter Richtarik 🔗 |
Sat 1:55 p.m. - 2:05 p.m.
|
Federated Learning for Speech Recognition: Revisiting Current Trends Towards Large-Scale ASR
(
Oral
)
>
link
SlidesLive Video
While automatic speech recognition (ASR) has witnessed remarkable achievements in recent years, it has not garnered a widespread focus within the federated learning (FL) and differential privacy (DP) communities. Meanwhile, ASR is a well suited benchmark for FL and DP as there is (i) a natural data split across users by using speaker information; (ii) heterogeneous data across speakers close to practical settings; (iii) different sequence-to-sequence loss functions.Recent production-ready state-of-the-art models in ASR include $\textit{large}$ conformer and transformer models, optimization of which is known to pose challenges even for the central training.While the main trends and benchmarks in FL and DP focus on $\textit{small}$ models, we show the necessity of disentangling optimization with $\textit{small}$ models from optimization with FL and DP as optimization for large models in the context of FL and DP behaves differently.In this paper, we analyze the key FL parameters (optimizers, training from scratch or from a seed model pre-trained centrally, cohort size, data heterogeneity) and propose $\textit{first}$ benchmark of $\textit{FL with DP}$ in the context of $\textit{large}$ models in ASR.We examine the applicability of prior results and present an overview of observed departures from the trends in prior works and from training different ASR models. Through this work, we provide researchers and practitioners in the fields of FL and DP with valuable insights into the fundamental differences that may arise when applying FL and DP research to large-scale ASR training.
|
Shams Azam · Martin Pelikan · Vitaly Feldman · Kunal Talwar · Jan Silovsky · Tatiana Likhomanenko 🔗 |
Sat 2:05 p.m. - 2:15 p.m.
|
LASER: Linear Compression in Wireless Distributed Optimization
(
Oral
)
>
link
SlidesLive Video
Data-parallel SGD is the de facto algorithm for distributed optimization, especially for large scale machine learning. Despite its merits, communication bottleneck is one of its persistent issues. Most compression schemes to alleviate this either assume noiseless communication links, or fail to achieve good performance on practical tasks. In this paper, we close this gap and introduce $\bf{LASER}$: ${\bf L}$ine${\bf A}$r Compre${\bf S}$sion in Wir${\bf E}$less Dist${\bf R}$ibuted Optimization. LASER capitalizes on the inherent low-rank structure of gradients and transmits them efficiently over the noisy channels. Whilst enjoying theoretical guarantees similar to those of the classical SGD, \textsc{LASER} shows consistent gains over baselines on a variety of practical benchmarks. In particular, it outperforms the state-of-the-art compression schemes on challenging computer vision and GPT language modeling tasks. On the latter, we obtain $50$-$64$ % improvement in perplexity over our baselines for noisy channels.
|
Ashok Vardhan Makkuva · Marco Bondaschi · Thijs Vogels · Martin Jaggi · Hyeji Kim · Michael Gastpar 🔗 |
Sat 2:15 p.m. - 2:20 p.m.
|
Best Paper Award Ceremony
(
Announcement
)
>
SlidesLive Video |
Shiqiang Wang 🔗 |
Sat 2:20 p.m. - 3:30 p.m.
|
Poster Session
(
Poster
)
>
Poster presentation of all contributed papers accepted to the workshop. |
🔗 |
-
|
Beyond Parameter Averaging in Model Aggregation
(
Poster
)
>
link
The success of foundation models is strongly linked to scale, which has reinforced the interest in federated learning. With the prohibitive cost of training a large language model (LLM) in mind, little attention has been placed on reusing pre-trained models in collaborative training settings. Self-supervision has also played an important role in this success, but its emphasis has been primarily on data. This paper leverages Bayesian principles to bring self-supervision into the model aggregation toolbox. It introduces self-supervised Fisher merging, a framework that successfully merges models in parameter space without re-visiting data, opening a new door in model reusability. Experimental results build the foundation of our method on tractable linear models, and highlight its potential on aggregating neural networks. |
Pol Garcia Recasens · Jordi Torres · Josep Lluís Berral · Søren Hauberg · Pablo Moreno-Muñoz 🔗 |
-
|
DPZero: Dimension-Independent and Differentially Private Zeroth-Order Optimization
(
Poster
)
>
link
Today’s widespread practice of fine-tuning pretrained large language models (LLMs) on domain-specific data faces two grand challenges in memory and privacy. First, as LLMs continue to expand, encompassing billions of parameters, the memory demands of gradient-based training methods via backpropagation become prohibitively high. Second, given the tendency of LLMs to memorize and disclose sensitive training data, the privacy of fine-tuning data must be respected. To this end, we explore the potential of zeroth-order methods in differentially private optimization for fine-tuning LLMs. Zeroth-order methods, which rely solely on forward passes, substantially reduce memory consumption during training. However, directly combining them with standard differential privacy mechanism poses dimension-dependent complexity. To bridge the gap, we introduce DPZero, a novel differentially private zeroth-order algorithm with nearly dimension-independent rates. Our theoretical analysis reveals that its complexity hinges primarily on the problem's intrinsic dimension and exhibits only a logarithmic dependence on the ambient dimension. This renders DPZero a highly practical option for real-world LLMs deployments. |
Liang Zhang · Kiran Thekumparampil · Sewoong Oh · Niao He 🔗 |
-
|
An Empirical Evaluation of Federated Contextual Bandit Algorithms
(
Poster
)
>
link
Fine-tuning (foundation) models with user feedback can be important for improving task-specific performance, as fine-grained supervision is generally unavailable. While the adoption of federated learning increases for learning from sensitive data local to user devices, it is unclear if learning can be done using implicit signals generated as users interact with the applications.We approach such problems with the framework of federated contextual bandits, and develop variants of prominent contextual bandit algorithms from the centralized seting for the federated setting. We carefully evaluate these algorithms in a range of scenarios simulated using publicly available datasets. Our simulations model typical setups encountered in the real-world, such as various misalignments between an initial pre-trained model and the subsequent user interactions due to non-stationarity in the data and/or heterogeneity across clients. Our experiments reveal the surprising effectiveness of the simple and commonly used softmax heuristic in balancing the well-know exploration-exploitation tradeoff across the breadth of our settings. |
Alekh Agarwal · H. Brendan McMahan · Zheng Xu 🔗 |
-
|
parameter averaging laws for multitask language models
(
Poster
)
>
link
Parameter-averaging, a method for combining multiple models into a single one, has emerged as a promising approach to enhance performance without requiring additional space or retraining. Nonetheless, the conditions for successful parameter-averaging remain undefined, calling for further research to characterize them. In this study, we empirically investigate the influential factors for successful parameter-averaging and reveal \emph{positive correlations between representation power and the performance gain of parameter-averaging}. Specifically, we evaluate how computational budget, data diversity and vocabulary size contribute to representation power, and their influence on the success of parameter-averaging. Our results demonstrate that parameter-averaging improves the generalization ability for both in-domain and out-of-domain data. Additionally, to reduce the computational cost of parameter-averaging, we introduce \textit{partial averaging}, which assumes arbitrary participation of a subset of contributors. We observe that partial averaging outperforms fine-tuning for models with sufficient representation power. Furthermore, we find that the impact of data heterogeneity, which arises from different data distributions of contributors, reduces as the representation power of the model increases. These findings provide valuable insights into the principles governing parameter-averaging and its potential for enhancing model performance. |
Woojin Chung · Hyowon Cho · James Thorne · Se-Young Yun 🔗 |
-
|
Consensus Optimization at Representation: Improving Personalized Federated Learning via Data-Centric Regularization
(
Poster
)
>
link
Federated learning is a large scale machine learning training paradigm where data is distributed across clients, and can be highly heterogeneous from one client to another. To ensure personalization in client models, and at the same time to ensure that the local models have enough commonality (i.e., prevent ``client-drift''), it has been recently proposed to cast the federated learning problem as a consensus optimization problem, where local models are trained on local data, but are forced to be similar via a regularization term. In this paper we propose an improved federated learning algorithm, where we ensure consensus optimization at the representation part of each local client, and not on whole local models. This algorithm naturally takes into account that today's deep networks are often partitioned into a feature extraction part (representation) and a prediction part. Our algorithm ensures greater flexibility compared to previous works on exact shared representation in highly heterogeneous settings, as it has been seen that the representation part can differ substantially with data distribution. Our method is quite stable to noise, and can be made differentially private with strong privacy guarantee without much loss of accuracy. We validate its good performance experimentally in standard datasets. |
Heng Zhu · Arya Mazumdar 🔗 |
-
|
Augmenting Federated Learning with Pretrained Transformers
(
Poster
)
>
link
The explosive growth and diversity of machine learning applications motivate a fundamental rethinking of learning with mobile and edge devices. How can we address *diverse/disparate client goals* and learn with *scarce heterogeneous data*? While federated learning (FL) aims to address these issues, it has several bottlenecks and challenges hindering a unified solution. On the other hand, large transformer models have been shown to work across a variety of tasks often achieving remarkable few-shot adaptation. This raises the question: Can FL clients use a single general-purpose model -- rather than custom models for each task -- while obeying *device and network constraints*? In this work, we investigate pretrained transformers (PTF) to achieve these on-device learning goals and thoroughly explore the roles of model size and modularity, where the latter refers to adaptation through modules such as prompts or adapters. We demonstrate that:**(1) Larger scale** shrinks the accuracy gaps between alternative approaches and improves heterogeneity robustness. Crucially, scale allows clients to run *more local SGD epochs* which substantially ($\times 4$) reduces the number of communication rounds. At the extreme, clients can achieve respectable accuracy fully-locally reducing the need for collaboration.**(2) Modularity** enables $>$100$\times$ less communication in bits. Surprisingly, it also boosts the generalization capability of local adaptation methods and the robustness of smaller PTFs. To explain these benefits, we show that scale and modularity can synergistically mitigate the *representation shift* during FL. Finally, to harness multitasking capabilities of modern PTFs, we propose FedYolo: A new FL approach that assigns both dedicated and shared modules to FL tasks to manage their interference. Our extensive experiments demonstrate FedYolo's value and the power of scale and modularity for multitasking.
|
Xuechen Zhang · Mingchen Li · Xiangyu Chang · Jiasi Chen · Amit Roy-Chowdhury · Ananda Theertha Suresh · Samet Oymak 🔗 |
-
|
Exploring User-level Gradient Inversion with a Diffusion Prior
(
Poster
)
>
link
We explore user-level gradient inversion as a new attack surface in distributed learning. We first investigate existing attacks on their ability to make inferences about private information info beyond training data reconstruction. Motivated by the low reconstruction quality of existing methods, we propose a novel gradient inversion attack that applies a denoising diffusion model as a strong image prior in order to enhance recovery in the large batch setting. Unlike traditional attacks, which aim to reconstruct individual samples and suffer at large batch and image sizes, our approach instead aims to recover a representative image that captures the sensitive shared semantic information corresponding to the underlying user. Our experiments with face images demonstrate the ability of our methods to recover realistic facial images along with private user attributes. |
Zhuohang Li · Andrew Lowy · Jing Liu · Toshiaki Koike-Akino · Bradley Malin · Kieran Parsons · Ye Wang 🔗 |
-
|
Making Batch Normalization Great in Federated Deep Learning
(
Poster
)
>
link
Batch Normalization (BN) is commonly used in modern deep foundation models to improve stability and speed up convergence in centralized training. In federated learning (FL) with non-IID decentralized data, previous works observed that training with BN could hinder performance due to the mismatch of the BN statistics between training and testing. Group Normalization (GN) is thus more often used in FL as an alternative to BN. In this paper, we identify a more fundamental issue of BN in FL that makes BN inferior even with high-frequency communication between clients and servers. We then propose a frustratingly simple treatment, which significantly improves BN and makes it outperform GN across a wide range of FL settings. Along with this study, we also reveal an unreasonable behavior of BN in FL. We find it quite robust in the low-frequency communication regime where FL is commonly believed to degrade drastically. We hope that our study could serve as a valuable reference for future practical usage and theoretical analysis in FL. |
Jike Zhong · Hong-You Chen · Wei-Lun (Harry) Chao 🔗 |
-
|
Leveraging Foundation Models to Improve Lightweight Clients in Federated Learning
(
Poster
)
>
link
Federated Learning (FL) is a distributed training paradigm that enables clients scattered across the world to cooperatively learn a global model without divulging confidential data. However, FL faces a significant challenge in the form of heterogeneous data distributions among clients, which leads to a reduction in performance and robustness. A recent approach to mitigating the impact of heterogeneous data distributions is through the use of foundation models, which offer better performance at the cost of larger computational overheads and slower inference speeds. We introduce foundation model distillation to assist in the federated training of lightweight client models and increase their performance under heterogeneous data settings while keeping inference costs low. Our results show improvement in the global model performance on a balanced testing set, which contains rarely observed samples, even under extreme non-IID client data distributions. We conduct a thorough evaluation of our framework with different foundation model backbones on CIFAR10, with varying degrees of heterogeneous data distributions ranging from class-specific data partitions across clients to dirichlet data sampling, parameterized by values between 0.01 and 1.0. |
Xidong Wu · Wan-Yi Lin · Devin Willmott · Filipe Condessa · Yufei Huang · Zhenzhen Li · Madan Ganesh 🔗 |
-
|
MARINA Meets Matrix Stepsizes: Variance Reduced Distributed Non-Convex Optimization
(
Poster
)
>
link
Matrix-stepsized gradient descent algorithms have been demonstrated to exhibit superior efficiency in non-convex optimization compared to their scalar counterparts. The det-CGD algorithm, as introduced by [LKR23], leverages matrix stepsizes to perform compressed gradient descent for non-convex objectives and matrix-smooth problems in a federated manner. The authors establish the algorithm's convergence to a neighborhood of the weighted stationarity point under a convex condition for the symmetric and positive-definite stepsize matrix. In this paper, we propose a variance-reduced version of the det-CGD algorithm, incorporating the MARINA method. Notably, we establish theoretically and empirically, that det-MARINA outperforms both MARINA and the distributed MARINA algorithms |
Hanmin Li · Avetik Karagulyan · Peter Richtarik 🔗 |
-
|
Private and Personalized Histogram Estimation in a Federated Setting
(
Poster
)
>
link
Personalized federated learning (PFL) aims at learning personalized models for users in a federated setup. We focus on the problem of privately estimating histograms (in the KL metric) for each user in the network. Conventionally, for more general problems learning a global model jointly via federated averaging, and then finetuning locally for each user has been a winning strategy. But this can be suboptimal if the user distribution observes diverse subpopulations, as one might expect with user vocabularies. To tackle this, we study an alternative PFL technique: clustering based personalization that first identifies diverse subpopulations when present, enabling users to collaborate more closely with others from the same subpopulation. We motivate our algorithm via a stylized generative process: mixture of Dirichlets, and propose initialization/pre-processing techniques that reduce the iteration complexity of clustering. This enables the application of privacy mechanisms at each step of our iterative procedure, making the algorithm user-level differentially private without severe drop in utility due to added noise. Finally, we present empirical results on Reddit users data where we compare our method with other well-known PFL approaches applied to private histogram estimation. |
Amrith Setlur · Vitaly Feldman · Kunal Talwar 🔗 |
-
|
TAMUNA: Doubly Accelerated Federated Learning with Local Training, Compression, and Partial Participation
(
Poster
)
>
link
In federated learning, a large number of users collaborate to learn a global model. They alternate local computations and communication with a distant server. Communication, which can be slow and costly, is the main bottleneck in this setting. In addition to communication-efficiency, a robust algorithm should allow for partial participation, the desirable feature that not all clients need to participate to every round of the training process. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose TAMUNA, the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and allows for partial participation. TAMUNA converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: it provably benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the model dimension, respectively. |
Laurent Condat · Ivan Agarský · Grigory Malinovsky · Peter Richtarik 🔗 |
-
|
FedML-HE: An Efficient Homomorphic-Encryption-Based Privacy-Preserving Federated Learning System
(
Poster
)
>
link
Federated Learning trains machine learning models on distributed devices by aggregating local model updates instead of local data. However, privacy concerns arise as the aggregated local models on the server may reveal sensitive personal information by inversion attacks. Privacy-preserving methods, such as Homomorphic Encryption (HE), then become necessary for FL training. Despite HE's post-quantum security advantages, its applications suffer from impractical overheads, especially for foundation models. In this paper, we present FedHE, the first practical federated learning system with efficient HE-based secure model aggregation. FedHE proposes to selectively encrypt sensitive parameters, significantly reducing both computation and communication overheads during training while providing customizable privacy preservation. Our optimized system demonstrates considerable overhead reduction, particularly for large foundation models (e.g., ~10x reduction for HE-federated training of ResNet-50, and ~40x reduction for BERT), demonstrating the potential for scalable HE-based FL deployment. |
Weizhao Jin · Yuhang Yao · Shanshan Han · Carlee Joe-Wong · Srivatsan Ravi · Salman Avestimehr · Chaoyang He 🔗 |
-
|
MOFL/D: A Federated Multi-objective Learning Framework with Decomposition
(
Poster
)
>
link
Multi-objective learning problems occur in all aspects of life and have been studied for decades, including in the field of machine learning. Many such problems also exist in distributed settings, where data cannot easily be shared. In recent years, joint machine learning has been made possible in such settings through the development of the Federated Learning (FL) paradigm. However, no general extension of the FL concept to multi-objective learning has been proposed yet, limiting such problems to non-cooperative individual learning. We address this gap by presenting a first general framework for multi-objective federated learning, based on decomposition (MOFL/D). Our framework addresses the a posteriori type of multi-objective problem, where user preferences are not known during the optimisation process, allowing multiple participants to jointly find a set of solutions, each optimised for some distribution of preferences. We present an instantiation of the framework and validate it through experiments on a set of multi-objective benchmarking problems that are extended from well-known single-objective benchmarks. |
Maria Hartmann · Grégoire Danoy · Mohammed Alswaitti · Pascal Bouvry 🔗 |
-
|
Absolute Variation Distance: an Inversion Attack Evaluation Metric for Federated Learning
(
Poster
)
>
link
Federated Learning (FL) has emerged as a pivotal approach for training models on decentralized data sources by sharing only model gradients. However, the shared gradients in FL are susceptible to inversion attacks which can expose sensitive information. While several defense and attack strategies have been proposed, their effectiveness is often evaluated using metrics that may not necessarily reflect the success rate of an attack or information retrieval, especially in the context of multidimensional data such as images. Traditional metrics like the Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE) are typically used as lightweight metrics, assume only pixel-wise comparison, but fail to consider the semantic context of the recovered data. This paper introduces the Absolute Variation Distance (AVD), a lightweight metric derived from total variation, to assess data recovery and information leakage in FL. Unlike traditional metrics, AVD offers a continuous measure for extracting information in noisy images and aligns closely with human perception. Our results are combined with a user experience survey demonstrate that AVD provides a more accurate and consistent measure of data recovery. It also matches the accuracy of the more costly and complex Neural Network based metric, the Learned Perceptual Image Patch Similarity (LPIPS). Hence it offers an effective tool for automatic evaluation of data security in Federation and a reliable way of studying defence and inversion attacks strategies in FL. |
Georgios Papadopoulos · Yash Satsangi · Shaltiel Eloul · Marco Pistoia 🔗 |
-
|
Fed3R: Recursive Ridge Regression for Federated Learning with strong pre-trained models
(
Poster
)
>
link
Federated Learning offers a powerful solution for training models on data that cannot be centrally stored due to privacy concerns. However, the existing paradigm suffers from high statistical heterogeneity across clients' data, resulting in client drift due to biased local solutions. This issue is particularly pronounced in the final classifier layer, severely impeding convergence speed during aggregation. To overcome these challenges, we introduce Federated Recursive Ridge Regression (Fed3R). This approach replaces the gradient-based classifier with a ridge regression-based classifier, computed in a closed form, ensuring client drift resilience and severely reducing convergence time and communication costs. The incremental formulation of Fed3R is equivalent to the ideal centralized ridge regression solution, enabling the utilization of more complex architectures with pre-trained parameters and robust generalization capabilities incompatible with previous federated learning techniques. We propose Fed3R in three variants, with Fed3R-RF significantly enhancing performance to levels akin to centralized training while remaining competitive regarding the total communication costs. |
Eros Fanì · Raffaello Camoriano · Barbara Caputo · Marco Ciccone 🔗 |
-
|
Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages
(
Poster
)
>
link
Pretrained large language models (LLMs) have emerged as a cornerstone in modern natural language processing, with their utility expanding to various applications and languages. However, the fine-tuning of multilingual LLMs, particularly for low-resource languages, is fraught with challenges stemming from data-sharing restrictions (the physical border) and from the inherent linguistic differences (the linguistic border). These barriers hinder users of various languages, especially those in low-resource regions, from fully benefiting from the advantages of LLMs.To overcome these challenges, we propose the Federated Prompt Tuning Paradigm for Multilingual Scenarios, which leverages parameter-efficient fine-tuning in a manner that preserves user privacy. We have designed a comprehensive set of experiments and introduced the concept of "language distance" to highlight the strengths of this paradigm: Even under computational constraints, our method not only bolsters data efficiency but also facilitates mutual enhancements across languages, particularly benefiting low-resource ones. Compared to traditional local cross-lingual transfer tuning methods, our approach achieves a 6.9\% higher accuracy, reduces the training parameters by over 99\%, and demonstrates stronger cross-lingual generalization. Such findings underscore the potential of our approach to promote social equality, ensure user privacy, and champion linguistic diversity. |
Wanru Zhao · Yihong Chen · Royson Lee · Xinchi Qiu · Yan Gao · Hongxiang Fan · Nicholas Lane 🔗 |
-
|
Learning Optimizers for Local SGD
(
Poster
)
>
link
Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally, that is on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art optimizers for deep learning. In this work, we incorporate local optimizers that compute multiple updates into a learned optimization framework, allowing to meta-learn potentially more efficient local SGD algorithms. Our results demonstrate that local learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. We show that the learned optimizers can generalize to new datasets and architectures, demonstrating the potential of learned optimizers for improving communication-efficient distributed learning. |
Charles-Étienne Joseph · Benjamin Thérien · Abhinav Moudgil · Boris Knyazev · Eugene Belilovsky 🔗 |
-
|
RealFM: A Realistic Mechanism to Incentivize Data Contribution and Device Participation
(
Poster
)
>
link
Edge device participation in federating learning (FL) has been typically studied under the lens of device-server communication (e.g., device dropout) and assumes an undying desire from edge devices to participate in FL. As a result, current FL frameworks are flawed when implemented in real-world settings, with many encountering the free-rider problem. In a step to push FL towards realistic settings, we propose RealFM: the first truly federated mechanism which (1) realistically models device utility, (2) incentivizes data contribution and device participation, and (3) provably removes the free-rider phenomena. RealFM does not require data sharing and allows for a non-linear relationship between device accuracy and utility, which improves the utility gained by the server and participating devices compared to non-participating devices as well as devices participating in other FL mechanisms. On real-world data, RealFM improves device and server utility, as well as data contribution, by up to 3 magnitudes and 7x respectively compared to baseline mechanisms. |
Marco Bornstein · Amrit Bedi · Anit Kumar Sahu · Furqan Khan · Furong Huang 🔗 |
-
|
Heterogeneous LoRA for Federated Fine-tuning of On-device Foundation Models
(
Poster
)
>
link
Foundation models (FMs) in massive parameter space pretrained on a large amount of (public) data perform remarkably well on various downstream tasks with just a few samples for fine-tuning. However, direct fine-tuning of the standard FMs often becomes difficult due to their massive size, especially for scenarios where FMs are adapted on private data distributed across resource-limited devices. As such, only those FMs with relatively small parameter size may be capable of on-device fine-tuning. We call these smaller FMs as on-device FMs (ODFMs). In our work, we investigate parameter-efficient federated fine-tuning of ODFMs (XXS PaLM2) for downstream tasks on devices using low-rank approximations (LoRAs) for potential downstream tasks of devices, where we investigate multi-session chat data from real clients as the downstream task of interest. We first examine federated fine-tuning with homogeneous LoRA ranks across clients, and show that higher ranks can lead to overfitting despite their faster learning speed whilst lower ranks do not overfit but converge slower in training. Based on these observations, we propose heterogeneous LoRA, where we deploy hetergeneous ranks across clients, aggregate the heterogeneous LoRA modules through zero-padding, and redistribute the LoRA modules heterogeneously through truncation. Our proposed heterogeneous LoRA is simple yet effective. It achieves the best of both worlds by combining the advantages of high-rank and low-rank LoRAs. This allows us to achieve the best performance with the fewest number of communication rounds, while also avoiding the problem of overfitting. |
Yae Jee Cho · Luyang Liu · Zheng Xu · Aldi Fahrezi · Matt Barnes · Gauri Joshi 🔗 |
-
|
FDAPT: Federated Domain-adaptive Pre-training for Language Models
(
Poster
)
>
link
Foundation models (FMs) have shown prominent success in a wide range of tasks [Bommasani et al., 2021]. Their applicability to specific domain-task pairings relies on the availability of, both, high-quality data and significant computational resources. These challenges are not new to the field and, indeed, Federated Learning (FL) has been shown to be a promising solution in similar setups [Yu et al., 2023, Zhuang et al., 2023]. This paper tackles the specific case of Domain-adaptive Pre-training (DAPT), a key step in the application of FMs. We conduct the first comprehensive empirical study to evaluate the performance of Federated Domain-adaptive Pre-training (FDAPT). We demonstrate that FDAPT can maintain competitive downstream task performance to the centralized baseline in both IID and non-IID situations. Finally, we propose a novel algorithm, Frozen Federated Domain-adaptive Pre-training (FFDAPT). FFDAPT improves the computational efficiency by 12.1% on average and exhibits similar downstream task performance to vanilla FDAPT, with general performance fluctuations remaining less than 1%. |
Lekang Jiang · Filip Svoboda · Nicholas Lane 🔗 |
-
|
Backdoor Threats from Compromised Foundation Models to Federated Learning
(
Poster
)
>
link
Federated learning (FL) represents a novel paradigm to machine learning, addressing critical issues related to data privacy and security, yet suffering from data insufficiency and imbalance.The emergence of foundation models (FMs) provides a promising solution to the problems with FL.For instance, FMs could serve as teacher models or good starting points for FL.However, the integration of FM in FL presents a new challenge, exposing the FL systems to potential threats. This paper investigates the robustness of FL incorporating FMs by assessing their susceptibility to backdoor attacks.Contrary to classic backdoor attacks against FL, the proposed attack (1) does not require the attacker fully involved in the FL process; (2) poses a significant risk in practical FL scenarios; (3) is able to evade existing robust FL frameworks/ FL backdoor defenses; (4) underscores the researches on the robustness of FL systems integrated with FMs.The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models and benchmark datasets encompassing both text and image classification domains. |
Xi Li · Songhe Wang · Chen Wu · Hao Zhou · Jiaqi Wang 🔗 |
-
|
Correlated Noise Provably Beats Independent Noise for Differentially Private Learning
(
Poster
)
>
link
Differentially private learning algorithms inject noise into the learning where the most common private learning algorithm, DP-SGD, adds independent Gaussian noise in each iteration. Motivated by the practical considerations in federated learning, recent work on matrix factorization mechanisms has shown empirically that introducing correlations in the noise can greatly improve their utility. We characterize the asymptotic objective suboptimality for any choice of the correlation function, giving precise analytical bounds for linear regression. We show, using these bounds, how correlated noise provably improves upon vanilla DP-SGD as a function of problem parameters such as the effective dimension and condition number. Moreover, our analytical expression for the near-optimal correlation function circumvents the cubic complexity of the semi-definite program used to optimize the noise correlation in prior work. We validate these theoretical results with experiments on private deep learning in both centralized and federated settings. Our work matches or outperforms prior work while being efficient both in terms of computation and memory. |
Christopher A. Choquette-Choo · Krishnamurthy Dvijotham · Krishna Pillutla · Arun Ganesh · Thomas Steinke · Abhradeep Guha Thakurta 🔗 |
-
|
User Inference Attacks on Large Language Models
(
Poster
)
>
link
We study the privacy implications of fine-tuning large language models (LLMs) on user-stratified (i.e. federated) data. We define a realistic threat model, called user inference, wherein an attacker infers whether or not a user's data was used for fine-tuning. We implement attacks for this threat model that require only a small set of samples from a user (possibly different from the samples used for training) and black-box access to the fine-tuned LLM. We find that LLMs are susceptible to user inference attacks across a variety of fine-tuning datasets with outlier users (i.e., those with data distributions sufficiently different from other users) and users who contribute large quantities of data being most susceptible. Finally, we find that mitigation interventions in the training algorithm, such as batch or per-example gradient clipping and early stopping fail to prevent user inference while limiting the number of fine-tuning samples from a single user can reduce attack effectiveness (albeit at the cost of reducing the total amount of fine-tuning data). |
Nikhil Kandpal · Krishna Pillutla · Alina Oprea · Peter Kairouz · Christopher A. Choquette-Choo · Zheng Xu 🔗 |
-
|
FedFN: Feature Normalization for Alleviating Data Heterogeneity Problem in Federated Learning
(
Poster
)
>
link
Federated Learning (FL) is a collaborative method for training models while preserving data privacy in decentralized settings. However, FL encounters challenges related to data heterogeneity, which can result in performance degradation. In our study, we observe that as data heterogeneity increases, feature representation in the FedAVG model deteriorates more significantly compared to classifier weight. Additionally, we observe that as data heterogeneity increases, the gap between higher feature norms for observed classes, obtained from local models, and feature norms of unobserved classes widens, in contrast to the behavior of classifier weight norms. This widening gap extends to encompass the feature norm disparities between local and the global models. To address these issues, we introduce Federated Averaging with Feature Normalization Update (FedFN), a straightforward learning method. We demonstrate the superior performance of FedFN through extensive experiments, even when applied to pretrained ResNet18. Subsequently, we confirm the applicability of FedFN to foundation models. |
SeongYoon Kim · Gihun Lee · Jaehoon Oh · Se-Young Yun 🔗 |
-
|
FedLDA: Personalized Federated Learning Through Collaborative Linear Discriminant Analysis
(
Poster
)
>
link
Data heterogeneity poses a significant challenge to federated learning. Observing the universality of neural networks in approximating the ground-truth, one emerging perspective is to train personalized models via learning a shared representation coupled with customized classifiers for each client. To the best of our knowledge, except for the concurrent work FedPAC, individual classifiers in most existing works only utilize local datasets, which may result in poor generalization. In this work, we propose FedLDA which enables federation in training classifiers by performing collaborative Linear Discriminant Analysis (LDA) on top of the latent shared representation. Our algorithm design is motivated by the observation that upon network initialization the extracted features are highly Gaussian, and client LDA models may benefit from distributed estimation of the Gaussian parameters. To support the high-dimension, low-sample scenario often encountered in PFL, we utilize a momentum update of the Gaussian parameters and employ $\ell_1$ regularization of local covariances. Our numerical results show that, surprisingly, in contrast to multiple state-of-the-art methods, our FedLDA is capable of maintaining the initial Gaussianity. More importantly, through empirical study, we demonstrate that our FedLDA method leads to faster convergence and improved generalization than state-of-the-art algorithms. Compared with FedPAC our method is communication-efficient and does not require the availability of a validation dataset.
|
Connor Mclaughlin · Lili Su 🔗 |