Workshop
Workshop on Advancing Neural Network Training (WANT): Computational Efficiency, Scalability, and Resource Optimization
Julia Gusak · Jean Kossaifi · Alena Shilova · Rocco Sedona · Cristiana Bentes · Animashree Anandkumar · Olivier Beaumont
Room 243 - 245
Unlock neural network training's potential for good and science! Enhance computational efficiency, scalability, and resource optimization. Join HPC and AI experts to tackle challenges in theory and applications.
Schedule
Sat 6:15 a.m. - 6:50 a.m.
|
Poster Placement
(
Break
)
>
|
🔗 |
Sat 6:50 a.m. - 7:00 a.m.
|
Opening Remarks
(
Talk
)
>
SlidesLive Video |
Julia Gusak 🔗 |
Sat 7:00 a.m. - 7:30 a.m.
|
A Data-Centric View on Workflows that Couple HPC with Large-Scale Models
(
Invited Talk
)
>
SlidesLive Video Abstract: In recent years, scientific computing workloads at HPC facilities have been undergoing a significant shift. While traditionally dominated by numerical simulations, these facilities are increasingly handling AI/ML applications for training and inference, processing and producing ever-increasing amounts of scientific data. Despite the focus on optimizing the execution of new AI/HPC workflows, little attention has been paid to the I/O runtime challenges they present. This talk aims to address that gap by analyzing these emerging trends from an I/O perspective. We will explore the performane of the multilayer high-performance I/O systems under the strain of these new workflows that combine traditional HPC techniques with AI interacting in new challenging ways. Speaker's Bio: Ana Gainaru is a computer scientist in the CSM division at Oak Ridge National Laboratory, working on data management and performance optimization for large scale scientific workflows with a focus on codes coupling traditional HPC with AI. She received her PhD from the University of Illinois at Urbana-Champaign working on fault tolerance and scheduling for large-scale systems. In her current position she is working with application developers in fusion, neutron scattering and materials sciences to deploy digital twins and large models and improve their performance at scale. |
Ana Gainaru 🔗 |
Sat 7:30 a.m. - 8:00 a.m.
|
Rematerialization Algorithms for Memory-efficient Learning
(
Invited Talk
)
>
SlidesLive Video Abstract: The training phase of Deep Neural Networks is often a very memory-intensive procedure, where large amounts of intermediate data have to be kept in memory during one iteration. One possible approach to reduce memory usage is rematerialization, aka gradient checkpointing, where some intermediate data are recomputed when needed rather than kept in memory. This provides a tradeoff between memory usage and recomputation time. In this talk I will present several approaches for the optimization problem, where one wants to minimize the recomputation time given a fixed memory budget. The corresponding algorithms have been implemented in easy-to-use libraries for the PyTorch framework, which can significantly reduce memory usage with reasonable overhead. Speaker's Bio: Lionel Eyraud-Dubois received his PhD degree in computer science from the Université de Grenoble. He is currently a full-time researcher with Inria Bordeaux Sud-Ouest in the Topal team. His main research interests encompass combinatorial optimization and operation research techniques for scheduling and resource allocation problems in high performance computer systems, including for optimizing the training and inference processes of Deep Neural Networks. |
Lionel Eyraud-Dubois 🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Sat 8:30 a.m. - 9:00 a.m.
|
Navigating the Landscape of Enormous AI Model Training
(
Invited Talk
)
>
SlidesLive Video Abstract: The proliferation of large models based on Transformers has outpaced advances in hardware, resulting in an urgent need for the ability to distribute enormous models across multiple GPUs. Despite this increasing need, the absence of established best practices for selecting an optimal strategy persists, owing to the extensive expertise required in High-Performance Computing (HPC), Deep Learning (DL), and distributed systems. These challenges have motivated both AI and HPC developers to delve into pivotal questions: How can the training and inference efficiency of large models be enhanced to minimize costs? How can larger AI models be accommodated, even with limited resources? What measures can be taken to facilitate broader community access to large models and large-scale applications? In this talk, I will discuss potential solutions to these challenges by exploring hybrid parallelisms, heterogeneous memory management, and the design of user-friendly frameworks such as our open-source systemic solution: Colossal-AI (https://github.com/hpcaitech/ColossalAI). Speaker's Bio: Yang You is a Presidential Young Professor at the National University of Singapore. He received his Ph.D. in Computer Science from UC Berkeley under Prof. James Demmel. Yang's research interests include Parallel/Distributed Algorithms, High Performance Computing, and Machine Learning. He is a winner of the IPDPS 2015 Best Paper Award (0.8%), ICPP 2018 Best Paper Award (0.3%), and ACM/IEEE George Michael HPC Fellowship. Yang is also a Siebel Scholar and a winner of the Lotfi A. Zadeh Prize. He also made the Forbes 30 Under 30 Asia list (2021) for young leaders and the IEEE-CS TCHPC early career award. |
Yang You 🔗 |
Sat 9:00 a.m. - 9:30 a.m.
|
Enabling Efficient Trillion Parameter Scale Training for Deep Learning Models
(
Invited Talk
)
>
SlidesLive Video Abstract: Deep Learning (DL) is driving unprecedented progress in a wide range of Artificial Intelligence domains, including natural language processing, vision, speech, and multimodal. However, sustaining this AI revolution requires practical solutions to the extreme demands of model scaling on the compute, memory, communication and storage components of modern computing hardware. To address this challenge, we created a deep learning optimization library called DeepSpeed to make distributed model training and inference efficient, effective, and easy on commodity hardware. This talk will focus on DeepSpeed training optimizations, particularly on ZeRO and DeepSpeed-MoE, which help to address the memory and compute requirements of extreme model scaling. Speaker's Bio: Olatunji (Tunji) Ruwase is a co-founder and Principal Research Sciences Manager of the DeepSpeed project at Microsoft. His broad industry and research background spans compilers, operating systems, and hardware accelerators. He is currently interested in building systems and convergence optimizations, and frameworks for distributed training and inference of deep learning models. His research results on DL training, inference, and hyperparameter search are used in multiple Microsoft systems and products, such as Bing, Ads, HyperDrive, and Catapault. |
Olatunji Ruwase 🔗 |
Sat 9:30 a.m. - 10:00 a.m.
|
Contributed Talks
(
Talk
)
>
link
SlidesLive Video Presentations of papers accepted as Orals to the WANT@NeurIPS. Papers will be presented during Poster sessions as well. |
🔗 |
Sat 9:31 a.m. - 9:36 a.m.
|
Training and inference of large language models using 8-bit floating point
(
Contributed Talk & Poster
)
>
link
FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference. |
Sergio Perez · Yan Zhang · James Briggs · Charles Blake · Josh Levy-Kramer · Paul Balanca · Carlo Luschi · Stephen Barlow · Andrew Fitzgibbon 🔗 |
Sat 9:37 a.m. - 9:42 a.m.
|
MatFormer: Nested Transformer for Elastic Inference
(
Contributed Talk & Poster
)
>
link
Transformer models are deployed in a wide range of settings, from multi-accelerator clusters to standalone mobile phones. The diverse inference constraints in these scenarios necessitate practitioners to train foundation models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes. Due to significant training costs, only a select few model sizes are trained and supported, limiting more fine-grained control over relevant tradeoffs, including latency, cost, and accuracy. This work introduces MatFormer, a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. Each Feed Forward Network (FFN) block of a MatFormer model is jointly optimized with a few nested smaller FFN blocks. This training procedure allows for the Mix'n'Match of model granularities across layers -- i.e., a trained universal MatFormer model enables extraction of hundreds of accurate smaller models, which were never explicitly optimized. We empirically demonstrate MatFormer's effectiveness across different model classes (decoders & encoders), modalities (language & vision), and scales (up to 2.6B parameters). We find that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting comparable validation loss and one-shot downstream evaluations to their independently trained counterparts. Furthermore, we observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, we showcase that speculative decoding with the accurate and consistent submodels extracted from MatFormer can further reduce inference latency. |
Fnu Devvrit · Sneha Kudugunta · Aditya Kusupati · Tim Dettmers · Kaifeng Chen · Inderjit Dhillon · Yulia Tsvetkov · Hannaneh Hajishirzi · Sham Kakade · Ali Farhadi · Prateek Jain
|
Sat 9:43 a.m. - 9:48 a.m.
|
Sparse Backpropagation for MoE Training
(
Contributed Talk & Poster
)
>
link
One defining characteristic of Mixture-of-Expert (MoE) models is their capacity for conducting sparse computation via expert routing, leading to remarkable scalability. However, backpropagation, the cornerstone of deep learning, requires dense computation, thereby posting challenges in MoE gradient computations. Here, we introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing. Unlike typical MoE training which strategically neglects certain gradient terms for the sake of sparse computation and scalability, SparseMixer provides scalable gradient approximations for these terms, enabling reliable gradient estimation in MoE training. Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations with negligible computational overhead. Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain, accelerating training convergence up to 2 times. |
Liyuan Liu · Jianfeng Gao · Weizhu Chen 🔗 |
Sat 9:49 a.m. - 9:54 a.m.
|
Efficient Parallelization Layouts for Large-Scale Distributed Model Training
(
Contributed Talk & Poster
)
>
link
Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a Llama-13B model. |
Johannes Hagemann · Samuel Weinbach · Konstantin Dobler · Maximilian Schall · Gerard de Melo 🔗 |
Sat 9:55 a.m. - 10:00 a.m.
|
CoTFormer: More Tokens With Attention Make Up For Less Depth
(
Contributed Talk & Poster
)
>
link
The race to continually develop ever larger and deeper foundational models is underway. However, techniques like the Chain-of-Thought (CoT) method continue to play a pivotal role in achieving optimal downstream performance. In this study, we establish an approximate parallel between the utilization of the chain-of-thought and employing a deeper transformer. Building on this insight, we introduce CoTFormer, a transformer variant that employs an implicit CoT-like mechanism to achieve comparable performance to that of a deeper model. Our empirical findings demonstrate the effectiveness of CoTFormers, as they significantly outperform larger standard transformers. |
Amirkeivan Mohtashami · Matteo Pagliardini · Martin Jaggi 🔗 |
Sat 10:00 a.m. - 11:30 a.m.
|
Lunch
(
Break
)
>
|
🔗 |
Sat 11:30 a.m. - 12:00 p.m.
|
Poster Session ( Poster Session ) > link | 🔗 |
Sat 12:00 p.m. - 12:30 p.m.
|
Crafting Computational Efficiency for Large Models: Training Recipes, Scaling Strategies and Sparsity Sorcery with Specialized Hardware
(
Invited Talk
)
>
SlidesLive Video Abstract: Large models are shifting “what’s possible” with AI. Brute-force scaling of model parameter count increases model capacity, and when presented with enough training data, has shown remarkable results. However, the advantages of large-scale models come at a price of steep increase in system complexity and infrastructure cost. Training and serving these models is an engineering challenge and is very expensive. Even minor errors in model design or training procedure can result in significant waste of resources. At Cerebras we have trained our share of large language models and learned along the way how to train these models efficiently to get “the biggest bang for the buck”. In this talk we will share our experience and insights from training various LLMs. In addition to techniques for compute efficient training of dense models, we will look into benefits of sparse training and inference on Cerebras hardware, designed to take full advantage of all types of sparsity. Speaker's Bio: Natalia Vassilieva is a Sr. Director of Product at Cerebras Systems, a computer systems company dedicated to accelerating deep learning. She leads the vision and strategy for Cerebras products, market, application, and algorithm analysis for machine learning use cases. Her focus is machine learning and artificial intelligence, analytics, and application-driven software-hardware optimization and co-design. Prior to joining Cerebras, Natalia was a Sr. Research Manager at Hewlett Packard Labs, where she led the Software and AI group and served as the head of HP Labs Russia from 2011 until 2015. Prior to Hewlett Packard, she was an Associate Professor at St. Petersburg State University in Russia and worked as a software engineer for several IT companies. Natalia holds a Ph.D. in computer science from St. Petersburg State University. |
Natalia Vassilieva 🔗 |
Sat 12:30 p.m. - 1:00 p.m.
|
Invited Talk by Databricks
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Sat 1:30 p.m. - 2:00 p.m.
|
Efficient LLM Training and Inference on GPUs
(
Invited Talk
)
>
SlidesLive Video Abstract: Training and inference of large transformer models is one of the most important computational challenges of modern AI. Systems for training these models must be highly scalable and run at extreme efficiency, because the amount of work necessary to converge a model can be extraordinarily large. Inference needs to be fast and accommodate different query sizes. In this talk, I'll discuss the work we have been doing at NVIDIA to optimize systems for Large Language Model training and inference on GPUs. I will present different parallelism techniques we are using in our LLM framework Megatron-LM and will discuss how parallelism techniques can be combined to maximize the training throughput of large models while retaining strict optimizer semantics. I will discuss optimizations techniques for inference and methods to accelerate inference and reduce memory fragmentation. Speaker's Bio: Dr. Mohammad Shoeybi is the Director of Applied Research at NVIDIA. His team focuses on building large foundational models and improving them to downstream applications. His team has build Megatron-LM, a framework for efficiently training LLMs and used it to train several large scale models such as Megatron-Turing NLG with 530 billions of parameters. He received his PhD. from Stanford University in 2010. Prior to NVIDIA, he worked at DeepMind and Baidu USA leading efforts on bringing deep learning and reinforcement learning to applications. |
Mohammad Shoeybi · Bryan Catanzaro 🔗 |
Sat 2:00 p.m. - 2:50 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Yang You · Olatunji Ruwase · Natalia Vassilieva · Mohammad Shoeybi · Ana Gainaru · Lionel Eyraud-Dubois · Jean Kossaifi 🔗 |
Sat 2:50 p.m. - 3:00 p.m.
|
Closing Remarks
(
Talk
)
>
SlidesLive Video |
Jean Kossaifi 🔗 |
Sat 3:00 p.m. - 3:30 p.m.
|
Poster Session ( Poster Session ) > link | 🔗 |
-
|
AI4HPC: Library to Train AI Models on HPC Systems using CFD Datasets
(
Poster
)
>
link
SlidesLive Video This paper introduces AI4HPC, an open-source library designed to integrate Artificial Intelligence (AI) models and workflows in High-Performance Computing (HPC) systems for Computational Fluid Dynamics (CFD)-based applications. Developed by CoE RAISE, AI4HPC addresses not only challenges in handling intricate CFD datasets, model complexity, and scalability but also includes extensive code optimizations to improve performance. Furthermore, the library encompasses data manipulation, specialized ML architectures, distributed training, hyperparameter optimization, and performance monitoring. Integrating AI and CFD in AI4HPC empowers efficient analysis of extensive and large-scale datasets. This paper outlines the architecture, components, and potential of AI4HPC to accelerate and augment data-driven fluid dynamics simulations and beyond, demonstrated by showing the scaling results of this library up to 3,664 GPUs. |
Eray Inanc · Rakesh Sarma · Marcel Aach · Rocco Sedona · Andreas Lintermann 🔗 |
-
|
Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale
(
Poster
)
>
link
The gradient noise scale is valuable to compute because it provides a suggestion for a compute efficient batch size when training a deep learning model. However, computing it can be awkward or expensive depending on the approach taken due to difficulty obtaining small batch gradient norm estimates. ``Efficient'' per-example gradient norms provide accurate small batch gradient norms but are inefficient in transformer or convolutional models. By assuming activations are normally distributed, we compute an approximate per-example gradient norm that tracks the true per-example gradient norm in practical settings. Using this approximation, we construct a Scaled Output Gradient Noise Scale (SOGNS) that is generally applicable at negligible cost and provides additional feedback to the practitioner during training. |
Gavia Gray · Anshul Samar · Joel Hestness 🔗 |
-
|
Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators
(
Poster
)
>
link
SlidesLive Video
The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper, $12$-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.
|
Yaniv Blumenfeld · Itay Hubara · Daniel Soudry 🔗 |
-
|
ReffAKD: Resource-efficient Autoencoder-based Knowledge Distillation
(
Poster
)
>
link
SlidesLive Video In this research, we propose an innovative method to boost Knowledge Distillation efficiency without the need for resource-heavy teacher models. Knowledge Distillation trains a smaller "student" model with guidance from a larger "teacher" model, which is computationally costly. However, the main benefit comes from the soft labels provided by the teacher, helping the student grasp nuanced class similarities. In our work, we propose an efficient method for generating these soft labels, thereby eliminating the need for a large teacher model. We employ a compact autoencoder to extract essential features and calculate similarity scores between different classes. Afterward, we apply the softmax function to these similarity scores to obtain a soft probability vector. This vector serves as valuable guidance during the training of the student model. Our extensive experiments on various datasets, including CIFAR-100, Tiny Imagenet, and Fashion MNIST, demonstrate the superior resource efficiency of our approach compared to traditional knowledge distillation methods that rely on large teacher models. Importantly, our approach consistently achieves similar or even superior performance in terms of model accuracy. We also perform a comparative study with various techniques recently developed for knowledge distillation showing our approach achieves competitive performance with using significantly less resources. We also show that our approach can be easily added to any logit based knowledge distillation method. This research contributes to making knowledge distillation more accessible and cost-effective for practical applications, making it a promising avenue for improving the efficiency of model training. |
Divyang Doshi · Jung-Eun Kim 🔗 |
-
|
Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search
(
Poster
)
>
link
Sequential recommender systems (SRS) have become a research hotspot due to their power in modeling user dynamic interests and sequential behavioral patterns. To maximize model expressive ability, a default choice is to apply a larger and deeper network architecture, which, however, often brings high network latency when generating online recommendations. Naturally, we argue that compressing the heavy recommendation models into middle- or light-weight neural networks that reduce inference latency while maintaining recommendation performance is of great importance for practical production systems. To realize such a goal, we propose AdaRec, a knowledge distillation (KD) framework which compresses knowledge of a teacher model into a student model adaptively according to its recommendation scene by using differentiable neural architecture search (NAS). Specifically, we introduce a target-oriented knowledge distillation loss to guide the network structure search process for finding the student network architecture, and a cost-sensitive loss as constraints for model size, which achieves a superior trade-off between recommendation effectiveness and efficiency. In addition, we leverage earth mover's distance (EMD) to realize many-to-many layer mapping during knowledge distillation, which enables each intermediate student layer to learn from other intermediate teacher layers adaptively. Extensive experiments on three real-world recommendation datasets demonstrate that our model achieves significantly better accuracy with notable inference speedup compared to strong counterparts, while discovering diverse architectures for sequential recommendation models under different recommendation scenes. |
Lei Chen 🔗 |
-
|
Remaining-Useful-Life Prediction and Uncertainty Quantification using LSTM Ensembles for Aircraft Engines
(
Poster
)
>
This paper proposes "LSTM (Long Short Term Memory) Ensemble" technique in building a regression model to predict the Remaining-Useful-Life (RUL) of aircraft engines along with uncertainty quantification, utilising the well-known run-to-failure turbo engine degradation dataset. This paper addressed the overlooked yet crucial aspect of uncertainty estimation in previous research, by revamping the LSTM architecture to facilitate uncertainty estimates, employing Negative Log Likelihood (NLL) as the training criterion. Through a series of experiments, the model demonstrated self-awareness of its uncertainty levels, correlating high confidence with low prediction errors and vice versa. This initiative not only enhances predictive maintenance strategies but also significantly improves the safety and reliability of aviation assets by offering a more nuanced understanding of predictive uncertainties. To the best of our knowledge, this is a pioneering work in this application domain from a non-Bayesian approach. |
Oishi Deb · Emmanouil Benetos · Philip Torr 🔗 |
-
|
LightSeq: : Sequence Level Parallelism for Distributed Training of Long Context Transformers
(
Poster
)
>
link
Increasing the context length of large language models (LLMs) unlocks fundamentally new capabilities, but also significantly increases the memory footprints of training. Previous model-parallel systems such as Megatron-LM partition and compute different attention heads in parallel, resulting in large communication volumes, so they cannot scale beyond the number of attention heads, thereby hindering its adoption. In this paper, we introduce a new approach, LightSeq, for long-context LLMs training. LightSeq has many notable advantages. First, LightSeq partitions over the sequence dimension, hence is agnostic to model architectures and readily applicable for models with varying numbers of attention heads, such as Multi-Head, Multi-Query and Grouped-Query attention. Second, LightSeq not only requires up to 4.7× less communication than Megatron-LM on popular LLMs but also overlaps the communication with computation. To further reduce the training time, LightSeq features a novel gradient checkpointing scheme to bypass an forward computation for memory-efficient attention. We evaluate LightSeq on Llama-7B and its variants with sequence lengths from 32K to 512K. Through comprehensive experiments on single and cross-node training, we show that LightSeq achieves up to 1.24-2.01× end-to-end speedup, and a 2-8× longer sequence length on models with fewer heads, compared to Megatron-LM. Anonymous codes available at https://anonymous.4open.science/r/lightseq-anonymized. |
Dacheng Li · Rulin Shao · Anze Xie · Eric Xing · Joseph Gonzalez · Ion Stoica · Xuezhe Ma · Hao Zhang 🔗 |
-
|
FlexTrain: A Dynamic Training Framework for Heterogeneous Devices Environments
(
Poster
)
>
link
SlidesLive Video As deep learning models become increasingly large, they pose significant challenges in heterogeneous devices environments. The size of deep learning models makes it difficult to deploy them on low-power or resource-constrained devices, leading to long inference times and high energy consumption. To address these challenges, we propose FlexTrain, a framework that accommodates the diverse storage and computational resources available on different devices during the training phase. FlexTrain enables efficient deployment of deep learning models, while respecting device constraints, minimizing communication costs, and ensuring seamless integration with diverse devices. We demonstrate the effectiveness of FlexTrain on the CIFAR-100 dataset, where a single global model trained with FlexTrain can be easily deployed on heterogeneous devices, saving training time and energy consumption. We also extend FlexTrain to the federated learning setting, showing that our approach outperforms standard federated learning benchmarks on both CIFAR-10 and CIFAR-100 datasets. |
Mert Unsal · Ali Maatouk · Antonio De Domenico · Nicola Piovesan · Fadhel Ayed 🔗 |
-
|
DYAD: A Descriptive Yet Abjuring Density efficient approximation to linear neural network layers
(
Poster
)
>
link
SlidesLive Video We devise, implement and performance-asses DYAD, a layer which can serve asa faster and more memory-efficient approximate replacement for linear layers,(nn.Linear() in Pytorch). These layers appear in common subcomponents, such asin the ff module of Transformers. DYAD is based on a bespoke near-sparse matrixstructure which approximates the dense "weight" matrix W that matrix-multiplies the input in the typical realization of such a layer, a.k.a DENSE. Our alternative near-sparse matrix structure is decomposable to a sum of 2 matrices permutable to ablock-sparse counterpart. These can be represented as 3D tensors, which in unisonallow a faster execution of matrix multiplication with the mini-batched input matrixcompared to DENSE (O(rows(W) × cols(W)) → O(rows(W)×cols(W)/ (# of blocks )). Asthe crux of our experiments, we pretrain both DYAD and DENSE variants of 2 sizesof the OPT arch and 1 size of the Pythia arch, including at different token scalesof the babyLM benchmark. We find DYAD to be competitive (≥ 90%) of DENSEperformance on zero-shot (e.g. BLIMP), few-shot (OPENLM) and finetuning(GLUE) benchmarks, while being ≥7-15% faster to train on-GPU even at 125mscale, besides surfacing larger speedups at increasing scale and model width. |
Sarin Chandy · Varun Prashant Gangal · Yi Yang · Gabriel Maggiotti 🔗 |
-
|
Improving Deep Ensembles without Communication
(
Poster
)
>
link
Ensembling has proven to be a powerful technique for boosting model performance, uncertainty estimation, and robustness in supervised deep learning. We propose to improve deep ensembles by optimizing a tighter PAC-Bayesian bound than the most popular ones. Our approach has a number of benefits over previous methods: 1) it requires no communication between ensemble members during training to improve performance and is trivially parallelizable, 2) it results in a simple soft thresholding gradient update that is much simpler than alternatives. Empirically, we outperform competing approaches that try to improve ensembles by encouraging diversity. We report test accuracy gains for MLP, LeNet, and WideResNet architectures, and for a variety of datasets. |
Konstantinos Pitas · Michael Arbel · Julyan Arbel 🔗 |
-
|
ConcatPlexer : Additional Dim1 Batching for Faster ViTs
(
Contributed Talk & Poster
)
>
link
SlidesLive Video Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, DataMultiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively. |
Donghoon Han · Seunghyeon Seo · Donghyeon Jeon · Jiho Jang · Chaerin Kong · Nojun Kwak 🔗 |
-
|
InstaTune: Instantaneous Neural Architecture Search During Fine-Tuning
(
Poster
)
>
link
One-Shot Neural Architecture Search (NAS) algorithms often rely on training a hardware agnostic super-network for a domain specific task. Optimal sub-networks are then extracted from the trained super-network for different hardware platforms. However, training super-networks from scratch can be extremely time consuming and compute intensive especially for large models that rely on a two-stage training process of pre-training and fine-tuning. State of the art pre-trained models are available for a wide range of tasks, but their large sizes significantly limits their applicability on various hardware platforms. We propose InstaTune, a method that leverages off-the-shelf pre-trained weights for large models and generates a super-network during the fine-tuning stage. InstaTune has multiple benefits. Firstly, since the process happens during fine-tuning, it minimizes the overall time and compute resources required for NAS. Secondly, the sub-networks extracted are optimized for the target task, unlike prior work that optimizes on the pre-training objective. Finally, InstaTune is easy to" plug and play" in existing frameworks. By using multi-objective evolutionary search algorithms along with lightly trained predictors, we find Pareto-optimal sub-networks that outperform their respective baselines across different performance objectives such as accuracy and MACs. Specifically, we demonstrate that our approach performs well across both unimodal (ViT and BERT) and multi-modal (BEiT-3) transformer based architectures. |
Sharath Nittur Sridhar · Souvik Kundu · Sairam Sundaresan · Maciej Szankin · Anthony Sarah 🔗 |
-
|
ReLoRA: High-Rank Training Through Low-Rank Updates
(
Poster
)
>
link
Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on 10 the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training. |
Vladislav Lialin · Sherin Muckatira · Namrata Shivagunde · Anna Rumshisky 🔗 |
-
|
Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
(
Poster
)
>
link
Recent works have explored the use of weight sparsity to improve the trainingefficiency (test accuracy w.r.t training FLOPs) of deep neural networks (DNNs).These works aim to reduce training FLOPs but training with sparse weights oftenleads to accuracy loss or requires longer training schedules, making theresulting training efficiency less clear. In contrast, we focus on usingsparsity to increase accuracy while using the same FLOPS as the dense model andshow training efficiency gains through higher accuracy. In this work, weintroduce Sparse-IFT, a family of Sparse Iso-FLOP Transformations which are usedas drop-in replacements for dense layers to improve their representationalcapacity and FLOP efficiency. Each transformation is parameterized by a singlehyperparameter (sparsity level) and provides a larger search space to findoptimal sparse masks. Without changing any training hyperparameters, replacingdense layers with Sparse-IFT leads to significant improvements across computervision and natural language processing tasks, including ResNet-18 onImageNet (+3.5\%) and GPT-3 Small on WikiText-103 (-0.4 PPL), both matchinglarger dense model variants that use 2x or more FLOPs. To our knowledge, this isthe first work to demonstrate the use of sparsity for improving the accuracy ofdense models via a simple set of sparse transformations. |
Vithursan Thangarasa · Shreyas Saxena · Abhay Gupta · Sean Lie 🔗 |
-
|
Embarrassingly Simple Dataset Distillation
(
Poster
)
>
link
Training of large-scale models in general requires enormous amounts of traning data. Dataset distillation aims to extract a small set of synthetic training samples from a large dataset with the goal of achieving competitive performance on test data when trained on this sample, thus reducing both dataset size and training time. In this work, we tackle dataset distillation at its core by treating it directly as a bilevel optimization problem. Re-examining the foundational back-propagation through time method, we study the pronounced variance in the gradients, computational burden, and long-term dependencies. We introduce an improved method: Random Truncated Backpropagation Through Time (RaT-BPTT) to address them. RaT-BPTT incorporates a truncation coupled with a random window, effectively stabilizing the gradients and speeding up the optimization while covering long dependencies. This allows us to establish new dataset distillation state-of-the-art for a variety of standard dataset benchmarks. |
Yunzhen Feng · Shanmukha Ramakrishna Vedantam · Julia Kempe 🔗 |
-
|
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
(
Poster
)
>
link
In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various asks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss. We will release our code and the compatible CUDA kernel for reproducibility. |
Suyu Ge · Yunan Zhang · Liyuan Liu · Minjia Zhang · Jiawei Han · Jianfeng Gao 🔗 |
-
|
A Quadratic Synchronization Rule for Distributed Deep Learning
(
Poster
)
>
link
In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{\eta^2}$ as the learning rate $\eta$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared to the standard data parallel training, QSR enables Local AdamW to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves $1.16\%$ or $0.84\%$ higher top-1 validation accuracy.
|
Xinran Gu · Kaifeng Lyu · Sanjeev Arora · Jingzhao Zhang · Longbo Huang 🔗 |
-
|
DAREL: Data Reduction with Losses for Training Acceleration of Real and Hypercomplex Neural Networks
(
Poster
)
>
link
SlidesLive Video Neural network training requires a lot of resources, and there are situations where training time and memory usage are limited. In such instances, the undertaking of devising specialized algorithms for training neural networks within the constraints of resource limitations holds significance. Data Reduction with Losses is a novel training data reduction method that operates with training samples based on losses obtained from a currently trained model or a pre-trained one. Proposed method is applicable to training Deep Neural Networks for both Computer Vision and Natural Language Processing tasks. Applied to Large Language Models fine-tuning, Data Reduction with Losses can be combined with existing methods for Parameter-Efficient fine-tuning, such as LoRA. Computational experiments demonstrate superiority of the proposed approach to pre-training neural networks for Computer Vision tasks over existing methods and draws clear evidence of improving Large Language Models fine-tuning quality and time. Training acceleration for ResNet18 is up to 2.03x, while for fine-tuning DAREL allows to achieve 1.43x acceleration for GPT2-M fine-tuning with corresponding increase of BLEU by 1.81 p.p. |
Alexander Demidovskij · Aleksei Trutnev · Artyom Tugaryov · Igor Salnikov · Stanislav Pavlov 🔗 |
-
|
Accelerating Deep Learning using Ivy
(
Poster
)
>
link
SlidesLive Video Today's machine learning (ML) ecosystem suffers from deep fragmentation due to the proliferation of numerous incompatible frameworks, compiler infrastructure and hardware. Each unique tool within this fragmented stack has its own set of benefits and drawbacks, making it better suited for certain use-cases. As a result, different areas of industry and academia use different tools for different use cases, which hinders collaboration and democratization, ultimately resulting in costly re-implementations and sub-optimal runtime efficiency when deploying, due to sparse and partial connections to the rest of the stack. In this paper, we present Ivy, a complementary, multi-backend ML framework, and its transpiler, which aims to bridge this gap and solve the fragmentation problem by enabling the integration of code from one framework into another to speed up research, development, and model inference. |
Guillermo Sanchez-Brizuela · Ved Patwardhan · Matthew Barrett · Paul Anderson · Mustafa Hani · Daniel Lenton 🔗 |
-
|
Something for (almost) nothing: improving deep ensemble calibration using unlabeled data
(
Poster
)
>
link
SlidesLive Video We present a method to improve the calibration of deep ensembles in the small training data regime in the presence of unlabeled data. Our approach is extremely simple to implement: given an unlabeled set, for each unlabeled data point, we simply fit a different randomly selected label with each ensemble member. We provide a theoretical analysis based on a PAC-Bayes bound which guarantees that if we fit such a labeling on unlabeled data, and the true labels on the training data, we obtain low negative log-likelihood and high ensemble diversity on testing samples. Crucially, each ensemble member can be trained independently from the rest (apart from the final validation/test step) making a parallel or distributed implementation extremely easy. |
Konstantinos Pitas · Julyan Arbel 🔗 |
-
|
LeanFlex-GKP: Advancing Hassle-Free Structured Pruning with Simple Flexible Group Count
(
Poster
)
>
link
Densely structured pruning methods — which generate pruned models in a fully dense format, allowing immediate compression benefits without additional demands — are evolving due to their practical significance. Traditional techniques in this domain mainly revolve around coarser granularities, such as filter pruning, and thereby limit performance due to a restricted pruning freedom.Recent advancements in Grouped Kernel Pruning (GKP) have enabled the utilization of finer granularities while maintaining a densely structured format. We observe that existing GKP methods often introduce dynamic operations to different aspects of their procedures at the cost of adding complications and/or imposing limitations (e.g. requiring an expensive mixture of clustering schemes), or contain dynamic pruning rates and sizes among groups which results in a reliance on custom architecture support for its pruned models.In this work, we argue that the best practice to introduce these dynamic operations to GKP is to make |
Jiamu Zhang · Shaochen (Henry) Zhong · Andrew Ye · Zirui Liu · Kaixiong Zhou · Xia Hu · Shuai Xu · Vipin Chaudhary 🔗 |
-
|
Patch Gradient Descent: Training Neural Networks on Very Large Images
(
Poster
)
>
link
SlidesLive Video Current deep learning models falter when faced with large-scale images, largely due to prohibitive computing and memory demands. Enter Patch Gradient Descent (PatchGD), a groundbreaking learning technique that seamlessly trains deep learning models on expansive images. This innovation takes inspiration from the standard feedforward-backpropagation paradigm. However, instead of processing an entire image simultaneously, PatchGD smartly segments and updates a core information-gathering element using portions of the image before the final evaluation. This ensures wide coverage across iterations, bringing in notable memory and computational efficiencies. When tested on the high-resolution PANDA and UltraMNIST datasets using ResNet50 and MobileNetV2 models, PatchGD clearly outstrips traditional gradient descent techniques, particularly under memory constraints. The future of handling vast image datasets effectively lies with PatchGD. |
Deepak Gupta · Gowreesh Mago · Arnav Chavan · Dilip K. Prasad · Rajat Thomas 🔗 |
-
|
Batched Low-Rank Adaptation of Foundation Models
(
Poster
)
>
link
Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request.To address this, we introduce FLORA (Fast LoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLORA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 6 languages. |
Yeming Wen · Swarat Chaudhuri 🔗 |
-
|
Local LoRA: Memory-Efficient Fine-Tuning of Large Language Models
(
Poster
)
>
link
We present Local LoRA, a memory-flexible fine-tuning approach that, in principle, can fine-tune an arbitrarily large model on fixed hardware, including consumer grade GPUs. Our approach aims to decouple the size of the model and the memory required to fine-tune it by dividing the model into chunks and sequentially fine tuning each chunk. Our results show that Local LoRA closes the gap between the un-tuned model and end-to-end LoRA on math reasoning tasks. |
Oscar Key · Jean Kaddour · Pasquale Minervini 🔗 |
-
|
Early Weight Averaging meets High Learning Rates for LLM Pre-training
(
Poster
)
>
link
Training Large Language Models (LLMs) incurs significant cost; hence, any strategy that accelerates model convergence is helpful. In this paper, we investigate the ability of a simple idea – checkpoint averaging along the trajectory of a training run – to improve both convergence and generalization quite early during training. Here we show that models trained with high learning rates observe higher gains due to checkpoint averaging. Furthermore, these gains are amplified when checkpoints are sampled with considerable spacing in training steps. Our training recipe outperforms conventional training and popular checkpoint averaging baselines such as exponential moving average (EMA) and stochastic moving average (SWA). We evaluate our training recipe by pre-training LLMs, where high learning rates are inherently preferred due to extremely large batch sizes. Specifically, we pre-trained nanoGPT-2 models of varying sizes—small (125M), medium (335M), and large (770M)—on the OpenWebText dataset, comprised of 9B tokens. Additionally, we present results for publicly available Pythia LLMs, ranging from 1B to 12B, which were trained on the PILE-deduped dataset containing 207B tokens. |
Sunny Sanyal · Atula Neerkaje · Jean Kaddour · Abhishek Kumar · Sujay Sanghavi 🔗 |
-
|
Bandit-Driven Batch Selection for Robust Learning under Label Noise
(
Poster
)
>
link
SlidesLive Video We introduce a novel approach for batch selection in Stochastic Gradient Descent (SGD) training, leveraging combinatorial bandit algorithms. Our methodology focuses on optimizing the learning process in the presence of label noise, a prevalent issue in real-world datasets. Experimental evaluations on the CIFAR-10 dataset reveal that our approach consistently outperforms existing methods across various levels of label corruption. Importantly, we achieve this superior performance without incurring the computational overhead commonly associated with auxiliary neural network models. This work presents a balanced trade-off between computational efficiency and model efficacy, offering a scalable solution for complex machine learning applications. |
Michal Lisicki · Graham Taylor · Mihai Nica 🔗 |
-
|
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
(
Poster
)
>
link
Deep Neural Networks (DNNs) have been a large driver and enabler for AI breakthroughs in recent years. These models have been getting larger in their attempt to become more accurate and tackle new upcoming use-cases, including AR/VR and intelligent assistants. However, the training process of such large models is a costly and time-consuming process, which typically yields a single model to fit all targets. To mitigate this, various techniques have been proposed in the literature, including pruning, sparsification or quantization of the model weights and updates. While able to achieve high compression rates, they often incur computational overheads or accuracy penalties. Alternatively, factorization methods have been leveraged to incorporate low-rank compression in the training process. Similarly, such techniques (e.g., SVD) frequently rely on the computationally expensive decomposition of layers and are potentially sub-optimal for non-linear models, such as DNNs. In this work, we take a further step in designing efficient low-rank models and propose Maestro, a framework for trainable low-rank layers. Instead of regularly applying a priori decompositions such as SVD, the low-rank structure is built into the training process through a generalized variant of Ordered Dropout. This method imposes an importance ordering via sampling on the decomposed DNN structure. Our theoretical analysis demonstrates that our method recovers the SVD decomposition of linear mapping on uniformly distributed data and PCA for linear autoencoders. We further apply our technique on DNNs and empirically illustrate that Maestro enables the extraction of lower footprint models that preserve model performance while allowing for graceful accuracy-latency tradeoff for the deployment to devices of different capabilities. |
Samuel Horváth · Stefanos Laskaridis · Shashank Rajput · Hongyi Wang 🔗 |
-
|
Tiny Graph Convolutional Networks with Topologically Consistent Magnitude Pruning
(
Poster
)
>
link
Magnitude pruning is one of the mainstream methods in lightweight architecture design whose goal is to extract subnetworks with the largest weight connections. This method is known to be successful, but under very high pruning regimes, it suffers from topological inconsistency which renders the extracted subnetworks disconnected, and this hinders their generalization ability. In this paper, we devise a novel end-to-end Topologically Consistent Magnitude Pruning (TCMP) method that allows extracting subnetworks while guaranteeing their topological consistency. The latter ensures that only accessible and co-accessible --- impactful --- connections are kept in the resulting lightweight architectures. Our solution is based on a novel reparametrization and two supervisory bi-directional networks which implement accessibility/co-accessibility and guarantee that only connected subnetworks will be selected during training. This solution allows enhancing generalization significantly, under very high pruning regimes, as corroborated through extensive experiments, involving graph convolutional networks, on the challenging task of skeleton-based action recognition. |
Hichem SAHBI 🔗 |
-
|
DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing Learning Efficiency
(
Poster
)
>
link
This paper introduces DONUT-hole, a sparse OCR-free visual document understanding (VDU) model that addresses the limitations of its predecessor model, dubbed DONUT. The DONUT model, leveraging a transformer architecture, overcoming the challenges of separate optical character recognition (OCR) and visual semantic understanding (VSU) components. However, its deployment in production environments and edge devices is hindered by high memory and computational demands, particularly in large-scale request services. To overcome these challenges, we propose an optimization strategy based on knowledge distillation and model pruning. Our paradigm to produce DONUT-hole, reduces the model denisty by 54\% while preserving performance. We also achieve a global representational similarity index between DONUT and DONUT-hole based on centered kernel alignment (CKA) metric of 0.79. Moreover, we evaluate the effectiveness of DONUT-hole in the document image key information extraction (KIE) task, highlighting its potential for developing more efficient VDU systems for logistic companies. |
azhar shaikh · Michael Cochez · Denis Diachkov · Michiel de Rijcke · Sahar Yousefi 🔗 |
-
|
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
(
Poster
)
>
link
The popularity of LLaMA and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA models, on a wide range of downstream and instruction tuning evaluations, while requiring less than 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building smaller LLMs. |
Mengzhou Xia · Tianyu Gao · Zhiyuan Zeng · Danqi Chen 🔗 |
-
|
A foundation for exact binarized morphological neural networks
(
Poster
)
>
link
Training and running deep neural networks (NNs) often demands a lot of computation and energy-intensive specialized hardware (e.g. GPU, TPU...). One way to reduce the computation and power cost is to use binary weight NNs, but these are hard to train because the sign function has a non-smooth gradient. We present a model based on Mathematical Morphology (MM), which can binarize ConvNets without losing performance under certain conditions, but these conditions may not be easy to satisfy in real-world scenarios. To solve this, we propose two new approximation methods and develop a robust theoretical framework for ConvNets binarization using MM. We propose as well regularization losses to improve the optimization. We empirically show that our model can learn a complex morphological network, and explore its performance on a classification task. |
Theodore Aouad · Hugues Talbot 🔗 |
-
|
Training Bayesian Neural Networks with Sparse Subspace Variational Inference
(
Poster
)
>
link
Bayesian neural networks (BNNs) offer uncertainty quantification but come with the downside of substantially increased training and inference costs. Sparse BNNs have been investigated for efficient inference, typically by either slowly introducing sparsity throughout the training or by post-training compression of dense BNNs. The dilemma of how to cut down massive training costs remains, particularly given the requirement to learn about the uncertainty. To solve this challenge, we introduce Sparse Subspace Variational Inference (SSVI), the first fully sparse BNN framework that maintains a consistently sparse Bayesian model throughout the training and inference phases. Starting from a randomly initialized low-dimensional sparse subspace, our approach alternately optimizes the sparse subspace basis selection and its associated parameters. While basis selection is characterized as a non-differentiable problem, we approximate the optimal solution with a removal-and-addition strategy, guided by novel criteria based on weight distribution statistics. Our extensive experiments show that SSVI sets new benchmarks in crafting sparse BNNs, achieving, for instance, a 10-20× compression in model size with comparable performance, and up to 20× FLOPs reduction during training. Remarkably, SSVI also demonstrates enhanced robustness to hyperparameters, reducing the need for intricate tuning in VI and occasionally even surpassing VI-trained dense BNNs. |
Junbo Li · Zichen Miao · Qiang Qiu · Ruqi Zhang 🔗 |
-
|
Task Arithmetic with LoRA for Continual Learning
(
Poster
)
>
link
Continual learning refers to the problem where the training data is available in sequential chunks, termed "tasks". The majority of progress in continual learning has been stunted by the problem of catastrophic forgetting, which is caused by sequential training of the model on streams of data. Moreover, it becomes computationally expensive to sequentially train large models multiple times. To mitigate both of these problems at once, we propose a novel method to continually train transformer-based vision models using low-rank adaptation and task arithmetic. Our method completely bypasses the problem of catastrophic forgetting, as well as reducing the computational requirement for training models on each task. When aided with a small memory of 10 samples per class, our method achieves performance close to full-set finetuning. We present rigorous ablations to support the prowess of our method. |
Rajas Chitale · Ankit Vaidya · Aditya Kane · Archana Ghotkar 🔗 |
-
|
Dynamic Observation Policies in Observation Cost-Sensitive Reinforcement Learning
(
Poster
)
>
link
Reinforcement learning (RL) has been shown to learn sophisticated control policies for complex tasks including games, robotics, heating and cooling systems and text generation. The action-perception cycle in RL, however, generally assumes that a measurement of the state of the environment is available at each time step without a cost. In applications such as materials design, deep-sea and planetary robot exploration and medicine, however, there can be a high cost associated with measuring, or even approximating, the state of the environment. In this paper, we survey the recently growing literature that adopts the perspective that an RL agent might not need, or even want, a costly measurement at each time step. Within this context, we propose the Deep Dynamic Multi-Step Observationless Agent (DMSOA), contrast it with the literature and empirically evaluate it on OpenAI gym and Atari Pong environments. Our results, show that DMSOA learns a better policy with fewer decision steps and measurements than the considered alternative from the literature. |
Colin Bellinger · Mark Crowley · Isaac Tamblyn 🔗 |
-
|
Cooperative Learning for Cost-Adaptive Inference
(
Poster
)
>
link
We propose a cooperative training framework for deep neural network architectures that enables the runtime network depths to change to satisfy dynamic computing resource requirements. In our framework, the number of layers participating in computation can be chosen dynamically to meet performance-cost trade-offs at inference runtime. Our method trains two Teammate nets and a Leader net, and two sets of Teammate sub-networks with various depths through knowledge distillation. The Teammate nets derive sub-networks and transfer knowledge to them, and to each other, while the Leader net guides Teammate nets to ensure accuracy. The approach trains the framework atomically at once instead of individually training various sizes of models; in a sense, the various-sized networks are all trained at once, in a "package deal." The proposed framework is not tied to any specific architecture but can incorporate any existing models/architectures, therefore it can maintain stable results and is insensitive to the size of a dataset's feature map. Compared with other related approaches, it provides comparable accuracy to its full network while various sizes of models are available. |
Xingli Fang · Richard Bradford · Jung-Eun Kim 🔗 |
-
|
Generalisable Agents for Neural Network Optimisation
(
Poster
)
>
link
Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO)---a multi-agent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsively scheduling hyperparameters during training. GANNO utilises an agent per layer that observes localised network dynamics and accordingly takes actions to adjust these dynamics at a layerwise level to collectively improve global performance. In this paper, we use GANNO to control the layerwise learning rate and show that the framework can yield useful and responsive schedules that are competitive with handcrafted heuristics. Furthermore, GANNO is shown to perform robustly across a wide variety of unseen initial conditions, and can successfully generalise to harder problems than it was trained on. Our work presents an overview of the opportunities that this paradigm offers for training neural networks, along with key challenges that remain to be overcome. |
Kale-ab Tessera · Callum R. Tilbury · Sasha Abramowitz · Ruan John de Kock · Omayma Mahjoub · Benjamin Rosman · Sara Hooker · Arnu Pretorius 🔗 |