The second version of the Efficient Natural Language and Speech Processing (ENLSP-II) workshop focuses on fundamental and challenging problems to make natural language and speech processing (especially pre-trained models) more efficient in terms of Data, Model, Training, and Inference. The workshop program offers an interactive platform for gathering different experts and talents from academia and industry through invited talks, panel discussion, paper submissions, reviews, interactive
posters, oral presentations and a mentorship program. This will be a unique opportunity to address the efficiency issues of current models, build connections, exchange ideas and brainstorm solutions, and foster future collaborations. The topics of this workshop can be of interest for people working on general machine learning, deep learning, optimization, theory and NLP & Speech applications.
Fri 5:30 a.m. - 5:50 a.m.
|
Breakfast
|
🔗 |
Fri 5:50 a.m. - 6:00 a.m.
|
Opening Remarks
(
Opening
)
|
🔗 |
Fri 6:00 a.m. - 6:30 a.m.
|
Fine-grained Interactive Vision Language Pre-training
(
KeyNote Talk
)
SlidesLive Video » Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this talk, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training method to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. The resultant model FILIP and Wukong achieve good performance on multiple downstream vision-language tasks, while maintaining the inference efficiency of dual-stream models. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability. Furthermore, we release a 100 million Chinese image-text pair dataset for pre-training. |
Lu Hou · Lu Hou 🔗 |
Fri 6:30 a.m. - 7:05 a.m.
|
Efficiency Tradeoffs in the Design of Neural Search Systems
(
KeyNote Talk
)
SlidesLive Video » Information retrieval (IR) - the challenge of connecting users to previously stored relevant information - has received renewed attention of late due to the advent of pretrained transformer-based models. In recent years, we have seen the introduction of many new types of models (e.g., dense and sparse learned representations, cross-encoders, etc.) in the context of techniques that have been around for decades (e.g., BM25, multi-stage ranking, etc.). What does it mean for a search system to be efficient? In this talk, I'll try to sort through efficiency tradeoffs in the design and construction of end-to-end search systems, organized along the dimensions of time, space, and cost. |
Jimmy Lin 🔗 |
Fri 7:05 a.m. - 7:35 a.m.
|
Last Advances in End-to-End Speech Recognition
(
KeyNote Talk
)
In this talk, we will discuss a multi-year research effort with end-to-end models for speech recognition. We will also discuss how we translated these research findings into productionizable models that are used on our Pixel phones. |
Tara Sainath 🔗 |
Fri 7:35 a.m. - 7:45 a.m.
|
Collective Knowledge Graph Completion with Mutual Knowledge Distillation
(
Spotlight
)
SlidesLive Video » Knowledge graph completion (KGC), the task that aims at predicting missing information based on the already existing relational data inside a knowledge graph(KG), has drawn significant attention in the recent years. However, predictive power of KGC methods is often limited by the completeness of the existing knowledge graphs. In monolingual and multilingual settings, KGs from different sources and languages are potentially complementary to each other. In this paper, we study the problem of multi-KG completion, where we focus on maximizing the collective knowledge from different KGs to alleviate the incompleteness on individual KGs. Specifically, we propose a novel method called CKGC-MKD that uses augmented CompGCN-based encoder models on both individual KGs and a large connected KG in which seed alignments between KGs are regarded as edges for message propagation. Additional mutual knowledge distillation are employed to maximize the knowledge transfer between the "global" connected KG and the "local" individual KGs. Experimental results on multilingual datasets has shown that our method outperforms all state-of-the-art models. |
Weihang Zhang · Ovidiu Serban · Jiahao Sun · Yike Guo 🔗 |
Fri 7:45 a.m. - 7:56 a.m.
|
Attribute Controlled Dialogue Prompting
(
Spotlight
)
SlidesLive Video » Prompt-tuning has become an increasingly popular parameter-efficient method for steering large pretrained language models to downstream tasks. However, both discrete prompting and continuous prompting assume fixed prompts for all data samples within a task, neglecting the fact that inputs vary greatly in some tasks such as open-domain dialogue generation. In this paper, we present a novel, instance-specific prompt-tuning algorithm for dialogue generation. Specifically, we generate prompts based on instance-level control code, rather than the conversation history, to explore their impact on controlled dialogue generation. Experiments on popular open-domain dialogue datasets, evaluated on both automated metrics and human evaluation, demonstrate that our method is superior to prompting baselines and comparable to fine-tuning with only 5%-6% of total parameters. |
Runcheng Liu · Ahmad Rashid · Ivan Kobyzev · Mehdi Rezaghoizadeh · Pascal Poupart 🔗 |
Fri 7:56 a.m. - 8:05 a.m.
|
Fast DistilBERT on CPUs
(
Spotlight
)
SlidesLive Video » Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50\% and up to 4.1x performance speedup over ONNX Runtime. |
Haihao Shen · Ofir Zafrir · Bo Dong · Hengyu Meng · Xinyu Ye · Zhe Wang · Yi Ding · Hanwen Chang · Guy Boudoukh · Moshe Wasserblat 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Morning Break and Poster Session 1
(
Break and Poster Session
)
|
🔗 |
Fri 8:30 a.m. - 9:05 a.m.
|
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
(
KeyNote Talk
)
SlidesLive Video » Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy due to outliers or do not run efficiently on hardware. I’ll present SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs, including OPT-175B, BLOOM-176B and GLM-130B, achieving faster inference speed with half the number of GPUs. We hope SmoothQuant can inspire economic deployment of LLMs in the future. |
Song Han 🔗 |
Fri 9:05 a.m. - 9:35 a.m.
|
Building Language Models Based on Retrieval
(
KeyNote Talk
)
SlidesLive Video » Large language models (LLMs) have utterly transformed the field of natural language processing. However, training LLMs comes at a massive financial and environmental cost, making them out of reach of academic research labs. Meanwhile, these models are costly to update and brittle in leaking private text data. In this talk, I will argue that retrieval-based language models are a promising way of scaling LMs and overcoming the above limitations. I will discuss recent developments of retrieval-based language models, compare their pros and cons, and show their benefits in interpretability, adaptability, and privacy. In particular, I will introduce a new training approach for retrieval-based language models called TRIME (TRaining with In-batch MEmories), which can train LMs to retrieve better from the text during inference. |
Danqi Chen 🔗 |
Fri 9:35 a.m. - 10:05 a.m.
|
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
(
KeyNote Talk
)
SlidesLive Video » The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing. Together with better performance come larger model sizes. This imposes challenges to the memory wall of the current accelerator hardware such as GPU. It is never ideal to train large models such as Vision Transformer, BERT, and GPT on a single GPU or a single machine. There is an urgent demand to train models in a distributed environment. However, distributed training, especially model parallelism, often requires domain expertise in computer systems and architecture. It remains a challenge for AI researchers to implement complex distributed training solutions for their models. To solve this problem, we introduce Colossal-AI, which is a unified parallel training system designed to seamlessly integrate different paradigms of parallelization techniques including data parallelism, pipeline parallelism, multiple tensor parallelism, and sequence parallelism. Colossal-AI aims to support the AI community to write distributed models in the same way as how they write models normally. This allows them to focus on developing the model architecture and separates the concerns of distributed training from the development process. Colossal-AI is able to achieve 2x speedup over state-of-the-art distributed systems for GPT model training. The source code can be found at this https://github.com/hpcaitech/ColossalAI |
Yang You 🔗 |
Fri 10:05 a.m. - 10:15 a.m.
|
Efficient Few-Shot Learning Without Prompts
(
Spotlight
)
SlidesLive Video » Recent few-shot learning methods, such as parameter-efficient fine-tuning (PEFT) and pattern exploiting training (PET), have achieved impressive results in label-scarce settings. However, they are difficult to employ since they are highly sensitive to handcrafted prompts, and typically require billion-parameter language models to achieve high accuracy. To address these shortcomings, we propose SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit works by first fine-tuning a pretrained ST on a small number of labeled text pairs, in a contrastive Siamese manner. The resulting model is then used to generate rich text embeddings, which are used to train a classification head. This simple framework requires no prompts or verbalizers, and achieves high accuracy with orders of magnitude less parameters and runtime than existing techniques. Our experiments show that SetFit achieves results competitive with PEFT and PET techniques, and outperforms them on a variety of classification tasks. |
Oren Pereg · Daniel Korat · Moshe Wasserblat · Lewis Tunstall · Unso Eun Seo Jo · Luke Bates · Nils Reimers 🔗 |
Fri 10:15 a.m. - 10:25 a.m.
|
PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation
(
Spotlight
)
SlidesLive Video » Existing work on controlled text generation (CTG) assumes a control interface of categorical attributes. In this work, we propose a natural language interface, where we craft a PCFG to embed the control attributes into natural language commands and propose variants of existing CTG models that take commands as input. We design tailored experiments to test model's generalization abilities. The results show our PCFG-based command generation approach is effective for handling unseen commands compared to fix-set templates, and our proposed NL models can effectively generalize to unseen attributes. |
Jingyu Zhang · Jim Glass · Tianxing He 🔗 |
Fri 10:25 a.m. - 10:35 a.m.
|
PromptDA: Label-guided Data Augmentation for Prompt-based Few Shot Learners
(
Spotlight
)
SlidesLive Video » Recent advances in large pre-trained language models (PLMs) lead to impressive gains on natural language understanding (NLU) tasks with task-specific fine-tuning. However, direct fine-tuning PLMs heavily relies on a large amount of labeled instances, which are usually hard to obtain. Prompt-based tuning on PLMs has proven valuable for various few-shot tasks. Existing works studying prompt-based tuning for few-shot NLU tasks mainly focus on deriving proper label words with a verbalizer or generating prompt templates for eliciting semantics from PLMs. In addition, conventional data augmentation methods have also been verified useful for few-shot tasks. However, currently there are few data augmentation methods designed for the prompt-based tuning paradigm. Therefore, we study a new problem of data augmentation for prompt-based few shot learners. Since the label semantics are essential in prompt-based tuning, we propose a novel label-guided data augmentation method PromptDA which exploits the enriched label semantic information for data augmentation. Extensive experiment results on few-shot text classification tasks show that our proposed framework achieves superior performance by effectively leveraging label semantics and data augmentation for natural language understanding. |
Canyu Chen · Kai Shu 🔗 |
Fri 10:30 a.m. - 11:30 a.m.
|
Lunch Break and Virtual Poster Session
link »
Join the gathertown link to meet our virtual poster presenters: [ protected link dropped ] |
🔗 |
Fri 11:30 a.m. - 12:00 p.m.
|
Efficient Identify Event Causality with Knowledge and Analogy
(
KeyNote Talk
)
SlidesLive Video » Event causality identification (ECI) is an important task in natural language processing (NLP) which aims to identify the causal relationships between events in text pieces, i.e., predict whether one event causes another one to happen. Due to the diversity of real-world causality events and difficulty in obtaining sufficient training data, existing ECI approaches have poor generalizability and struggle to identify the relation between seldom-seen events. We propose to utilize both external knowledge and internal analogy to improve ECI. By utilizing a commonsense knowledge graph to reveal the commonalities or associations between different events, and retrieving similar events as analogy examples to glean useful experiences from such analogous neighbors, we can better identify the relationship between a new event pair. Extensive evaluations show that our approach significantly outperforms other baseline methods. |
Bang Liu 🔗 |
Fri 12:00 p.m. - 12:50 p.m.
|
Interactive Industrial Panel
(
Discussion Panel
)
SlidesLive Video » |
Jiahao Sun · Ahmed Ibrahim · Marjan Ghazvininejad · Yu Cheng · Boxing Chen · Mohammad Norouzi · Rahul Gupta 🔗 |
Fri 12:50 p.m. - 12:59 p.m.
|
Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement
(
Spotlight
)
SlidesLive Video » Self-supervised speech representation learning aims to extract meaningful factors from the speech signal that can later be used across different downstream tasks, such as speech and/or emotion recognition. Existing models, such as HuBERT, however, can be fairly large thus may not be suitable for edge speech applications. Moreover, realistic applications typically involve speech corrupted by noise and room reverberation, hence models need to provide representations that are robust to such environmental factors. In this study, we build on the so-called DistilHuBERT model, which distils HuBERT to a fraction of its original size, with three modifications, namely: (i) augment the training data with noise and reverberation, while the student model needs to distill the clean representations from the teacher model; (ii) introduce a curriculum learning approach where increasing levels of noise are introduced as the model trains, thus helping with convergence and with the creation of more robust representations; and (iii) introduce a multi-task learning approach where the model also reconstructs the clean waveform jointly with the distillation task, thus also acting as an enhancement step to ensure additional environment robustness to the representation. Experiments on three SUPERB tasks show the advantages of the proposed method not only relative to the original DistilHuBERT, but also to the original HuBERT, thus showing the advantages of the proposed method for ``in the wild'' edge speech applications. |
Heitor Guimarães · Arthur Pimentel · Anderson R. Avila · Mehdi Rezaghoizadeh · Tiago H Falk 🔗 |
Fri 12:59 p.m. - 1:05 p.m.
|
Gradient Knowledge Distillation for Pre-trained Language Models
(
Spotlight
)
SlidesLive Video » Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models transfer knowledge by aligning instance-wise outputs between the teacher and the student, while neglecting an important knowledge source, i.e., the gradient of the teacher. The gradient characterizes how the teacher responds to changes in inputs, which we assume is beneficial for the student to better approximate the underlying mapping function of the teacher. Therefore, we propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process.Experimental results show that GKD outperforms previous KD methods in the student's performance. Further analysis shows that incorporating gradient knowledge makes the student behave more consistently with the teacher, improving the interpretability greatly. |
Lean Wang · Lei Li · Xu Sun 🔗 |
Fri 1:00 p.m. - 1:30 p.m.
|
Break and Poster Session II
(
Break and Poster Session
)
|
🔗 |
Fri 1:30 p.m. - 2:05 p.m.
|
Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval
(
KeyNote Talk
)
SlidesLive Video » Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over kNN-LM (Khandelwal et al., 2020) without hurting perplexity. |
Graham Neubig 🔗 |
Fri 2:05 p.m. - 2:35 p.m.
|
Do we still need inductive biases after Transformer language models?
(
KeyNote Talk
)
SlidesLive Video » In this talk, I will explore the role of inductive biases when fine-tuning large Transformer language models in three different scenarios: when output space is structured, for example, semantic parsing from language to code; when performing multi-task learning where tasks may share some latent structure, e.g., different semantic tasks like question answering and text entailment may share common reasoning skills; when the input involves a higher-order (latent) structure such as negation. It is not always the case that inductive biases help. Come with your wisest/wildest answers. |
Siva Reddy 🔗 |
Fri 2:35 p.m. - 3:05 p.m.
|
8-bit Methods for Efficient Deep Learning
(
KeyNote Talk
)
SlidesLive Video » Large language models are effective tools for many tasks but are difficult to train and inference due to their size. Moving from 32-bit models to 16-bit models resulted in considerable efficiency gains that made training and inference of large models easier. Can we train and inference in 8-bit to make further gains? In this talk, I will show that 8-bit inference and training can be used without degrading performance while improving efficiency. To make 8-bit methods work, it is essential to understand how quantization precision affects model performance and training stability as we scale the model size. I will talk about how these factors change with scale and how we need to adjust 8-bit methods to make them work. In particular, I will speak about 8-bit optimizers for training and Int8 inference for large language models with up to 175B parameters. These methods make training and inference more efficient and make large models more accessible to researchers. |
Tim Dettmers 🔗 |
Fri 3:05 p.m. - 3:35 p.m.
|
Efficient Controllable Generative Models for Music and Performance Synthesis
(
KeyNote Talk
)
SlidesLive Video » How can we design generative models with structure that both improve the efficiency of models and controllability for users? In this talk, I'll give two examples to illustrate how we could achieve this goal by taking inspiration from the nonlinear and hierarchical structure that underlies the human process of creating music. Generative models of music composition typically assume music is written in a single pass from beginning to end, constraining the user to also follow this unnatural chronological process. To enable a more nonlinear creative workflow, we introduced Coconet (Huang et al., 2017) an Orderless NADE (Uria et al., 2014) like generative model (similar to masked language and visual models) that models all permutations of orderings of breaking down the task of composition. This enables both the model to learn more efficiently from data sequences by traversing it from all directions, and users to put down notes in any order and have the model complete any partial score. Neural audio synthesizers typically synthesize musical performance audio from MIDI end-to-end, resulting in a blackbox that offers few mechanisms for control. To enable detailed user control, we introduced MIDI-DDSP (Wu et al., 2022), a hierarchical model of musical performance synthesis, that breaks down audio synthesis into a three-level hierarchy of notes, performance, and synthesis, analogous to how a creative process involves composers, performers and instruments. Not only does this interpretable hierarchy allow users to intervene at each level or utilize trained priors (performance given notes, synthesis given performance) for creative assistance, it also allows models to leverage these inductive biases to learn more efficiently from data, making it possible to train high-fidelity performance synthesis models from only a few hours of recordings. We hope these examples might encourage researchers to partner with creative practitioners to innovate in modeling, interaction, and human-ai co-creativity. We could see the goal as not only designing generative models that can model and generate creative artifacts well, but also working towards generative agents that we can coordinate and collaborate with in a creative setting. |
Cheng-Zhi Anna Huang 🔗 |
Fri 3:35 p.m. - 3:45 p.m.
|
Best Paper and Poster Awards
(
Closing remark
)
SlidesLive Video » |
🔗 |
-
|
Parameter-Efficient Low-Resource Dialogue State Tracking by Prompt Tuning
(
Poster
)
SlidesLive Video » Dialogue state tracking (DST) is an important step in dialogue management to keep track of users' beliefs. Existing works fine-tune all language model (LM) parameters to tackle the DST task, which requires significant data and computing resources for training and hosting. The cost grows exponentially in the real-world deployment where dozens of fine-tuned LM are used for different domains and tasks. To develop domain-specific models that better utilize slot-related information with less training data and fewer parameters, we propose to use soft prompt tokens to learn task properties, incorporate segment information and reiterate the task before predicting value. Without tuning LM parameters, our method drastically reduces the number of parameters needed to less than 0.5% of prior works while achieves better low-resource DST performance. |
Mingyu Derek Ma · Jiun-Yu Kao · Shuyang Gao · arpit gupta · Di Jin · Tagyoung Chung · Nanyun Peng 🔗 |
-
|
BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning
(
Poster
)
SlidesLive Video » Current pre-trained language models rely on large datasets for achieving state-of-the-art performance. However, past research has shown that not all examples in a dataset are equally important during training. In fact, it is sometimes possible to prune a considerable fraction of the training set while maintaining the test performance. Established on standard vision benchmarks, two gradient-based scoring metrics for finding important examples are GraNd and its estimated version, EL2N. In this work, we employ these two metrics for the first time in NLP. We demonstrate that these metrics need to be computed after at least one epoch of fine-tuning and they are not reliable in early steps. Furthermore, we show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it. This paper details adjustments and implementation choices which enable GraNd and EL2N to be applied to NLP. |
Mohsen Fayyaz · Ehsan Aghazadeh · Seyed MohammadAli Modarressi · Mohammad Taher Pilehvar · Yadollah Yaghoobzadeh · Samira Ebrahimi Kahou 🔗 |
-
|
Pre-Training a Graph Recurrent Network for Language Representation
(
Poster
)
SlidesLive Video » Transformer-based models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism in Transformer may not be necessary, both convolutional neural networks and multi-layer perceptron based models have been investigated as Transformer alternatives. In this paper, we consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications, together with a sentence-level representation decoupled from other tokens. We find such architecture can give comparable results against Transformer-based ones in both English and Chinese language benchmarks. Moreover, instead of the quadratic complexity, our model has linear complexity and performs more efficiently during inference. Our models and code will be released for further research. |
Yile Wang · Linyi Yang · Zhiyang Teng · Ming Zhou · Yue Zhang 🔗 |
-
|
An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks
(
Poster
)
SlidesLive Video » Access to external knowledge is essential for many natural language processing tasks, such as question answering and dialogue. Existing methods often rely on a parametric model that stores knowledge in its parameters, or use a retrieval-augmented model that has access to an external knowledge source. Parametric and retrieval-augmented models have complementary strengths in terms of computational efficiency and predictive accuracy. To combine the strength of both approaches, we propose the Efficient Memory-Augmented Transformer (EMAT) – it encodes external knowledge into a key-value memory and exploits the fast maximum inner product search for memory querying. Experiments on various knowledge-intensive tasks such as question answering and dialogue datasets show that, simply augmenting parametric models (T5-base) using our method produces more accurate results while retaining a high throughput. Compared to retrieval-augmented models, EMAT runs substantially faster across the board and produces more accurate results on WoW and ELI5. |
Yuxiang Wu · Yu Zhao · Baotian Hu · Pasquale Minervini · Pontus Lars Erik Saito Stenetorp · Sebastian Riedel 🔗 |
-
|
QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
(
Poster
)
SlidesLive Video » Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The code to reproduce this work will be publicly released on Github soon. |
Shira Guskin · Moshe Wasserblat · Haihao Shen · Chang Wang 🔗 |
-
|
Towards Data Efficient And Robust Speech Representation Model Distillation
(
Poster
)
SlidesLive Video » While knowledge distillation has been proven effective in learning student models of smaller size on various tasks, a large amount of distillation training data is required to keep the performance of the student model competitive to the teacher model. Our research aims to further improve the efficiency in task-agnostic speech representation model pre-training. By perturbing the training data distribution, we distil a more robust task-agnostic speech representation model with a lower training data requirement. By learning representations from both a) the teacher model, which is trained via self-supervised learning (SSL) and b) the known effective hand-crafted features, we effectively regularize and compensate the representation loss due to the distillation process. Our proposed methods are evaluated on a number of downstream tasks and are shown to be effective in certain aspects, which prompts future research that builds on our work to develop efficient task-agnostic speech representation model distillation approaches. |
Pheobe Sun · Ruibo Shi · Ahmad Emami · Sean Moran 🔗 |
-
|
On Spectral and Temporal Feature Encoding Behaviour in Stacked Architectures
(
Poster
)
SlidesLive Video » Acoustic models typically employ production and perception based short-term features. In the context of deep models the acoustic information is hierarchically combined either 1) across frequency bands followed by temporal modelling similar to cepstrum features; or 2) across temporal trajectories followed by combination across spectral bands similar to relative spectra (RASTA) features. Such a processing pipeline is often implemented using low-rank methods to achieve low-footprint compared to SOTA models involving simultaneous spectral-temporal processing. However, very few attempts have been made to address the question of if and how such deep acoustic models flexibly integrate information from spectral or temporal features. In this work with the help of an Large vocabulary continuous speech recognition (LVCSR) case study, the geometry of loss landscape is used as a visualisation tool to understand the link between generalization error and spectral or temporal feature integration in learning task-specific information. |
Vaibhav Singh · Vinayak Abrol · Karan Nathwani 🔗 |
-
|
Few-Shot Aspect Extraction using Prompt Training
(
Poster
)
SlidesLive Video » A fundamental task of fine-grained sentiment analysis is aspect term extraction.Supervised-learning approaches have demonstrated state-of-the art results for thistask; however, they underperform in few-shot scenarios, where labeled trainingdata is scarce. Prompt-based training has proven effective in few-shot sequenceclassification; however, it would not apply to token classification tasks. In thiswork we propose PATE (Prompt-based Aspect Term Extraction), a few-shotprompt-based method for the token classification task of aspect term extraction.We demonstrate that this method significantly outperforms the standard supervisedtraining approach in few-shot setups and make our code publicly available. |
Oren Pereg · Daniel Korat · Moshe Wasserblat · Kfir Bar 🔗 |
-
|
Can we get smarter than majority vote? Efficient use of individual rater’s labels for content moderation
(
Poster
)
A large number of natural language processing (NLP) datasets contain crowdsourced labels. Most of the time, training set labels are generated using majority vote from individual rater's labels, which discards a significant amount of information. This work focuses on improving data-efficiency when training a model for "marginally abusive" Tweet classification. We compare majority vote to two families of alternative methods, changing the training process in two different steps: (1) aggregating individual labels using weak supervision to improve the quality of labels for model training, and (2) predicting individual labels using the multi-rater models proposed by Davani et al. [2022]. We find that majority vote is a strong baseline. Dawid-Skene and multi-rater models perform well, although the latter tend to be more susceptible to overfit. Finally, we also identify a number of practical considerations for the practitioner, such as setting a minimum number of labels per rater, or preferring soft to hard labels. |
Changho Shin · Alice Schoenauer-Sebag 🔗 |
-
|
BudgetLongformer: Can we Cheaply Pretrain a SOTA Legal Language Model From Scratch?
(
Poster
)
SlidesLive Video » Pretrained transformer models have achieved state-of-the-art results in many tasksand benchmarks recently. Many state-of-the-art (SOTA) Language Models (LM s),however, do not scale well above the threshold of 512 input tokens. In specializeddomains though (such as legal, scientific or biomedical), models often need toprocess very long text (sometimes well above 10000 tokens). Even though manyefficient transformers have been proposed (such as Longformer, BigBird or FNet),so far, only very few such efficient models are available for specialized domains.Additionally, since the pretraining process is extremely costly in general – buteven more so as the sequence length increases – it is often only in reach of largeresearch labs. One way of making pretraining cheaper is the Replaced TokenDetection ( RTD ) task, by providing more signal during training, since the losscan be computed over all tokens. In this work, we train Longformer models withthe efficient RTD task on legal data to showcase that pretraining efficient LMs ispossible using much less compute. We evaluate the trained models on challengingsummarization tasks requiring the model to summarize long texts to show to whatextent the models can achieve good performance on downstream tasks. We findthat both the small and base models outperform their baselines on the in-domainBillSum and out-of-domain PubMed tasks in their respective parameter range. Wepublish our code and models for research purposes. |
Joel Niklaus · Daniele Giofrè 🔗 |
-
|
Parameter-Efficient Finetuning of Transformers for Source Code
(
Poster
)
SlidesLive Video » Pretrained Transformers achieve state-of-the-art performance in various code-processing tasks but may be too large to be deployed. As software development tools often incorporate modules for various purposes which may potentially use a single instance of the pretrained model, it appears relevant to utilize parameter-efficient fine-tuning for the pretrained models of code. In this work, we test two widely used approaches, adapters and LoRA, which were initially tested on NLP tasks, on four code-processing tasks. We find that though the efficient fine-tuning approaches may achieve comparable or higher performance than the standard, full, fine-tuning in code understanding tasks, they underperform full fine-tuning in code-generative tasks. These results underline the importance of testing efficient fine-tuning approaches on other domains than NLP and motivate future research in efficient fine-tuning for source code. |
Shamil Ayupov · Nadezhda Chirkova 🔗 |
-
|
Graph Masking Pre-training for Graph-to-Text Generation
(
Poster
)
SlidesLive Video » Large-scale pre-trained language models (PLMs) have advanced Graph-to-Text (G2T) generation by processing the linearised version of a graph. However, the linearisation is known to ignore the structural information. Additionally, PLMs are typically pre-trained on free text which introduces domain mismatch between pre-training and downstream G2T generation tasks. To address these shortcomings, we propose efficient graph masking pre-training strategies that neither require supervision signals nor adjust the architecture of the underlying pre-trained encoder-decoder model. When used with a pre-trained T5, our approach achieves new state-of-the-art results on WebNLG+2020 and EventNarrative G2T generation datasets. Our method also shows to be very effective in the low-resource setting. Our code will be available with publication. |
Jiuzhou Han · Ehsan Shareghi 🔗 |
-
|
The Ineffectiveness of TKGE Models in Encoding Real-World Knowledge Graphs
(
Poster
)
SlidesLive Video » Temporal knowledge graphs (TKGs) have been rising in popularity in many industrial applications. However, for TKG-based applications to perform accurately, we need to have a reliable temporal knowledge graph embedding (TKGE) model to capture the semantic meanings of entities and the relationship between entities. This is possible when we have many standardised academic TKGs that are well-connected with popular entities. However, in real-world settings, these well-connected TKGs are hardly available. Instead, real-world TKGs are usually more sparse and filled with noisy and less popular entities, which makes it very challenging to use to train an accurate TKGE model. In this paper, we ran five different TKGE models on the TKGQA mergers and acquisitions (M\&A) dataset to assess the effectiveness of TKGE models in encoding real-world TKGs. Specifically, we selected M\&As because it's common for a well-known company to merge/acquire a less popular/unknown company and as such we can evaluate the effectiveness of TKGE models in encoding the less well-known companies. The results show that TKGE models are ineffective in encoding less popular/unknown entities in sparse KGs; given the lack of information on the entities, the TKGE models find distinguishing them in the embedding space challenging. |
Chuan Ming Ong · Jiahao Sun · Ovidiu Serban · Yike Guo 🔗 |
-
|
PEST: Combining Parameter-Efficient Fine-Tuning with Self-Training and Co-Training
(
Poster
)
SlidesLive Video » We demonstrate how to improve the zero-shot and few-shot performance of large language models (LLMs) by using the T-Few parameter-efficient fine-tuning method (Liu et al., 2022) with self-training or co-training. Our methods apply to settings where labeled data is very limited, but unlabeled data is plentiful. Specifically, we combine T-Few with (i) the co-training techniques of Lang et al. (2022a), and (ii) SETRED, a self-training algorithm that uses a very simple data selection criterion (Li and Zhou, 2005). By using the efficient T-Few method, we are able to scale co-training to larger models (from T0-3B to T0-11B) and cut down on wallclock training time, improving the zero-shot co-training results of Lang et al. 2022a). By performing multiple iterations of self- or co-training, we significantly improve over the few-shot performance of T-Few reported by Liu et al. (2022) without using any additional labeled data. Our methods are relatively fast (2.5 hours to self-train T0-11B on a single A100 80GB) and allow T0-11B to match the few-shot performance of models with an order of magnitude more parameters. |
Hunter Lang · Monica Agrawal · Yoon Kim · David Sontag 🔗 |
-
|
ContextNER: Contextual Phrase Generation at Scale
(
Poster
)
SlidesLive Video » NLP research has been focused on NER extraction and how to efficiently extract them from a sentence. However, generating relevant context of entities from a sentence has remained under-explored. In this work, we introduce the task Context-NER in which relevant context of an entity has to be generated. The extracted context may not be found exactly as a substring in the sentence. We also introduce the EDGAR10-Q dataset for the same, which is a corpus of 1,500 publicly traded companies. It is a manually created complex corpus and one of the largest in terms of number of sentences and entities (1 M and 2.8 M). We introduce a baseline approach that leverages phrase generation algorithms and uses the pre-trained BERT model to get 33% ROUGE-L score. We also do a one shot evaluation with GPT-3 and get 39% score, signifying the hardness and future scope of this task. We hope that addition of this dataset and our study will pave the way for further research in this domain. |
Himanshu Gupta · Shreyas Verma · Tarun Kumar · Swaroop Mishra · Tamanna Agrawal · Amogh Badugu · Himanshu Bhatt 🔗 |
-
|
Efficient Speech Translation with Pre-trained models
(
Poster
)
SlidesLive Video » When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models. Using this strategy, we can train and apply the models on a single GPU. While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data. In a second step, we proposed an additional similarity loss to encourage the model to generate similar hidden representations for speech and transcript. Using this technique, we can increase the data efficiency and improve the translation quality by 6 BLEU points in scenarios with limited end-to-end training data. |
Zhaolin Li · Jan Niehues 🔗 |
-
|
Dynamic Query Representation for Extractive Question Answering
(
Poster
)
SlidesLive Video » Extractive question answering (ExQA) is an essential task for Natural Language Processing. The dominant approach to ExQA is one that represents the input sequence tokens (question and passage) with a pre-trained transformer, then uses two learned query vectors to compute distributions over the start and end answer span positions. These query vectors lack the context of the inputs, which can be a bottleneck for the model performance. To address this problem, we propose \textit{DyREx}, a generalization of the \textit{vanilla} approach where we dynamically compute query vectors given the input, using an attention mechanism through transformer layers. Empirical observations demonstrate that our approach consistently improves the performance over the standard one. The code and accompanying files for running the experiments are available in the supplementary materials. |
Urchade Zaratiana · Niama El Khbir · Dennis Núñez-Fernández · Pierre Holat · Nadi Tomeh · Thierry Charnois 🔗 |
-
|
Strategies for Applying Low Rank Decomposition to Transformer-Based Models
(
Poster
)
SlidesLive Video » Low rank decomposition decomposes each fully-connected layer of the transformer modules into two smaller layers using Singular Value Decomposition. The state-of-the-art techniques usually apply LRD in a single-shot, where all of thelayers are decomposed simultaneously. In this paper, we propose and compare different strategies for applying low rank decomposition to compress pre-trained transformer based models. These strategies include: layer-by-layer and progressive decomposition. We observe that progressive low rank decomposition, in which the rank is decreased incrementally results in a higher accuracy after decomposition comparing to single-shot and layer-by-layer low rank decomposition. Furthermore, in contrast with many of state-of-the-art compression methods where intensive pre-training of the compressed model is necessary, we show that progressive LRD can provide promising performance by compressing the model in the fine-tuning stage. |
Habib Hajimolahoseini · Walid Ahmed · Mehdi Rezaghoizadeh · Vahid Partovi Nia · Yang Liu 🔗 |
-
|
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low Rank Adaptation
(
Poster
)
SlidesLive Video »
With the ever-growing size of pre-trained models (PMs), fine-tuning has become more expensive and resource hungry. As a remedy, low-rank adapters (LoRA) keep the main pre-trained weights of the model frozen and just introduce some learnable truncated SVD modules (so called LoRA blocks) to the model. While LoRA blocks are parameter efficient, they suffer from two major problems: first, the size of these blocks is fixed and cannot be modified after training (for example if we need to change the rank of LoRA blocks, then we need to train them from scratch); second, optimizing their rank requires an exhaustive search. In this work, we introduce a dynamic low rank adaptation (DyLoRA) solution to address these two problems together. Our DyLoRA method trains LoRA blocks for a range of ranks instead of a single rank by sorting out the representation learned at different ranks during training. We evaluate our solution on different tasks in the GLUE benchmark using the RoBERTa model. Our results show that we can train DyLoRA at least $7x$ faster than LoRA without compromising the performance significantly. Moreover, our models can perform consistently well on a much larger range of ranks compared to LoRA.
|
Mojtaba Valipour · Mehdi Rezaghoizadeh · Ivan Kobyzev · Ali Ghodsi 🔗 |
-
|
Pyramid Dynamic Inference: Encouraging Faster Inference via Early Exit Boosting
(
Poster
)
SlidesLive Video » Large transformer-based models have demonstrated state of the art results on several Natural Language Understanding (NLU) tasks. However, their deployment comes at the cost of increased footprint and inference latency, limiting their adoption to real-time applications, especially on resource constrained devices. In order to optimize the trade-off between model accuracy, footprint and inference latency, we propose Pyramid Dynamic Inference (PDI), a scheme that encourages fast inference by introducing early inference routes in a transformer model, with a focus on boosting the performance of early exit heads. Owing to the limited capacity of the earlier transformer layers to extract complex semantics, the exit heads for these layers typically display high confidence only over easy data samples. PDI aims to recover this by applying a pyramidal structure to the classification heads that allows for more confident early inference by injecting stronger classifiers at earlier layers. It also prevents a significant increase in the model footprint by gradually shrinking the classifiers as the semantic capacity of the deeper transformer layers increase. We validate the efficiency of the PDI scheme on the GLUE benchmark, where we show that PDI consistently outperforms FastBert on both accuracy and latency. Compared to the original 6-layer DistilBert, PDI achieves on average up to 3.66x speedup with 29% fewer parameters with only 3.3% accuracy degradation. |
Ershad Banijamali · Pegah Kharazmi · Samridhi Choudhary · Sepehr Eghbali · Clement Chung 🔗 |
-
|
An efficient RNN Language Model using activity sparsity and sparse back-propagation through time
(
Poster
)
SlidesLive Video » Transformers have displaced recurrent neural networks (RNN) for language modelling due to their effectiveness, and their scalability on ubiquitous GPUs.But in resource constrained systems, both training and inference with transformer language models are challenging due to their computational and memory requirements. RNN language models are a potential alternative, but there is still a need to bridge the gap between what RNNs are capable of in terms of efficiency and performance, and the requirements of resource constrained applications. The memory and computational requirements arising from propagating the activations of all the neurons at every time step to every connected neuron together with the sequential dependence of activations make RNNs harder to train efficiently. We propose a solution inspired by biological neuron dynamics, by making the communication between RNN units sparse and discrete along the forward direction. We show that this makes the backward pass with backpropagation through time (BPTT) computationally sparse and efficient as well. We base our model on gated recurrent unit (GRU), extending it to have its units emit discrete events for communication triggered by a threshold, so that no information needs to be communicated to other units in the absence of events. Our model achieves efficiency without compromising task performance, demonstrating competitive performance compared to state-of-the-art recurrent network models in language modelling. The dynamic activity sparsity mechanism also makes our model well suited for energy-efficient neuromorphic hardware. |
Mark Schoene · Khaleelulla Khan Nazeer · David Kappel · Christian Mayr · Anand Subramoney 🔗 |
-
|
An Exploration of Methods for Zero-shot Transfer in Small Language Models
(
Poster
)
SlidesLive Video » Multi-task learning (MTL), instruction tuning, and prompting have recently been shown to improve the generalizability of large language models to new tasks. However, the benefits of such methods are less well-documented in smaller language models, with some studies finding contradictory results. In this work, we explore and isolate the effects of (i) model size, (ii) general purpose MTL, (iii) in-domain MTL, and (iv) instruction tuning for models with fewer than 500 million parameters. Our experiments demonstrate that general purpose MTL improves performance by 31% on average, with further in-domain MTL improving performance by an additional 37.6% on average. We find that instruction tuning provides a modest 2% performance improvement for small models. |
Alon Albalak · Akshat Shrivastava · Chinnadhurai Sankar · Adithya Sagar · Mike Ross 🔗 |
-
|
On the impact of the quality of pseudo-labels on the self-supervised speaker verification task
(
Poster
)
SlidesLive Video » One of the most widely used self-supervised speaker verification system training methods is to optimize the speaker embedding network in a discriminative fashion using clustering algorithm-driven pseudo-labels. Although the pseudo-label-based self-supervised training scheme showed impressive performance, recent studies have shown that label noise can significantly impact the performance. In this paper, we have explored various pseudo-labels driven by different clustering algorithms and conducted a fine-grained analysis of the relationship between the quality of the pseudo-labels and the speaker verification performance. From our experimental results, we shed light on several previously unexplored and overlooked aspects of the pseudo-labels that can have an impact on the speaker verification performance.Moreover, we could observe that the self-supervised speaker verification performance is heavily dependent on multiple qualitative aspects of the clustering algorithm that was used for generating the pseudo-labels. Furthermore, we show that the speaker verification performance can be severely degraded from overfitting to the noisy pseudo-labels and that the mixup strategy can mitigate the memorization effects of label noise. |
Abderrahim Fathan · JAHANGIR ALAM · Woo Hyun Kang 🔗 |
-
|
INT8 Transformers for Inference Acceleration
(
Poster
)
SlidesLive Video » Given the general trend towards large models in the deep learning community (particularly in the space of Transformers), much work has been done with the goal of reducing the cost associated with inference. In this work, we reach a new low, quantizing all weights and activations of BERT to 8-bit integers. GELU and exp are implemented with integer lookup tables, achieving optimal INT8 quantization error. We introduce a generalized technique to compute operations frequently missing on integer-only hardware (e.g. divisions, roots) via an efficient instantiation of binary search. By applying it to intermediate computations in Softmax and LayerNorm, we obtain accurate implementations of these layers as well. We evaluate our approach on several GLUE tasks, demonstrating minimal accuracy degradation. |
Andy Rock · Omar Khalil · Ofer Shai · Paul Grouchy 🔗 |
-
|
Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic
(
Poster
)
SlidesLive Video » The use of multilingual language models for tasks in low and high-resource languages has been a success story in deep learning. In recent times, Arabic has been receiving widespread attention on account of its dialectal variance. While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of sufficient monolingual dialectal data and parallel translation data of such dialectal variants. It remains an open problem on whether the limited dialectical data can be used to improve the models trained in Arabic on its dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model and beat existing benchmarks (by an avg metric of +6.41). We then explore two continual pre-training methods-- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function. We show that both approaches help improve performance on dialectal classification tasks (+4.64 avg. gain) when used on monolingual models. |
Soumajyoti Sarkar · Saab Mansour · Sailik Sengupta · Sheng Zha · Kaixiang Lin 🔗 |
-
|
Depth-Wise Attention (DWAtt): A Layer Fusion Method for Data-Efficient Classification
(
Poster
)
Language Models pretrained on large textual data have been shown to encode different types of knowledge simultaneously. Usually, only the features from the last layer are used when adapting to new tasks or data. We put forward that in using or finetuning deep pretrained models, intermediate layer features that may be relevant to the downstream task are buried too deep to be used efficiently in terms of needed samples or steps. To test this, we propose a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface signals from non-final model layers. We compare DWAtt to a basic concatenation-based layer fusion method (Concat), and compare both to a deeper model baseline---all kept within a similar parameter budget. Our findings show that DWAtt and Concat are more step- and sample-efficient than the baseline, especially in the few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03 NER, layer fusion shows 3.68-9.73\% F1 gain at different few-shot sizes. The layer fusion models presented significantly outperform the baseline in various training scenarios with different data sizes, architectures, and training constraints. |
Muhammad ElNokrashy · Badr AlKhamissi · Mona Diab 🔗 |
-
|
SymbolicGPT: A Generative Transformer Model for Symbolic Regression
(
Poster
)
SlidesLive Video » Symbolic regression is the task of identifying a mathematical expression that best fits a provided dataset of input and output values. Due to the richness of the space of mathematical expressions, symbolic regression is generally a challenging problem. While conventional approaches based on genetic evolution algorithms have been used for decades, deep learning-based methods are relatively new and an active research area. In this work, we present SymbolicGPT, a novel transformer-based language model for symbolic regression. This model exploits the advantages of probabilistic language models like GPT, including strength in performance, scalability and flexibility. Through comprehensive experiments, we show that our model performs strongly compared to competing models. |
Mojtaba Valipour · Bowen You · Maysum H Panju · Ali Ghodsi 🔗 |
-
|
Using Informative Data Subsets for Efficient Training of Large Language Models: An Initial Study
(
Poster
)
SlidesLive Video » Language Models (LMs) are pretrained on large unlabeled corpora through self-supervision tasks and have become ubiquitous to several NLP applications. Recent trends indicate that the generalization capability of Large LMs (LLMs) improves tremendously with increasing model capacity and size of the pretraining dataset. However, this also results in inefficiencies owing to higher training times, compute requirements and environmental impact. Previous works have mostly addressed the inefficiency concerns with respect to improving sample efficiency, architecture and training loss objective with little focus on data optimization. In this work, we explore if it is possible to use only highly informative subsets of the training data to train LLMs while maintaining their performance. We build upon the work done in informative data subset selection and propose INGENIOUS, a framework that selects highly representative subsets of the training corpus by optimizing a submodular function. We show INGENIOUS can be adopted for the scale of LLM training and empirically demonstrate that the proposed framework achieves ~99% of original BERT performance in about 35% of the original training time. |
H S V N S Kowndinya Renduchintala · Krishnateja Killamsetty · Sumit Bhatia · Milan Aggarwal · Ganesh Ramakrishnan · Rishabh Iyer 🔗 |
-
|
Using Selective Masking as a Bridge between Pre-training and Fine-tuning
(
Poster
)
SlidesLive Video » Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of-the-art results for various NLP tasks. Pre-training is usually independent of the downstream task, and previous works have shown that this pre-training alone might not be sufficient to capture the task-specific nuances. We propose a way to tailor a pre-trained BERT model for the downstream task via task-specific masking before the standard supervised fine-tuning. For this, a word list is first collected specific to the task. For example, if the task is sentiment classification, we collect a small sample of words representing both positive and negative sentiments. Next, a word's importance for the task, called the word's task score, is measured using the word list. Each word is then assigned a probability of masking based on its task score. We experiment with different masking functions that assign the probability of masking based on the word's task score. The BERT model is further trained on MLM objective, where masking is done using the above strategy. Following this standard supervised fine-tuning is done for different downstream tasks. Results on these tasks show that the selective masking strategy outperforms random masking, indicating its effectiveness. |
Tanish Lad · Himanshu Maheshwari · Shreyas Kottukkal · Radhika Mamidi 🔗 |
-
|
Improved Knowledge Distillation by Utilizing Backward Pass Knowledge in Neural Networks
(
Poster
)
SlidesLive Video » Knowledge distillation (KD) is one of the prominent techniques for model compression. Although conventional KD is effective for matching the two networks over the given data points, there is no guarantee that these models would match in other areas for which we do not have enough training samples. In this work, we address this problem by generating new auxiliary training samples based on extracting knowledge from the backward pass and identifying the areas where the student diverges greatly from the teacher. This is done by perturbing data samples in the direction of the gradient of the difference between the student and the teacher. We studied the effect of the proposed method on various tasks in different domains, including images and NLP tasks with considerably smaller student networks. Our experiments, show the proposed method got superior results over other baselines. |
Aref Jafari · Mehdi Rezaghoizadeh · Ali Ghodsi 🔗 |
-
|
Topic Segmentation in the Wild: Towards Segmentation of Semi-structured & Unstructured Chats
(
Poster
)
SlidesLive Video » Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured texts. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin |
Reshmi Ghosh · Sharanya Kamath · Soundararajan Srinivasan · Dhuri Shrivastava · Samyadeep Basu · Harjeet Kajal 🔗 |
-
|
A Theory of Unsupervised Translation for Understanding Animal Communication
(
Poster
)
SlidesLive Video » Unsupervised translation generally refers to the challenging task of translating between two languages without parallel translations, i.e., from two separate monolingual corpora.In this work, we propose an information-theoretic framework of unsupervised translation that can be well suited even for the case where the source language is that of highly intelligent animals, such as whales, and the target language is a human language, such as English.We identify two conditions that combined allow for unsupervised translation: (1) there is access to an prior distribution over the target language that estimates the likelihood that a sentence was translated from the source language, and (2) most alterations of translations are deemed implausible by the prior. We then give an (inefficient) algorithm which, given access to the prior and unlabeled source examples as input, outputs a provably accurate translation function. We prove upper bounds on the number of samples needed by our algorithm. Surprisingly, our analysis suggests that the amount of source data required for unsupervised translation is not significantly greater than that of supervised translation.To support the viability of our theory, we propose a simplified probabilistic language model: the random sub-tree language model, in which sentences correspond to paths in a randomly-labeled tree. We prove that random sub-tree languages satisfy conditions (1-2) with high probability, and are therefore translatable by our algorithm.Our theory is motivated by a recent initiative to translate whale communication using modern machine translation techniques. The recordings of whale communications that are being collected have no parallel human-language data. Our work seeks to inform this ambitious effort by modeling unsupervised translation. We are further motivated by recent empirical work, reported in the machine learning literature, demonstrating that unsupervised translation is possible in certain settings. |
Shafi Goldwasser · David Gruber · Adam Tauman Kalai · Orr Paradise 🔗 |
-
|
Collective Knowledge Graph Completion with Mutual Knowledge Distillation
(
Poster
)
SlidesLive Video » Knowledge graph completion (KGC), the task that aims at predicting missing information based on the already existing relational data inside a knowledge graph(KG), has drawn significant attention in the recent years. However, predictive power of KGC methods is often limited by the completeness of the existing knowledge graphs. In monolingual and multilingual settings, KGs from different sources and languages are potentially complementary to each other. In this paper, we study the problem of multi-KG completion, where we focus on maximizing the collective knowledge from different KGs to alleviate the incompleteness on individual KGs. Specifically, we propose a novel method called CKGC-MKD that uses augmented CompGCN-based encoder models on both individual KGs and a large connected KG in which seed alignments between KGs are regarded as edges for message propagation. Additional mutual knowledge distillation are employed to maximize the knowledge transfer between the "global" connected KG and the "local" individual KGs. Experimental results on multilingual datasets has shown that our method outperforms all state-of-the-art models. |
Weihang Zhang · Ovidiu Serban · Jiahao Sun · Yike Guo 🔗 |
-
|
Gradient Knowledge Distillation for Pre-trained Language Models
(
Poster
)
SlidesLive Video » Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models transfer knowledge by aligning instance-wise outputs between the teacher and the student, while neglecting an important knowledge source, i.e., the gradient of the teacher. The gradient characterizes how the teacher responds to changes in inputs, which we assume is beneficial for the student to better approximate the underlying mapping function of the teacher. Therefore, we propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process.Experimental results show that GKD outperforms previous KD methods in the student's performance. Further analysis shows that incorporating gradient knowledge makes the student behave more consistently with the teacher, improving the interpretability greatly. |
Lean Wang · Lei Li · Xu Sun 🔗 |
-
|
Efficient Few-Shot Learning Without Prompts
(
Poster
)
SlidesLive Video » Recent few-shot learning methods, such as parameter-efficient fine-tuning (PEFT) and pattern exploiting training (PET), have achieved impressive results in label-scarce settings. However, they are difficult to employ since they are highly sensitive to handcrafted prompts, and typically require billion-parameter language models to achieve high accuracy. To address these shortcomings, we propose SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit works by first fine-tuning a pretrained ST on a small number of labeled text pairs, in a contrastive Siamese manner. The resulting model is then used to generate rich text embeddings, which are used to train a classification head. This simple framework requires no prompts or verbalizers, and achieves high accuracy with orders of magnitude less parameters and runtime than existing techniques. Our experiments show that SetFit achieves results competitive with PEFT and PET techniques, and outperforms them on a variety of classification tasks. |
Oren Pereg · Daniel Korat · Moshe Wasserblat · Lewis Tunstall · Unso Eun Seo Jo · Luke Bates · Nils Reimers 🔗 |
-
|
Fast DistilBERT on CPUs
(
Spotlight
)
Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50\% and up to 4.1x performance speedup over ONNX Runtime. |
Haihao Shen · Ofir Zafrir · Bo Dong · Hengyu Meng · Xinyu Ye · Zhe Wang · Yi Ding · Hanwen Chang · Guy Boudoukh · Moshe Wasserblat 🔗 |
-
|
PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation
(
Spotlight
)
SlidesLive Video » Existing work on controlled text generation (CTG) assumes a control interface of categorical attributes. In this work, we propose a natural language interface, where we craft a PCFG to embed the control attributes into natural language commands and propose variants of existing CTG models that take commands as input. We design tailored experiments to test model's generalization abilities. The results show our PCFG-based command generation approach is effective for handling unseen commands compared to fix-set templates, and our proposed NL models can effectively generalize to unseen attributes. |
Jingyu Zhang · Jim Glass · Tianxing He 🔗 |
-
|
Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement
(
Poster
)
Self-supervised speech representation learning aims to extract meaningful factors from the speech signal that can later be used across different downstream tasks, such as speech and/or emotion recognition. Existing models, such as HuBERT, however, can be fairly large thus may not be suitable for edge speech applications. Moreover, realistic applications typically involve speech corrupted by noise and room reverberation, hence models need to provide representations that are robust to such environmental factors. In this study, we build on the so-called DistilHuBERT model, which distils HuBERT to a fraction of its original size, with three modifications, namely: (i) augment the training data with noise and reverberation, while the student model needs to distill the clean representations from the teacher model; (ii) introduce a curriculum learning approach where increasing levels of noise are introduced as the model trains, thus helping with convergence and with the creation of more robust representations; and (iii) introduce a multi-task learning approach where the model also reconstructs the clean waveform jointly with the distillation task, thus also acting as an enhancement step to ensure additional environment robustness to the representation. Experiments on three SUPERB tasks show the advantages of the proposed method not only relative to the original DistilHuBERT, but also to the original HuBERT, thus showing the advantages of the proposed method for ``in the wild'' edge speech applications. |
Heitor Guimarães · Arthur Pimentel · Anderson R. Avila · Mehdi Rezaghoizadeh · Tiago H Falk 🔗 |
-
|
Attribute Controlled Dialogue Prompting
(
Spotlight
)
Prompt-tuning has become an increasingly popular parameter-efficient method for steering large pretrained language models to downstream tasks. However, both discrete prompting and continuous prompting assume fixed prompts for all data samples within a task, neglecting the fact that inputs vary greatly in some tasks such as open-domain dialogue generation. In this paper, we present a novel, instance-specific prompt-tuning algorithm for dialogue generation. Specifically, we generate prompts based on instance-level control code, rather than the conversation history, to explore their impact on controlled dialogue generation. Experiments on popular open-domain dialogue datasets, evaluated on both automated metrics and human evaluation, demonstrate that our method is superior to prompting baselines and comparable to fine-tuning with only 5%-6% of total parameters. |
Runcheng Liu · Ahmad Rashid · Ivan Kobyzev · Mehdi Rezaghoizadeh · Pascal Poupart 🔗 |
-
|
PromptDA: Label-guided Data Augmentation for Prompt-based Few Shot Learners
(
Spotlight
)
SlidesLive Video » Recent advances in large pre-trained language models (PLMs) lead to impressive gains on natural language understanding (NLU) tasks with task-specific fine-tuning. However, direct fine-tuning PLMs heavily relies on a large amount of labeled instances, which are usually hard to obtain. Prompt-based tuning on PLMs has proven valuable for various few-shot tasks. Existing works studying prompt-based tuning for few-shot NLU tasks mainly focus on deriving proper label words with a verbalizer or generating prompt templates for eliciting semantics from PLMs. In addition, conventional data augmentation methods have also been verified useful for few-shot tasks. However, currently there are few data augmentation methods designed for the prompt-based tuning paradigm. Therefore, we study a new problem of data augmentation for prompt-based few shot learners. Since the label semantics are essential in prompt-based tuning, we propose a novel label-guided data augmentation method PromptDA which exploits the enriched label semantic information for data augmentation. Extensive experiment results on few-shot text classification tasks show that our proposed framework achieves superior performance by effectively leveraging label semantics and data augmentation for natural language understanding. |
Canyu Chen · Kai Shu 🔗 |
-
|
TBD7
(
KeyNote Talk
)
|
Kenneth Heafield 🔗 |