Transfer learning from large pre-trained language models (PLM) has become the de-facto method for a wide range of natural language processing tasks. Current transfer learning methods, combined with PLMs, have seen outstanding successes in transferring knowledge to new tasks, domains, and even languages. However, existing methods, including fine-tuning, in-context learning, parameter-efficient tuning, semi-parametric models with knowledge augmentation, etc., still lack consistently good performance across different tasks, domains, varying sizes of data resources, and diverse textual inputs.
This workshop aims to invite researchers from different backgrounds to share their latest work in efficient and robust transfer learning methods, discuss challenges and risks of transfer learning models when deployed in the wild, understand positive and negative transfer, and also debate over future directions.
Sat 6:50 a.m. - 7:00 a.m.
|
Opening Remarks
(
Intro
)
SlidesLive Video » See our website for the updated schedule: https://tl4nlp.github.io/Program/ |
Alon Albalak 🔗 |
Sat 7:00 a.m. - 7:45 a.m.
|
Modular and Composable Transfer Learning
(
Talk
)
SlidesLive Video » With pre-trained transformer-based models continuously increasing in size, there is a dire need for parameter-efficient and modular transfer learning strategies. In this talk, we will touch base on adapter-based fine-tuning, where instead of fine-tuning all weights of a model, small neural network components are introduced at every layer. While the pre-trained parameters are frozen, only the newly introduced adapter weights are fine-tuned, achieving an encapsulation of the down-stream task information in designated parts of the model. We will demonstrate that adapters are modular components which can be composed for improvements on a target task and how they can be used for out of distribution generalization on the example of zero-shot cross-lingual transfer. Finally, we will discuss how adding modularity during pre-training can mitigate catastrophic interference and consequently lift the curse of multilinguality. |
Jonas Pfeiffer 🔗 |
Sat 7:45 a.m. - 8:30 a.m.
|
Automating Auxiliary Learning
(
Talk
)
SlidesLive Video » When faced with data-starved or highly complex end-tasks, it is commonplace for machine learning practitioners to introduce auxiliary objectives as supplementary learning signals. While much work has been done to formulate useful auxiliary objectives, their construction is still an art which proceeds by slow and tedious hand-design. Intuitions about how and when these objectives improve end-task performance have also had limited theoretical backing. In this talk I will present two works. First, I will discuss the widely used pre-train and fine-tune paradigm, and argue that when we know an end-task of interest before-hand we should also consider joint multi-task learning as a credible alternative. I will discuss an algorithm that we propose, META-TARTAN, that allows us to automatically learn the weights for the multi-task objective. Second, I will present AANG, an approach for automatically generating a suite of auxiliary objectives. AANG deconstructs existing objectives within a novel unified taxonomy, identifying connections between them, and generating new ones based on the uncovered structure. This leads us to a principled and efficient algorithm for searching the space of generated objectives to find those most useful to a specified end-task. We empirically verify that our automated auxiliary learning pipeline leads to strong improvements over competitive baselines across continued training experiments on a pre-trained model on 5 NLP end-tasks. |
Graham Neubig 🔗 |
Sat 8:30 a.m. - 9:15 a.m.
|
Fine-Tuning without Distortion: Improving Robustness to Distribution Shifts
(
Talk
)
SlidesLive Video » Fine-tuning foundation models (such as BERT or CLIP) is one of the most successful ways to achieve high accuracy. But achieving high in-distribution accuracy is not enough: high-stakes applications such as self-driving cars, medical diagnosis, and poverty mapping, also require models that generalize to circumstances not seen in the fine-tuning distribution. To examine this, we also evaluate models on out-of-distribution (OOD) test data. We show that standard full fine-tuning of all the model’s parameters can distort pretrained information and underperform OOD. Instead, we explain why selectively tuning parts of the model (e.g., prefixes, linear probes, embedding layers) can preserve pretrained information and lead to better OOD performance. Our analysis suggests the easy two-step strategy of linear probing then full fine-tuning (LP-FT), which improves pretrained features without distortion, and leads to even higher accuracies. These works underscore the importance of preserving pretrained knowledge when using powerful pretrained models. |
Percy Liang · Ananya Kumar 🔗 |
Sat 9:15 a.m. - 9:30 a.m.
|
Break
|
🔗 |
Sat 9:30 a.m. - 10:30 a.m.
|
Debate: Sara Hooker & Kyunghyun Cho
(
Debate/Discussion
)
SlidesLive Video » |
🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Lunch
|
🔗 |
Sat 12:00 p.m. - 12:45 p.m.
|
Cross-lingual Transfer for Named Entity Recognition: A study on African Languages
(
Talk
)
SlidesLive Video » Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resource languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. Similarly, in limited labelled data scenario, cross-lingual transfer learning with PLMs provides an opportunity for fast adaptation to new languages in both zero- and few-shot scenarios. In this talk, we will discuss five components of effective cross-lingual transfer for named entity recognition (NER) task including (1) availability of typologically diverse multilingual benchmark datasets for transfer (2) development of highly effective and easy-to-adapt multilingual PLMs (3) building effective and parameter-efficient cross-lingual transfer frameworks (4) making use of the same domain for both source and target transfer languages (5) choosing the best source transfer language for adaptation. Our evaluation on MasakhaNER – a benchmark dataset for 21 African languages shows that each of these components significantly improves transfer results. |
David I Adelani 🔗 |
Sat 12:45 p.m. - 1:30 p.m.
|
Training Language Models to Negotiate in the Game of Diplomacy
(
Talk
)
SlidesLive Video » Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. I will describe how we adapted language models to negotiate with people, reaching human-level performance in Diplomacy. A typical game involves generating hundreds of messages, which must be grounded in the game state, dialogue history, and the agent’s intended actions - all in a domain far from the pre-training data. The core of our approach is a method for linking language models to a symbolic planning module. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game. |
Mike Lewis 🔗 |
Sat 1:30 p.m. - 3:00 p.m.
|
Poster Session
|
Alon Albalak 🔗 |
-
|
Exploring Dimensions of Generalizability and Few-shot Transfer for Text-to-SQL Semantic Parsing
(
Poster
)
link »
Existing work on generalization in Text-to-SQL semantic parsing has been restricted to a zero-shot cross-domain setting. In this paper, we introduce Spider-Gen: a Text-to-SQL benchmark to develop a paradigm of transfer learning across distinct dimensions of generalization in Text-to-SQL semantic parsing. The Spider-Gen benchmark focuses on few-shot adaption for Cross-domain, Lexical, and Structural generalization of Text-to-SQL models. Through our experiments with the Spider-Gen dataset, we show that Seq2Seq language models struggle to generalize against change in data distribution, lexical changes in database schema, and changes in SQL query complexity. Our experiments also reveal that performing few-shot fine-tuning helps Text-to-SQL models to generalize across these changes. However, such few-shot adaptation comes with a negative effect on the knowledge learnt during training. Hence, we also explore Parameter-efficient Fine-tuning methods to overcome the limitations of Seq2Seq Text-to-SQL models. We release the Spider-Gen dataset publicly to facilitate further research in generalization and transfer learning across various dimensions in Text-to-SQL semantic parsing. |
Rajaswa Patil · Manasi Patwardhan · Shirish Karande · Lovekesh Vig · Gautam Shroff 🔗 |
-
|
SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
(
Poster
)
link »
Learned representations of scientific documents can serve as valuable input features for downstream tasks, without the need for further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks across four formats: classification, regression, ranking and search. We then use the benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different task format, can improve performance.We experiment with task-format-specific control codes and adapters in a multi-task setting and find that they outperform the existing single-embedding state-of-the-art by up to 1.5 points absolute. |
Amanpreet Singh · Mike D'Arcy · Arman Cohan · Doug Downey · Sergey Feldman 🔗 |
-
|
PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales
(
Poster
)
link »
Neural language models (LMs) have achieved impressive results on various language-based reasoning tasks by utilizing latent knowledge encoded in their own pretrained parameters. To make this reasoning process more explicit, recent works retrieve a rationalizing LM's internal knowledge by training/prompting it to generate free-text rationales, which can be used to guide task predictions made by either the same LM or a separate reasoning LM. However, rationalizing LMs require expensive rationale annotation, without any assurance that the generated rationales improve LM task performance or faithfully reflect LM decision-making. In this paper, we propose PINTO, an LM pipeline that rationalizes via prompt-based learning, and learns to faithfully reason over rationales via counterfactual regularization. First, PINTO maps out a suitable reasoning process for the task input by prompting a frozen rationalizing LM to generate a free-text rationale. Second, PINTO's reasoning LM is fine-tuned to solve the task using the generated rationale as context, while regularized to output less confident predictions when the rationale is perturbed. Across four datasets, we show that PINTO significantly improves the generalization ability of the reasoning LM, yielding higher performance on both in-distribution and out-of-distribution test sets. Also, PINTO leverages the rationales more faithfully than competitive baselines do. |
Peifeng Wang · Aaron Chan · Filip Ilievski · Muhao Chen · Xiang Ren 🔗 |
-
|
Languages You Know Influence Those You Learn: Impact of Language Characteristics on Multi-Lingual Text-to-Text Transfer
(
Poster
)
link »
In this work, we analyze a pre-trained mT5 \cite{xue2020mt5} to discover the attributes of cross-lingual connections learned by this model.Through a statistical interpretation framework over 90 language pairs across three tasks, we show that transfer performance can be significantly modeled by a few linguistic and data-derived features.These observations enable us to interpret cross-lingual understanding of the mT5 model.Through these observations, one can favorably choose the best source language for a task, and can anticipate its training data demands.A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer, significantly more than just the lexical similarity of languages.For a given language, we are able to predict zero-shot performance, that increases on a logarithmic scale with the number of few-shot target language data points. |
Benjamin Muller · Deepanshu Gupta · Jean-Philippe Fauconnier · Siddharth Patwardhan · David Vandyke · Sachin Agarwal 🔗 |
-
|
Downstream Datasets Make Surprisingly Good Pretraining Corpora
(
Poster
)
link »
For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around -- less data), outperforming the latter on and datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Our results suggest that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the incorporation of massive datasets. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data. |
Kundan Krishna · Saurabh Garg · Jeffrey Bigham · Zachary Lipton 🔗 |
-
|
MetaXCR: Reinforcement-Based Meta-Transfer Learning for Cross-Lingual Commonsense Reasoning
(
Poster
)
link »
Commonsense reasoning (CR) has been studied in many pieces of domain and has achieved great progress with the aid of large datasets. Unfortunately, most existing CR datasets are built in English, so most previous work focus on English. Furthermore, as the annotation of commonsense reasoning is costly, it is impossible to build a large dataset for every novel task. Therefore, there are growing appeals for Cross-lingual Low-Resource Commonsense Reasoning, which aims to leverage diverse existed English datasets to help the model adapt to new cross-lingual target datasets with limited labeled data. In this paper, we propose a multi-source adapter for cross-lingual low-resource Commonsense Reasoning (MetaXCR). In this framework, we first extend meta learning by incorporating multiple training datasets to learn a generalized task adapters across different tasks. Then, we further introduce a reinforcement-based sampling strategy to help the model sample the source task that is the most helpful to the target task. Finally, we introduce two types of cross-lingual meta-adaption methods to enhance the performance of models on target languages. Extensive experiments demonstrate MetaXCR is superior over state-of-the-arts, while being trained with fewer parameters than other work. |
Jie He · Yu Fu 🔗 |
-
|
Language Modelling with Pixels
(
Poster
)
link »
Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches instead of predicting a distribution over tokens. We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust than BERT to orthographic attacks and linguistic code-switching, further confirming the benefits of modelling language with pixels. |
Phillip Rust · Jonas Lotz · Emanuele Bugliarello · Elizabeth Salesky · Miryam de Lhoneux · Desmond Elliott 🔗 |
-
|
Learning Cross-Database Transfer of Text-queries for Adapting Text-to-SQL Parsers
(
Poster
)
link »
Modern Text-to-SQL semantic parsers struggle when tested on database schemas unseen during the train time. Further, model adaptation to a new database is challenging owing to zero availability of text queries for the target database until the initial deployment of the parser in the real world. We present ReFill, a framework for transferring text queries from existing databases to a target database. ReFill retrieves diverse existing text queries and masks their source-schema tokens, followed by editing and refilling with target-schema tokens for transferring text queries to the target schema. We show that this process leads to significantly more diverse text than achievable by using an SQL-to-Text generation model trained to directly translate SQL queries into natural text. Experiments across multiple relational databases establish that finetuning a semantic parser on the text synthesized by ReFill offers consistent performance gains over prior data-augmentation methods. |
Abhijeet Awasthi · Ashutosh Sathe · Sunita Sarawagi 🔗 |
-
|
Poly-S: Analyzing and Improving Polytropon for Data-Efficient Multi-Task Learning
(
Poster
)
link »
Polytropon learns a set of modular skills, which can be re-combined and fine-tuned on novel tasks with limited data. In this paper, we first investigate what makes this method successful. Specifically, we extend the evaluation benchmark to include more datasets and design a series of controlled experiments to isolate the impact of different components.We then propose a new method, Poly-S, which allows for a more fine-grained control over the combination of skills, with no additional cost in compute at inference time. We evaluate Poly-S on three multi-task NLP benchmarks, and observe improvements over strong baselines. |
Lucas Page-Caccia · Edoardo Maria Ponti · Liyuan Liu · Matheus Pereira · Nicolas Le Roux · Alessandro Sordoni 🔗 |
-
|
Evaluating the Robustness of Biomedical Concept Normalization
(
Poster
)
link »
Biomedical concept normalization involves linking entity mentions in text to standard concepts in knowledge bases. It aids in resolving challenges to standardising ambiguous, variable terms in text or handling missing links. Therefore, it is one of the essential tasks of text mining that helps in effective information access and finds its utility in biomedical decision-making. Pre-trained language models (e.g., BERT) achieve impressive performance on this task. It has been observed that such models are insensitive to word order permutations and vulnerable to adversarial attacks on tasks like Text Classification, Natural Language Inference. However, the effect of such attacks is unknown for the task of Normalization, especially in the biomedical domain. In this paper, we propose heuristics-based Input Transformations (word level modifications and word order variations) and Adversarial Attacks to study the robustness of BERT-based normalization models across various datasets consisting of different biomedical entity types. We conduct experiments across three datasets: NCBI disease, BC5CDR Disease, and BC5CDR Chemical. We observe that for input transformations, pre-trained models often fail to detect invalid input. On the other hand, our proposed adversarial attacks that add imperceptible perturbations, result in affecting the ranking of a concept list for a given mention (or vice versa). We also generate natural adversarial examples that lead to performance degradation of $\sim$30\% in the F1-score. Additionally, we explore existing mitigation strategies to help a model recognize invalid inputs.
|
Sinchani Chakraborty · Harsh Raj · Srishti Gureja · Tanmay Jain · Atif Hassan · Sayantan Basu 🔗 |
-
|
Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning
(
Poster
)
link »
While transferring a pretrained language model, common approaches conventionally attach their task-specific classifiers to the top layer and adapt all the pretrained layers. We investigate whether one could make a task-specific selection on which subset of the layers to adapt and where to place the classifier. The goal is to reduce the computation cost of transfer learning methods (e.g. fine-tuning or adapter-tuning) without sacrificing its performance.We propose to select layers based on the variability of their hidden states given a task-specific corpus. We say a layer is already ``well-specialized'' in a task if the within-class variability of its hidden states is low relative to the between-class variability. Our variability metric is cheap to compute and doesn't need any training or hyperparameter tuning. It is robust to data imbalance and data scarcity. Extensive experiments on the GLUE benchmark demonstrate that selecting layers based on our metric can yield significantly stronger performance than using the same number of top layers and often match the performance of fine-tuning or adapter-tuning the entire language model. |
Shuo Xie · Jiahao Qiu · Ankita Pasad · Li Du · Qing Qu · Hongyuan Mei 🔗 |
-
|
DistillEmb: Distilling Word Embeddings via Contrastive Learning
(
Poster
)
link »
Word embeddings powered the early days of neural network-based NLP research. Their effectiveness in small data regimes makes them still relevant in low-resource environments. However, they are limited in two critical ways: linearly increasing memory requirements and out-of-vocabulary token handling. In this work, we present a distillation technique of word embeddings into a CNN network using contrastive learning. This method allows embeddings to be regressed given the characters of a token. It is then used as a pretrained layer, replacing word embeddings. Low-resource languages are the primary beneficiary of this method and hence, we show its effectiveness on two morphology-rich Semitic languages, and in a multilingual NER task comprised of 10 African languages. Apart from improving performance and lowering memory usage, the model is data efficient and is capable of transferring word representation to a similar language. |
Amanuel Mersha · Stephen Wu 🔗 |
-
|
Can Large Language Models Truly Understand Prompts? A Case Study with Negated Prompts
(
Poster
)
link »
Previous work has shown that there exists a scaling law between the size of Language Models (LMs) and their zero-shot performance on different downstream NLP tasks. In this work, we show that this phenomenon does not hold when evaluating large LMs on tasks with \textit{negated} prompts, but instead shows an \textit{inverse} scaling law. We evaluate 9 different tasks with negated prompts on (1) pretrained LMs (OPT \& GPT-3) of varying sizes (125M - 175B), (2) LMs further pretrained to generalize to novel prompts (InstructGPT), (3) LMs provided with few-shot examples, and (4) LMs fine-tuned specifically on negated prompts; all LM types perform worse on negated prompts as they scale and show a huge performance gap between the human performance when comparing the average score on both original and negated prompts. By highlighting a critical limitation of existing LMs and methods, we urge the community to develop new approaches of developing LMs that actually follow the given instructions. We provide the code and the datasets to explore negated prompts at http://www.omitted.link/. |
Joel Jang · Seonghyeon Ye · Minjoon Seo 🔗 |
-
|
Extractive Question Answering with Dynamic Query Representation for Free
(
Poster
)
link »
Extractive QA is an important NLP task with numerous real-world applications. The most common method for extractive QA is to encode the input sequence with a pretrained Transformer such as BERT, and then compute the probability of the start and end positions of span answers using two leaned query vectors. This method has been shown to be effective and hard to outperform. However, the query vectors are static, meaning they are the same regardless of the input, which can be a challenging issue in improving the model's performance. To address this problem, we propose \texttt{DyReF} (\texttt{Dy}namic \texttt{Re}presentation for \texttt{F}ree), a model that dynamically learns query vectors for free, i.e. without adding any parameters, by concatenating the query vectors with the embeddings of the input tokens of the Transformer layers. In this way, the query vectors can aggregate information from the source sentence and adapt to the question, while the representations of the input tokens are also dependent on the queries, allowing for better task specialization. We demonstrate empirically that our simple approach outperforms strong baseline in a variety of extractive question answering benchmark datasets. Our code will be made publicly available. |
Urchade Zaratiana · Niama El Khbir · Pierre Holat · Nadi Tomeh · Thierry Charnois 🔗 |
-
|
Multi-Task Learning Framework for Extracting Emotion Cause Span and Entailment in Conversations
(
Poster
)
link »
Predicting emotions expressed in text is a well-studied problem in the NLP community. Recently there has been active research in extracting the cause of an emotion expressed in text. Most of the previous work has done causal emotion entailment in documents. In this work, we propose neural models to extract emotion cause span and entailment in conversations. For learning such models, we use RECCON dataset, which is annotated with cause spans at the utterance level. In particular, we propose MuTEC, an end-to-end Multi-Task learning framework for extracting emotions, emotion cause, and entailment in conversations. This is in contrast to existing baseline models that use ground truth emotions to extract the cause. MuTEC performs better than the baselines for most of the data folds provided in the dataset. |
Ashwani Bhat · Ashutosh Modi 🔗 |
-
|
Do Current Multi-Task Optimization Methods in Deep Learning Even Help?
(
Poster
)
link »
Recent research has proposed a series of specialized optimization algorithms for deep multi-task models. It is often claimed that these multi-task optimization (MTO) methods yield solutions that are superior to the ones found by simply optimizing a weighted average of the task losses. In this paper, we perform large-scale experiments on a variety of language and vision tasks to examine the empirical validity of these claims. We show that, despite the added design and computational complexity of these algorithms, MTO methods do not yield any performance improvements beyond what is achievable via traditional optimization approaches. We highlight alternative strategies that consistently yield improvements to the performance profile and point out common training pitfalls that might cause suboptimal results. Finally, we outline challenges in reliably evaluating the performance of MTO algorithms and discuss potential solutions. |
Derrick Xin · Behrooz Ghorbani · Dami Choi · Ankush Garg · Orhan Firat · Justin Gilmer 🔗 |
-
|
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning
(
Poster
)
link »
Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained models. Although recently proposed parameter-efficient transfer learning (PETL) techniques allow updating a small subset of parameters (e.g. only using $2\%$ of parameters) inside a pre-trained backbone network for a new task, they only reduce the training memory requirement by up to $30\%$. This is because the gradient computation for the trainable parameters still requires backpropagation through the large pre-trained backbone model. To address this, we propose Ladder Side-Tuning (LST), a new PETL technique that can reduce training memory requirements by more substantial amounts. Unlike existing parameter-efficient methods that insert additional parameters inside backbone networks, we train a ladder side network, a small and separate network that takes intermediate activations as input via shortcut connections (ladders) from backbone networks and makes predictions. LST has significantly lower memory requirements than previous methods, because it does not require backpropagation through the backbone network, but instead only through the side network and ladder connections. We evaluate our method with various models (T5 and CLIP-T5) on both natural language processing (GLUE) and vision-and-language (VQA, GQA, NLVR$^2$, MSCOCO) tasks. LST saves $69\%$ of the memory costs to fine-tune the whole network, while other methods only save $26\%$ of that in similar parameter usages (hence, 2.7x more memory savings). Moreover, LST achieves higher accuracy than Adapter and LoRA in a low-memory regime. To further show the advantage of this better memory efficiency, we also apply LST to larger T5 models (T5-large, T5-3B), attaining better GLUE performance than full fine-tuning and other PETL methods. The trend also holds in the experiments on vision-and-language tasks, where LST achieves similar accuracy to other PETL methods when training a similar number of parameters while also having 2.7x more memory savings.
|
Yi-Lin Sung · Jaemin Cho · Mohit Bansal 🔗 |
-
|
Can you label less by using out-of-domain data? Active & Transfer Learning with Few-shot Instructions
(
Poster
)
link »
Labeling social-media data for custom dimensions of toxicity and social bias is challenging and labor-intensive. Existing transfer and active learning approaches meant to reduce annotation effort require fine-tuning, which suffers from overfitting to noise and can cause domain shift with small sample sizes. In this work, we propose a novel Active Transfer Few-shot Instructions (ATF) approach which requires no fine-tuning. ATF leverages the internal linguistic knowledge of pre-trained language models (PLMs) to facilitate the transfer of information from existing pre-labeled datasets (source-domain task) with minimum labeling effort on unlabeled target data (target-domain task). We demonstrate that our strategy can yield positive transfer achieving a mean AUC gain of 13.20% compared to no transfer with a large 22b parameter PLM. We further show that the impact of transfer from pre-labeled source-domain task decreases with more annotation effort on target-domain task (26% drop in gain between 100 and 2000 annotated examples). Finally, we find that not all transfer scenarios yield a positive gain, which seems related to the PLMs initial performance on the target-domain task. |
Rafal Kocielnik · Sara Kangaslahti · Shrimai Prabhumoye · Meena Hari · Michael Alvarez · Anima Anandkumar 🔗 |
-
|
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
(
Poster
)
link »
For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions. |
Anuj Diwan · Puyuan Peng · Raymond Mooney 🔗 |
-
|
This joke is [MASK]: Recognizing Humor and Offense with Prompting
(
Poster
)
link »
Humor is a magnetic component in everyday human interactions and communications. Computationally modeling humor enables NLP systems to entertain and engage with users. We investigate the effectiveness of prompting, a new transfer learning paradigm for NLP, for humor recognition. We show that prompting performs similarly to finetuning when numerous annotations are available, but gives stellar performance in low-resource humor recognition. The relationship between humor and offense is also inspected by applying influence functions to prompting; we show that models could rely on offense to determine humor during transfer. |
Junze Li · Mengjie Zhao · Yubo Xie · Antonis Maronikolakis · Pearl Pu · Hinrich Schuetze 🔗 |
-
|
Classifiers are Better Experts for Controllable Text Generation
(
Poster
)
link »
This paper proposes a simple method for controllable text generation based on weighting logits with a free-form classifier, namely CAIF sampling. Using an arbitrary text classifier, we adjust a small part of a language model's logits and guide text generation towards or away from classifier prediction. We experimented with toxicity avoidance and sentiment control tasks and showed that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and task accuracy metrics based on the external classifier of generated texts. In addition, compared to other approaches, it is easier to implement and tune and has significantly fewer restrictions and requirements. |
Askhat Sitdikov · Nikita Balagansky · Daniil Gavrilov · Alexander Markov 🔗 |