Workshop
Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)
Ananth Balashankar · Saurabh Garg · Jindong Gu · Amrith Setlur · Yao Qin · Aditi Raghunathan · Ahmad Beirami
La Nouvelle Orleans Ballroom A+B (level 2)
Recent advances in the capabilities of large foundation models have been catalyzed by repurposing pretrained models to domain specific use cases through few-shot learning methods like prompt-tuning, in-context-learning; and zero-shot learning based on task descriptions. Given a few labeled examples that outline a new task [T5, GPT2, T0, DALL-E, CLIP], these large foundation models have demonstrably improved upon previous few-shot learning benchmarks [T-few, LAION]. We are closer than ever to learn from very few examples; and recent works [Frozen, Flamingo] have proposed methods to use large language and vision transformer models directly on these few examples, instead of human annotation to create large datasets for fine-tuning. The lessons learned from past-work in counterfactual reasoning, domain adaptation, meta-learning, continual learning, and adversarial training have to be revisited with a new lens towards improving robustness of few-shot learning methods or learning from no supervision (i.e., unlabeled data) that scale to multiple tasks in a safe and responsible manner. In addition to leveraging few-shot learning methods with labeled examples, there is also significant potential in harnessing the power of unlabeled data. When labeled and unlabeled data are from the same distribution, semi-supervised learning methods can be modified to now utilize large foundation models that can further improve boost performance over purely few-shot algorithms. Furthermore, similar ideas need to be explored for unsupervised domain adaptation, to improve robustness of fine-tuned methods to distribution shifts when the unlabeled data distribution is much broader than the distribution from which the labeled examples are collected.
Schedule
Fri 6:50 a.m. - 7:00 a.m.
|
Opening Remarks
(
Introduction
)
>
SlidesLive Video |
🔗 |
Fri 7:00 a.m. - 7:30 a.m.
|
Invited Talk Partha Talukdar
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
Invited Talk Anima Anandkumar
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 8:30 a.m. - 9:00 a.m.
|
Invited Talk Alex Beutel
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 9:00 a.m. - 9:30 a.m.
|
Invited Talk Tian Li
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 9:30 a.m. - 10:00 a.m.
|
Invited Talk Srijan Kumar
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 10:00 a.m. - 11:30 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Fri 11:30 a.m. - 12:30 p.m.
|
Poster Session
(
Poster Session
)
>
|
🔗 |
Fri 12:30 p.m. - 1:00 p.m.
|
Invited Talk Yair Carmon
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 1:00 p.m. - 1:30 p.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 1:30 p.m. - 2:00 p.m.
|
Invited Talk Aditya Grover
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 2:00 p.m. - 2:10 p.m.
|
Mindstorms in Natural Language-Based Societies of Mind
(
Oral
)
>
SlidesLive Video Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents---all communicating through the same universal symbolic language---are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents—some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions. |
Mingchen Zhuge · Haozhe Liu · Francesco Faccio · Dylan R. Ashley · Róbert Csordás · Anand Gopalakrishnan · Abdullah Hamdi · Hasan Hammoud · Vincent Herrmann · Kazuki Irie · Louis Kirsch · Bing Li · Guohao Li · Shuming Liu · Jinjie Mai · Piotr Piękos · Aditya Ramesh · Imanol Schlag · Weimin Shi · Aleksandar Stanić · Wenyi Wang · Yuhui Wang · Mengmeng Xu · Deng-Ping Fan · Bernard Ghanem · Jürgen Schmidhuber
|
Fri 2:10 p.m. - 2:20 p.m.
|
Foundation Models Can Robustify Themselves, For Free
(
Oral
)
>
link
SlidesLive Video Zero-shot inference is a powerful paradigm that enables the use of large pretrained models for downstream classification tasks without further training. However, these models are vulnerable to inherited biases that can impact their performance. The traditional solution is fine-tuning, but this undermines the key advantage of pretrained models, which is their ability to be used out-of-the-box. We propose RoboShot, a method that improves the robustness of pretrained model embeddings in a fully zero-shot fashion. First, we use language models (LMs) to obtain useful insights from task descriptions. These insights are embedded and used to remove harmful and boost useful components in embeddings---without any supervision. Theoretically, we provide a simple and tractable model for biases in zero-shot embeddings and give a result characterizing under what conditions our approach can boost performance. Empirically, we evaluate RoboShot on nine image and NLP classification tasks and show an average improvement of 15.98% over several zero-shot baselines. Additionally, we demonstrate that RoboShot is compatible with a variety of pretrained and language models. |
Dyah Adila · Changho Shin · Linrong Cai · Frederic Sala 🔗 |
Fri 2:20 p.m. - 2:30 p.m.
|
Teaching language models with canonical examples
(
Oral
)
>
SlidesLive Video It is easy to write a desirable or undesirable language model behavior (e.g., knowledge---The capital of Mauritius is Port Louis---or undesirable stereotypes---Researchers are always coldhearted) but it is difficult to make the model robustly generalize from these canonical examples. We formalize this task: a learning method takes a model and simple canonical examples and must produce a model that (1) generalizes to naturalistic examples, (2) stays within a bound of the original model's loss, and (3) performs well on a ``hard negative'' distribution to test overgeneralization. We build on the Backpack language model; its predictions take the form of a sparse weighted sum over a very large sense vector bank. We select and finetune a few Backpack senses per canonical example and find that this substantially outperforms other training methods. The Backpack we work with is only 170m parameters; yet, we find that it can improve much larger models: a product-of-experts ensemble between the 35x larger GPT-J-6B and the ratio of finetuned to pretrained Backpack outperforms finetuning GPT-J itself. |
John Hewitt · Sarah Chen · Percy Liang · Christopher D Manning 🔗 |
Fri 2:30 p.m. - 2:40 p.m.
|
The Consensus Game: Language Model Generation via Equilibrium Search
(
Oral
)
>
SlidesLive Video When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate answers). These procedures sometimes yield very different predictions. How do we reconcile mutually incompatible scoring procedures to obtain coherent LM predictions? We introduce a new, a training-free, game-theoretic procedure for language model decoding. Our approach casts language model decoding as a regularized imperfect-information sequential signaling game—which we term the concensus game—in which a generator seeks to communicate an abstract correctness parameter using natural language sentences to a discriminator. We develop computational procedures for finding approximate equilibria of this game, resulting in a decoding algorithm we call equilibrium-ranking. Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and assistive dialog), equilibrium-ranking consistently improves performance over existing LM decoding procedures. These improvements are sometimes substantial—on multiple benchmarks, we observe that applying equilibrium-ranking to LLaMA-7B outperforms the much larger LLaMA-65B and PaLM-540B models. |
Athul Jacob · Yikang Shen · Gabriele Farina · Jacob Andreas 🔗 |
Fri 2:45 p.m. - 3:25 p.m.
|
Panel Discussion
(
Panel Discussion
)
>
SlidesLive Video |
🔗 |
Fri 3:25 p.m. - 3:30 p.m.
|
Closing Remarks
(
Closing Remarks
)
>
SlidesLive Video |
🔗 |
-
|
Mindstorms in Natural Language-Based Societies of Mind
(
Poster
)
>
Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents---all communicating through the same universal symbolic language---are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents—some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions. |
Mingchen Zhuge · Haozhe Liu · Francesco Faccio · Dylan R. Ashley · Róbert Csordás · Anand Gopalakrishnan · Abdullah Hamdi · Hasan Hammoud · Vincent Herrmann · Kazuki Irie · Louis Kirsch · Bing Li · Guohao Li · Shuming Liu · Jinjie Mai · Piotr Piękos · Aditya Ramesh · Imanol Schlag · Weimin Shi · Aleksandar Stanić · Wenyi Wang · Yuhui Wang · Mengmeng Xu · Deng-Ping Fan · Bernard Ghanem · Jürgen Schmidhuber
|
-
|
Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing
(
Poster
)
>
Large language models (LLMs) have made impressive progress in natural language processing. These models rely on proper human instructions (or prompts) to generate suitable responses. However, the potential of LLMs are not fully harnessed by commonly-used prompting methods: many human-in-the-loop algorithms employ ad-hoc procedures for prompt selection; while auto prompt generation approaches are essentially searching all possible prompts randomly and inefficiently. We propose Evoke, an automatic prompt refinement framework. In Evoke, there are two instances of a same LLM: one as a reviewer (LLM-Reviewer), it scores the current prompt; the other as an author (LLM-Author), it edits the prompt by considering the edit history and the reviewer's feedback. Such an author-reviewer feedback loop ensures that the prompt is refined in each iteration. We further aggregate a data selection approach to Evoke, where only the hard samples are exposed to the LLM. The hard samples are more important because the LLM can develop deeper understanding of the tasks out of them, while the model may already know how to solve the easier cases. Experimental results show that Evoke significantly outperforms existing methods across a diverse range of tasks. |
Xinyu Hu · Pengfei Tang · Simiao Zuo · Zihan Wang · Bowen Song · Qiang Lou · Jian Jiao · Denis Charles 🔗 |
-
|
Dissecting In-Context Learning of Translations
(
Poster
)
>
Most of the recent work in leveraging Large Language Models (LLMs) such as GPT-3 for Machine Translation (MT) through in-context learning of translations has focused on selecting the few-shot demonstration samples. In this work, we characterize the robustness of LLMs from the GPT family to certain perturbations on few-shot translation demonstrations as a means to dissect the in-context learning of translations. In particular, we try to better understand the role of demonstration attributes for the in-context learning of translations through perturbations of high-quality, in-domain demonstrations. We find that asymmetric perturbation of the source-target mappings yield vastly different results. Further, we show that the perturbation of the source side has surprisingly little impact, while target perturbation can drastically reduce translation quality, suggesting that it is the output text distribution that provides the most important learning signal during in-context learning of translations. Based on our findings, we propose a method named Zero-Shot-Context to add this signal automatically in Zero-Shot prompting. Our proposed method greatly improves upon the zero-shot translation performance of GPT-3, thereby making it competitive with few-shot prompted translations. |
Vikas Raunak · Arul Menezes · Hany Awadalla 🔗 |
-
|
FedJETs: Efficient Just-In-Time Personalization with Federated Mixture of Experts
(
Poster
)
>
One of the goals in Federated Learning (FL) is to create personalized models that can adapt to the context of each participating client, while utilizing knowledge from a shared global model. Yet, often, personalization requires a fine-tuning step using clients' labeled data in order to achieve good performance. This may not be feasible in scenarios where incoming clients are fresh and/or have privacy concerns. It, then, remains open how one can achieve just-in-time personalization in these scenarios. We propose FedJETs, a novel solution by using a Mixture-of-Experts (MoE) framework within a FL setup. Our method leverages the diversity of the clients to train specialized experts on different subsets of classes, and a gating function to route the input to the most relevant expert(s). Our gating function harnesses the knowledge of a pretrained model (common expert) to enhance its routing decisions on-the-fly. As a highlight, our approach can improve accuracy up to 18% in state of the art FL settings, while maintaining competitive zero-shot performance. In practice, our method can handle non-homogeneous data distributions, scale more efficiently, and improve the state-of-the-art performance on common FL benchmarks. |
Chen Dun · Mirian Hipolito Garcia · Guoqing Zheng · Ahmed Awadallah · Robert Sim · Anastasios Kyrillidis · Dimitrios Dimitriadis 🔗 |
-
|
Your CLIP Model Might Be Undertrained
(
Poster
)
>
Contrastive Language-Image Pretraining (CLIP) models exhibit good performance on a range of vision tasks. To improve the performance of this class of models even further, several works have proposed to modify the CLIP training procedure. In this work, we show that it is possible to achieve substantial gains using a much simpler strategy. Specifically, existing CLIP models---especially those trained on smaller datasets---tend to be undertrained. Indeed, we show that extending the training procedure according to a simple heuristic can significantly improve the performance of CLIP models. |
Alaa Khaddaj · Hadi Salman · Andrew Ilyas · Guillaume Leclerc · Aleksander Madry 🔗 |
-
|
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
(
Poster
)
>
While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. We present a dataset of over 126,808 prompt injection attacks and 46,457 anti-injection "defense'' prompts to elucidate this problem, created by players of an online game called Tensor Trust. To the best of our knowledge, this is the largest dataset of human-generated adversarial examples for instruction-following LLMs. We demonstrate that these attacks often have a simple structure that sheds light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, our small-scale experiments on deployed LLM-based applications show that attack strategies in the dataset generalize beyond the setting of the game. We release all data and source code. |
Sam Toyer · Olivia Watkins · Ethan Mendes · Justin Svegliato · Luke Bailey · Tiffany Wang · Isaac Ong · Karim Elmaaroufi · Pieter Abbeel · Trevor Darrell · Alan Ritter · Stuart J Russell
|
-
|
Provable Robust Watermarking for AI-Generated Text
(
Poster
)
>
We study the problem of watermarking large language models (LLMs) generated text — one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs. |
Xuandong Zhao · Prabhanjan Ananth · Lei Li · Yu-Xiang Wang 🔗 |
-
|
Neural Sandbox Framework for Classification: A Concept Based Method of Leveraging LLMs for Text Classification
(
Poster
)
>
We introduce a neural sandbox framework for text classification via self-referencing defined label concepts from an Large Language Model(LLM). The framework draws inspiration from the define-optimize alignment problem, in which the motivations of a model are described initially and then the model is optimized to align with these predefined objectives. In our case, we focus on text classification where we use a pre-trained LLM to convert text into vectors and provide it with specific concept words based on labels and input text. We then optimize an operator to classify text based on how relevant it is to these concept words (cop-words). Experiments with multiple text classification datasets and LLM models reveal that incorporating our sandbox network generally improves the accuracy and macro f1 when compared to a baseline. The framework, not only improves classification but also provides insights into the model's decision making based on the provided cop-words. We also demonstrated the framework's ability to understand learned concepts and identify potential biases. However, we found that the model's incentives may not always align with human decisions. |
Mostafa Mushsharat · Nabeel Mohammed · Mohammad Ruhul Amin 🔗 |
-
|
Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions
(
Poster
)
>
In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained Large Language Models (LLMs). In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set. |
Satwik Bhattamishra · Arkil Patel · Phil Blunsom · Varun Kanade 🔗 |
-
|
Can LLM-Generated Misinformation Be Detected?
(
Poster
)
>
The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating misinformation with LLMs. Then, through extensive empirical investigation, we discover that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. We also discuss the implications of our discovery on combating misinformation in the age of LLMs and the countermeasures. |
Canyu Chen · Kai Shu 🔗 |
-
|
Jailbreaking Black Box Large Language Models in Twenty Queries
(
Poster
)
>
There is growing research interest in ensuring that large language models align with human safety and ethical guidelines. Adversarial attacks known as 'jailbreaks' pose a significant threat as they coax models into overriding alignment safeguards. Identifying these vulnerabilities through attacking a language model (red teaming) is instrumental in understanding inherent weaknesses and preventing misuse. We present Prompt Automatic Iterative Refinement (PAIR), which generates semantic jailbreaks with only black-box access to a language model.Empirically, PAIR often requires fewer than 20 queries, orders of magnitude fewer than prior jailbreak attacks. PAIR draws inspiration from the human process of social engineering, and employs an attacker language model to automatically generate adversarial prompts in place of a human. The attacker model uses the target model's response as additional context to iteratively refine the adversarial prompt. PAIR achieves competitive jailbreaking success rates and transferability on open and closed-source language models, including GPT-3.5/4, Vicuna, and PaLM. |
Patrick Chao · Alexander Robey · Edgar Dobriban · Hamed Hassani · George J. Pappas · Eric Wong 🔗 |
-
|
Learning Through Consistency for Prompt Tuning
(
Poster
)
>
We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models that addresses the challenge of improving the generalization capability of large foundation models while fine-tuning them on downstream tasks in a few-shot setting. The basic idea of CoPrompt is to enforce a consistency constraint in the prediction of the trainable and pre-trained models to prevent overfitting on the downstream task. Additionally, we introduce the following two components into our consistency constraint to further boost the performance: enforcing consistency on two perturbed inputs and combining two dominant paradigms of tuning, prompting and adapter. Enforcing consistency on perturbed input further regularizes the consistency constraint, effectively improving generalization, while tuning additional parameters with prompting and adapters improves the performance on downstream tasks. Extensive experiments show that CoPrompt outperforms existing methods on a range of evaluation suites, including base-to-novel generalization, domain generalization, and cross-dataset evaluation tasks. On the generalization task, CoPrompt improves the state-of-the-art by 2.09\% on the zero-shot task and 1.93\% on the harmonic mean over 11 recognition datasets. Detailed ablation studies show the effectiveness of each of the components in CoPrompt. |
Shuvendu Roy · Ali Etemad 🔗 |
-
|
How does fine-tuning affect your model? Mechanistic analysis on procedural tasks
(
Poster
)
>
Fine-tuning large pre-trained models has become the de facto strategy for developing models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task. |
Samyak Jain · Robert Kirk · Ekdeep S Lubana · Robert Dick · Hidenori Tanaka · Tim Rocktäschel · Edward Grefenstette · David Krueger 🔗 |
-
|
How Robust is Google's Bard to Adversarial Image Attacks?
(
Poster
)
>
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this paper, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on robustness of MLLMs and facilitate future research on defenses. |
Yinpeng Dong · Huanran Chen · Jiawei Chen · Zhengwei Fang · Xiao Yang · Yichi Zhang · Yu Tian · Hang Su · Jun Zhu 🔗 |
-
|
Effective Data Augmentation With Diffusion Models
(
Poster
)
>
SlidesLive Video Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity along key semantic axes present in the data. Current augmentations cannot alter the high-level semantic attributes, such as animal species present in a scene, to enhance the diversity of data. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains. |
Brandon Trabucco · Kyle Doherty · Max Gurinas · Russ Salakhutdinov 🔗 |
-
|
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification
(
Poster
)
>
CLIP showcases exceptional cross-modal matching capabilities due to its training on text-image matching tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image feature representation, adversely affecting CLIP's effectiveness in targeted tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation (CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives, thereby fully leveraging CLIP's robust cross-modal capabilities. The key to constructing a high-quality CODER lies in how to create a vast amount of high-quality text to match with images. We introduce the Auto Prompt Generator (APG) to autonomously produce the required text in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experimental results across various datasets and architectures confirm CODER's effectiveness. |
Chao Yi · Lu Ren · De-Chuan Zhan · Han-Jia Ye 🔗 |
-
|
Group Preference Optimization: Few-Shot Alignment of Large Language Models
(
Poster
)
>
Applications of large language models (LLMs) often demand nuanced judgments that vary among different groups. Existing alignment algorithms can be costly, requiring extensive group-specific data and computation. We present Group Preference Optimization (GPO), a framework that efficiently aligns LLMs to group preferences using a few-shot approach.In GPO, we augment the base LLM with an independent transformer module to predict the preferences of a group for the LLM generations.For few-shot learning, this module acts as an in-context autoregressive transformer and is trained via meta-learning on several groups. Through empirical validation on opinion adaptation tasks involving US demographic groups, global countries, and individuals, GPO demonstrates superior alignment performance, requiring fewer group-specific preferences and reduced training and computational resources, surpassing existing strategies like in-context steering and fine-tuning. |
Siyan Zhao · John Dang · Aditya Grover 🔗 |
-
|
HART: Efficient Adaptation via Regularized Autoregressive Parameter Generation
(
Poster
)
>
Fine-tuning is an effective approach for adapting a pre-trained language model to downstream tasks, but it incurs a high computational cost. To achieve an extremely efficient task adaptation, \citet{phang2022hypertuning} have proposed to use an auxiliary hypernetwork to generate task-specific weights without any backpropagation. A hypernetwork can generate weights for parameter-efficient fine-tuning (PEFT) modules, such as prefixes \citep{li2021prefix} and LoRAs \citep{hu2021lora}, for any unseen task based on a few task-specific demonstration examples, at the cost of a single forward pass. However, hypernetwork training is challenging. Firstly, it is sample inefficient due to the under-exploitation of the dependencies between PEFT weights across layers. Secondly, it exhibits training instability due to the high diversity of few-shot demonstration inputs. To address these limitations, we propose a novel hypernetwork training approach, named HART. It exploits layerwise dependencies by autoregressively generating weights for individual layers, and stabilizes the training by regularizing the consistency between weights generated based on different demonstrations. We train the hypernetwork on a diverse collection of tasks \citep{wang2022super,sanh2021multitask} and evaluate its performance on unseen tasks. HART notably outperforms \citet{phang2022hypertuning} on both T5-Large and T5-XL models. |
Chen Liang · Nikos Karampatziakis · Tuo Zhao · Weizhu Chen 🔗 |
-
|
SELF-EXPLAIN: Teaching Large Language Models to Reason Complex Questions by Themselves
(
Poster
)
>
Large language models (LLMs) can generate intermediate reasoning steps. To elicit the reliable reasoning, the common practice is to employ few-shot chain-of-thought prompting, where several in-context demonstrations for reasoning are prepended to the question. However, such chain-of-thought examples are expensive to craft, especially for professional domains, and can have high variance depending on human annotators. Therefore, this work investigates whether LLMs can teach themselves to reason without human-crafted demonstrations. We propose SELF-EXPLAIN to generate CoT examples by LLMs inspired by ``encoding specificity'' in human memory retrieval. We find using self-explanations makes LLMs more confident, more calibrated and less biased when answering complex questions. Moreover, we find prompting with self-explanations can even significantly outperform using human-crafted CoTs on several complex question-answering datasets. |
Jiachen Zhao · Zonghai Yao · Zhichao Yang · Hong Yu 🔗 |
-
|
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
(
Poster
)
>
Despite efforts to align large language models (LLMs), widely-used LLMs such as GPT and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM. |
Alexander Robey · Eric Wong · Hamed Hassani · George J. Pappas 🔗 |
-
|
Evaluating Adversarial Defense in the Era of Large Language Models
(
Poster
)
>
Large language models (LLMs) have demonstrated superior performance in many natural language processing tasks. Existing works have shown that LLMs are not robust to adversarial attacks, questioning the applicability of these models in scenarios with safety concerns. However, one key aspect that has been overlooked is evaluating and developing defense mechanisms against adversarial attacks.In this work, we systematically study how LLMs react to different adversarial defense strategies. We also propose defenses tailored for LLMs that can significantly improve their robustness: First, we develop prompting methods to alert the LLM about potential adversarial contents; Second, we use neural models such as the LLM itself for typo correction; Third, we propose an effective fine-tuning scheme to improve robustness against corrupted inputs.Extensive experiments are conducted to evaluate the adversarial defense approaches. We show that by using the proposed defenses, robustness of LLMs can increase by up to 20\%. |
Joachim Studnia · Simiao Zuo · Xiaodong Liu · Qiang Lou · Jian Jiao · Denis Charles 🔗 |
-
|
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
(
Poster
)
>
Low-rank adaptations (LoRA) are often employed to fine-tune large language models (LLMs) for new tasks. This paper investigates LoRA composability for cross-task generalization and introduces LoraHub, a strategic framework devised for the purposive assembly of LoRA modules trained on diverse given tasks, with the objective of achieving adaptable performance on unseen tasks. With just a few examples from a novel task, LoraHub enables the fluid combination of multiple LoRA modules, eradicating the need for human expertise. Notably, the composition requires neither additional model parameters nor gradients. Our empirical results, derived from the Big-Bench Hard (BBH) benchmark, suggest that LoraHub can effectively mimic the performance of in-context learning in few-shot scenarios, excluding the necessity of in-context examples alongside each inference input. A significant contribution of our research is the fostering of a community for LoRA, where users can share their trained LoRA modules, thereby facilitating their application to new tasks. We anticipate this resource will widen access to and spur advancements in general intelligence as well as LLMs in production. |
Chengsong Huang · Qian Liu · Bill Yuchen Lin · Chao Du · Tianyu Pang · Min Lin 🔗 |
-
|
ICL-Markup: Structuring In-Context Learning using Soft-Token Tags
(
Poster
)
>
Large pretrained language models (PLMs) can be rapidly adapted to a wide variety of tasksvia a text-to-text approach, where the instruction and input are fed to the model in natural language.Combined with in-context learning (ICL), this paradigm is impressively flexible and powerful.However, it also burdens engineers with an overwhelming amount of choices,many of them arbitrary.Inspired by markup languages like HTML, we contribute a method of using soft-token (a.k.a tunable token)tags to compose prompt templates.This approach reduces arbitrary decisionsand streamlines the application of ICL.Our method is a form of meta-learning for ICL;it learns these tags in advance during a parameter-efficient fine-tuning ``warm-up'' process.The tags can subsequently be used in templates for ICL on new,unseen tasks without any additional fine-tuning. Our experiments with this approach yield promising initial results.Improving PLM performance in important enterprise applications such as few-shot and open-world intent detection, as well as text classification in a legal domain. |
Marc-Etienne Brunet · Ashton Anderson · Richard Zemel 🔗 |
-
|
Lag-Llama: Towards Time-Series Foundation Models
(
Poster
)
>
SlidesLive Video Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-Llama, a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen “out-of-distribution” time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws to fit and predict model scaling behavior. |
Kashif Rasul · Arjun Ashok · Marin Biloš · Andrew Williams · Arian Khorasani · George Adamopoulos · Rishika Bhagwatkar · Hena Ghonia · Nadhir Hassen · Anderson Schneider · Sahil Garg · Alexandre Drouin · Nicolas Chapados · Yuriy Nevmyvaka · Irina Rish
|
-
|
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
(
Poster
)
>
Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP). Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC. Furthermore, they perform moderately on video captioning and poorly on video QA. These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step. |
Avinash Madasu · Anahita Bhiwandiwalla · VASUDEV LAL 🔗 |
-
|
TART: A plug-and-play Transformer module for task-agnostic reasoning
(
Poster
)
>
Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and, as a proof of concept, propose TART which generically improves an LLM's reasoning abilities using a synthetically trained reasoning module. TART trains this Transformer-based reasoning module in a task-agnostic manner using only synthetic logistic regression tasks and composes it with an arbitrary real-world pre-trained model without any additional training. With a single inference module, TART improves performance across different model families (GPT-Neo, Pythia, Bloom), model sizes (100M - 6B), tasks (14 NLP classification tasks), and even across different modalities (audio and vision). On the RAFT Benchmark, TART improves GPT-Neo (125M)'s performance such that it outperforms Bloom (176B), and is within $4$% of GPT-3.
|
Kush Bhatia · Avanika Narayan · Christopher De Sa · Christopher Ré 🔗 |
-
|
READ: Recurrent Adaptation of Large Transformers
(
Poster
)
>
SlidesLive Video In the realm of Natural Language Processing (NLP), large-scale transformers have established themselves as pivotal, achieving unparalleled results across numerous tasks. The conventional approach involves pre-training these models on extensive web-scale data, followed by fine-tuning them for specific downstream tasks. However, the burgeoning size of these models, which has surged almost two orders of magnitude faster than GPU memory since 2018, has rendered their fine-tuning financially and computationally exorbitant, limiting this capability to a select few well-funded institutions. Parameter-efficient transfer learning (PETL) has emerged as a potential solution, aiming to efficiently adapt pre-trained model parameters to target tasks using smaller, task-specific models. Nonetheless, existing PETL methods either introduce additional inference latency or marginally reduce memory requirements during training, thus not fully addressing the primary motivation behind PETL. This paper introduces REcurrent ADaption (READ), a novel, lightweight, and memory-efficient fine-tuning method that incorporates a small RNN network alongside the backbone model. READ not only achieves comparable model quality to traditional fine-tuning, saving over 84\% in energy consumption, but also demonstrates scalability and independence from the backbone model size. Through extensive experiments on various NLP benchmarks, including the GLUE benchmark, READ showcases robust performance and high efficiency, reducing model training memory consumption by 56\% and GPU energy usage by 84\% relative to full-tuning, without significantly impacting inference latency and memory. We provide a theoretically justified, scalable solution for fine-tuning large transformers. |
Sid Wang · John Nguyen · Ke Li · Carole-Jean Wu 🔗 |
-
|
Hierarchical Network Fusion for Multi-Modal Electron Micrograph Representation Learning with Foundational Large Language Models
(
Poster
)
>
Characterizing materials with electron micrographs is a crucial task in fields such as semiconductors and quantum materials. The complex hierarchical structure of micrographs often poses challenges for traditional classification methods. In this study, we propose an innovative backbone architecture for analyzing electron micrographs. We create multi-modal representations of the micrographs by tok-enizing them into patch sequences and, additionally, representing them as vision graphs, commonly referred to as patch attributed graphs. We introduce the Hierarchical Network Fusion (HNF), a multi-layered network structure architecture that facilitates information exchange between the multi-modal representations and knowledge integration across different patch resolutions. Furthermore, we leverage large language models (LLMs) to generate detailed technical descriptions of nano-materials as auxiliary information to assist in the downstream task. We utilize a cross-modal attention mechanism for knowledge fusion across cross-domain representations(both image-based and linguistic insights) to predict the nanomaterial category. This multi-faceted approach promises a more comprehensive and accurate representation and classification of micrographs for nanomaterial identification. Our framework outperforms traditional methods, overcoming challenges posed by distributional shifts, and facilitating high-throughput screening. |
Sagar Srinivas Sakhinana · Sannidhi G N K Geethan · Venkataramana Runkana 🔗 |
-
|
SAD: Segment Any RGBD
(
Poster
)
>
SlidesLive Video The Segment Anything Model (SAM) has demonstrated its effectiveness in segmenting any part of 2D RGB images. A lot of SAM-based applications have shown amazing performance. However, SAM exhibits a stronger emphasis on texture information while paying less attention to geometry information when segmenting RGB images. To address this limitation, we propose the Segment Any RGBD (SAD) model, which is specifically designed to extract geometry information directly from images. Inspired by the natural ability of humans to identify objects through the visualization of depth maps, SAD utilizes SAM to segment the rendered depth map, thus providing cues with enhanced geometry information and mitigating the issue of over-segmentation. Compared to other SAM-based projects, we are the first to use SAM to segment non-RGB images. We further include the open-vocabulary semantic segmentation in our framework to provide the semantic labels of each segment. |
Jun CEN · Yizheng Wu · Kewei Wang · Xingyi Li · Jingkang Yang · Yixuan Pei · Lingdong Kong · Ziwei Liu · Qifeng Chen 🔗 |
-
|
Estimating Uncertainty in Multimodal Foundation Models using Public Internet Data
(
Poster
)
>
Foundation models are trained on vast amounts of data at scale using self-supervised learning, enabling adaptation to a wide range of downstream tasks. At test time, these models exhibit zero-shot capabilities through which they can classify previously unseen (user-specified) categories. In this paper, we address the problem of quantifying uncertainty in these zero-shot predictions. We propose a heuristic approach for uncertainty estimation in zero-shot settings using conformal prediction with web data. Given a set of classes at test time, we conduct zero-shot classification with CLIP-style models using a prompt template, e.g., ``an image of a |
Shiladitya Dutta · Hongbo Wei · Lars van der Laan · Ahmed Alaa 🔗 |
-
|
Crossing New Frontiers: Knowledge-Augmented Large Language Model Prompting for Zero-Shot Text-Based De Novo Molecule Design
(
Poster
)
>
Molecule design is a multifaceted approach that leverages computational methods and experiments to optimize molecular properties, fast-tracking new drug discoveries, innovative material development, and more efficient chemical processes. Recently, text-based molecule design has emerged, inspired by next-generation AI tasks analogous to foundational vision-language models. Our study explores theuse of knowledge-augmented prompting of large language models (LLMs) for the zero-shot text-conditional de novo molecular generation task. Our approach uses task-specific instructions and a few demonstrations to address distributional shift challenges when constructing augmented prompts for querying LLMs to generate molecules consistent with technical descriptions. Our framework proves effective,outperforming state-of-the-art (SOTA) baseline models on benchmark datasets. |
Sagar Srinivas Sakhinana · Venkataramana Runkana 🔗 |
-
|
Investigating Hiring Bias in Large Language Models
(
Poster
)
>
Large Language Models (LLMs) such as GPT-3.5, Bard, and Claude exhibit applicability across numerous tasks. One domain of interest is their use in algorithmic hiring, specifically in matching resumes with job categories. Yet, this introduces issues of bias on protected attributes like gender, race and maternity status. The seminal work of Bertrand and Mullainathan (2003) set the gold-standard for identifying hiring bias via field experiments where the response rate for identical resumes that differ only in protected attributes, e.g., racially suggestive names such as Emily or Lakisha, is compared. We replicate this experiment on state-of-art LLMs to evaluate bias (or lack thereof) on gender, race, maternity status, pregnancy status, and political affiliation. We evaluate LLMs on two tasks: (1) matching resumes to job categories; and (2) summarizing resumes with employment relevant information. Overall, LLMs are robust across race and gender. They differ in their performance on pregnancy status and political affiliation. We use contrastive input decoding on open-source LLMs to uncover potential sources of bias. |
Akshaj Kumar Veldanda · Fabian Grob · Shailja Thakur · Hammond Pearce · Benjamin Tan · Ramesh Karri · Siddharth Garg 🔗 |
-
|
LOWA: Localize Objects in the Wild with Attributes
(
Poster
)
>
SlidesLive Video Existing open-vocabulary object detectors can struggle with uncommon or fine-grained classes, as the model and users may have different understandings of object names. Incorporating attributes such as color, shape, and size can help to reduce this inconsistency and make interactive detection more convenient and flexible. Motivated by this, we present LOWA, a new method for localizing objects with attributes effectively in the wild. To train LOWA, we propose a multi-step vision-language training strategy to learn object detection and recognition with class names as well as attribute information, which empowers users to flexibly customize text queries and extend to fine-grained detection with attribute and object information for a wider range of applications. LOWA is built on top of a two-tower vision-language architecture and consists of a standard vision transformer as the image encoder and a similar transformer as the text encoder. To learn the alignment between visual and text inputs at the instance level, we train LOWA with three training steps: object-level training, attribute-aware learning, and free-text joint training of objects and attributes. This training strategy first ensures correct object detection, then incorporates instance-level attribute information, and finally balances the object class and attribute sensitivity. We evaluate our model performance of attribute classification and attribute localization on the Open-Vocabulary Attribute Detection (OVAD) benchmark and the Visual Attributes in the Wild (VAW) dataset, and experiments indicate strong zero-shot performance. Ablation studies additionally demonstrate the effectiveness of each training step of our approach. |
Xiaoyuan Guo · Kezhen Chen · Jinmeng Rao · Yawen Zhang · Baochen Sun · Jie Yang 🔗 |
-
|
Understanding the Vulnerability of CLIP to Image Compression
(
Poster
)
>
SlidesLive Video CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models. |
Cangxiong Chen · Vinay Namboodiri · Julian Padget 🔗 |
-
|
Towards General-Purpose In-Context Learning Agents
(
Poster
)
>
Reinforcement Learning (RL) algorithms are usually hand-crafted, driven by the research and engineering of humans. An alternative approach is to automate this research process via meta-learning. A particularly ambitious objective is to automatically discover new RL algorithms from scratch that use in-context learning to learn-how-to-learn entirely from data while also generalizing to a wide range of environments. Those RL algorithms are implemented entirely in neural networks, by conditioning on previous experience from the environment, without any explicit optimization-based routine at meta-test time. To achieve generalization, this requires a broad task distribution of diverse and challenging environments. Our Transformer-based Generally Learning Agents (GLAs) are an important first step in this direction. Our GLAs are meta-trained using supervised learning techniques on an offline dataset with experiences from RL environments that is augmented with random projections to generate task diversity. During meta-testing our agents perform in-context meta-RL on entirely different robotic control problems such as Reacher, Cartpole, or HalfCheetah that were not in the meta-training distribution. |
Louis Kirsch · James Harrison · Daniel Freeman · Jascha Sohl-Dickstein · Jürgen Schmidhuber 🔗 |
-
|
HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning
(
Poster
)
>
In this paper, we focus on the important yet understudied problem of Continual Federated Learning (CFL), where a server communicates with a set of clients to incrementally learn new concepts over time without sharing or storing any data. The complexity of this problem is compounded by challenges from both the Continual and Federated Learning perspectives. Specifically, models trained in a CFL setup suffer from catastrophic forgetting which is exacerbated by data heterogeneity across clients. Existing attempts at this problem tend to impose large overheads on clients and communication channels or require access to stored data which renders them unsuitable for real-world use due to privacy. We study this problem in the context of Foundation Models and showcase their effectiveness in mitigating forgetting while minimizing overhead costs and without requiring access to any stored data. We achieve this by leveraging a prompting based approach (such that only prompts and classifier heads have to be communicated) and proposing a novel and lightweight generation and distillation scheme to aggregate client models at the server. We formulate this problem for image classification and establish strong baselines for comparison, conduct experiments on CIFAR-100 as well as challenging, large-scale datasets like ImageNet-R and DomainNet. Our approach outperforms both existing methods and our own baselines by more than 7% while significantly reducing communication and client-level computation costs. |
Shaunak Halbe · James S Smith · Junjiao Tian · Zsolt Kira 🔗 |
-
|
Image Clustering Conditioned on Text Criteria
(
Poster
)
>
Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified criteria in the form of text by leveraging modern Vision-Language Models and Large Language Models. We call our method Image Clustering Conditioned on Text Criteria (IC$|$TC), and it represents a different paradigm of image clustering. IC$|$TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC$|$TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.
|
Sehyun Kwon · Jaeseung Park · Minkyu Kim · Jaewoong Cho · Ernest Ryu · Kangwook Lee 🔗 |
-
|
On the Relationship between Skill Neurons and Robustness in Prompt Tuning
(
Poster
)
>
Prompt Tuning is a popular parameter-efficient finetuning method for pre-trained large language models (PLMs). Recently, based on experiments with RoBERTa, it has been suggested that Prompt Tuning activates specific neurons in the transformer's feed-forward networks, that are highly predictive and selective for the given task. In this paper, we study the robustness of Prompt Tuning in relation to these "skill neurons", using RoBERTa and T5. We show that prompts tuned for a specific task are transferable to tasks of the same type but are not very robust to adversarial data, with higher robustness for T5 than RoBERTa. At the same time, we replicate the existence of skill neurons in RoBERTa and further show that skill neurons also seem to exist in T5. Interestingly, the skill neurons of T5 determined on non-adversarial data are also among the most predictive neurons on the adversarial data, which is not the case for RoBERTa. We conclude that higher adversarial robustness may be related to a model's ability to activate the relevant skill neurons on adversarial data. |
Leon Ackermann · Xenia Ohmer 🔗 |
-
|
Latent Skill Discovery for Chain-of-Thought Reasoning
(
Poster
)
>
Recent advances in Large Language Models (LLMs) have led to an emergent ability of chain-of-thought (CoT) prompting, a prompt reasoning strategy that adds intermediate rationale steps between questions and answers to construct prompts. Conditioned on these prompts, LLMs can effectively learn in context to generate rationales that lead to more accurate answers than when answering the same question directly. To design LLM prompts, one important setting, called demonstration selection, considers selecting demonstrations from an example bank. Existing methods use various heuristics for this selection, but for CoT prompting, which involves unique rationales, it is essential to base the selection upon the intrinsic skills that CoT rationales need, for instance, the skills of addition or subtraction for math word problems.To address this requirement, we introduce a novel approach named Reasoning Skill Discovery (RSD) that uses unsupervised learning to create a latent space representation of rationales, called a reasoning skill. Simultaneously, RSD learns a reasoning policy to determine the required reasoning skill for a given question. This can then guide the selection of examples that demonstrate the required reasoning skills. Our approach offers several desirable properties: it is (1) theoretically grounded, (2) sample-efficient, requiring no LLM inference or manual prompt design, and (3) LLM-agnostic. Empirically, RSD outperforms existing methods by up to 6% in terms of the answer accuracy across multiple reasoning tasks. |
Zifan Xu · Haozhu Wang · Dmitriy Bespalov · Peter Stone · Yanjun Qi 🔗 |
-
|
Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models
(
Poster
)
>
SlidesLive Video
Multimodal Large Language Models (LLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to $46\%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose three automatic visual cropping methods as inference time mechanisms to improve the zero-shot performance of multimodal LLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that multimodal LLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance.
|
jiarui zhang · Mahyar Khayatkhoei · Prateek Chhikara · Filip Ilievski 🔗 |
-
|
Uncertainty In Natural Language Explanations Of Large Language Models
(
Poster
)
>
Large Language Models (LLMs) are increasingly used as powerful tools for several high-stakes natural language processing (NLP) applications. Recent works on prompting claim to elicit intermediate reasoning steps and important tokens in LLMs to serve as proxy explanations for its predictions. However, there is no guarantee or certainty whether these explanations are reliable and reflect the LLM's true behavior. In this work, we introduce the first definitions of uncertainty in natural language explanations of LLMs, where we propose a novel approach $\textit{Probing Uncertainty}$ --- to quantify the confidence of the generated explanations. Our approach probes a neighbourhood of explanations of the LLM to estimate the uncertainty. While verbalized uncertainty involves prompting the LLM to express its confidence level in generated explanations, we show that it is not a reliable estimate of explanation confidence. Our empirical analysis reveals two key insights about uncertainty in generated natural language explanations: i) Verbalized uncertainty estimation using LLMs often exhibits high overconfidence, raising questions on the trustworthiness of its explanation, and ii) Explanation confidence calculated from the proposed metric is correlated with the faithfulness of an explanation, where lower explanation confidence pertains to explanations with lower faithfulness. Our study provides insights into the challenges and opportunities in quantifying uncertainty in explanations of LLMs, contributing to the broader discussion of explainability and trustworthiness in machine learning applications.
|
Sree Harsha Tanneru · Chirag Agarwal · Himabindu Lakkaraju 🔗 |
-
|
Benchmarking Robustness of Text-Image Composed Retrieval
(
Poster
)
>
SlidesLive Video Text-image composed retrieval aims to retrieve the target image through the composed query, which is specified in the form of an image plus some text that describes desired modifications to the input image. It has recently attracted attention due to its ability to leverage both information-rich images and concise language to precisely express the requirements for target images. However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied. In this paper, we perform the first robustness study and establish three new diversified benchmarks for systematically analysis of text-image composed retrieval against natural corruptions in both vision and text and further probe textural understanding. For natural corruption analysis, we introduce two new large-scale benchmark datasets, CIRR-C and FashionIQ-C for testing in open domain and fashion domain respectively, both of which apply 15 visual corruptions and 7 textural corruptions. For textural understanding analysis, we introduce a new diagnostic dataset CIRR-D by expanding the original raw data with synthetic data, which contains modified text so to better probe textual understanding ability including numerical variation, attribute variation, object removal, background variation, and fine-grained evaluation. |
Shitong Sun · Jindong Gu · Shaogang Gong 🔗 |
-
|
Foundation Models Can Robustify Themselves, For Free
(
Poster
)
>
Zero-shot inference is a powerful paradigm that enables the use of large pretrained models for downstream classification tasks without further training. However, these models are vulnerable to inherited biases that can impact their performance. The traditional solution is fine-tuning, but this undermines the key advantage of pretrained models, which is their ability to be used out-of-the-box. We propose RoboShot, a method that improves the robustness of pretrained model embeddings in a fully zero-shot fashion. First, we use language models (LMs) to obtain useful insights from task descriptions. These insights are embedded and used to remove harmful and boost useful components in embeddings---without any supervision. Theoretically, we provide a simple and tractable model for biases in zero-shot embeddings and give a result characterizing under what conditions our approach can boost performance. Empirically, we evaluate RoboShot on nine image and NLP classification tasks and show an average improvement of 15.98% over several zero-shot baselines. Additionally, we demonstrate that RoboShot is compatible with a variety of pretrained and language models. |
Dyah Adila · Changho Shin · Linrong Cai · Frederic Sala 🔗 |
-
|
Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
(
Poster
)
>
Few-shot learning is valuable in many real-world applications, but learning a generalizable model without overfitting to the few labeled datapoints is challenging. In this work, we focus on Few-shot Learning with Auxiliary Data (FLAD), a training paradigm that assumes access to auxiliary data during few-shot learning in hopes of improving generalization. Previous works have proposed automated methods for mixing auxiliary and target data, but these methods typically scale linearly (or worse) with the number of auxiliary datasets, limiting their practicality. In this work we relate FLAD to the explore-exploit dilemma that is central to the multi-armed bandit setting and derive algorithms whose computational complexity is independent of the number of auxiliary datasets, allowing us to scale to 100x more auxiliary datasets than prior methods. We propose two algorithms -- EXP3-FLAD and UCB1-FLAD -- and compare them with prior FLAD methods that either explore or exploit, finding that the combination of exploration and exploitation is crucial. Through extensive experimentation we find that our methods outperform all pre-existing FLAD methods by 4\% and lead to the first 3 billion parameter language models that outperform the 175 billion parameter GPT-3. |
Alon Albalak · Colin Raffel · William Yang Wang 🔗 |
-
|
Predicting the Performance of Foundation Models via Agreement-on-the-line
(
Poster
)
>
Estimating out-of-distribution (OOD) performance is critical to safely deploying machine learning models. Recently, Baek et al showed that the phenomenon ``agreement-on-the-line'' can be a reliable method for predicting OOD accuracy of models in an ensemble consisting largely of CNNs trained from scratch. However, it is now increasingly common to lightly fine-tune foundation models, and it is unclear whether such fine-tuning is sufficient to produce enough diversity in models for such agreement-based methods to work properly. In this paper, we develop methods for reliably applying agreement-on-the-line-based performance estimation to fine-tuned foundation models. In particular, we first study the case of fine-tuning a single foundation model, where we extensively study how different types of randomness (linear head initialization, hyperparameter selection, data subsetting, and data shuffling) contribute to the agreement-on-the-line of the resulting model sets; we find, somewhat surprisingly, that it is typically possible to obtain strong agreement via random initialization of the linear head alone. Next, we study how multiple foundation models, pretrained on different data sets but fine-tuned on the same task, may or may not produce agreement; we show, again rather surprisingly, that the diversity of such models is already sufficient and not too disparate for them to all lie on the same agreement line. In total, these methods enable reliable and efficient estimation of OOD accuracy for fine-tuned foundation models, without leveraging any labeled OOD data. |
Rahul Saxena · Aman Mehra · Taeyoun Kim · Christina Baek · J. Zico Kolter · Aditi Raghunathan 🔗 |
-
|
Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks
(
Poster
)
>
Although remarkable progress has been achieved preventing LLMs hallucinations, using instruction tuning and retrieval augmentation, it is currently difficult to measure the reliability of LLMs using available static data that is often not challenging enough and could suffer from data leakage. Inspired by adversarial machine learning, this paper aims to develop an automatic method for generating new evaluation data by appropriately modifying existing data on which LLMs behave faithfully. Specifically, this paper presents AutoDebug, an LLM-based framework for using prompt chaining to generate transferable adversarial attacks (in the form of question-answering examples). We seek to understand the extent to which these trigger hallucination behavior in LLMs. We first implement our framework using ChatGPT and evaluate the resulting two variants of a popular open-domain question-answering dataset, Natural Questions (NQ) on a collection of open-source and proprietary LLMs under various prompting settings. Our generated evaluation data is human-readable and, as we show, humans can answer these modified questions well. Nevertheless, we observe pronounced accuracy drops across multiple LLMs including GPT-4. Our experimental results confirm that LLMs are likely to hallucinate in two categories of question-answering scenarios where (1) there are conflicts between knowledge given in the prompt and their parametric knowledge, or (2) the knowledge expressed in the prompt is complex. Finally, the adversarial examples generated by the proposed method are transferrable across all considered LLMs, making our approach viable for LLM-based debugging using more cost-effective LLMs. |
Xiaodong Yu · Hao Cheng · Xiaodong Liu · Dan Roth · Jianfeng Gao 🔗 |
-
|
Task Arithmetic with LoRA for Continual Learning
(
Poster
)
>
Continual learning refers to the problem where the training data is available in sequential chunks, termed "tasks". The majority of progress in continual learning has been stunted by the problem of catastrophic forgetting, which is caused by sequential training of the model on streams of data. Moreover, it becomes computationally expensive to sequentially train large models multiple times. To mitigate both of these problems at once, we propose a novel method to continually train transformer-based vision models using low-rank adaptation and task arithmetic. Our method completely bypasses the problem of catastrophic forgetting, as well as reducing the computational requirement for training models on each task. When aided with a small memory of 10 samples per class, our method achieves performance close to full-set finetuning. We present rigorous ablations to support the prowess of our method. |
Rajas Chitale · Ankit Vaidya · Aditya Kane · Archana Ghotkar 🔗 |
-
|
Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering
(
Poster
)
>
SlidesLive Video Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding tasks. |
Han Zhou · Xingchen Wan · Lev Proleev · Diana Mincu · Jilin Chen · Katherine Heller · Subhrajit Roy 🔗 |
-
|
Teaching language models with canonical examples
(
Poster
)
>
It is easy to write a desirable or undesirable language model behavior (e.g., knowledge---The capital of Mauritius is Port Louis---or undesirable stereotypes---Researchers are always coldhearted) but it is difficult to make the model robustly generalize from these canonical examples. We formalize this task: a learning method takes a model and simple canonical examples and must produce a model that (1) generalizes to naturalistic examples, (2) stays within a bound of the original model's loss, and (3) performs well on a ``hard negative'' distribution to test overgeneralization. We build on the Backpack language model; its predictions take the form of a sparse weighted sum over a very large sense vector bank. We select and finetune a few Backpack senses per canonical example and find that this substantially outperforms other training methods. The Backpack we work with is only 170m parameters; yet, we find that it can improve much larger models: a product-of-experts ensemble between the 35x larger GPT-J-6B and the ratio of finetuned to pretrained Backpack outperforms finetuning GPT-J itself. |
John Hewitt · Sarah Chen · Percy Liang · Christopher D Manning 🔗 |
-
|
How Do Large Multimodal Models Really Fare in Classical Vision Few-Shot Challenges? A Deep Dive
(
Poster
)
>
SlidesLive Video Recent advances in multimodal foundational models have demonstrated marvelous in-context learning capabilities for diverse vision-language tasks. However, existing literature has mainly focused on few-shot learning tasks similar to their NLP counterparts. It is unclear whether these foundation models can also address classical vision challenges such as few-shot classification, which in some settings (e.g., 5-way 5-shot) necessitates sophisticated reasoning over several dozens of images -- a challenging task for learning systems. In this work, we take a deep dive to probe the potential and limitations of existing multimodal models on this problem. Our investigation reveals that while these models under careful calibration can outperform dedicated visual models in complex narratable scenes, they can falter with more abstract visual inputs. Moreover, we also investigate curriculum learning and find out how it can mitigate the performance gap via smoothly bridging verbal and nonverbal reasoning for vision language tasks. |
Qing Guo · Prashan Wanigasekara · Jian Zheng · Jacob Fang · Xinwei Deng · Chenyang Tao 🔗 |
-
|
Think before you speak: Training Language Models With Pause Tokens
(
Poster
)
>
Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{\rm th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{\rm th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on eight tasks, most prominently, a gain of $18\\%$ EM score on the QA task of SQuAD, $8\\%$ on CommonSenseQA and $1\\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.
|
Sachin Goyal · Ziwei Ji · Ankit Rawat · Aditya Menon · Sanjiv Kumar · Vaishnavh Nagarajan 🔗 |
-
|
Stepwise Inference in Transformers: Exploring a Synthetic Graph Navigation Task
(
Poster
)
>
Taking correct steps through elementary logical operations is the essence of logical reasoning, culminating in precise planning outcomes. While such \emph{stepwise inference} approaches have demonstrated benefits in Large Language Models (LLMs), conducting an accurate quantitative evaluation is challenging, given their extensive scale, complexity, and lack of accessibility.We introduce a minimal synthetic setup, where an autoregressive language model solves a navigation task on directed acyclic graphs (DAGs), taking inspiration from computational graphs and execution traces.By implementing training with sample paths from start to goal node in a 'step-by-step' manner, we perform systematic experiments and develop novel analyses illustrating that stepwise navigation proves advantageous when the underlying graph is hierarchical and generalization necessitates the stitching of subpaths observed during pretraining. Further, we observe a diversity-accuracy tradeoff while varying sampling temperature and a bias towards generating shorter paths.We next elucidate how in-context chain-of-thought exemplars can steer the model's navigation. Importantly, these exemplars can guide the model to follow a path of reasoning we provide, instead of relying on its potentially biased priors. Together, this work showcases the utility and adaptability of this paradigm in exploring the complexities of logical reasoning and planning in LLMs. |
Mikail Khona · Maya Okawa · Rahul Ramesh · Kento Nishi · Robert Dick · Ekdeep S Lubana · Hidenori Tanaka 🔗 |
-
|
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
(
Poster
)
>
The ML community is rapidly exploring techniques for prompting language models (LMs), but existing LM pipelines often rely on hard-coded “prompt templates” discovered via trial and error. We introduce DSPy, a programming model that abstracts LM pipelines as imperative computation graphs where LMs are invoked through declarative modules. DSPy modules are parameterized so they can learn to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies and show that a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting and pipelines with expert-created demonstrations. |
Omar Khattab · Arnav Singhvi · Paridhi Maheshwari · Zhiyuan Zhang · Keshav Santhanam · Sri Vardhamanan A · Saiful Haq · Ashutosh Sharma · Thomas Joshi · Hanna Moazam · Heather Miller · Matei A Zaharia · Christopher Potts
|
-
|
Fooling GPT with adversarial in-context examples for text classification
(
Poster
)
>
Deep learning-based methods helped solve NLP tasks more efficiently than traditional methods, and adversarial attacks for these methods have been extensively explored. However, Large Language Models (LLMs) have set up a new paradigm of few-shot prompting, which opens up the possibility for novel attacks. In this study, we show that LLMs can be vulnerable to adversarial prompts. We develop the first method to attack the few-shot examples in the text classification setup. We can degrade the model performance significantly during the test time by only slightly perturbing the examples based on optimization. Our method achieves a performance degradation of up to 50% without distorting the semantic meaning. |
Sudhanshu Ranjan · Chung-En Sun · Linbo Liu · Lily Weng 🔗 |
-
|
Dr.ICL: Demonstration-Retrieved In-context Learning
(
Poster
)
>
In-context learning (ICL), which teaches a large language model (LLM) to perform a task with few-shot demonstrations rather than adjusting the model parameters, has emerged as a strong paradigm for using LLMs. While early studies primarily used a fixed or random set of demonstrations for all test queries, recent research suggests that retrieving semantically similar demonstrations to the input from a pool of available demonstrations results in better performance. This work expands the applicability of retrieval-based ICL approaches along several dimensions. We extend the success of retrieval-based ICL to instruction-finetuned LLMs as well as Chain-of-Thought (CoT) prompting. While the prior work utilizes general Large Language Models (LLMs), such as GPT-3, we find that retrieved demonstrations also enhance instruction-finetuned LLMs. This insight implies that training data, despite being exposed during the fine-tuning phase, can still be effectively used through retrieval and in-context demonstrations during testing, resulting in superior outcomes when compared to utilizing no demonstrations or selecting them at random. For CoT, when the demonstrations contain reasoning chains, we get improvements by retrieving based on such chains. Finally, we train a task-specific demonstration retriever that outperforms off-the-shelf retrievers. |
Man Luo · Xin Xu · Zhuyun Dai · Panupong Pasupat · Mehran Kazemi · Chitta Baral · Vaiva Imbrasaite · Vincent Zhao 🔗 |
-
|
Trained Transformers Learn Linear Models In-Context
(
Poster
)
>
Attention-based neural network sequence models such as transformers have the capacity to act as supervised learning algorithms: They can take as input a sequence of labeled examples and output predictions for unlabeled test examples. Indeed, recent work by Garg et al. has shown that when training GPT2 architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of in-context learning of linear predictors for a transformer with a single linear self-attention layer trained by gradient flow. We show that despite the non-convexity of the underlying optimization problem, gradient flow with a random initialization finds a global minimum of the objective function. Moreover, when given a prompt of labeled examples from a new linear prediction task, the trained transformer achieves small prediction error on unlabeled test examples. We further characterize the behavior of the trained transformer under distribution shifts. |
Ruiqi Zhang · Spencer Frei · Peter Bartlett 🔗 |
-
|
Zero-shot Conversational Summarization Evaluations with small Large Language Models
(
Poster
)
>
Large Language Models (LLMs) exhibit powerful summarization abilities. However, their capabilities on conversational summarization remains under explored. In this work we evaluate LLMs (~10 billion parameters) on conversational summarization and showcase their performance on various prompts. We show that the summaries generated by models depend on the instructions and the performance of LLMs vary with different instructions sometimes resulting steep drop in ROUGE scores if prompts are not selected carefully. We also evaluate the models with human evaluations and discuss the limitations of the models on conversational summarization. |
Ramesh Manuvinakurike · Saurav Sahay · Sangeeta Manepalli · Lama Nachman 🔗 |
-
|
In-Context Learning and Bayesian Inference
(
Poster
)
>
SlidesLive Video
In-context learning (ICL) is one of the surprising and useful features of large language models and subject of intense research. Recently, stylized meta-learning-like ICL setups have been devised that train transformers on sequences of input-output pairs $(x, f(x))$ using the language modeling loss. The function $f$ comes from a function class and generalization is checked by evaluation on sequences for unseen functions from the same class. One of the main discoveries in this line of research has been that for several function classes, such as linear regression, transformers successfully generalize to new functions in the class. However, it is unclear if transformers trained on multiple function classes (a setup closer to that of real-world LLMs) also exhibit this generalization. Moreover, the inductive biases of these models resulting in this generalization are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution. In this paper we address these issues and empirically examine how far this Bayesian perspective can help us understand ICL. To this end, we generalize the previous meta-ICL setup to hierarchical meta-ICL setup which involve unions of multiple task families. We instantiate this setup on a diverse range of linear and nonlinear function families and find that transformers can do ICL in this setting as well. Where Bayesian inference is tractable, we find evidence that high-capacity transformers mimic the Bayesian predictor. Via the example of learning Fourier series, we also study the inductive bias for in-context learning. We find that in-context learning may or may not have simplicity bias depending on the pretraining data distribution. The Bayesian perspective provides insights into these inductive biases and how transformers perform a particular task when they are trained on multiple tasks.
|
Madhur Panwar · Kabir Ahuja · Navin Goyal 🔗 |
-
|
AutoVP: An Automated Visual Prompting Framework and Benchmark
(
Poster
)
>
Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the design space of VP and no clear benchmark for evaluating its performance. To bridge this gap, we propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks that can serve as a holistic VP-performance benchmark. Our design space covers 1) the joint optimization of the prompts; 2) the selection of pre-trained models, including image classifiers and text-image encoders; and 3) model output mapping strategies, including nonparametric and trainable label mapping. Our extensive experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin, having up to 6.7% improvement in accuracy; and attains a maximum performance increase of 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold contribution: serving both as an efficient tool for hyperparameter tuning on VP design choices, and as a comprehensive benchmark that can reasonably be expected to accelerate VP’s development. |
Hsi-Ai Tsao · Lei Hsiung · Pin-Yu Chen · Sijia Liu · Tsung-Yi Ho 🔗 |
-
|
How Capable Can a Transformer Become? A Study on Synthetic, Interpretable Tasks
(
Poster
)
>
Transformers trained on huge text corpora exhibit a remarkable set of capabilities. Given the inherent compositional nature of language, one can expect the model to learn to compose these capabilities, potentially yielding a combinatorial explosion of what operations it can perform on an input. Motivated by the above, we aim to assess in this paper "how capable can a transformer become?". In this work, we train Transformer models on a data-generating process that involves compositions of a set of well-defined monolithic capabilities and show that: (1) Transformers generalize to exponentially or even combinatorially many functions not seen in the training data; (2) Transformers that generate the intermediate outputs of the composition are more effective at generalizing to unseen compositions; (3) The training data has a significant impact on the model's ability to compose functions (4) Attention layers in the latter half of the model seem critical to compositionality. |
Rahul Ramesh · Mikail Khona · Robert Dick · Hidenori Tanaka · Ekdeep S Lubana 🔗 |
-
|
What’s important here?: Opportunities and Challenges of LLM in retrieving information from Web Interface
(
Poster
)
>
Large language models (LLMs) that have been trained on large corpus of codes exhibit a remarkable ability to understand HTML code [1]. As web interfaces are mainly constructed using HTML, we designed an in-depth study to see how the code understanding ability of LLMs can be used to retrieve and locate important elements for a user given query (i.e. task description) in web interface. In contrast with prior works, which primarily focused on autonomous web navigation, we decompose the problem as an even atomic operation - Can LLMs find out the important information in the web page for a user given query? This decomposition enables us to scrutinize the current capabilities of LLMs and uncover the opportunities and challenges they present. Our empirical experiments show that the LLMs exhibit a reasonable level of competence, there is still a substantial room for improvement. We hope our investigation will inspire follow-up works in overcoming the current challenges in this domain. |
Faria Huq · Jeffrey Bigham · Nikolas Martelaro 🔗 |
-
|
One shot localization and segmentation of medical images with Foundation Models
(
Poster
)
>
SlidesLive Video Recent advances in Vision Transformers (ViT) and Stable Diffusion (SD) models with their ability to capture rich semantic features of the image have been used for image correspondence tasks on natural images. In this paper, we examine the ability of a variety of pre-trained ViT (DINO, DINOv2, SAM, CLIP) and SD models, trained exclusively on natural images, for solving the correspondence problems on medical images. While many works have made a case for in-domain training, we show that the models trained on natural images can offer good performance on medical images across different modalities (CT,MR,Ultrasound) sourced from various manufacturers, over multiple anatomical regions (brain, thorax, abdomen, extremities), and on wide variety of tasks. Further, we leverage the correspondence with respect to a template image to prompt a Segment Anything (SAM) model to arrive at single shot segmentation, achieving dice range of 62%-90% across tasks, using just one image as reference. We also show that our single-shot method outperforms the recently proposed few-shot segmentation method - UniverSeg (Dice range 47%-80%) on most of the semantic segmentation tasks(six out of seven) across medical imaging modalities. |
Deepa Anand · Gurunath Reddy Madhumani · Vanika Singhal · Dattesh Shanbhag · Shriram KS · Uday Patil · Chitresh Bhushan · Kavitha Manickam · Dawei Gui · Rakesh Mullick · Avinash Gopal · Parminder Bhatia · Taha Kass-Hout
|
-
|
A Universal Prompt Generator for Large Language Models
(
Poster
)
>
LLMs are primarily reliant on high-quality and task-specific prompts. However, the prompt engineering process relies on clever heuristics and requires multiple iterations. Some recent works attempt to automate this process by improving upon human written prompts. However, creating high-quality prompts from scratch is still an unresolved challenge owing to its inherent complexity. In this work, we propose UniPrompt, a novel technique for generating high-quality human-like prompts from scratch. To do so, we identify characteristic features of human-generated prompts such as being detailed and consisting of multiple sections. Our proposed method, UniPrompt, takes as input a single sentence description of the task and generates human-like sectioned prompts using an auxiliary language model. We train the model in two stages. First, the model is finetuned on multiple tasks using a novel dataset curated using GPT-4 across over 500 tasks. Second, we align the auxiliary model to generate task-relevant (high accuracy) prompts by collecting a prompt preference dataset and optimizing the model using the Direct Preference Optimization method. Importantly, UniPrompt is task-agnostic: once trained, it can be used to generate prompts for any task. We find that UniPrompt outperforms human-generated prompts, GPT-generated prompts, and other prompt optimization techniques across diverse tasks on medicine, causality, and hate speech by up to 5.1 %, 7.2 %, and 11.1 % respectively. |
Gurusha Juneja · Amit Sharma 🔗 |
-
|
AutoMix: Mixing Models with Few-shot Self and Meta Verification
(
Poster
)
>
link
Large language models (LLMs) are now available in various sizes and configurations from cloud API providers. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present AutoMix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to AutoMix is a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring training. Given that verifications can be noisy, we employ a meta verifier in \ours to refine the accuracy of these assessments. Our experiments using LLAMA2-13B and LLAMA2-70B, on five context-grounded reasoning datasets demonstrate that AutoMix surpasses established baselines, improving the incremental benefit per cost by up to 57%. |
Aman Madaan · Pranjal Aggarwal · Ankit Anand · Srividya Pranavi Potharaju · Swaroop Mishra · Pei Zhou · Aditya Gupta · Dheeraj Rajagopal · Yiming Yang · Shyam Upadhyay · - Mausam · Manaal Faruqui
|
-
|
Coded Prompts for Large Language Models
(
Poster
)
>
While Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks and various prompting techniques have been proposed, there remains room for performance enhancement. In this work, we introduce a novel dimension to prompt design -- coded prompts for LLM inference. Drawing inspiration from coding theory, where coded symbols communicate or store functions of multiple information symbols, we design coded prompts to process multiple inputs simultaneously. We validate this approach through experiments on two distinct tasks: identifying the maximum prime number within a range and sentence toxicity prediction. Our results indicate that coded prompts can indeed improve task performance. We believe that coded prompts will pave a new way for innovative strategies to enhance the efficiency and effectiveness of LLMs. |
Ziqian Lin · Yicong Chen · Yuchen Zeng · Kangwook Lee 🔗 |
-
|
Deep Embedded Clustering in Few-shot Representations (DECiFR)
(
Poster
)
>
Few-shot Learning has been the center of attention in the deep learning community as it can potentially address the problem of data inaccessibility. Several approaches have been proposed to learn from a few samples efficiently, nevertheless, the majority of them use a large dataset to generalize the feature representation obtained from a single or pre-defined set of backbones before adapting to novel classes. In this paper, different from prior works that use a single best-performing backbone, we present a model-agnostic framework that does not require to "decipher" which backbone is more suitable for the specific FSL task. We propose the Deep Embedded Clustering in Few-shot Representations (DECiFR) algorithm that leverages Deep Embedded Clustering (DEC) to abstract discriminative information from the best combination of features from different backbones, by simultaneously mapping and clustering feature representations using deep neural networks. Subsequently, we propose a contrastive variant of KNN to enhance the cluster separation by propagating through the samples that minimize the inter-class distance and maximize the intra-class distance. Empirical results show that our approach not only enhances the feature embeddings but also boosts the classification accuracy, approaching or surpassing state-of-the-art performance on numerous datasets. |
Yasaman Esfandiari · Rodolfo Valiente Romero · Amir Rahimi 🔗 |
-
|
Divide and Conquer: Two-Level Problem Remodeling for Large-Scale Few-Shot Learning
(
Poster
)
>
Few-shot learning methods have achieved notable performance in recent years. However, few-shot learning in large-scale settings with hundreds of classes is still challenging.In this paper, we tackle the problems of large-scale few-shot learning by taking advantage of pre-trained foundation models. We recast the original problem in two levels with different granularity. At the coarse-grained level, we introduce a novel object recognition approach with robustness to sub-population shifts. At the fine-grained level, generative experts are designed for few-shot learning, specialized for different superclasses.A Bayesian schema is considered to combine coarse-grained information with fine-grained predictions in a winner-takes-all fashion.Extensive experiments on large-scale datasets and different architectures show that the proposed method is both effective and efficient besides its simplicity and natural problem remodeling. The code is publicly available at https://github.com/divnconquer/divideandconquer. |
Mohamadreza Fereydooni · Hosein Hasani · Ali Razghandi · Mahdieh Soleymani 🔗 |
-
|
JAB: Joint Adversarial Prompting and Belief Augmentation
(
Poster
)
>
SlidesLive Video With the recent surge of language models in different applications, attention to safety and robustness of these models has gained significant importance. Here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. Importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. In our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model. |
Ninareh Mehrabi · Palash Goyal · Anil Ramakrishna · Jwala Dhamala · Shalini Ghosh · Richard Zemel · Kai-Wei Chang · Aram Galstyan · Rahul Gupta 🔗 |
-
|
Function Constrained Program Synthesis
(
Poster
)
>
This work introduces: (1) a technique that allows pre-trained large language models (LLMs) to leverage user-provided code when solving programming tasks and (2) a method to iteratively generate modular sub-functions that can aid future code generation attempts when the initial code generated by the LLM is inadequate. Generating computer programs in general-purpose programming languages like Python poses a challenge for LLMs when restricted to using only code provided in the prompt. A naive approach is to present a chat-based LLM (e.g. GPT-4, Claude) with relevant code snippets and prompt the model to synthesize the target algorithm using the provided code. Alternatively, code-specific LLMs (e.g. GitHub Copilot, CodeLlama2) can generate code completions in real-time by drawing on all code available in the integrated development environment. However, restricting code-specific LLMs to use only in-context code is not straightforward, as the model is not explicitly instructed to use the user-generated code and users cannot highlight precisely which snippets of code the model should incorporate into its context for subsequent code-generations. Moreover, chat and code LLMs lack effective recovery methods, forcing users to iteratively re-prompt the model with modified prompts until a sufficient solution is reached.Our method differs from traditional LLM-powered code-generation by constraining code-generation to an explicit function set and enabling recovery from failed attempts through automatically generated sub-functions. When the LLM cannot produce working code, we generate modular sub-functions to aid subsequent attempts at generating functional code. A by-product of our method is a library of reusable sub-functions that can solve related tasks (imitating a software team where efficiency scales with experience). We also introduce a new “half-shot” evaluation paradigm that provides tighter estimates of LLMs' coding abilities compared to traditional zero-shot evaluation. Our proposed method encourages models to output solutions in a structured format, decreasing syntax errors that can be mistaken for poor coding ability. |
Patrick A. Hajali · Ignas Budvytis 🔗 |
-
|
On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation
(
Poster
)
>
Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The vision-language foundation model has recently emerged as a promising paradigm, demonstrating impressive learning capabilities across various tasks while requiring a small amount of finetuning samples. While numerous approaches have focused on developing better fine-tuning strategies for specific domains, we instead examine the robustness of such foundation models to domain shifts in the medical image segmentation task. To this end, we compare the generalization performance to unseen domains of various pre-trained models after being finetuned on the same in-distribution dataset and show that foundation-based models enjoy better robustness compared to other architectures. From here, we further developed a new Bayesian uncertainty estimation for frozen models and used them as an indicator to characterize the model's performance on out-of-distribution (OOD) data, which can be extremely useful for real-world applications. Our experiments show the shortcomings of existing indicators used in natural image applications and the promising results of the proposed Bayesian uncertainty. |
Duy M. H. Nguyen · Tan Ngoc Pham · Nghiem Diep · Nghi Phan · Quang Pham · Vinh Tong · Binh Nguyen · Ngan Le · Nhat Ho · Pengtao Xie · Daniel Sonntag · Mathias Niepert
|
-
|
Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation
(
Poster
)
>
Large Language Models (LLMs) have the ability to solve a variety of tasks, such as text summarization and mathematical questions, just out of the box, but they are often trained with a single task in mind.Due to high computational costs, the current trend is to use prompt instruction tuning to better adjust monolithic, pretrained LLMs for new --but often individual-- downstream tasks. Thus, how one would expand prompt tuning to handle --concomitantly-- heterogeneous tasks and data distributions is a widely open question. To address this gap, we suggest the use of Mixture of Prompts, or MoPs, associated with smart gating functionality: the latter --whose design is one of the contributions of this paper-- can identify relevant skills embedded in different groups of prompts and dynamically assign combined experts (i.e., collection of prompts), based on the target task.Additionally, MoPs are empirically agnostic to any model compression technique applied --for efficiency reasons-- as well as instruction data source and task composition.In practice, MoPs can simultaneously mitigate prompt training "interference" in multi-task, multi-source scenarios (e.g., task and data heterogeneity across sources), as well as possible implications from model approximations. As a highlight, MoPs manage to decrease final perplexity from $\sim20\%$ up to $\sim70\%$, as compared to baselines, in the federated scenario, and from $\sim 3\%$ up to $\sim30\%$ in the centralized scenario.
|
Chen Dun · Mirian Hipolito Garcia · Guoqing Zheng · Ahmed Awadallah · Anastasios Kyrillidis · Robert Sim 🔗 |
-
|
Zero-shot Improvement of Object Counting with CLIP
(
Poster
)
>
link
We focus on the object counting limitations of vision-language models, with a particular emphasis on Contrastive Language-Image Pre-Training (CLIP) models. We assess the counting performance of CLIP using a custom dataset, which uncovers significant variations across diverse objects. To address this, we introduce a zero-shot, training-free method aimed at improving counting accuracy by manipulating the text embedding space of CLIP. Through comprehensive experiments, we demonstrate that our method not only enhances the counting capabilities of CLIP but also boosts the performance of text-to-image generative models like Stable Diffusion, particularly in generating images with precise object counts. |
Ruisu Zhang · Yicong Chen · Kangwook Lee 🔗 |
-
|
Efficient Online Data Mixing For Language Model Pre-Training
(
Poster
)
>
The data used to pretrain large language models has a decisive impact on a model’s downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining. |
Alon Albalak · Liangming Pan · Colin Raffel · William Yang Wang 🔗 |
-
|
The Consensus Game: Language Model Generation via Equilibrium Search
(
Poster
)
>
When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate answers). These procedures sometimes yield very different predictions. How do we reconcile mutually incompatible scoring procedures to obtain coherent LM predictions? We introduce a new, a training-free, game-theoretic procedure for language model decoding. Our approach casts language model decoding as a regularized imperfect-information sequential signaling game—which we term the concensus game—in which a generator seeks to communicate an abstract correctness parameter using natural language sentences to a discriminator. We develop computational procedures for finding approximate equilibria of this game, resulting in a decoding algorithm we call equilibrium-ranking. Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and assistive dialog), equilibrium-ranking consistently improves performance over existing LM decoding procedures. These improvements are sometimes substantial—on multiple benchmarks, we observe that applying equilibrium-ranking to LLaMA-7B outperforms the much larger LLaMA-65B and PaLM-540B models. |
Athul Jacob · Yikang Shen · Gabriele Farina · Jacob Andreas 🔗 |
-
|
Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning
(
Poster
)
>
In the field of chemistry, the objective is to create novel molecules with desired properties, facilitating accurate property predictions for applications such as material design and drug screening. However, existing graph deep learning methods face limitations that curb their expressive power. To address this, we explore the integration of vast molecular domain knowledge from Large Language Models(LLMs) with the complementary strengths of Graph Neural Networks (GNNs) to enhance performance in property prediction tasks. We introduce a Multi-Modal Fusion (MMF) framework that synergistically harnesses the analytical prowess of GNNs and the linguistic generative and predictive abilities of LLMs, thereby improving accuracy and robustness in predicting molecular properties. Our frameworkcombines the effectiveness of GNNs in modeling graph-structured data with the zero-shot and few-shot learning capabilities of LLMs, enabling improved predictions while reducing the risk of overfitting. Furthermore, our approach effectively addresses distributional shifts, a common challenge in real-world applications, and showcases the efficacy of learning cross-modal representations, surpassingstate-of-the-art baselines on benchmark datasets for property prediction tasks. |
Sagar Srinivas Sakhinana · Venkataramana Runkana 🔗 |
-
|
Trainable Transformer in Transformer
(
Poster
)
>
SlidesLive Video Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose a new efficient construction, Transformer in Transformer (in short, TINT), that allows a transformer to simulate and fine-tune more complex models during inference (e.g., pre-trained language models). In particular, we introduce innovative approximation techniques that allow a TINT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TINT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TINT on various language modeling and downstream tasks. For example, even with a limited one-step budget, we observe TINT for a OPT-125M model improves performance by 4 − 16% absolute on average compared to OPT-125M. These findings suggest that large pre-trained language models are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TINT will be open-sourced. |
Abhishek Panigrahi · Sadhika Malladi · Mengzhou Xia · Sanjeev Arora 🔗 |
-
|
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning
(
Poster
)
>
SlidesLive Video The remarkable performance of pre-trained large language models has revolutionised various natural language processing applications. Due to huge parametersizes and extensive running costs, companies or organisations tend to transfer the models to the target task by zero-shot prompting techniques. However, the prohibitive costs of tokens and time have hindered their adoption in applications. We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs, thereby reducing token and time costs. This approach could potentially improve task performance during API queries due to better conditional distribution mapping. Evaluated across diverse classification datasets, our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance, and in some cases, even improving it. An ablation study conducted on various LLMs, along with an investigation into the robustness of our prompting strategy to different input ordering, offers valuable insights into the broader applicability of our method across diverse tasks. These findings also suggest a more seamless integration of our method with LLMs through an API. |
Jiazheng Li · Runcong Zhao · Yongxin Yang · Yulan He · Lin Gui 🔗 |
-
|
Fewshot learning on global multimodal embeddings for earth observation tasks
(
Poster
)
>
In this work we pretrain a CLIP/ViT based model using three different modalities of satellite imagery across five AOIs covering over ~10\% of the earth total landmass, namely Sentinel 2 RGB optical imagery, Sentinel 1 SAR amplitude and Sentinel 1 SAR interferometric coherence. This model uses $\sim 250$ M parameters. Then, we use the embeddings produced for each modality with a classical machine learning method to attempt different downstream tasks for earth observation related to vegetation, built up surface, croplands and permanent water. We consistently show how we reduce the need for labeled data by 99\%, so that with ~200-500 randomly selected labeled examples (around 4K-10K km$^2$) we reach performance levels analogous to those achieved with the full labeled datasets (about 150K image chips or 3M km$^2$ in each AOI) on all modalities, AOIs and downstream tasks. This leads us to think that the model has captured significant earth features useful in a wide variety of scenarios. To enhance our model's usability in practice, its architecture allows inference in contexts with missing modalities and even missing channels within each modality. Additionally, we visually show that this embedding space, obtained with no labels, is sensible to the different earth features represented by the labelled datasets we selected.
|
Matthew Allen · Francisco Dorr · Joseph Alejandro Gallego Mejia · Laura Martínez-Ferrer · Anna Jungbluth · Freddie Kalaitzis · Raul Ramos-Pollán 🔗 |
-
|
Selective Prediction For Open-Ended Question Answering in Black-Box Vision-Language Models
(
Poster
)
>
When mistakes have serious consequences, reliable use of a model requires understanding when the predictions of the model are trustworthy. One approach is selective prediction, in which a model is allowed to abstain if it is uncertain. Existing methods for selective prediction require access to model internals, retraining, or large number of model evaluations, and cannot be used for black box models available only through an API. This is a barrier to the use of powerful commercial foundation models in risk-sensitive applications. Furthermore, existing work has largely focused on unimodal foundation models. We propose a method to improve selective prediction in a black box vision-language model by measuring consistency over the neighbors of a visual question. Although direct sampling of the neighborhood is not possible, we propose using a probing model as a proxy. We describe experiments testing the proposed method on in-distribution, out-of-distribution and adversarial questions. We find that the consistency of a vision-language model across rephrasings of a visual question can be used to identify and reject high-risk visual questions, even in out-of-distribution and adversarial settings, constituting a step towards safe use of black-box vision-language models. |
Zaid Khan · Yun Fu 🔗 |
-
|
LOVM: Language-Only Vision Model Selection
(
Poster
)
>
Pre-trained multi-modal vision-language models (VLMs) excel in downstream applications, especially in the few- and zero-shot settings. However, choosing the optimal VLM for some downstream applications is challenging due to task and dataset dependencies. Exhaustive evaluation of all VLMs is impractical and requires the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. To address this, we introduce a novel task, LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We also present an extensive LOVM benchmark consisting of ground-truth evaluations of 23 pre-trained VLMs and 35 datasets, enabling effective ranking and performance prediction of VLMs. Code and dataset will be publicly available upon publication. |
Orr Zohar · Shih-Cheng Huang · Kuan-Chieh Wang · Serena Yeung 🔗 |
-
|
Context is Environment
(
Poster
)
>
Two lines of work are taking center stage in AI research. On the one hand, increasing efforts are being made to build models that generalize out-of-distribution (OOD). Unfortunately, a hard lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to the eclectic contextual circumstances. We argue that context is environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context$\unicode{x2013}\unicode{x2013}$unlabeled examples as they arrive$\unicode{x2013}\unicode{x2013}$allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant OOD performance improvements. From all of this, two messages are worth taking home: researchers in domain generalization should consider environment as context, and harness the adaptive power of in-context learning. Researchers in LLMs should consider context as environment, to better structure data towards generalization.
|
Sharut Gupta · David Lopez-Paz · Stefanie Jegelka · Kartik Ahuja 🔗 |
-
|
InstructEval: Systematic Evaluation of Instruction Selection Methods
(
Poster
)
>
In-context learning (ICL) performs tasks by prompting a large language model (LLM) using an instruction and a small set of annotated examples called demonstrations. Recent work has shown that precise details of the inputs used in the ICL prompt significantly impact performance, which has incentivized instruction selection algorithms. The effect of instruction-choice however is severely underexplored, with existing analyses restricted to shallow subsets of models and tasks, limiting the generalizability of their insights. We develop InstructEval, an ICL evaluation suite to conduct a thorough assessment of these techniques. The suite includes 13 open-sourced LLMs of varying scales from four model families, and covers nine tasks across three categories. Using the suite, we evaluate the relative performance of seven popular instruction selection methods over five metrics relevant to ICL. Our experiments reveal that using curated manually-written instructions or simple instructions without any task-specific descriptions often elicits superior ICL performance overall than that of automatic instruction-induction methods, pointing to a lack of generalizability among the latter. We release our evaluation suite for benchmarking instruction selection approaches and enabling more generalizable methods in this space. |
Anirudh Ajith · Mengzhou Xia · Ameet Deshpande · Karthik Narasimhan 🔗 |
-
|
PATHFINDER: Guided Search over Multi-Step Reasoning Paths
(
Poster
)
>
With recent advancements in large language models, methods like chain-of-thought prompting to elicit reasoning chains have been shown to improve results on reasoning tasks. However, tasks that require multiple steps of reasoning still pose significant challenges to state-of-the-art models. Drawing inspiration from the beam search algorithm, we propose PATHFINDER, a tree-search-based reasoning path generation approach. It enhances diverse branching and multi-hop reasoning through the integration of dynamic decoding, enabled by varying sampling methods and parameters. Using constrained reasoning, PATHFINDER integrates novel quality constraints, pruning, and exploration methods to enhance the efficiency and the quality of generation. Moreover, it includes scoring and ranking features to improve candidate selection. Our approach outperforms competitive baselines on three complex arithmetic and commonsense reasoning tasks by 6% on average. Our model generalizes well to longer, unseen reasoning chains, reflecting similar complexities to beam search with large branching factors. |
Olga Golovneva · Sean O'Brien · Ramakanth Pasunuru · Tianlu Wang · Luke Zettlemoyer · Maryam Fazel-Zarandi · Asli Celikyilmaz 🔗 |
-
|
Enhancing Large Language Models with Ensemble of Critics for Mitigating Toxicity and Hallucination
(
Poster
)
>
SlidesLive Video We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors. |
Sajad Mousavi · Ricardo Luna Gutierrez · Desik Rengarajan · Vineet Gundecha · Ashwin Ramesh Babu · Avisek Naug · Antonio Guillen-Perez · Soumyendu Sarkar 🔗 |
-
|
Meta- (out-of-context) learning in neural networks
(
Poster
)
>
Brown et al. (2020) famously introduced the phenomenon of in-context learning in large language models (LLMs). We establish the existence of a phenomenon we call meta-out-of-context learning (meta-OCL) via carefully designed synthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs to more readily “internalize” the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and use it in appropriate circumstances. We further demonstrate meta-OCL in a synthetic computer vision setting, and propose two hypotheses for the emergence of meta-OCL: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based optimizers may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks. |
Dmitrii Krasheninnikov · Egor Krasheninnikov · Bruno Mlodozeniec · David Krueger 🔗 |
-
|
Zero-shot Clustering of Embeddings with Pretrained and Self-Supervised Learnt Encoders
(
Poster
)
>
We explore whether large pretrained models can provide a useful representation space for datasets they were not trained on, and whether these representations can be used to group novel unlabelled data into meaningful clusters. To this end, we conduct experiments using image encoders pretrained on ImageNet using either supervised or self-supervised training techniques. These encoders are deployed on image datasets that were not seen during training, and we investigate whether their embeddings can be clustered with conventional clustering algorithms. We find that it is possible to create well-defined clusters using self-supervised feature encoders, especially when using the Agglomerative Clustering method, and that it is possible to do so even for very fine-grained datasets such as NABirds. We also find indications that the Silhouette score is a good proxy of cluster quality for self-supervised feature encoders when no ground-truth is available. |
Scott Lowe · Joakim Bruslund Haurum · Sageev Oore · Thomas Moeslund · Graham Taylor 🔗 |
-
|
Flexible visual prompts for in context learning in computer vision
(
Poster
)
>
In this work, we address in-context learning (ICL) for computer vision, introducing a novel approach that adapts a modern Video Object Segmentation (VOS) technique for visual ICL. This adaptation is inspired by the VOS methods' ability to efficiently and flexibly learn objects from a few examples. Through evaluations across a range of support set sizes and on diverse segmentation datasets, our method consistently surpasses existing techniques. Notably, it excels with data containing classes not encountered during training. Additionally, we propose a technique for support set selection that enhances the performance of all tested ICL methods. We plan to release all code for this study prior to publication. |
Thomas Foster · Ioana Croitoru · Robert Dorfman · Christoffer Edlund · Thomas Varsavsky · Jon Almazan 🔗 |
-
|
Why Larger Language Models Do In-context Learning Differently?
(
Poster
)
>
Large language models (LLM) have emerged as a powerful tool for many AI problems and are deeply involved in many aspects of human activity. One important emergent ability is in-context learning (ICL), where LLM can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model's parameters. Many works trying to study ICL and one recent interesting counter-intuitive observation is that different scale language models may have different ICL behaviors. Despite the tremendous success made by ICL, why different ICL behaviors remains a mystery. In this work, we are trying to answer this question. As a limited understanding of the ICL mechanism, we study a simplified setting, one-layer single-head linear self-attention network pretrained on linear regression in-context task. We characterize language model scale as the rank of key and query matrix in attention. We show that smaller language models are more robust to noise, while larger language models are easily distracted, leading to different ICL behaviors. We also conduct ICL experiments utilizing the LLaMA model families. The results are consistent with previous work and our analysis. |
Zhenmei Shi · Junyi Wei · Zhuoyan Xu · Yingyu Liang 🔗 |
-
|
Analyzing ChatGPT’s Behavior Shifts Over Time
(
Poster
)
>
GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on two tasks: 1) solving math problems, and 2) generating code. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers ($84\%$ accuracy) but GPT-4 (June 2023) was poor on these same questions ($51\%$ accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. We provide evidence that GPT-4's ability to follow user instructions has decreased over time, which is one common factor behind the many behavior drifts. Overall, our findings show that the behavior of the ``same'' LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.
|
Lingjiao Chen · Matei A Zaharia · James Zou 🔗 |
-
|
Are Large Language Models Post Hoc Explainers?
(
Poster
)
>
Large Language Models (LLMs) are increasingly used as powerful tools for a plethora of natural language processing (NLP) applications. A recent innovation, in-context learning (ICL), enables LLMs to learn new tasks by supplying a few examples in the prompt during inference time, thereby eliminating the need for model fine-tuning. While LLMs have been utilized in several applications, their applicability in explaining the behavior of other models remains relatively unexplored. Despite the growing number of new explanation techniques, many require white-box access to the model and/or are computationally expensive, highlighting a need for next-generation post hoc explainers. In this work, we present the first framework to study the effectiveness of LLMs in explaining other predictive models. More specifically, we propose a novel framework encompassing multiple prompting strategies: i) Perturbation-based ICL, ii) Prediction-based ICL, iii) Instruction-based ICL, and iv) Explanation-based ICL, with varying levels of information about the underlying ML model and the local neighborhood of the test sample. We conduct extensive experiments with real-world benchmark datasets to demonstrate that LLM generated explanations perform on par with state-of-the-art post hoc explainers using their ability to leverage ICL examples and their internal knowledge in generating model explanations. On average, across four datasets and two ML models, we observe that LLMs identify the most important feature with 72.19% accuracy, opening up new frontiers in explainable artificial intelligence (XAI) to explore LLM-based explanation frameworks. |
Nicholas Kroeger · Dan Ley · Satyapriya Krishna · Chirag Agarwal · Himabindu Lakkaraju 🔗 |
-
|
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget
(
Poster
)
>
The recent work CLIPA presents an inverse scaling law for CLIP training --- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with approximately 13B image-text pairs seen during training. Our results are exciting --- by only allocating a budget of $\textdollar$10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0\% and meanwhile reducing the computational cost by approximately $39\times$. Moreover, with an additional investment of $4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. By upscaling a G/14 model, we've achieved an impressive state-of-the-art zero-shot ImageNet accuracy of 83.0%, relying solely on open-source data.
|
Xianhang Li · Zeyu Wang · Cihang Xie 🔗 |
-
|
Inferring Latent Class Statistics from Text for Robust Visual Few-Shot Learning
(
Poster
)
>
In the realm of few-shot learning, foundation models like CLIP have proven effective but exhibit limitations in cross-domain robustness especially in few-shot settings. Recent works add text as an extra modality to enhance the performance of these models. Most of these approaches treat text as an auxiliary modality without fully exploring its potential to elucidate the underlying class visual features distribution. In this paper, we present a novel approach that leverages text-derived statistics to predict the mean and covariance of the visual feature distribution for each class. This predictive framework enriches the latent space, yielding more robust and generalizable few-shot learning models. We demonstrate the efficacy of incorporating both mean and covariance statistics in improving few-shot classification performance across various datasets. Our method shows that we can use text to predict the mean and covariance of the distribution offering promising improvements in few-shot learning scenarios. |
Yassir BENDOU · Bastien Pasdeloup · Giulia Lioi · Vincent Gripon · Fabien Cardinaux · Ghouthi BOUKLI HACENE · Lukas Mauch 🔗 |
-
|
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
(
Spotlight
)
>
While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. We present a dataset of over 126,808 prompt injection attacks and 46,457 anti-injection "defense'' prompts to elucidate this problem, created by players of an online game called Tensor Trust. To the best of our knowledge, this is the largest dataset of human-generated adversarial examples for instruction-following LLMs. We demonstrate that these attacks often have a simple structure that sheds light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, our small-scale experiments on deployed LLM-based applications show that attack strategies in the dataset generalize beyond the setting of the game. We release all data and source code. |
Sam Toyer · Olivia Watkins · Ethan Mendes · Justin Svegliato · Luke Bailey · Tiffany Wang · Isaac Ong · Karim Elmaaroufi · Pieter Abbeel · Trevor Darrell · Alan Ritter · Stuart J Russell
|
-
|
Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions
(
Spotlight
)
>
In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained Large Language Models (LLMs). In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set. |
Satwik Bhattamishra · Arkil Patel · Phil Blunsom · Varun Kanade 🔗 |
-
|
Learning Through Consistency for Prompt Tuning
(
Spotlight
)
>
We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models that addresses the challenge of improving the generalization capability of large foundation models while fine-tuning them on downstream tasks in a few-shot setting. The basic idea of CoPrompt is to enforce a consistency constraint in the prediction of the trainable and pre-trained models to prevent overfitting on the downstream task. Additionally, we introduce the following two components into our consistency constraint to further boost the performance: enforcing consistency on two perturbed inputs and combining two dominant paradigms of tuning, prompting and adapter. Enforcing consistency on perturbed input further regularizes the consistency constraint, effectively improving generalization, while tuning additional parameters with prompting and adapters improves the performance on downstream tasks. Extensive experiments show that CoPrompt outperforms existing methods on a range of evaluation suites, including base-to-novel generalization, domain generalization, and cross-dataset evaluation tasks. On the generalization task, CoPrompt improves the state-of-the-art by 2.09\% on the zero-shot task and 1.93\% on the harmonic mean over 11 recognition datasets. Detailed ablation studies show the effectiveness of each of the components in CoPrompt. |
Shuvendu Roy · Ali Etemad 🔗 |
-
|
Effective Data Augmentation With Diffusion Models
(
Spotlight
)
>
Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity along key semantic axes present in the data. Current augmentations cannot alter the high-level semantic attributes, such as animal species present in a scene, to enhance the diversity of data. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains. |
Brandon Trabucco · Kyle Doherty · Max Gurinas · Russ Salakhutdinov 🔗 |
-
|
Evaluating Adversarial Defense in the Era of Large Language Models
(
Spotlight
)
>
Large language models (LLMs) have demonstrated superior performance in many natural language processing tasks. Existing works have shown that LLMs are not robust to adversarial attacks, questioning the applicability of these models in scenarios with safety concerns. However, one key aspect that has been overlooked is evaluating and developing defense mechanisms against adversarial attacks.In this work, we systematically study how LLMs react to different adversarial defense strategies. We also propose defenses tailored for LLMs that can significantly improve their robustness: First, we develop prompting methods to alert the LLM about potential adversarial contents; Second, we use neural models such as the LLM itself for typo correction; Third, we propose an effective fine-tuning scheme to improve robustness against corrupted inputs.Extensive experiments are conducted to evaluate the adversarial defense approaches. We show that by using the proposed defenses, robustness of LLMs can increase by up to 20\%. |
Joachim Studnia · Simiao Zuo · Xiaodong Liu · Qiang Lou · Jian Jiao · Denis Charles 🔗 |
-
|
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
(
Spotlight
)
>
Low-rank adaptations (LoRA) are often employed to fine-tune large language models (LLMs) for new tasks. This paper investigates LoRA composability for cross-task generalization and introduces LoraHub, a strategic framework devised for the purposive assembly of LoRA modules trained on diverse given tasks, with the objective of achieving adaptable performance on unseen tasks. With just a few examples from a novel task, LoraHub enables the fluid combination of multiple LoRA modules, eradicating the need for human expertise. Notably, the composition requires neither additional model parameters nor gradients. Our empirical results, derived from the Big-Bench Hard (BBH) benchmark, suggest that LoraHub can effectively mimic the performance of in-context learning in few-shot scenarios, excluding the necessity of in-context examples alongside each inference input. A significant contribution of our research is the fostering of a community for LoRA, where users can share their trained LoRA modules, thereby facilitating their application to new tasks. We anticipate this resource will widen access to and spur advancements in general intelligence as well as LLMs in production. |
Chengsong Huang · Qian Liu · Bill Yuchen Lin · Chao Du · Tianyu Pang · Min Lin 🔗 |
-
|
TART: A plug-and-play Transformer module for task-agnostic reasoning
(
Spotlight
)
>
Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and, as a proof of concept, propose TART which generically improves an LLM's reasoning abilities using a synthetically trained reasoning module. TART trains this Transformer-based reasoning module in a task-agnostic manner using only synthetic logistic regression tasks and composes it with an arbitrary real-world pre-trained model without any additional training. With a single inference module, TART improves performance across different model families (GPT-Neo, Pythia, Bloom), model sizes (100M - 6B), tasks (14 NLP classification tasks), and even across different modalities (audio and vision). On the RAFT Benchmark, TART improves GPT-Neo (125M)'s performance such that it outperforms Bloom (176B), and is within $4$% of GPT-3.
|
Kush Bhatia · Avanika Narayan · Christopher De Sa · Christopher Ré 🔗 |
-
|
Estimating Uncertainty in Multimodal Foundation Models using Public Internet Data
(
Spotlight
)
>
SlidesLive Video Foundation models are trained on vast amounts of data at scale using self-supervised learning, enabling adaptation to a wide range of downstream tasks. At test time, these models exhibit zero-shot capabilities through which they can classify previously unseen (user-specified) categories. In this paper, we address the problem of quantifying uncertainty in these zero-shot predictions. We propose a heuristic approach for uncertainty estimation in zero-shot settings using conformal prediction with web data. Given a set of classes at test time, we conduct zero-shot classification with CLIP-style models using a prompt template, e.g., ``an image of a |
Shiladitya Dutta · Hongbo Wei · Lars van der Laan · Ahmed Alaa 🔗 |
-
|
Towards General-Purpose In-Context Learning Agents
(
Spotlight
)
>
Reinforcement Learning (RL) algorithms are usually hand-crafted, driven by the research and engineering of humans. An alternative approach is to automate this research process via meta-learning. A particularly ambitious objective is to automatically discover new RL algorithms from scratch that use in-context learning to learn-how-to-learn entirely from data while also generalizing to a wide range of environments. Those RL algorithms are implemented entirely in neural networks, by conditioning on previous experience from the environment, without any explicit optimization-based routine at meta-test time. To achieve generalization, this requires a broad task distribution of diverse and challenging environments. Our Transformer-based Generally Learning Agents (GLAs) are an important first step in this direction. Our GLAs are meta-trained using supervised learning techniques on an offline dataset with experiences from RL environments that is augmented with random projections to generate task diversity. During meta-testing our agents perform in-context meta-RL on entirely different robotic control problems such as Reacher, Cartpole, or HalfCheetah that were not in the meta-training distribution. |
Louis Kirsch · James Harrison · Daniel Freeman · Jascha Sohl-Dickstein · Jürgen Schmidhuber 🔗 |
-
|
Uncertainty In Natural Language Explanations Of Large Language Models
(
Spotlight
)
>
Large Language Models (LLMs) are increasingly used as powerful tools for several high-stakes natural language processing (NLP) applications. Recent works on prompting claim to elicit intermediate reasoning steps and important tokens in LLMs to serve as proxy explanations for its predictions. However, there is no guarantee or certainty whether these explanations are reliable and reflect the LLM's true behavior. In this work, we introduce the first definitions of uncertainty in natural language explanations of LLMs, where we propose a novel approach $\textit{Probing Uncertainty}$ --- to quantify the confidence of the generated explanations. Our approach probes a neighbourhood of explanations of the LLM to estimate the uncertainty. While verbalized uncertainty involves prompting the LLM to express its confidence level in generated explanations, we show that it is not a reliable estimate of explanation confidence. Our empirical analysis reveals two key insights about uncertainty in generated natural language explanations: i) Verbalized uncertainty estimation using LLMs often exhibits high overconfidence, raising questions on the trustworthiness of its explanation, and ii) Explanation confidence calculated from the proposed metric is correlated with the faithfulness of an explanation, where lower explanation confidence pertains to explanations with lower faithfulness. Our study provides insights into the challenges and opportunities in quantifying uncertainty in explanations of LLMs, contributing to the broader discussion of explainability and trustworthiness in machine learning applications.
|
Sree Harsha Tanneru · Chirag Agarwal · Himabindu Lakkaraju 🔗 |
-
|
Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering
(
Spotlight
)
>
Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding tasks. |
Han Zhou · Xingchen Wan · Lev Proleev · Diana Mincu · Jilin Chen · Katherine Heller · Subhrajit Roy 🔗 |
-
|
Trained Transformers Learn Linear Models In-Context
(
Spotlight
)
>
Attention-based neural network sequence models such as transformers have the capacity to act as supervised learning algorithms: They can take as input a sequence of labeled examples and output predictions for unlabeled test examples. Indeed, recent work by Garg et al. has shown that when training GPT2 architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of in-context learning of linear predictors for a transformer with a single linear self-attention layer trained by gradient flow. We show that despite the non-convexity of the underlying optimization problem, gradient flow with a random initialization finds a global minimum of the objective function. Moreover, when given a prompt of labeled examples from a new linear prediction task, the trained transformer achieves small prediction error on unlabeled test examples. We further characterize the behavior of the trained transformer under distribution shifts. |
Ruiqi Zhang · Spencer Frei · Peter Bartlett 🔗 |
-
|
A Universal Prompt Generator for Large Language Models
(
Spotlight
)
>
SlidesLive Video LLMs are primarily reliant on high-quality and task-specific prompts. However, the prompt engineering process relies on clever heuristics and requires multiple iterations. Some recent works attempt to automate this process by improving upon human written prompts. However, creating high-quality prompts from scratch is still an unresolved challenge owing to its inherent complexity. In this work, we propose UniPrompt, a novel technique for generating high-quality human-like prompts from scratch. To do so, we identify characteristic features of human-generated prompts such as being detailed and consisting of multiple sections. Our proposed method, UniPrompt, takes as input a single sentence description of the task and generates human-like sectioned prompts using an auxiliary language model. We train the model in two stages. First, the model is finetuned on multiple tasks using a novel dataset curated using GPT-4 across over 500 tasks. Second, we align the auxiliary model to generate task-relevant (high accuracy) prompts by collecting a prompt preference dataset and optimizing the model using the Direct Preference Optimization method. Importantly, UniPrompt is task-agnostic: once trained, it can be used to generate prompts for any task. We find that UniPrompt outperforms human-generated prompts, GPT-generated prompts, and other prompt optimization techniques across diverse tasks on medicine, causality, and hate speech by up to 5.1 %, 7.2 %, and 11.1 % respectively. |
Gurusha Juneja · Amit Sharma 🔗 |
-
|
Efficient Online Data Mixing For Language Model Pre-Training
(
Spotlight
)
>
The data used to pretrain large language models has a decisive impact on a model’s downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining. |
Alon Albalak · Liangming Pan · Colin Raffel · William Yang Wang 🔗 |
-
|
InstructEval: Systematic Evaluation of Instruction Selection Methods
(
Spotlight
)
>
In-context learning (ICL) performs tasks by prompting a large language model (LLM) using an instruction and a small set of annotated examples called demonstrations. Recent work has shown that precise details of the inputs used in the ICL prompt significantly impact performance, which has incentivized instruction selection algorithms. The effect of instruction-choice however is severely underexplored, with existing analyses restricted to shallow subsets of models and tasks, limiting the generalizability of their insights. We develop InstructEval, an ICL evaluation suite to conduct a thorough assessment of these techniques. The suite includes 13 open-sourced LLMs of varying scales from four model families, and covers nine tasks across three categories. Using the suite, we evaluate the relative performance of seven popular instruction selection methods over five metrics relevant to ICL. Our experiments reveal that using curated manually-written instructions or simple instructions without any task-specific descriptions often elicits superior ICL performance overall than that of automatic instruction-induction methods, pointing to a lack of generalizability among the latter. We release our evaluation suite for benchmarking instruction selection approaches and enabling more generalizable methods in this space. |
Anirudh Ajith · Mengzhou Xia · Ameet Deshpande · Karthik Narasimhan 🔗 |