Poster

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Jae Sung Park ⋅ Jack Hessel ⋅ Khyathi Chandu ⋅ Paul Pu Liang ⋅ Ximing Lu ⋅ Peter West ⋅ Youngjae Yu ⋅ Qiuyuan Huang ⋅ Jianfeng Gao ⋅ Ali Farhadi ⋅ Yejin Choi

2023 Poster

[ Paper] [ OpenReview]

Abstract

Instruction following vision-language (VL) models offer a flexibleinterface that supports a broad range of multimodal tasks in a zero-shot fashion.However, interfaces that operate on full images do not directly enable the user to“point to" and access specific regions within images. This capability is importantnot only to support reference-grounded VL benchmarks, but also, for practicalapplications that require precise within-image reasoning. We build LocalizedVisual Commonsense model which allows users to specify (multiple) regions-as-input. We train our model by sampling localized commonsense knowledgefrom a large language model (LLM): specifically, we prompt a LLM to collectcommonsense knowledge given a global literal image description and a localliteral region description automatically generated by a set of VL models. Thispipeline is scalable and fully automatic, as no aligned or human-authored imageand text pairs are required. With a separately trained critic model that selectshigh quality examples, we find that training on the localized commonsense corpusexpanded solely from images can successfully distill existing VL models to supporta reference-as-input interface. Empirical results and human evaluations in zero-shotsettings demonstrate that our distillation method results in more precise VL modelsof reasoning compared to a baseline of passing a generated referring expression.

Video

Chat is not available.