Timezone: »
Generating fluent and relevant language to describe visual content is critical for the video captioning task. Many existing methods generate captions using sequence models that predict words in a left-to-right order. In this paper, we investigate a graph-structured model for caption generation by explicitly modeling the hierarchical structure in the sentences to further improve the fluency and relevance of sentences. To this end, we propose a novel video captioning method that generates a sentence by first constructing a multi-modal dependency tree and then traversing the constructed tree, where the syntactic structure and semantic relationship in the sentence are represented by the tree topology. To take full advantage of the information from both vision and language, both the visual and textual representation features are encoded into each tree node. Different from existing dependency parsing methods that generate uni-modal dependency trees for language understanding, our method construct s multi-modal dependency trees for language generation of images and videos. We also propose a tree-structured reinforcement learning algorithm to effectively optimize the captioning model where a novel reward is designed by evaluating the semantic consistency between the generated sub-tree and the ground-truth tree. Extensive experiments on several video captioning datasets demonstrate the effectiveness of the proposed method.
Author Information
Wentian Zhao (Beijing Institute of Technology)
Xinxiao Wu (Beijing Institute of Technology)
Jiebo Luo (U. Rochester)
More from the Same Authors
-
2023 Poster: Wyze Rule: Federated Rule Dataset for Rule Recommendation Benchmarking »
Mohammad Mahdi Kamani · Yuhang Yao · Hanjia Lyu · Zhongwei Cheng · Lin Chen · Liangju Li · Carlee Joe-Wong · Jiebo Luo -
2021 Poster: Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training »
Hongwei Xue · Yupan Huang · Bei Liu · Houwen Peng · Jianlong Fu · Houqiang Li · Jiebo Luo -
2020 Poster: Learning Semantic-aware Normalization for Generative Adversarial Networks »
Heliang Zheng · Jianlong Fu · Yanhong Zeng · Jiebo Luo · Zheng-Jun Zha -
2020 Spotlight: Learning Semantic-aware Normalization for Generative Adversarial Networks »
Heliang Zheng · Jianlong Fu · Yanhong Zeng · Jiebo Luo · Zheng-Jun Zha -
2019 Poster: Learning Deep Bilinear Transformation for Fine-grained Image Representation »
Heliang Zheng · Jianlong Fu · Zheng-Jun Zha · Jiebo Luo