Reframe Anything: LLM Agent for Open World Video Reframing
Abstract
The rapid proliferation of mobile devices and social media has fundamentally transformed content dissemination, with short-form video emerging as a dominant medium. To adapt original video content to this format, manual reframing is often required to meet constraints on duration and device screen size. This process is not only labor-intensive and time-consuming but also demands significant professional expertise. While machine learning models—such as video salient object detection—offer promising avenues for automation, existing approaches typically lack human-in-the-loop interaction, making it difficult to accommodate personalized user preferences. To address these limitations, AI systems must be capable of fully understanding user intent and dynamically tailoring video reframing strategies in response to evolving requirements. The powerful capabilities of large language models (LLMs) make them particularly well-suited for handling such complex multimodal interaction scenarios. Building on this insight, we introduce \textbf{R}eframe \textbf{A}ny \textbf{V}ideo \textbf{A}gent (RAVA), an LLM-based agent that integrates visual foundation models with human instructions to intelligently restructure visual content for video reframing. RAVA operates in three stages: \textit{perception}, where it interprets user instructions and video content; \textit{planning}, where it determines suitable aspect ratios and reframing strategies; and \textit{execution}, where it invokes editing tools to produce the final video. Our experiments demonstrate the effectiveness of RAVA in both video salient object detection and real-world reframing tasks, showcasing its potential as a powerful tool for AI-powered video editing.