Confidence Scores for Temporal Properties over Sequences of Predictions
Abstract
Detection models such as YOLO and Grounding DINO can be used to detect local properties such as objects ("car") in individual frames of videos. The likelihood of these detections is indicated by confidence scores in [0, 1]. How can these confidence scores be lifted to temporal properties over sequences (e.g., "a car is detected at all times in the video"or "a traffic light is eventually detected")? In this work, we study this problem of assigning Confidence Scores for TempOral Properties (CSTOPs) over sequences, given confidence score predictors for local properties. These scores can be useful for several downstream applications such as query matching, ranking and fine-tuning. We argue that Temporal Logic provides a natural language for expressing diverse temporal properties of interest. Inspired by quantitative semantics for Temporal Logic, we then propose a scoring function called LogCSTOP for defining CSTOPs that can be computed in polynomial time with respect to the lengths of the sequence and temporal property, while existing methods require exponential space and time. We introduce the CSTOP-RealTLV-video benchmark with video sequences from the RealTLV dataset and ground-truth labels for temporal properties from 15 diverse templates. Finally, we show that LogCSTOP outperforms Large Vision Language Models, such as LongVA-7B, and an existing neurosymbolic baseline on query matching over videos by more than 16% in terms of balanced accuracy .