With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific normalization.We propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter tuning.Across three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.
翻译:暂无翻译