Text-guided Fine-Grained Video Anomaly Understanding¶
People¶
|
|
|
|
1University College London, 2United Arab Emirates University
CVPR 2026 SVC Workshop
Resources¶
Bibtex
@inproceedings{gu2026tvau,
author = {Gu, Jihao and Li, Kun and Wang, He and Ak{\c{s}}it, Kaan},
title = {Text-guided Fine-Grained Video Anomaly Understanding},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2nd Workshop on Subtle Visual Computing (SVC)},
year = {2026},
address = {Denver, CO, USA}
}
Abstract¶
Subtle abnormal events in videos often manifest as weak spatio-temporal cues that are easily overlooked by conventional anomaly detection systems. Existing video anomaly detection approaches typically provide coarse binary anomaly decisions without interpretable evidence, while large vision-language models (LVLMs) can produce textual judgments but lack precise localization of subtle visual signals. To address this gap, we propose Text-guided Fine-Grained Video Anomaly Understanding (T-VAU), a framework that grounds subtle anomaly evidence into multimodal reasoning. Specifically, we introduce an Anomaly Heatmap Decoder (AHD) that performs visual-textual feature alignment to extract pixel-level spatio-temporal anomaly heatmaps from intermediate visual representations. We further design a Region-aware Anomaly Encoder (RAE) that converts these heatmaps into structured prompt embeddings, enabling the LVLM to perform anomaly detection, localization, and semantic explanation in a unified reasoning pipeline. To support fine-grained supervision, we construct a target-level fine-grained video-text anomaly dataset derived from ShanghaiTech and UBnormal with detailed annotations of object appearance, localization, and motion trajectories. Extensive experiments demonstrate that T-VAU significantly improves anomaly localization and textual reasoning performance on both benchmarks, achieving strong results in BLEU-4 metrics and Yes/No decision accuracy while providing interpretable pixel-level spatio-temporal evidence for anomaly understanding.
Proposed Method¶
Model Design¶
The proposed T-VAU model. The framework consists of three modules:
-
Text Encoder (\(E_t\)) that generates class-specific text embeddings \(\mathbf{S}_c\) from binary prompts
-
Anomaly Heatmap Decoder (AHD) that fuses \(\mathbf{S}_c\) with visual features \(\mathcal{V}\) to produce spatio-temporal pixel-level anomaly heatmaps \(\mathbf{H}\)
-
Region-aware Anomaly Encoder (RAE) that projects \(\mathbf{H}_c\) into the LoRA-tuned LVLM semantic space and integrates it with video \(\mathcal{V}\) and a sequence of incrementally refined questions \(\mathbf{Q}_{\leq t}\) to yield the final anomaly understanding response \(\mathbf{A}_t\)
Fine-grained Anomaly Understanding¶
Fine-grained anomaly dataset construction pipeline. Starting from existing datasets with only pixel-level anomaly labels (e.g., ShanghaiTech and UBnormal), we build a structured video–text dataset through three stages:
-
Frame-level structured prompting to extract target attributes and spatial information and aggregate them into target timelines
-
Anomaly-focused refinement using anomaly masks and background suppression to emphasize abnormal evidence
-
Cross-modal consistency verification between appearance and motion cues
The resulting dataset provides aligned annotations of appearance, spatial localization, and motion trajectory for fine-grained anomaly understanding.
Experimental Results¶
Anomaly Detection Results. Experimental results on UBnormal. We report micro-/macro-averaged frame-level AUC, RBDC, and TBDC (%). PT, FT, and 1S denote pre-trained, fine-tuned, and one-shot; SR denotes frame sampling rate. Best results are highlighted in bold.
| Method | Micro AUC ↑ | Macro AUC ↑ | RBDC ↑ | TBDC ↑ |
|---|---|---|---|---|
| Georgescu et al. | 58.5 | 94.4 | 18.580 | 48.213 |
| Georgescu et al. (FT) | 68.2 | 95.3 | 28.654 | 58.097 |
| Sultani et al. (PT) | 61.1 | 89.4 | 0.001 | 0.012 |
| Sultani et al. (FT) | 51.8 | 88.0 | 0.001 | 0.001 |
| Bertasius et al. (FT, SR=1/32) | 86.1 | 89.2 | 0.008 | 0.021 |
| Bertasius et al. (FT, SR=1/8) | 83.4 | 90.6 | 0.009 | 0.023 |
| Bertasius et al. (FT, SR=1/4) | 78.5 | 89.2 | 0.006 | 0.018 |
| AHD (1S) | 94.5 | 85.2 | 64.300 | 74.400 |
| AHD (FT) | 94.8 | 87.8 | 67.800 | 76.700 |
Multi-turn Dialogue Results. One-shot evaluation results of representative LVLMs on our constructed dataset. “Size” denotes model parameters (billions). BLEU-4 is reported for Target and Trajectory, and Accuracy for Yes/No.
| Method | Size ↓ | Target (ST) ↑ | Trajectory (ST) ↑ | Yes/No (ST) ↑ | Target (UB) ↑ | Trajectory (UB) ↑ | Yes/No (UB) ↑ |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL (zero-shot) | 7B | 18.74 | 27.33 | 61.03% | 16.20 | 24.18 | 65.62% |
| Qwen2.5-VL (one-shot) | 7B | 50.42 | 78.91 | 92.36% | 44.35 | 70.82 | 87.24% |
| LLaVA-1.6 (one-shot) | 7B | 47.68 | 75.42 | 91.07% | 42.11 | 68.07 | 85.91% |
| MiniCPM-V 2.6 (one-shot) | 7B | 52.34 | 80.41 | 93.11% | 46.70 | 72.88 | 86.94% |
| Idefics2 (one-shot) | 8B | 44.29 | 73.84 | 90.12% | 39.51 | 65.92 | 84.03% |
| InternVL (one-shot) | 8B | 55.73 | 82.65 | 94.28% | 49.84 | 71.63 | 88.65% |
| RAE (Ours) | 7B | 62.67 | 88.84 | 97.67% | 50.32 | 78.10 | 89.73% |
Ablation Studies. Results of different T-VAU variants. Heatmap metrics include RBDC and TBDC. BLEU-4 is reported for Target and Trajectory. Accuracy is Yes/No classification.
| Method | Params ↓ | RBDC ↑ | TBDC ↑ | Target ↑ | Trajectory ↑ | Yes/No ↑ |
|---|---|---|---|---|---|---|
| T-VAU w/o AHD | 8299M | -- | -- | 61.82 | 85.47 | 95.38% |
| T-VAU w/o RAE | 8317M | 67.8 | 76.7 | -- | -- | -- |
| T-VAU w/o AHD & RAE | 8274M | -- | -- | 61.82 | 85.47 | 95.38% |
| T-VAU | 8324M | 67.8 | 76.7 | 62.67 | 88.84 | 97.67% |
Examples of interpretable anomaly detection and multi-turn QA across scenes. Each group shows the raw frame, pixel-level anomaly heatmaps produced by AHD, and T-VAU's dialogue outputs (anomaly yes/no, appearance/action details, and motion trajectory).
-
Left: a cyclist (with umbrella and backpack) is localized as the anomalous target, with the trajectory “enter from right \(\rightarrow\) turn toward the upper-right corner \(\rightarrow\) exit.”
-
Right: a silver SUV suddenly appears from the left and moves rapidly; AHD consistently highlights the vehicle, and the QA module explains the abrupt appearance and fast motion.
T-VAU first detects the anomaly, then describes the appearance (white top, grey shorts, green schoolbag) and the change from walking to running. Red arrows indicate the main motion directions, and heatmap intensity reflects anomaly confidence. RAE encodes the heatmaps into region-aware text prompts that guide the LVLM to produce consistent decisions and descriptions, closing the loop from pixel-level evidence to readable narratives.
Trajectory visualization by accumulating frame-level outputs. The first row shows multi-frame overlays of the original video with green bounding boxes for GT and red bounding boxes for predictions. The second row overlays GT pixel-level masks to form fine-grained trajectories, while the third row overlays predicted pixel-level masks. Both bounding-box and pixel-level trajectories show strong spatial alignment with GT, indicating that our model accurately captures motion paths over time.
Conclusions¶
We present T-VAU, a closed-loop framework that unifies pixel-level anomaly grounding and high-level semantic reasoning by coupling an Anomaly Heatmap Decoder (AHD) with a Region-aware Anomaly Encoder (RAE). By aligning visual features with textual prompts, T-VAU achieves precise, threshold-free spatio-temporal anomaly localization, while its region- and motion-aware prompt design enables LVLMs to perform faithful, structured, and multi-turn anomaly reasoning. This unified formulation goes beyond conventional score-based paradigms, jointly supporting detection, localization, target identification, and explanation within a single framework. Extensive experiments on UBnormal and ShanghaiTech demonstrate consistent improvements over prior methods across localization accuracy, reasoning quality, and dialogue-based evaluation, while ablations confirm the strong complementarity between AHD and RAE.
Outreach¶
We host a Slack group with more than 250 members. This Slack group focuses on the topics of rendering, perception, displays and cameras. The group is open to public and you can become a member by following this link.
Contact Us¶
Warning
Please reach us through email to provide your feedback and comments.
Acknowledgements¶
We would like to thank Alex Chapiro for insightful discussions and constructive feedback on earlier versions of this manuscript.








