Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

Abstract

We introduce Mistake Attribution (MATT), a new task for fine-grained understanding of human mistakes in egocentric videos. While prior work detects whether a mistake occurs, MATT attributes the mistake to what part of the instruction is violated (semantic role), when in the video the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame.

We develop MisEngine, a data engine that automatically constructs mistake samples from existing datasets with attribution-rich annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M—two datasets up to two orders of magnitude larger than prior mistake datasets.

We then present MisFormer, a unified attention-based model for mistake attribution across semantic, temporal, and spatial dimensions, trained with MisEngine supervision. A human study demonstrates the ecological validity of our MisEngine-constructed mistake samples, confirming that EPIC-KITCHENS-M and Ego4D-M can serve as reliable benchmarks for mistake understanding.

Experiments on both our datasets and prior benchmarks show that MisFormer, as a single unified model, outperforms task-specific SOTA methods by at least 6.66%, 21.81%, 18.7%, and 3.00% in video-language understanding, temporal localization, hand-object interaction, and mistake detection, respectively.

Method

MisEngine: Data Engine

A significant challenge in mistake video analysis is the paucity of available data in the face of massive diversity of possible mistakes. MisEngine overcomes this by automatically creating new mistake understanding datasets from source corpora through a careful series of sampling and cross-matching methods. It uses Semantic Role Labeling on the text instruction and then matches across the available roles (e.g., Object and Predicate), systematically cross-matching an instruction text with descriptions of other action videos to create misaligned pairs. Videos naturally inherit temporal and spatial annotations that we process into the corresponding attribution targets. Applied to Ego4D and EPIC-KITCHENS, MisEngine yields Ego4D-M (257K samples) and EPIC-KITCHENS-M (221K samples)—at least two orders of magnitude larger than any existing mistake dataset.

MisFormer: Model Architecture

MisFormer jointly processes the instruction text and an attempt video, extracting shared multimodal features. Three specialized transformer heads perform semantic attribution (detecting misaligned roles), temporal localization (pinpointing the Point-of-No-Return frame), and spatial localization (predicting mistake regions via attention-driven bounding boxes), enabling comprehensive and interpretable mistake analysis across semantic, temporal, and spatial dimensions. At inference, the temporal and spatial modules are gated—invoked only when any role is predicted as Mistake.

Results

Experiments show that existing methods—even when specialized for individual subtasks—struggle to address MATT, underscoring its challenge. MisFormer, as a single unified model, consistently outperforms strong baselines including Video-Language Models, Temporal Localization Models, Hand-Object Interaction detectors, and mistake detection methods on both our and prior benchmarks, surpassing task-specific SOTA methods by at least 6.66%, 21.81%, 18.7%, and 3.00% in video-language understanding, temporal localization, hand-object interaction, and mistake detection, respectively.

BibTeX

@inproceedings{li2026mistakeattribution,
  title     = {Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos},
  author    = {Li, Yayuan and Jain, Aadit and Bellos, Filippos and Corso, Jason J.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}