Differences That Matter:
Auditing Models for Capability Gap Discovery and Rectification

 
 

Qihao Liu 1,2 Chengzhi Mao 1 Yaojie Liu 1 Alan Yuille 2 Wen-Sheng Chu 1

 1 Google  2 Johns Hopkins University
 

LLM/MLLM leaderboards tell us who wins —

but they rarely explain what changed, where it fails, or what’s still missing.

Therefore, we propose AuditDM

A framework that finds capability gaps and turns them into concrete fixes.

1

Systematically discovers capability gaps between models and produces interpretable weakness summaries, enabling a comprehensive understanding of model behavior.

2

Delivers actionable feedback that guides fixes and model improvement.

 

 

Abstract

 

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

 

 

 

Method

 

 

AuditDM fine-tunes an MLLM into an auditor that generates challenging probing questions and counterfactual images (via captions for image regeneration or editing commands), yielding question–image pairs on which the target model fails while the MLLM ensemble agrees, thus exposing capability gaps and failure modes. The auditor is trained to maximize prediction discrepancy between the target and the ensemble. Once trained, it identifies weaknesses and failure cases in a single inference pass.

 


 

AuditDM for Model Failure Detection

   
 
   
 
   

 


 

AuditDM for Model Improvement

   
PaliGemma-2
Improving PaliGemma2-3B.
Gemma-3
Improving Gemma3-4B.

 


 

BibTex

@article{liu2025differences,
  title={Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification},
  author={Liu, Qihao and Mao, Chengzhi and Liu, Yaojie and Yuille, Alan and Chu, Wen-Sheng},
  journal={arXiv preprint arXiv:2512.16921},
  year={2025}
}