The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis

The Collapse of AI Coding Agents When Images Enter the Picture

AI coding agents now write real PRs and merge code at unprecedented scale, but they collapse when faced with visual content. In SWE-bench Multimodal, we observe a 73.2% performance drop when images are involved — a result that challenges the entire field’s current trajectory.

The Rise of SWE-Agents in Industry

We are witnessing unprecedented adoption of SWE-agents across the industry. This trend is clearly evident from the surge in pull requests being opened and merged by autonomous coding agents. PR Arena tracks the opened and merged PRs by top SWE coding agents - GitHub Copilot, OpenAI Codex, Devin, Cursor Agents, and Codegen. The numbers tell a compelling story of AI agents increasingly participating in real-world software development workflows.

Reality Check: A Tale of Two Benchmarks - The Performance Cliff

However, when we examine the benchmarks used to evaluate these SWE-agents, a different pattern emerges. Looking at the top performers on the SWE-bench Verified leaderboard, we see several submissions achieving ~70-75% on all instances. But when we examine the same models on SWE-bench Multimodal, performance severely drops to ~30-35%. This raises a natural question: Why do SWE agents using identical models excel on Verified but struggle significantly on the Multimodal dataset?

We know from our previous analysis that the SWE-bench Verified benchmark suffers from a saturation problem - while state-of-the-art SWE-agents achieve impressive overall scores, they struggle significantly on discriminative subsets that isolate truly challenging problems, e.g., problems that require complex reasoning across multiple files. This article investigates whether a similar pattern exists for the multimodal benchmark - and what factors drive the performance cliff we observe.

This discrepancy raises a critical question: Is this truly a benchmark problem, or does it reveal fundamental limitations in current AI systems?

Enter SWE-bench Multimodal: A New Perspective

To answer this question, we turn to SWE-bench Multimodal, a recently introduced benchmark that offers a fresh lens on AI agent capabilities. Unlike SWE-bench Verified, which focuses primarily on textual code understanding and generation, SWE-bench Multimodal targets JavaScript-based, user-facing applications, such as UI design systems, web app development, interactive mapping, and syntax highlighters.

This multimodal approach better reflects the reality of software engineering, where developers must interpret visual information alongside code. Issues on GitHub frequently include screenshots, error dialogs, and visual demonstrations of problems that need solving.

The dataset contains 510 instances in the test split. Our performance analysis uses this test split, where only 48 instances (9.4%) are purely text-based¹, while 462 instances (90.6%) include visual elements - making this a predominantly multimodal challenge.

The Visual Complexity Penalty

We uncover a steep 73.2% performance drop when images are introduced to coding tasks.

This dramatic drop highlights a critical limitation in multimodal AI for software engineering — and suggests that the gaps we saw in SWE-bench Verified are symptoms of deeper architectural challenges. Our analysis examines the performance of 11 SWE agents (from 2025) on the SWE-bench Multimodal leaderboard.²

Before diving into the detailed analysis, let’s establish the key differences between these benchmarks:

Metric	SWE-Bench Verified	SWE-Bench Multimodal
Top performance	~75%	~36%
Single-file tasks	85.8%	40%
Tasks with images	0%	90.6%
Task focus	Text/code only	Visual + code
Domain	General repositories	JavaScript UI/web apps

We categorize performance into three key areas:

Overall Performance - Complete benchmark results across all 510 instances
Text-Only Performance - Instances without visual elements (48 instances)
Images Present Performance - Instances containing visual elements (462 instances)

Our analysis reveals a consistent and striking pattern: visual content severely degrades performance across all models.

Overall Performance Comparison

The following table illustrates the stark contrast between text-only and image-containing performance across all evaluated agents. Notice how every single agent achieves near-perfect scores on text-only instances (97-100%) but experiences severe degradation when visual elements are present:

Rank	Agent Name	Date	Overall (%) 510 instances	Text-Only (%) 48 instances	Images Present (%) 462 instances
★	Combined Systems	2025-07-26	47.8%	100.0%	42.4%
1	GUIRepair + o3 (2025-04-16)	2025-07-01	36.5%	100.0%	29.9%
2	Refact.ai Agent	2025-06-11	36.1%	100.0%	29.4%
3	OpenHands-Versa (Claude-Sonnet 4)	2025-05-28	34.9%	100.0%	28.1%
4	GUIRepair + o4-mini (2025-04-16)	2025-05-31	34.3%	100.0%	27.5%
5	OpenHands-Versa (Claude-3.7 Sonnet)	2025-05-09	31.8%	100.0%	24.7%
6	GUIRepair + GPT 4.1 (2025-04-14)	2025-05-31	31.6%	100.0%	24.5%
7	Zencoder (2025-04-01)	2025-04-01	31.0%	97.9%	24.0%
8	GUIRepair + GPT 4o (2024-08-06)	2025-05-31	30.8%	100.0%	23.6%
9	Globant Code Fixer Agent	2025-03-25	30.0%	100.0%	22.7%
10	Zencoder (2025-03-10)	2025-03-11	27.5%	81.2%	21.9%
11	Agentless Lite + Claude-3.5 Sonnet	2025-02-26	25.7%	100.0%	18.0%

The Top-Performing Agents achieve near-perfect performance on text-only instances (97-100%) but experience sharp drops when visual elements are introduced - an average 73.2% performance drop when images are introduced (calculated as: 99.1% average text-only performance minus 25.9% average image-present performance = 73.2%).

Is Image Quantity the key factor?

The analysis above suggests that images are the primary factor, but is there more to the story? Let’s examine the relationship between the number of images and solve rates across different subsets.

Number of Images	Instance Count	Average Solve Rate (%)
0	48	51.4%
1	281	14.1%
2	87	16.5%
3	58	16.4%
≥4	36	19.4%

Even though one might assume more images = more difficulty, the data doesn’t support this. Solve rates remain low regardless of whether there’s 1 image or 4.

The Multi-File Editing Connection?

While we cannot directly analyze patch complexity due to the absence of gold patches in SWE-bench Multimodal, the original paper provides crucial insights about the underlying task complexity:

In SWE-bench, 83% of task instances edit one file, and 65% edit one function. In SWE-bench M (Multimodal), just 40% of task instances change one file, and 32.5% change one function. On average, the changes by reference solutions in SWE-bench M are larger than those in SWE-bench, with multi-file edits being more commonplace.

This finding aligns with our previous analyses on patch complexity and multi-file patterns and single-file saturation effects, which showed that state-of-the-art SWE agents struggle significantly with multi-file editing tasks.

The 73.2% performance drop reveals intersecting difficulties: it’s not just visual complexity, but the compound challenge of multimodal reasoning combined with multi-file editing tasks that are prevalent in image-containing instances.

Conclusion

Our analysis of SWE-bench Multimodal provides a definitive answer to the central question we posed: the performance gaps we observe in discriminative subsets aren’t just benchmark artifacts - they reflect genuine limitations in current AI architectures when handling complex, real-world software engineering tasks.

The 73.2% performance drop reveals a double burden that extends far beyond simple visual reasoning difficulties. Our analysis uncovered that image-containing instances systematically involve more complex multi-file editing tasks (only 40% single-file in SWE-Bench Multimodal vs. 83% in traditional SWE-bench), creating a perfect storm of difficulties that current AI systems struggle to handle.

This finding directly connects to our previous research on discriminative subsets, where we demonstrated that state-of-the-art agents struggle with multi-file reasoning even in purely text-based scenarios.

The consistency of this penalty across all top-performing models - from GUIRepair variants to OpenHands-Versa to Agentless Lite - suggests that the challenge runs deeper than individual model architectures.

These agents achieve near-perfect performance (97-100%) on text-only instances but uniformly collapse to 18-30% when visual elements and multi-file complexity are introduced. This pattern indicates fundamental gaps in how current AI systems integrate visual and textual information in technical contexts, particularly when structural complexity increases.

Key Takeaways

Our findings point to three critical conclusions:

• Current AI models collapse under multimodal + multi-file complexity - The 73.2% performance drop demonstrates that existing architectures cannot handle the intersecting difficulties of visual reasoning and structural code analysis.

• This impacts the real-world reliability of SWE agents - Many GitHub issues include screenshots and visual context, and if these follow patterns similar to SWE-bench Multimodal, current deployments may face significant limitations in complex scenarios.

• Fundamentally new multimodal reasoning architectures are needed - The path forward requires not just better visual understanding, but entirely new approaches to managing complexity when visual reasoning intersects with multi-file code modifications - the hallmark of challenging software engineering work.

Citation

If you find this analysis useful for your research, please cite it as:

@misc{ganhotra2025visual,
  title={The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis},
  author={Ganhotra, Jatin},
  year={2025},
  month={July},
  url={https://jatinganhotra.dev/blog/swe-agents/2025/07/26/swe-bench-multimodal-visual-complexity/},
  note={Blog post}
}

The 48 text-based instances are identified using the image_assets field in the dataset metadata, which systematically flags whether each instance contains visual content based on analysis of the GitHub issue attachments and descriptions. ↩
Due to data collection format improvements between October 2024 and early 2025, our detailed analysis focuses on 11 agents from 2025 with complete instance-level results. While 21 total SWE agents have been submitted to the benchmark, 10 agents from October 2024 provided only aggregate performance numbers without specifying which instances were resolved, preventing detailed comparative analysis. ↩