-
The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis
Analyzing how visual content dramatically impacts AI agents' performance on SWE tasks
-
From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets
Uncovering the real performance of SWE-Agents by analyzing discriminative subsets of SWE-Bench Verified, showing how aggregate scores can mask significant performance variations across task types.
-
Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?
Analysis of task difficulty distribution in SWE-Bench-Verified using human annotations, revealing the true complexity spectrum and what it means for AI coding performance evaluation.
-
The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges
Deep analysis of why SWE-Bench Verified's focus on single-file changes doesn't represent real-world programming challenges that typically involve multi-file modifications and complex codebase interactions.
-
Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified
Exploring how SWE-Agents handle multi-file software engineering tasks compared to human developers, with detailed analysis of patterns and performance on SWE-Bench Verified benchmark.