-
Hidden Naming Contracts in SWE-Agent Benchmarks
A programmatic scan across six SWE-bench-style benchmarks finds that tests sometimes encode hidden naming requirements, penalizing behaviorally correct fixes that choose different identifiers.
-
The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis
Analyzing how visual content dramatically impacts AI agents' performance on SWE tasks
-
From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets
Uncovering the real performance of SWE-Agents by analyzing discriminative subsets of SWE-Bench Verified, showing how aggregate scores can mask significant performance variations across task types.
-
Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?
Analysis of task difficulty distribution in SWE-Bench-Verified using human annotations, revealing the true complexity spectrum and what it means for AI coding performance evaluation.
-
The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges
Deep analysis of why SWE-Bench Verified's focus on single-file changes doesn't represent real-world programming challenges that typically involve multi-file modifications and complex codebase interactions.