-
Hidden Naming Contracts in SWE-Agent Benchmarks
A programmatic scan of six SWE-bench-style benchmarks — SWE-bench Verified, SWE-bench Pro and SWE-PolyBench — finds tests that encode hidden naming contracts, penalizing behaviorally correct fixes that choose different identifiers.
-
The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis
How visual complexity penalizes SWE-agents on SWE-bench Multimodal — testing SWE-agent, Agentless and OpenHands with Claude 3.7 Sonnet and OpenAI o3 on visually rich GitHub issues.
-
From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets
Discriminative subsets of SWE-bench Verified reveal true SWE-agent capability — how aggregate scores hide wide variation across SWE-agent, OpenHands, Claude 4 Opus and the L* agent (from 73% to 11%).
-
Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?
Task-difficulty distribution in SWE-bench Verified from human annotations — what easy, medium and hard mean for SWE-agents like SWE-agent and Agentless running Claude and OpenAI o1.
-
The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges
Why SWE-bench Verified's focus on single-file changes misses real-world multi-file programming — analyzed across SWE-agent, Agentless, Claude 3 Opus, Claude 3.5 Sonnet, OpenAI o1 and Amazon Q.