SWE-Bench

an archive of posts with this tag

Apr 05, 2026 Hidden Naming Contracts in SWE-Agent Benchmarks
A programmatic scan of six SWE-bench-style benchmarks — SWE-bench Verified, SWE-bench Pro and SWE-PolyBench — finds tests that encode hidden naming contracts, penalizing behaviorally correct fixes that choose different identifiers.
Dec 31, 2024 OpenHands CodeAct v2.1 v/s Tools + Claude 3.5 Sonnet
Head-to-head comparison of OpenHands CodeAct v2.1 and Anthropic Claude 3.5 Sonnet on SWE-bench Verified, analyzing the performance differences and capabilities of these leading SWE-agent approaches.
Dec 26, 2024 SWE-Bench Verified ⊊ real-world SWE tasks
Why SWE-bench Verified is only a subset of real-world software engineering tasks — comparing SWE-agents such as OpenHands CodeAct v2.1, Amazon Q, SWE-agent, Agentless and AutoCodeRover, with Claude 3.5 Sonnet.