SWE-Bench

an archive of posts with this tag

Apr 05, 2026 Hidden Naming Contracts in SWE-Agent Benchmarks
A programmatic scan across six SWE-bench-style benchmarks finds that tests sometimes encode hidden naming requirements, penalizing behaviorally correct fixes that choose different identifiers.
Dec 31, 2024 OpenHands CodeAct v2.1 v/s Tools + Claude 3.5 Sonnet
Comprehensive comparison between OpenHands CodeAct v2.1 and Claude 3.5 Sonnet on SWE-Bench tasks, analyzing the performance differences and capabilities of these leading SWE-Agent approaches.
Dec 26, 2024 SWE-Bench Verified ⊊ real-world SWE tasks
Analysis of how SWE-Bench Verified relates to real-world software engineering tasks, exploring the subset relationship between benchmark evaluation and practical development challenges.