SWE-Bench
an archive of posts with this tag
| Apr 05, 2026 | Hidden Naming Contracts in SWE-Agent Benchmarks A programmatic scan across six SWE-bench-style benchmarks finds that tests sometimes encode hidden naming requirements, penalizing behaviorally correct fixes that choose different identifiers. |
|---|---|
| Dec 31, 2024 | OpenHands CodeAct v2.1 v/s Tools + Claude 3.5 Sonnet Comprehensive comparison between OpenHands CodeAct v2.1 and Claude 3.5 Sonnet on SWE-Bench tasks, analyzing the performance differences and capabilities of these leading SWE-Agent approaches. |
| Dec 26, 2024 | SWE-Bench Verified ⊊ real-world SWE tasks Analysis of how SWE-Bench Verified relates to real-world software engineering tasks, exploring the subset relationship between benchmark evaluation and practical development challenges. |