SWE-Bench_Verified

an archive of posts with this tag

Jun 05, 2025 From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets
Uncovering the real performance of SWE-Agents by analyzing discriminative subsets of SWE-Bench Verified, showing how aggregate scores can mask significant performance variations across task types.
Apr 15, 2025 Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?
Analysis of task difficulty distribution in SWE-Bench-Verified using human annotations, revealing the true complexity spectrum and what it means for AI coding performance evaluation.
Mar 30, 2025 The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges
Deep analysis of why SWE-Bench Verified's focus on single-file changes doesn't represent real-world programming challenges that typically involve multi-file modifications and complex codebase interactions.
Jan 05, 2025 Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified
Exploring how SWE-Agents handle multi-file software engineering tasks compared to human developers, with detailed analysis of patterns and performance on SWE-Bench Verified benchmark.
Dec 31, 2024 OpenHands CodeAct v2.1 v/s Tools + Claude 3.5 Sonnet
Comprehensive comparison between OpenHands CodeAct v2.1 and Claude 3.5 Sonnet on SWE-Bench tasks, analyzing the performance differences and capabilities of these leading SWE-Agent approaches.
Dec 26, 2024 SWE-Bench Verified ⊊ real-world SWE tasks
Analysis of how SWE-Bench Verified relates to real-world software engineering tasks, exploring the subset relationship between benchmark evaluation and practical development challenges.