SWE-Bench_Verified | Jatin Ganhotra

Jun 05, 2025	From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets Uncovering the real performance of SWE-Agents by analyzing discriminative subsets of SWE-Bench Verified, showing how aggregate scores can mask significant performance variations across task types.
Apr 15, 2025	Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really? Analysis of task difficulty distribution in SWE-Bench-Verified using human annotations, revealing the true complexity spectrum and what it means for AI coding performance evaluation.
Mar 30, 2025	The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges Deep analysis of why SWE-Bench Verified's focus on single-file changes doesn't represent real-world programming challenges that typically involve multi-file modifications and complex codebase interactions.
Jan 05, 2025	Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified Exploring how SWE-Agents handle multi-file software engineering tasks compared to human developers, with detailed analysis of patterns and performance on SWE-Bench Verified benchmark.
Dec 31, 2024	OpenHands CodeAct v2.1 v/s Tools + Claude 3.5 Sonnet Comprehensive comparison between OpenHands CodeAct v2.1 and Claude 3.5 Sonnet on SWE-Bench tasks, analyzing the performance differences and capabilities of these leading SWE-Agent approaches.
Dec 26, 2024	SWE-Bench Verified ⊊ real-world SWE tasks Analysis of how SWE-Bench Verified relates to real-world software engineering tasks, exploring the subset relationship between benchmark evaluation and practical development challenges.