evaluation | Jatin Ganhotra

Jul 26, 2025	The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis Analyzing how visual content dramatically impacts AI agents' performance on SWE tasks
Jun 05, 2025	From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets Uncovering the real performance of SWE-Agents by analyzing discriminative subsets of SWE-Bench Verified, showing how aggregate scores can mask significant performance variations across task types.
Apr 15, 2025	Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really? Analysis of task difficulty distribution in SWE-Bench-Verified using human annotations, revealing the true complexity spectrum and what it means for AI coding performance evaluation.
Mar 30, 2025	The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges Deep analysis of why SWE-Bench Verified's focus on single-file changes doesn't represent real-world programming challenges that typically involve multi-file modifications and complex codebase interactions.
Jan 05, 2025	Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified Exploring how SWE-Agents handle multi-file software engineering tasks compared to human developers, with detailed analysis of patterns and performance on SWE-Bench Verified benchmark.
Dec 26, 2024	SWE-Bench Verified ⊊ real-world SWE tasks Analysis of how SWE-Bench Verified relates to real-world software engineering tasks, exploring the subset relationship between benchmark evaluation and practical development challenges.