evaluation
an archive of posts with this tag
| Jul 26, 2025 | The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis Analyzing how visual content dramatically impacts AI agents' performance on SWE tasks |
|---|---|
| Jun 05, 2025 | From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets Uncovering the real performance of SWE-Agents by analyzing discriminative subsets of SWE-Bench Verified, showing how aggregate scores can mask significant performance variations across task types. |
| Apr 15, 2025 | Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really? Analysis of task difficulty distribution in SWE-Bench-Verified using human annotations, revealing the true complexity spectrum and what it means for AI coding performance evaluation. |
| Mar 30, 2025 | The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges Deep analysis of why SWE-Bench Verified's focus on single-file changes doesn't represent real-world programming challenges that typically involve multi-file modifications and complex codebase interactions. |
| Jan 05, 2025 | Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified Exploring how SWE-Agents handle multi-file software engineering tasks compared to human developers, with detailed analysis of patterns and performance on SWE-Bench Verified benchmark. |
| Dec 26, 2024 | SWE-Bench Verified ⊊ real-world SWE tasks Analysis of how SWE-Bench Verified relates to real-world software engineering tasks, exploring the subset relationship between benchmark evaluation and practical development challenges. |