Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?
When OpenAI released SWE-Bench-Verified, they included human annotations estimating how long each coding task would take an experienced software engineer to solve. This gives us a unique lens through which to analyze AI coding performance. As I’ve discussed in previous analyses, understanding the true complexity distribution of these tasks is critical to properly interpreting benchmark results.
How Human Experts Judge Task Difficulty
OpenAI asked human annotators to estimate completion times for each task, assuming the engineer had “a few hours to familiarize themselves with the codebase.” They used four time-based categories:
- < 15 minutes: Trivial changes like adding assertions to a function
- 15 minutes – 1 hour: Small changes requiring some thought
- 1 – 4 hours: Substantial rewrites affecting functions or multiple files
- > 4 hours: Esoteric issues requiring significant research and changing 100+ lines of code
While these estimates aren’t used for dataset filtering, they provide valuable insight into the perceived difficulty distribution.
Breaking Down SWE-Bench-Verified by Difficulty
When we analyze the 500 issues in SWE-Bench-Verified using these time-based difficulty metrics, the distribution is revealing:
Difficulty Category | Count | Percentage |
---|---|---|
<15 minutes | 194 | 38.80% |
15 minutes - 1 hour | 261 | 52.20% |
1-4 hours | 42 | 8.40% |
>4 hours | 3 | 0.60% |
Key Insight: The vast majority (91%) of issues are estimated to take less than an hour for a human expert to solve, with only a tiny fraction (0.60%) requiring more than 4 hours.
Standardizing Difficulty Terminology
More recently, the “Multi-SWE-bench” paper (Zhu et al., 2024) simplified these four categories into three difficulty levels:
- Easy: ≤ 15 minutes (194 issues, 38.80%)
- Medium: 15 minutes – 1 hour (261 issues, 52.20%)
- Hard: ≥ 1 hour (45 issues, 9.00%)
This classification provides a clearer framework for evaluating model performance across difficulty levels.
Quantifying Difficulty: Beyond Time Estimates
To gain deeper insights into what makes tasks difficult, we can examine objective metrics across difficulty levels:
Difficulty | Count | Avg. #Files | Avg. #Hunks | Avg. #Lines |
---|---|---|---|---|
Easy | 194 | 1.03 | 1.37 | 5.04 |
Medium | 261 | 1.28 | 2.48 | 14.1 |
Hard | 45 | 2.0 | 6.82 | 55.78 |
Overall | 500 | 1.25 | 2.44 | 14.33 |
Key Observations
- Scaling Relationship: All metrics (files, hunks, lines) increase with difficulty level, but at different rates.
- Lines Changed: Shows the most dramatic increase from Easy to Hard (11x increase), highlighting that hard patches involve significantly more code changes.
- Files Modified: Shows a more modest increase (2x from Easy to Hard), suggesting that difficulty isn’t just about the number of files.
- Hunks: Increases 5x from Easy to Hard, indicating more separate code blocks need modification in harder tasks.
The Complexity of Single vs. Multi-File Issues
When we combine difficulty levels with file count, we see striking patterns:
Difficulty | Total Issues | Single-file | Multi-file | Best Model % | Combined % |
---|---|---|---|---|---|
Easy | 194 | 188 (96.91%) | 6 (3.09%) | 81.44% | 95.36% |
Medium | 261 | 221 (84.67%) | 40 (15.33%) | 62.07% | 84.29% |
Hard | 45 | 20 (44.44%) | 25 (55.56%) | 26.67% | 42.22% |
Total | 500 | 429 (85.8%) | 71 (14.2%) | 65.4% | 84.8% |
Critical Observation: As difficulty increases, the proportion of multi-file issues rises dramatically—from just 3.09% of easy issues to 55.56% of hard issues. This suggests multi-file complexity as a significant factor in what makes programming challenges difficult.
Looking deeper at the metrics for single vs. multi-file tasks:
File Count | Count | Avg. #Files | Avg. #Hunks | Avg. #Lines |
---|---|---|---|---|
Single | 429 | 1.0 | 1.78 | 10.05 |
Multi | 71 | 2.73 | 6.42 | 40.23 |
Overall | 500 | 1.25 | 2.44 | 14.33 |
This table reveals that multi-file tasks require, on average:
- Nearly 4x as many code hunks (separate code blocks)
- 4x as many lines of code changed
- More complex edits across multiple files
These metrics substantiate what the performance data shows: multi-file tasks represent a significant complexity jump.
Performance Across the Spectrum
Model | Overall (% Resolved) | Easy (194) — 188 single — 6 multi | Medium (261) — 221 single — 40 multi | Hard (45) — 20 single — 25 multi |
---|---|---|---|---|
Combined Systems | ||||
Combined Systems | 84.8% (424/500) | 95.36% (185/194) — 180/188 single — 5/6 multi | 84.29% (220/261) — 190/221 single — 30/40 multi | 42.22% (19/45) — 10/20 single — 9/25 multi |
Top Performers | ||||
Augment Agent v0 | 65.4% (327/500) | 80.4% (156/194) — 155/188 single — 1/6 multi | 62.1% (162/261) — 145/221 single — 17/40 multi | 20.0% (9/45) — 7/20 single — 2/25 multi |
W&B Programmer O1 crosscheck5 | 64.6% (323/500) | 77.3% (150/194) — 149/188 single — 1/6 multi | 62.1% (162/261) — 144/221 single — 18/40 multi | 24.4% (11/45) — 9/20 single — 2/25 multi |
AgentScope | 63.4% (317/500) | 81.4% (158/194) — 157/188 single — 1/6 multi | 56.7% (148/261) — 134/221 single — 14/40 multi | 24.4% (11/45) — 9/20 single — 2/25 multi |
Mid-Range Systems | ||||
Emergent E1 (v2024-12-23) | 57.2% (286/500) | 74.7% (145/194) — 144/188 single — 1/6 multi | 50.6% (132/261) — 116/221 single — 16/40 multi | 20.0% (9/45) — 7/20 single — 2/25 multi |
Amazon Q Developer Agent (v20241202-dev) | 55.0% (275/500) | 72.2% (140/194) — 139/188 single — 1/6 multi | 49.4% (129/261) — 115/221 single — 14/40 multi | 13.3% (6/45) — 6/20 single — 0/25 multi |
Agentless-1.5 + Claude-3.5 Sonnet (20241022) | 50.8% (254/500) | 70.6% (137/194) — 137/188 single — 0/6 multi | 42.5% (111/261) — 101/221 single — 10/40 multi | 13.3% (6/45) — 3/20 single — 3/25 multi |
Earlier Systems | ||||
SWE-agent + Claude 3.5 Sonnet | 33.6% (168/500) | 47.9% (93/194) — 92/188 single — 1/6 multi | 28.0% (73/261) — 67/221 single — 6/40 multi | 4.4% (2/45) — 1/20 single — 1/25 multi |
SWE-agent + GPT 4o (2024-05-13) | 23.2% (116/500) | 36.6% (71/194) — 71/188 single — 0/6 multi | 16.9% (44/261) — 39/221 single — 5/40 multi | 2.2% (1/45) — 1/20 single — 0/25 multi |
SWE-agent + Claude 3 Opus | 18.2% (91/500) | 27.3% (53/194) — 53/188 single — 0/6 multi | 10.0% (26/261) — 26/221 single — 0/40 multi | 0.0% (0/45) — 0/20 single — 0/25 multi |
Examining representative systems across the performance spectrum reveals consistent patterns:
- The Easy Category Is Largely Solved
- Combined resolution rate: 95.36%
- Even top individual systems solve ~80% of easy tasks
- The remaining gap is closing with each new LLM release
- The Medium Category Shows Progress
- Combined resolution rate: 84.29%
- Top systems solve 56-62% individually
- Significant improvement from earlier systems (<30%)
- The Hard Category Remains Challenging
- Combined resolution rate: only 42.22%
- Best individual systems solve just 20-25%
- Multi-file hard issues are particularly difficult (only 9/25 solved by any system)
Key Takeaways
This analysis reveals several important insights:
-
The Easy-Hard Gap Persists: Even top systems show a dramatic performance drop from easy (80%+) to hard tasks (20-25%).
-
Multi-File Issues Present a Frontier: The correlation between task difficulty and multi-file complexity is striking. As complexity increases along all metrics (files, hunks, lines), performance drops precipitously.
-
Lines of Code as a Key Indicator: The 11x increase in average lines changed from Easy to Hard tasks (5.04 → 55.78) appears to be the strongest predictor of task difficulty, far outpacing the increase in file count (2x) or hunks (5x).
-
Combined Performance Ceiling: The gap between individual and combined system performance suggests that different approaches excel at different tasks — no single system can yet solve all problem types.
-
Hard Multi-File Issues Remain Unsolved: With only 9/25 hard multi-file issues solved by any system, this represents a clear frontier for improvement.
-
Scale of Changes Matters: Multi-file tasks require 4x more lines of code and 4x more hunks than single-file tasks, highlighting that the scope of required changes significantly impacts task difficulty.
Relating to the Reality Gap
As I argued in my December 2024 post, the distribution in SWE-Bench-Verified significantly underrepresents the complexity of real-world programming tasks. Comparing datasets reveals this stark contrast:
Dataset | % issues >1 file |
---|---|
SWE-Bench train set | 50.27% |
SWE-Bench test set | 24.89% |
SWE-Bench-Verified test set | 14.2% |
This discrepancy is significant and highlights a critical area for improvement in benchmark design. If we use file count as a complexity proxy, SWE-Bench-Verified presents a dramatically simplified view compared to real-world codebases, where approximately half of all issues require multi-file changes.
The data clearly shows that as we move from simple, localized fixes (Easy) to complex, multi-file, multi-hunk patches (Hard), AI performance drops dramatically. Future research should focus on improving coordination across multiple files and handling larger, more complex code changes that span multiple distinct code blocks.
This finding reinforces my previous analysis in The Multi-File Frontier, where I emphasized that truly robust AI programming systems must be capable of coordinating changes across multiple files in an interconnected codebase.
Conclusion
While impressive progress has been made in solving isolated, single-file coding challenges, the frontier of multi-file, complex issues remains largely unexplored. For AI programming systems to truly match human capabilities in real-world software engineering, they must evolve beyond generating localized patches to understand the rich, interconnected nature of modern codebases.
Until benchmarks like SWE-Bench-Verified more accurately reflect the distribution of tasks in real-world development—particularly the proportion of multi-file changes—we should interpret leaderboard results with appropriate caution, recognizing they represent an optimistic view of AI’s current programming capabilities.
Enjoy Reading This Article?
Here are some more articles you might like to read next: