From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets
Since my last analysis of SWE-Bench Verified on April 15, there has been significant progress on the leaderboard. The best performing SWE-agents are now at 73.20% (Tools + Claude 4 Opus) whereas only ~45 days ago, the best performing system was Augment Agent v0 at 65.40% - representing a remarkable 7.8 percentage point improvement in less than two months.
The Saturation Problem Revisited
In my previous analyses, I identified that Easy problems are effectively saturated, with top SWE-agents achieving 84-86% success rates. My single-file saturation study further revealed structural limitations in current evaluation approaches.
This saturation creates a differentiation problem: when most competitive SWE-agents solve 160+ of the same 194 Easy problems, these categories no longer provide meaningful signal for distinguishing between top-tier systems. The real competition has shifted to unsolved and sparsely-solved problems across all difficulty categories.
A Data-Driven Solution: Discriminative Subsets
Rather than arbitrarily choosing “hard” problems, I developed a systematic methodology by analyzing how many SWE-agents solve each instance across all 500 problems in SWE-Bench Verified.
Instance Solve Distribution Analysis
Each of the 500 instances was checked against evaluation results from 83 distinct SWE-agents (submitted between October 2023–May 2025) to record solve counts. “Solved” means the agent’s fix passed the verification test suite - the same standard used in the original SWE-bench evaluation.
I categorized all instances based on how many of the 83 evaluated SWE-agents successfully solve them:
Instance Solve Count Distribution
How many SWE-Bench Verified instances are solved by different numbers of agents (out of 83 evaluated)
Bucket | Solve Count | Instances | Percentage | Easy | Medium | Hard | Single | Multi |
---|---|---|---|---|---|---|---|---|
Unsolved | 0 agents | 52 | 10.4% | 5 | 26 | 21 | 27 | 25 |
Ultra Rare | 1-2 agents | 26 | 5.2% | 6 | 16 | 4 | 17 | 9 |
Very Rare | 3-5 agents | 17 | 3.4% | 3 | 10 | 4 | 14 | 3 |
Rare | 6-10 agents | 22 | 4.4% | 1 | 19 | 2 | 19 | 3 |
Uncommon | 11-20 agents | 38 | 7.6% | 13 | 22 | 3 | 28 | 10 |
Common | 21-40 agents | 96 | 19.2% | 27 | 62 | 7 | 82 | 14 |
Very Common | 41-60 agents | 93 | 18.6% | 38 | 52 | 3 | 88 | 5 |
Solved | 61+ agents | 156 | 31.2% | 101 | 53 | 2 | 154 | 2 |
Key insight:
- 69% of problems are solved by 21+ agents, resulting in limited discrimination during evaluation
- The competitive frontier exists in the first five categories (155 instances total), solved by ≤20 agents (High discrimination potential)
- 52 completely unsolved problems (Maximum discrimination)
Four Discriminative Subsets
Rather than continuing to measure incremental improvements on largely-solved Easy problems, I designed targeted subsets to focus on:
- Completely unsolved problems (52 instances) - true frontier challenges
- Sparsely solved problems - instances resolved by only a handful of agents
- Problems with high solution variance - where top SWE-agents show meaningful differences
This approach yields a more discriminative and high-resolution instrument for measuring real-world SWE-agent capability differences, similar to how other AI benchmarks have evolved when existing evaluations became saturated.
Based on this analysis, I created four targeted evaluation subsets. The “Solve Range” column shows how many agents successfully solve the problems within each subset - for example, Frontier subset problems are solved by 0-5 agents, making them the most evaluatively sensitive.
Subset | Description | Total | Easy | Medium | Hard | Single | Multi | Solve Range | Top Agent % |
---|---|---|---|---|---|---|---|---|---|
Frontier | Solved by ≤5 agents | 95 | 14 | 52 | 29 | 58 | 37 | 0–5 | 11.6% |
Challenging | Solved by ≤20 agents | 155 | 28 | 93 | 34 | 105 | 50 | 0–20 | 31.6% |
Hard | All Hard problems | 45 | 0 | 0 | 45 | 20 | 25 | 0–61 | 42.2% |
MultiFile | Multi-file + ≤10 solves | 40 | 3 | 17 | 20 | 0 | 40 | 0–7 | 10.0% |
Subset Relationships:
- Frontier ⊆ Challenging (all Frontier problems are included in Challenging)
- Hard and MultiFile subsets partially overlap with both Frontier and Challenging subsets
- Single-file problems involve changes to one source file, while multi-file problems require coordinated modifications across multiple files.
1. Frontier Subset (95 instances)
Problems solved by ≤5 agents - maximum evaluative sensitivity
This subset combines completely unsolved problems with ultra-rare and very-rare solves. It’s composed of 14 Easy, 52 Medium, and 29 Hard problems. Notably, even the top-performing Claude 4 Opus scores just 11.6% on this subset, compared to its 73.2% on the full benchmark. This provides extraordinary differentiation power between cutting-edge systems.
- Composition: 58 single-file, 37 multi-file problems
- Top Performance: Claude 4 Opus 11.6% (vs 73.2% on full benchmark)
- Purpose: Maximum resolution for cutting-edge agent comparison
Top-10 performing SWE-agents on the Frontier subset:
Rank | SWE-Agent | Resolved | Percentage |
---|---|---|---|
1 | Tools + Claude 4 Opus (2025-05-22) | 11/95 | 11.6% |
2 | Tools + Claude 4 Sonnet (2025-05-22) | 8/95 | 8.4% |
3 | OpenHands + Claude 4 Sonnet | 7/95 | 7.4% |
4 | Zencoder (2025-04-30) | 6/95 | 6.3% |
5 | Nemotron-CORTEXA (2025-05-16) | 5/95 | 5.3% |
6 | Learn-by-interact | 4/95 | 4.2% |
7 | TRAE | 3/95 | 3.2% |
8 | Refact.ai Agent | 3/95 | 3.2% |
9 | SWE-agent + Claude 4 Sonnet | 3/95 | 3.2% |
10 | Blackbox AI Agent | 3/95 | 3.2% |
2. Challenging Subset (155 instances)
Problems solved by ≤20 agents - strong evaluative power
Expanding to include rare and uncommon problems, this subset maintains robust differentiating ability while providing more instances for statistical significance. Claude 4 Opus reaches 31.6% here - still providing much better separation than the 84%+ scores on Easy problems.
- Composition: 28 Easy, 93 Medium, 34 Hard; 105 single-file, 50 multi-file
- Top Performance: Claude 4 Opus 31.6%
- Purpose: Balance of resolution and statistical significance
Top-10 performing SWE-agents on the Challenging subset:
Rank | SWE-Agent | Resolved | Percentage |
---|---|---|---|
1 | Tools + Claude 4 Opus (2025-05-22) | 49/155 | 31.6% |
2 | Tools + Claude 4 Sonnet (2025-05-22) | 42/155 | 27.1% |
3 | OpenHands + Claude 4 Sonnet | 36/155 | 23.2% |
4 | Zencoder (2025-04-30) | 36/155 | 23.2% |
5 | TRAE | 34/155 | 21.9% |
6 | Nemotron-CORTEXA (2025-05-16) | 32/155 | 20.6% |
7 | devlo (2025-05-19) | 30/155 | 19.4% |
8 | Refact.ai Agent | 27/155 | 17.4% |
9 | Blackbox AI Agent | 27/155 | 17.4% |
10 | Learn-by-interact | 26/155 | 16.8% |
3. Hard (45 instances)
All Hard difficulty problems regardless of solve rate
This subset focuses specifically on the 45 Hard problems, with Claude 4 Opus achieving 42.2%. Interestingly, this includes some problems solved by many agents, showing that difficulty level and solve count don’t perfectly correlate.
- Composition: 0 Easy, 0 Medium, 45 Hard; 20 single-file, 25 multi-file
- Top Performance: Claude 4 Opus 42.2%
- Purpose: Focused evaluation on most difficult problem category
Top-10 performing SWE-agents on the Hard subset:
Rank | SWE-Agent | Resolved | Percentage |
---|---|---|---|
1 | Tools + Claude 4 Opus (2025-05-22) | 19/45 | 42.2% |
2 | Tools + Claude 4 Sonnet (2025-05-22) | 15/45 | 33.3% |
3 | OpenHands + Claude 4 Sonnet | 15/45 | 33.3% |
4 | TRAE | 14/45 | 31.1% |
5 | devlo (2025-05-19) | 13/45 | 28.9% |
6 | OpenHands (2025-04-15) | 13/45 | 28.9% |
7 | Zencoder (2025-04-30) | 12/45 | 26.7% |
8 | Nemotron-CORTEXA (2025-05-16) | 12/45 | 26.7% |
9 | SWE-agent + Claude 4 Sonnet | 12/45 | 26.7% |
10 | OpenHands + 4x Scaled (2024-02-03) | 12/45 | 26.7% |
4. MultiFile (40 instances)
Multi-file problems solved by ≤10 agents
This targets the intersection of multi-file problems (which tend to be harder) with low solve counts. These problems require coordinated edits across multiple source files, making them more complex for current SWE-agents. It’s the most challenging subset - even Claude 4 Sonnet only achieves 10.0%. The composition (3 Easy, 17 Medium, 20 Hard) confirms that multi-file problems are inherently more difficult.
- Composition: 3 Easy, 17 Medium, 20 Hard; 0 single-file, 40 multi-file
- Top Performance: Claude 4 Sonnet 10.0%
- Purpose: Target intersection of multi-file complexity and low solve rates
Top-10 performing SWE-agents on the MultiFile subset:
Rank | SWE-Agent | Resolved | Percentage |
---|---|---|---|
1 | Tools + Claude 4 Sonnet (2025-05-22) | 4/40 | 10.0% |
2 | Tools + Claude 4 Opus (2025-05-22) | 3/40 | 7.5% |
3 | OpenHands + Claude 4 Sonnet | 3/40 | 7.5% |
4 | SWE-agent + Claude 4 Sonnet | 3/40 | 7.5% |
5 | TRAE | 2/40 | 5.0% |
6 | Zencoder (2025-04-30) | 2/40 | 5.0% |
7 | OpenHands (2025-04-15) | 2/40 | 5.0% |
8 | Blackbox AI Agent | 2/40 | 5.0% |
9 | Learn-by-interact | 2/40 | 5.0% |
10 | Amazon Q Developer Agent (v20240719-dev) | 2/40 | 5.0% |
Summary Comparison
Subset | Instances | Top Agent % | Focus |
---|---|---|---|
Frontier | 95 | 11.6% | Maximum sensitivity |
Challenging | 155 | 31.6% | Broad + sensitive |
Hard | 45 | 42.2% | Traditional difficulty |
MultiFile | 40 | 10.0% | Real-world complexity |
Key Insights and Patterns
1. Medium Problems Drive Frontier Resolution
The Frontier subset contains 52 Medium vs 29 Hard problems, revealing that traditional difficulty categories don’t capture all sources of complexity. Some Medium problems remain more challenging than many Hard problems. This suggests that ‘Medium’ in SWE-Bench often reflects problem scope or context rather than true agent difficulty.
2. Multi-file Problems Are Genuinely Harder
Multi-file problems dominate the low-solve buckets:
- 40/40 MultiFile problems are multi-file (by design)
- 37/95 Frontier problems are multi-file (39%)
- Only 2/156 “Solved” problems are multi-file (1.3%)
3. Massive Performance Differentiation Gains
The performance gaps between full benchmark and targeted subsets are dramatic. While the top two SWE-agents (Tools + Claude 4 Opus and Tools + Claude 4 Sonnet) are separated by less than 1 percentage point on the entire 500-instance SWE-Bench Verified benchmark (73.2% vs 72.4%), the specialized subsets create substantial separation between these same systems. These clear performance gaps demonstrate the power of focusing evaluation on truly challenging problems.
While the specialized subsets have smaller sample sizes, the large performance differences (typically 35-65 percentage point drops from full benchmark performance) clearly indicate meaningful capability distinctions.
Performance Comparison Across Subsets
Top SWE-agents performance on full benchmark vs. discriminative subsets
Implications for the Research Community
Immediate Benefits
- Enhanced Signal: Researchers can immediately use these subsets for more sensitive evaluation
- Research Focus: Identifying specific problem types guides targeted improvement efforts
- Facilitates curriculum design: For agent fine-tuning based on real-world bottlenecks
- Clearer Progress: Performance improvements on frontier problems represent genuine capability advances
Methodological Innovation
This solve-distribution approach could be applied to other saturated benchmarks:
- Analyze how many systems solve each problem
- Identify low-solve instances for targeted subsets
- Create sensitive evaluation instruments
- Maintain evaluative power as capabilities advance
Future Evolution
As SWE-agents continue improving, this methodology enables dynamic subset creation:
- Current “Ultra Rare” problems may become “Common”
- New targeted subsets can be generated using the same framework
- Evaluation maintains sensitivity to genuine progress
Technical Implementation Notes
Data Availability
All subset definitions and solve matrices are available as structured JSON data, enabling:
- Immediate subset evaluation using existing SWE-Bench infrastructure
- Integration with current evaluation pipelines
- Reproducible research and fair comparisons
Evaluation Guidelines
For consistent evaluation across the research community:
- Report both full benchmark and subset performance
- Use Frontier subset for cutting-edge system comparison
- Use Challenging subset for statistical significance
- Use specialized subsets (Hard, MultiFile) for targeted research
Try the Discriminative Subsets
The targeted subsets are available on HuggingFace for immediate use:
Dataset: jatinganhotra/SWE-bench_Verified-discriminative
Four splits: frontier, challenging, hard, multifile
from datasets import load_dataset
# Load all splits
dataset = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative")
# Load specific split for targeted evaluation
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")
I encourage researchers to benchmark their SWE-agents on these discriminative subsets and share results publicly. Tracking progress here, not just on saturated sets, will best reflect true capability gains in autonomous software engineering rather than incremental improvements on saturated problem sets.
Conclusion
This analysis reveals that SWE-Bench Verified’s evaluative power has become concentrated in a subset of challenging problems. Systematically identifying these problems through solve-distribution analysis enables high-resolution benchmarks like the Frontier Subset, which reveal capability distinctions that broad benchmarks obscure.
The Challenging Subset balances sensitivity with statistical power, while the Hard and MultiFile subsets offer targeted evaluation for specific research directions.
Most importantly, this methodology - analyzing solve distribution to identify evaluatively sensitive problems - provides a data-driven framework for benchmark evolution. As AI capabilities continue advancing, this approach ensures evaluation maintains sensitivity to genuine progress rather than incremental improvements on saturated problem sets.
The targeted subsets are immediately usable with existing SWE-Bench infrastructure, enabling the research community to adopt more sensitive evaluation practices while pushing toward the true frontiers of automated software engineering.
Citation
If you find this analysis useful for your research, please cite it as:
@misc{ganhotra2025discriminative,
title={From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets},
author={Ganhotra, Jatin},
year={2025},
month={June},
url={https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets/},
note={Blog post}
}
Enjoy Reading This Article?
Here are some more articles you might like to read next: