From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets

Since my last analysis of SWE-Bench Verified on April 15, there has been significant progress on the leaderboard. The best performing SWE-agents are now at 73.20% (Tools + Claude 4 Opus) whereas only ~45 days ago, the best performing system was Augment Agent v0 at 65.40% - representing a remarkable 7.8 percentage point improvement in less than two months.

The Saturation Problem Revisited

In my previous analyses, I identified that Easy problems are effectively saturated, with top SWE-agents achieving 84-86% success rates. My single-file saturation study further revealed structural limitations in current evaluation approaches.

This saturation creates a differentiation problem: when most competitive SWE-agents solve 160+ of the same 194 Easy problems, these categories no longer provide meaningful signal for distinguishing between top-tier systems. The real competition has shifted to unsolved and sparsely-solved problems across all difficulty categories.

A Data-Driven Solution: Discriminative Subsets

Rather than arbitrarily choosing “hard” problems, I developed a systematic methodology by analyzing how many SWE-agents solve each instance across all 500 problems in SWE-Bench Verified.

Instance Solve Distribution Analysis

Each of the 500 instances was checked against evaluation results from 83 distinct SWE-agents (submitted between October 2023–May 2025) to record solve counts. “Solved” means the agent’s fix passed the verification test suite - the same standard used in the original SWE-bench evaluation.

I categorized all instances based on how many of the 83 evaluated SWE-agents successfully solve them:

Instance Solve Count Distribution

How many SWE-Bench Verified instances are solved by different numbers of agents (out of 83 evaluated)

Bucket	Solve Count	Instances	Percentage	Easy	Medium	Hard	Single	Multi
Unsolved	0 agents	52	10.4%	5	26	21	27	25
Ultra Rare	1-2 agents	26	5.2%	6	16	4	17	9
Very Rare	3-5 agents	17	3.4%	3	10	4	14	3
Rare	6-10 agents	22	4.4%	1	19	2	19	3
Uncommon	11-20 agents	38	7.6%	13	22	3	28	10
Common	21-40 agents	96	19.2%	27	62	7	82	14
Very Common	41-60 agents	93	18.6%	38	52	3	88	5
Solved	61+ agents	156	31.2%	101	53	2	154	2

Key insight:

69% of problems are solved by 21+ agents, resulting in limited discrimination during evaluation
The competitive frontier exists in the first five categories (155 instances total), solved by ≤20 agents (High discrimination potential)
52 completely unsolved problems (Maximum discrimination)

Four Discriminative Subsets

Rather than continuing to measure incremental improvements on largely-solved Easy problems, I designed targeted subsets to focus on:

Completely unsolved problems (52 instances) - true frontier challenges
Sparsely solved problems - instances resolved by only a handful of agents
Problems with high solution variance - where top SWE-agents show meaningful differences

This approach yields a more discriminative and high-resolution instrument for measuring real-world SWE-agent capability differences, similar to how other AI benchmarks have evolved when existing evaluations became saturated.

Based on this analysis, I created four targeted evaluation subsets. The “Solve Range” column shows how many agents successfully solve the problems within each subset - for example, Frontier subset problems are solved by 0-5 agents, making them the most evaluatively sensitive.

Subset	Description	Total	Easy	Medium	Hard	Single	Multi	Solve Range	Top Agent %
Frontier	Solved by ≤5 agents	95	14	52	29	58	37	0–5	11.6%
Challenging	Solved by ≤20 agents	155	28	93	34	105	50	0–20	31.6%
Hard	All Hard problems	45	0	0	45	20	25	0–61	42.2%
MultiFile	Multi-file + ≤10 solves	40	3	17	20	0	40	0–7	10.0%

Subset Relationships:

Frontier ⊆ Challenging (all Frontier problems are included in Challenging)
Hard and MultiFile subsets partially overlap with both Frontier and Challenging subsets
Single-file problems involve changes to one source file, while multi-file problems require coordinated modifications across multiple files.

1. Frontier Subset (95 instances)

Problems solved by ≤5 agents - maximum evaluative sensitivity

This subset combines completely unsolved problems with ultra-rare and very-rare solves. It’s composed of 14 Easy, 52 Medium, and 29 Hard problems. Notably, even the top-performing Claude 4 Opus scores just 11.6% on this subset, compared to its 73.2% on the full benchmark. This provides extraordinary differentiation power between cutting-edge systems.

Composition: 58 single-file, 37 multi-file problems
Top Performance: Claude 4 Opus 11.6% (vs 73.2% on full benchmark)
Purpose: Maximum resolution for cutting-edge agent comparison

Top-10 performing SWE-agents on the Frontier subset:

Rank	SWE-Agent	Resolved	Percentage
1	Tools + Claude 4 Opus (2025-05-22)	11/95	11.6%
2	Tools + Claude 4 Sonnet (2025-05-22)	8/95	8.4%
3	OpenHands + Claude 4 Sonnet	7/95	7.4%
4	Zencoder (2025-04-30)	6/95	6.3%
5	Nemotron-CORTEXA (2025-05-16)	5/95	5.3%
6	Learn-by-interact	4/95	4.2%
7	TRAE	3/95	3.2%
8	Refact.ai Agent	3/95	3.2%
9	SWE-agent + Claude 4 Sonnet	3/95	3.2%
10	Blackbox AI Agent	3/95	3.2%

2. Challenging Subset (155 instances)

Problems solved by ≤20 agents - strong evaluative power

Expanding to include rare and uncommon problems, this subset maintains robust differentiating ability while providing more instances for statistical significance. Claude 4 Opus reaches 31.6% here - still providing much better separation than the 84%+ scores on Easy problems.

Composition: 28 Easy, 93 Medium, 34 Hard; 105 single-file, 50 multi-file
Top Performance: Claude 4 Opus 31.6%
Purpose: Balance of resolution and statistical significance

Top-10 performing SWE-agents on the Challenging subset:

Rank	SWE-Agent	Resolved	Percentage
1	Tools + Claude 4 Opus (2025-05-22)	49/155	31.6%
2	Tools + Claude 4 Sonnet (2025-05-22)	42/155	27.1%
3	OpenHands + Claude 4 Sonnet	36/155	23.2%
4	Zencoder (2025-04-30)	36/155	23.2%
5	TRAE	34/155	21.9%
6	Nemotron-CORTEXA (2025-05-16)	32/155	20.6%
7	devlo (2025-05-19)	30/155	19.4%
8	Refact.ai Agent	27/155	17.4%
9	Blackbox AI Agent	27/155	17.4%
10	Learn-by-interact	26/155	16.8%

3. Hard (45 instances)

All Hard difficulty problems regardless of solve rate

This subset focuses specifically on the 45 Hard problems, with Claude 4 Opus achieving 42.2%. Interestingly, this includes some problems solved by many agents, showing that difficulty level and solve count don’t perfectly correlate.

Composition: 0 Easy, 0 Medium, 45 Hard; 20 single-file, 25 multi-file
Top Performance: Claude 4 Opus 42.2%
Purpose: Focused evaluation on most difficult problem category

Top-10 performing SWE-agents on the Hard subset:

Rank	SWE-Agent	Resolved	Percentage
1	Tools + Claude 4 Opus (2025-05-22)	19/45	42.2%
2	Tools + Claude 4 Sonnet (2025-05-22)	15/45	33.3%
3	OpenHands + Claude 4 Sonnet	15/45	33.3%
4	TRAE	14/45	31.1%
5	devlo (2025-05-19)	13/45	28.9%
6	OpenHands (2025-04-15)	13/45	28.9%
7	Zencoder (2025-04-30)	12/45	26.7%
8	Nemotron-CORTEXA (2025-05-16)	12/45	26.7%
9	SWE-agent + Claude 4 Sonnet	12/45	26.7%
10	OpenHands + 4x Scaled (2024-02-03)	12/45	26.7%

4. MultiFile (40 instances)

Multi-file problems solved by ≤10 agents

This targets the intersection of multi-file problems (which tend to be harder) with low solve counts. These problems require coordinated edits across multiple source files, making them more complex for current SWE-agents. It’s the most challenging subset - even Claude 4 Sonnet only achieves 10.0%. The composition (3 Easy, 17 Medium, 20 Hard) confirms that multi-file problems are inherently more difficult.

Composition: 3 Easy, 17 Medium, 20 Hard; 0 single-file, 40 multi-file
Top Performance: Claude 4 Sonnet 10.0%
Purpose: Target intersection of multi-file complexity and low solve rates

Top-10 performing SWE-agents on the MultiFile subset:

Rank	SWE-Agent	Resolved	Percentage
1	Tools + Claude 4 Sonnet (2025-05-22)	4/40	10.0%
2	Tools + Claude 4 Opus (2025-05-22)	3/40	7.5%
3	OpenHands + Claude 4 Sonnet	3/40	7.5%
4	SWE-agent + Claude 4 Sonnet	3/40	7.5%
5	TRAE	2/40	5.0%
6	Zencoder (2025-04-30)	2/40	5.0%
7	OpenHands (2025-04-15)	2/40	5.0%
8	Blackbox AI Agent	2/40	5.0%
9	Learn-by-interact	2/40	5.0%
10	Amazon Q Developer Agent (v20240719-dev)	2/40	5.0%

Summary Comparison

Subset	Instances	Top Agent %	Focus
Frontier	95	11.6%	Maximum sensitivity
Challenging	155	31.6%	Broad + sensitive
Hard	45	42.2%	Traditional difficulty
MultiFile	40	10.0%	Real-world complexity

Key Insights and Patterns

1. Medium Problems Drive Frontier Resolution

The Frontier subset contains 52 Medium vs 29 Hard problems, revealing that traditional difficulty categories don’t capture all sources of complexity. Some Medium problems remain more challenging than many Hard problems. This suggests that ‘Medium’ in SWE-Bench often reflects problem scope or context rather than true agent difficulty.

2. Multi-file Problems Are Genuinely Harder

Multi-file problems dominate the low-solve buckets:

40/40 MultiFile problems are multi-file (by design)
37/95 Frontier problems are multi-file (39%)
Only 2/156 “Solved” problems are multi-file (1.3%)

3. Massive Performance Differentiation Gains

The performance gaps between full benchmark and targeted subsets are dramatic. While the top two SWE-agents (Tools + Claude 4 Opus and Tools + Claude 4 Sonnet) are separated by less than 1 percentage point on the entire 500-instance SWE-Bench Verified benchmark (73.2% vs 72.4%), the specialized subsets create substantial separation between these same systems. These clear performance gaps demonstrate the power of focusing evaluation on truly challenging problems.

While the specialized subsets have smaller sample sizes, the large performance differences (typically 35-65 percentage point drops from full benchmark performance) clearly indicate meaningful capability distinctions.

Performance Comparison Across Subsets

Top SWE-agents performance on full benchmark vs. discriminative subsets

SWE-Agent

Full Benchmark

Frontier

Challenging

Hard

MultiFile

Implications for the Research Community

Immediate Benefits

Enhanced Signal: Researchers can immediately use these subsets for more sensitive evaluation
Research Focus: Identifying specific problem types guides targeted improvement efforts
Facilitates curriculum design: For agent fine-tuning based on real-world bottlenecks
Clearer Progress: Performance improvements on frontier problems represent genuine capability advances

Methodological Innovation

This solve-distribution approach could be applied to other saturated benchmarks:

Analyze how many systems solve each problem
Identify low-solve instances for targeted subsets
Create sensitive evaluation instruments
Maintain evaluative power as capabilities advance

Future Evolution

As SWE-agents continue improving, this methodology enables dynamic subset creation:

Current “Ultra Rare” problems may become “Common”
New targeted subsets can be generated using the same framework
Evaluation maintains sensitivity to genuine progress

Technical Implementation Notes

Data Availability

All subset definitions and solve matrices are available as structured JSON data, enabling:

Immediate subset evaluation using existing SWE-Bench infrastructure
Integration with current evaluation pipelines
Reproducible research and fair comparisons

Evaluation Guidelines

For consistent evaluation across the research community:

Report both full benchmark and subset performance
Use Frontier subset for cutting-edge system comparison
Use Challenging subset for statistical significance
Use specialized subsets (Hard, MultiFile) for targeted research

Try the Discriminative Subsets

The targeted subsets are available on HuggingFace for immediate use:

Dataset: jatinganhotra/SWE-bench_Verified-discriminative

Four splits: frontier, challenging, hard, multifile

from datasets import load_dataset

# Load all splits
dataset = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative")

# Load specific split for targeted evaluation
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")

I encourage researchers to benchmark their SWE-agents on these discriminative subsets and share results publicly. Tracking progress here, not just on saturated sets, will best reflect true capability gains in autonomous software engineering rather than incremental improvements on saturated problem sets.

Conclusion

This analysis reveals that SWE-Bench Verified’s evaluative power has become concentrated in a subset of challenging problems. Systematically identifying these problems through solve-distribution analysis enables high-resolution benchmarks like the Frontier Subset, which reveal capability distinctions that broad benchmarks obscure.

The Challenging Subset balances sensitivity with statistical power, while the Hard and MultiFile subsets offer targeted evaluation for specific research directions.

Most importantly, this methodology - analyzing solve distribution to identify evaluatively sensitive problems - provides a data-driven framework for benchmark evolution. As AI capabilities continue advancing, this approach ensures evaluation maintains sensitivity to genuine progress rather than incremental improvements on saturated problem sets.

The targeted subsets are immediately usable with existing SWE-Bench infrastructure, enabling the research community to adopt more sensitive evaluation practices while pushing toward the true frontiers of automated software engineering.

Citation

If you find this analysis useful for your research, please cite it as:

@misc{ganhotra2025discriminative,
  title={From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets},
  author={Ganhotra, Jatin},
  year={2025},
  month={June},
  url={https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets/},
  note={Blog post}
}