From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets

Since my last analysis of SWE-Bench Verified on April 15, there has been significant progress on the leaderboard. The best performing SWE-agents are now at 73.20% (Tools + Claude 4 Opus) whereas only ~45 days ago, the best performing system was Augment Agent v0 at 65.40% - representing a remarkable 7.8 percentage point improvement in less than two months.

The Saturation Problem Revisited

In my previous analyses, I identified that Easy problems are effectively saturated, with top SWE-agents achieving 84-86% success rates. My single-file saturation study further revealed structural limitations in current evaluation approaches.

This saturation creates a differentiation problem: when most competitive SWE-agents solve 160+ of the same 194 Easy problems, these categories no longer provide meaningful signal for distinguishing between top-tier systems. The real competition has shifted to unsolved and sparsely-solved problems across all difficulty categories.


A Data-Driven Solution: Discriminative Subsets

Rather than arbitrarily choosing “hard” problems, I developed a systematic methodology by analyzing how many SWE-agents solve each instance across all 500 problems in SWE-Bench Verified.

Instance Solve Distribution Analysis

Each of the 500 instances was checked against evaluation results from 83 distinct SWE-agents (submitted between October 2023–May 2025) to record solve counts. “Solved” means the agent’s fix passed the verification test suite - the same standard used in the original SWE-bench evaluation.

I categorized all instances based on how many of the 83 evaluated SWE-agents successfully solve them:

Instance Solve Count Distribution

How many SWE-Bench Verified instances are solved by different numbers of agents (out of 83 evaluated)

Bucket Solve Count Instances Percentage Easy Medium Hard Single Multi
Unsolved 0 agents 52 10.4% 5 26 21 27 25
Ultra Rare 1-2 agents 26 5.2% 6 16 4 17 9
Very Rare 3-5 agents 17 3.4% 3 10 4 14 3
Rare 6-10 agents 22 4.4% 1 19 2 19 3
Uncommon 11-20 agents 38 7.6% 13 22 3 28 10
Common 21-40 agents 96 19.2% 27 62 7 82 14
Very Common 41-60 agents 93 18.6% 38 52 3 88 5
Solved 61+ agents 156 31.2% 101 53 2 154 2

Key insight:

  • 69% of problems are solved by 21+ agents, resulting in limited discrimination during evaluation
  • The competitive frontier exists in the first five categories (155 instances total), solved by ≤20 agents (High discrimination potential)
  • 52 completely unsolved problems (Maximum discrimination)


Four Discriminative Subsets

Rather than continuing to measure incremental improvements on largely-solved Easy problems, I designed targeted subsets to focus on:

  1. Completely unsolved problems (52 instances) - true frontier challenges
  2. Sparsely solved problems - instances resolved by only a handful of agents
  3. Problems with high solution variance - where top SWE-agents show meaningful differences

This approach yields a more discriminative and high-resolution instrument for measuring real-world SWE-agent capability differences, similar to how other AI benchmarks have evolved when existing evaluations became saturated.

Based on this analysis, I created four targeted evaluation subsets. The “Solve Range” column shows how many agents successfully solve the problems within each subset - for example, Frontier subset problems are solved by 0-5 agents, making them the most evaluatively sensitive.

Subset Description Total Easy Medium Hard Single Multi Solve Range Top Agent %
Frontier Solved by ≤5 agents 95 14 52 29 58 37 0–5 11.6%
Challenging Solved by ≤20 agents 155 28 93 34 105 50 0–20 31.6%
Hard All Hard problems 45 0 0 45 20 25 0–61 42.2%
MultiFile Multi-file + ≤10 solves 40 3 17 20 0 40 0–7 10.0%

Subset Relationships:

  • Frontier ⊆ Challenging (all Frontier problems are included in Challenging)
  • Hard and MultiFile subsets partially overlap with both Frontier and Challenging subsets
  • Single-file problems involve changes to one source file, while multi-file problems require coordinated modifications across multiple files.

1. Frontier Subset (95 instances)

Problems solved by ≤5 agents - maximum evaluative sensitivity

This subset combines completely unsolved problems with ultra-rare and very-rare solves. It’s composed of 14 Easy, 52 Medium, and 29 Hard problems. Notably, even the top-performing Claude 4 Opus scores just 11.6% on this subset, compared to its 73.2% on the full benchmark. This provides extraordinary differentiation power between cutting-edge systems.

  • Composition: 58 single-file, 37 multi-file problems
  • Top Performance: Claude 4 Opus 11.6% (vs 73.2% on full benchmark)
  • Purpose: Maximum resolution for cutting-edge agent comparison

Top-10 performing SWE-agents on the Frontier subset:

Rank SWE-Agent Resolved Percentage
1 Tools + Claude 4 Opus (2025-05-22) 11/95 11.6%
2 Tools + Claude 4 Sonnet (2025-05-22) 8/95 8.4%
3 OpenHands + Claude 4 Sonnet 7/95 7.4%
4 Zencoder (2025-04-30) 6/95 6.3%
5 Nemotron-CORTEXA (2025-05-16) 5/95 5.3%
6 Learn-by-interact 4/95 4.2%
7 TRAE 3/95 3.2%
8 Refact.ai Agent 3/95 3.2%
9 SWE-agent + Claude 4 Sonnet 3/95 3.2%
10 Blackbox AI Agent 3/95 3.2%

2. Challenging Subset (155 instances)

Problems solved by ≤20 agents - strong evaluative power

Expanding to include rare and uncommon problems, this subset maintains robust differentiating ability while providing more instances for statistical significance. Claude 4 Opus reaches 31.6% here - still providing much better separation than the 84%+ scores on Easy problems.

  • Composition: 28 Easy, 93 Medium, 34 Hard; 105 single-file, 50 multi-file
  • Top Performance: Claude 4 Opus 31.6%
  • Purpose: Balance of resolution and statistical significance

Top-10 performing SWE-agents on the Challenging subset:

Rank SWE-Agent Resolved Percentage
1 Tools + Claude 4 Opus (2025-05-22) 49/155 31.6%
2 Tools + Claude 4 Sonnet (2025-05-22) 42/155 27.1%
3 OpenHands + Claude 4 Sonnet 36/155 23.2%
4 Zencoder (2025-04-30) 36/155 23.2%
5 TRAE 34/155 21.9%
6 Nemotron-CORTEXA (2025-05-16) 32/155 20.6%
7 devlo (2025-05-19) 30/155 19.4%
8 Refact.ai Agent 27/155 17.4%
9 Blackbox AI Agent 27/155 17.4%
10 Learn-by-interact 26/155 16.8%

3. Hard (45 instances)

All Hard difficulty problems regardless of solve rate

This subset focuses specifically on the 45 Hard problems, with Claude 4 Opus achieving 42.2%. Interestingly, this includes some problems solved by many agents, showing that difficulty level and solve count don’t perfectly correlate.

  • Composition: 0 Easy, 0 Medium, 45 Hard; 20 single-file, 25 multi-file
  • Top Performance: Claude 4 Opus 42.2%
  • Purpose: Focused evaluation on most difficult problem category

Top-10 performing SWE-agents on the Hard subset:

Rank SWE-Agent Resolved Percentage
1 Tools + Claude 4 Opus (2025-05-22) 19/45 42.2%
2 Tools + Claude 4 Sonnet (2025-05-22) 15/45 33.3%
3 OpenHands + Claude 4 Sonnet 15/45 33.3%
4 TRAE 14/45 31.1%
5 devlo (2025-05-19) 13/45 28.9%
6 OpenHands (2025-04-15) 13/45 28.9%
7 Zencoder (2025-04-30) 12/45 26.7%
8 Nemotron-CORTEXA (2025-05-16) 12/45 26.7%
9 SWE-agent + Claude 4 Sonnet 12/45 26.7%
10 OpenHands + 4x Scaled (2024-02-03) 12/45 26.7%

4. MultiFile (40 instances)

Multi-file problems solved by ≤10 agents

This targets the intersection of multi-file problems (which tend to be harder) with low solve counts. These problems require coordinated edits across multiple source files, making them more complex for current SWE-agents. It’s the most challenging subset - even Claude 4 Sonnet only achieves 10.0%. The composition (3 Easy, 17 Medium, 20 Hard) confirms that multi-file problems are inherently more difficult.

  • Composition: 3 Easy, 17 Medium, 20 Hard; 0 single-file, 40 multi-file
  • Top Performance: Claude 4 Sonnet 10.0%
  • Purpose: Target intersection of multi-file complexity and low solve rates

Top-10 performing SWE-agents on the MultiFile subset:

Rank SWE-Agent Resolved Percentage
1 Tools + Claude 4 Sonnet (2025-05-22) 4/40 10.0%
2 Tools + Claude 4 Opus (2025-05-22) 3/40 7.5%
3 OpenHands + Claude 4 Sonnet 3/40 7.5%
4 SWE-agent + Claude 4 Sonnet 3/40 7.5%
5 TRAE 2/40 5.0%
6 Zencoder (2025-04-30) 2/40 5.0%
7 OpenHands (2025-04-15) 2/40 5.0%
8 Blackbox AI Agent 2/40 5.0%
9 Learn-by-interact 2/40 5.0%
10 Amazon Q Developer Agent (v20240719-dev) 2/40 5.0%

Summary Comparison

Subset Instances Top Agent % Focus
Frontier 95 11.6% Maximum sensitivity
Challenging 155 31.6% Broad + sensitive
Hard 45 42.2% Traditional difficulty
MultiFile 40 10.0% Real-world complexity

Key Insights and Patterns

1. Medium Problems Drive Frontier Resolution

The Frontier subset contains 52 Medium vs 29 Hard problems, revealing that traditional difficulty categories don’t capture all sources of complexity. Some Medium problems remain more challenging than many Hard problems. This suggests that ‘Medium’ in SWE-Bench often reflects problem scope or context rather than true agent difficulty.

2. Multi-file Problems Are Genuinely Harder

Multi-file problems dominate the low-solve buckets:

  • 40/40 MultiFile problems are multi-file (by design)
  • 37/95 Frontier problems are multi-file (39%)
  • Only 2/156 “Solved” problems are multi-file (1.3%)

3. Massive Performance Differentiation Gains

The performance gaps between full benchmark and targeted subsets are dramatic. While the top two SWE-agents (Tools + Claude 4 Opus and Tools + Claude 4 Sonnet) are separated by less than 1 percentage point on the entire 500-instance SWE-Bench Verified benchmark (73.2% vs 72.4%), the specialized subsets create substantial separation between these same systems. These clear performance gaps demonstrate the power of focusing evaluation on truly challenging problems.

While the specialized subsets have smaller sample sizes, the large performance differences (typically 35-65 percentage point drops from full benchmark performance) clearly indicate meaningful capability distinctions.

Performance Comparison Across Subsets

Top SWE-agents performance on full benchmark vs. discriminative subsets

SWE-Agent
Full Benchmark
Frontier
Challenging
Hard
MultiFile

Implications for the Research Community

Immediate Benefits

  1. Enhanced Signal: Researchers can immediately use these subsets for more sensitive evaluation
  2. Research Focus: Identifying specific problem types guides targeted improvement efforts
  3. Facilitates curriculum design: For agent fine-tuning based on real-world bottlenecks
  4. Clearer Progress: Performance improvements on frontier problems represent genuine capability advances

Methodological Innovation

This solve-distribution approach could be applied to other saturated benchmarks:

  1. Analyze how many systems solve each problem
  2. Identify low-solve instances for targeted subsets
  3. Create sensitive evaluation instruments
  4. Maintain evaluative power as capabilities advance

Future Evolution

As SWE-agents continue improving, this methodology enables dynamic subset creation:

  • Current “Ultra Rare” problems may become “Common”
  • New targeted subsets can be generated using the same framework
  • Evaluation maintains sensitivity to genuine progress

Technical Implementation Notes

Data Availability

All subset definitions and solve matrices are available as structured JSON data, enabling:

  • Immediate subset evaluation using existing SWE-Bench infrastructure
  • Integration with current evaluation pipelines
  • Reproducible research and fair comparisons

Evaluation Guidelines

For consistent evaluation across the research community:

  1. Report both full benchmark and subset performance
  2. Use Frontier subset for cutting-edge system comparison
  3. Use Challenging subset for statistical significance
  4. Use specialized subsets (Hard, MultiFile) for targeted research

Try the Discriminative Subsets

The targeted subsets are available on HuggingFace for immediate use:

Dataset: jatinganhotra/SWE-bench_Verified-discriminative

Four splits: frontier, challenging, hard, multifile

from datasets import load_dataset

# Load all splits
dataset = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative")

# Load specific split for targeted evaluation
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")

I encourage researchers to benchmark their SWE-agents on these discriminative subsets and share results publicly. Tracking progress here, not just on saturated sets, will best reflect true capability gains in autonomous software engineering rather than incremental improvements on saturated problem sets.


Conclusion

This analysis reveals that SWE-Bench Verified’s evaluative power has become concentrated in a subset of challenging problems. Systematically identifying these problems through solve-distribution analysis enables high-resolution benchmarks like the Frontier Subset, which reveal capability distinctions that broad benchmarks obscure.

The Challenging Subset balances sensitivity with statistical power, while the Hard and MultiFile subsets offer targeted evaluation for specific research directions.

Most importantly, this methodology - analyzing solve distribution to identify evaluatively sensitive problems - provides a data-driven framework for benchmark evolution. As AI capabilities continue advancing, this approach ensures evaluation maintains sensitivity to genuine progress rather than incremental improvements on saturated problem sets.

The targeted subsets are immediately usable with existing SWE-Bench infrastructure, enabling the research community to adopt more sensitive evaluation practices while pushing toward the true frontiers of automated software engineering.


Citation

If you find this analysis useful for your research, please cite it as:

@misc{ganhotra2025discriminative,
  title={From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets},
  author={Ganhotra, Jatin},
  year={2025},
  month={June},
  url={https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets/},
  note={Blog post}
}




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified
  • Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?
  • The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis
  • SWE-Bench Verified ⊊ real-world SWE tasks
  • The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges