blog | Jatin Ganhotra

A New Chapter in Blogging Exploring the World of Agents

After a decade-long hiatus, I am thrilled to announce my return to blogging! This new journey will center around the fascinating and ever-evolving domain of Agents, with a particular focus on Software Engineering Agents (SWE-Agents).

Through this blog, I aim to share insights, ideas, and developments in this exciting field. My goal is to spark thought-provoking discussions and provide content that is both insightful and valuable to readers. Your feedback and perspectives will be invaluable, so I warmly invite you to share your thoughts in the comments and join the conversation.

Hidden Naming Contracts in SWE-Agent Benchmarks

A programmatic scan of six SWE-bench-style benchmarks — SWE-bench Verified, SWE-bench Pro and SWE-PolyBench — finds tests that encode hidden naming contracts, penalizing behaviorally correct fixes that choose different identifiers.

23 min read · April 05, 2026

2026 · evaluation benchmarks SWE-Bench SWE-Bench_Verified SWE-Bench_Pro SWE-PolyBench SWE-agent · blog swe-agents
The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis

How visual complexity penalizes SWE-agents on SWE-bench Multimodal — testing SWE-agent, Agentless and OpenHands with Claude 3.7 Sonnet and OpenAI o3 on visually rich GitHub issues.

10 min read · July 26, 2025

2025 · evaluation benchmarks SWE-Bench_Multimodal SWE-agent Agentless OpenHands Claude 3.7 Sonnet · blog swe-agents
From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets

Discriminative subsets of SWE-bench Verified reveal true SWE-agent capability — how aggregate scores hide wide variation across SWE-agent, OpenHands, Claude 4 Opus and the L* agent (from 73% to 11%).

14 min read · June 05, 2025

2025 · evaluation benchmarks SWE-Bench_Verified SWE-agent OpenHands Claude 4 Opus L* agent · blog swe-agents
Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?

Task-difficulty distribution in SWE-bench Verified from human annotations — what easy, medium and hard mean for SWE-agents like SWE-agent and Agentless running Claude and OpenAI o1.

10 min read · April 15, 2025

2025 · evaluation benchmarks SWE-Bench_Verified SWE-agent Agentless Claude 3.5 Sonnet OpenAI o1 · blog swe-agents
The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges

Why SWE-bench Verified's focus on single-file changes misses real-world multi-file programming — analyzed across SWE-agent, Agentless, Claude 3 Opus, Claude 3.5 Sonnet, OpenAI o1 and Amazon Q.

6 min read · March 30, 2025

2025 · evaluation benchmarks SWE-Bench_Verified SWE-agent Agentless Amazon Q Claude 3 Opus OpenAI o1 · blog swe-agents

A New Chapter in Blogging Exploring the World of Agents

Hidden Naming Contracts in SWE-Agent Benchmarks

The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis

From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets

Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?

The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges