Claude 3.5 Sonnet
an archive of posts with this tag
| Apr 15, 2025 | Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really? Task-difficulty distribution in SWE-bench Verified from human annotations — what easy, medium and hard mean for SWE-agents like SWE-agent and Agentless running Claude and OpenAI o1. |
|---|---|
| Jan 05, 2025 | Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified How SWE-agents (OpenHands, SWE-agent, Agentless) handle multi-file software engineering tasks compared to human developers on SWE-bench Verified, with Claude 3.5 Sonnet and OpenAI models. |
| Dec 31, 2024 | OpenHands CodeAct v2.1 v/s Tools + Claude 3.5 Sonnet Head-to-head comparison of OpenHands CodeAct v2.1 and Anthropic Claude 3.5 Sonnet on SWE-bench Verified, analyzing the performance differences and capabilities of these leading SWE-agent approaches. |
| Dec 26, 2024 | SWE-Bench Verified ⊊ real-world SWE tasks Why SWE-bench Verified is only a subset of real-world software engineering tasks — comparing SWE-agents such as OpenHands CodeAct v2.1, Amazon Q, SWE-agent, Agentless and AutoCodeRover, with Claude 3.5 Sonnet. |