OpenAI o1

an archive of posts with this tag

Apr 15, 2025 Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?
Task-difficulty distribution in SWE-bench Verified from human annotations — what easy, medium and hard mean for SWE-agents like SWE-agent and Agentless running Claude and OpenAI o1.
Mar 30, 2025 The Multi-File Frontier: Why SWE-Bench Verified Doesn't Reflect Real-World Programming Challenges
Why SWE-bench Verified's focus on single-file changes misses real-world multi-file programming — analyzed across SWE-agent, Agentless, Claude 3 Opus, Claude 3.5 Sonnet, OpenAI o1 and Amazon Q.