Code Editing Tool Benchmarking

Written by Sudhir Balaji | Dec 2, 2024

We frequently monitor the leaderboard for SWE-bench, which is a benchmark for measuring the performance of LLMs at editing real-world Python codebases to resolve real-world GitHub issues. Although this is slightly different to what we're doing at Engine Labs, SWE-bench still provides a reasonable way to track performance and keep an eye on the state of the art.

A new version of Anthropic's Claude 3.5 Sonnet was released in October 2024 - with this came a jump in success rate of submissions to the SWE-bench leaderboard. At the same time, Anthropic also released tools to allow Claude to use a computer directly.

Examining submission logs, we saw that two new high-scoring leaderboard entrants were both using tools that looked like they were based on code from Anthropic's computer use demo quickstart.

The EditTool from the quickstart in particular looked like a novel way of providing an LLM with the ability to view, create, and edit files. There have been a few different approaches to providing LLMs with these abilities over time - we had been successfully using a set of tools based on the SWE-agent paper to achieve the same things.

We wanted to see if EditTool was responsible at all for the increased benchmark success rates. We used the Aider project's code editing benchmark to run two evaluations - one for the new tool and one for our existing toolset, using a basic agent loop with a maximum of 5 iterations and the newest Claude model claude-3-5-sonnet-20241022.

Existing toolset results

{
  "successRate": "55.6%",
  "successes": 74,
  "failures": 59,
  "errors": 0
}

`EditTool` results

{
  "successRate": "54.9%",
  "successes": 73,
  "failures": 60,
  "errors": 0
}

Overall performance is nearly identical between the two toolsets, suggesting that the new toolset is unlikely to have made a difference to benchmark success rates.

An interesting observation from the tests is that each toolset succeeded in some cases where the other failed. We did not repeat any of these tests but we would expect different outcomes for the same inputs due to inherent LLM non-determinism depending on the temperature settings (we used the default Anthropic API temperature here).

If you would like to see more detailed results, please reach out to us at contact[at]enginelabs.ai!

(We also recently submitted results for our agent to the Verified leaderboard of SWE-bench, with a 51.8% success rate - keep an eye out!)

View full post

Code Editing Tool Benchmarking

Existing toolset results

EditTool results

`EditTool` results