GPT-4o vs Claude 3.5 Sonnet for Code

Written by Simon Spurrier | Nov 14, 2024

Open AI vs Anthropic

Until now, OpenAI and Anthropic have traded the code generation crown. Google’s Gemini models have led on context length and xAI is amassing enormous GPU capacity but neither have challenged the duopoly yet.

Currently, Anthropic is in the lead. Their latest flagship model, Claude 3.5 Sonnet 20241022 tops most benchmarks and is widely anecdotally considered ‘better’. This extended their lead from the previous release of Claude 3.5 Sonnet.

Of a random sample of 49,923 generations from Engine’s users, Claude 3.5 was by far the most preferred, as shown in the chart below

OpenAI’s latest models that focus on improved reasoning, the o1 family, have yet to be fully released and are missing some important API features like system messages and tool use. The full release of o1 is likely to happen before the end of 2024 and could challenge Claude 3.5.

Additionally, OpenAI’s next generation model, known as Orion or GPT-5, is rumoured to be months away. A jump from GPT-4 to 5 similar to that from 3.5 to 4 would reset the landscape and put OpenAI far ahead.

It is reasonable to expect these models to improve with time in a fairly linear fashion for at least the next couple of years. Code generation will benefit.

Other Options

There are well known names in this category, including Meta. However, at the margin, closed source models with the most funding and compute capacity appear to have a commanding lead. For high value use cases, like production code generation, the extra capability is worth the cost and closed source trade-offs.

There are also several notable fine tunes and foundation models optimised specifically for code generation. They often claim or predict state-of-the-art benchmark performance, but this cannot be verified since their logs have not been published and the models are not publicly available. It is our view that these models are unlikely to achieve generalised capabilities beyond those of the market leaders. In the long term, foundational models will outperform, especially in code generation, where their models have demonstrated exceptional generalisability.

Exciting progress is being made in areas other than raw code generation ability. OpenAI’s mini models and Google’s Flash models are around 30 times cheaper than their full-size counterparts. Groq runs open-source models on custom hardware, delivering extremely fast inference times. However, for code generation in a work setting, value exists mostly at the frontier.

View full post