OpenAI o1 Benchmarks - and Streamlining Coding with o1-preview for Maximum Efficiency

Checking the performance of the new o1 models from OpenAI. Bonus: a practical tip on efficient coding with OpenAI o1-preview.

Sep 24, 2024

Benchmarking the new models from OpenAI: as expected, o1 top the ranking - but with a catch

Here is the leaderboard of my LLM benchmarks with the new o1 family.

o1-preview is the leader of my LLM product benchmark, surpassing 4o.

o1-mini ranks third: its performance would improve if it just paid attention to the instructions every time, rather than only every other time (you can see it in the integrate column).

Ultimately, o1 ends up being far superior to the others in reason. But there is a catch: the models are very expensive for their functionality. Moreover, I couldn’t even run them out-of-the-box: all of my tests had max_tokens set sufficiently for a response. However, since the models now generate more tokens during reasoning, it wasn’t enough, and I ended up with empty models.

OpenAI now recommends setting a minimum of 25000 max_completion_tokens for experiments with these models.

In the end, it turns out that the models are slightly more powerful but significantly more expensive compared to similar ones. This is due to the costly tokens and the large number of hidden tokens used for reasoning.

In my opinion, these models won’t become mainstream - they are too expensive and unusual. But other providers can copy the approach and try to apply it to cheaper models - this might get interesting.

This is my biggest takeaway from o1-preview

There is one category of tasks where the new models have already become indispensable for me, especially o1-preview. And this is coding. If the previous models like Sonnet 3.5 or GPT-4 perform at the level of a well-read junior developer, then o1-preview is already at a solid Middle Level.

Here is the mental model I use when coding with o1-preview.

I assign coding tasks as if to a highly experienced developer (trained on the full spectrum of knowledge), who’s picked up some bad habits along the way (trained on many posts where people overcomplicated things), but understands me instantly, without needing me to over-explain (o1 requires much less prompt engineering).

If the assignment scope is precise, only a short prompt would suffice:

Rewrite this course template in golang to follow the style of my own website. You can reuse all of my styles and drop the external css (as used by the course). 

<golang template to rewrite> 

<full html source of my website, as copied from browser>

If I’m asking to rewrite the code or redesign the architecture for a new feature, which is a more abstract assignment, the model can have several various solutions. Here I break the task down into two steps: Explore + Implement.

In the first step I ask to suggest solution options with priorities indicated - simple code, no overcomplication - and paste the sources as whole files, for example, with code snippets using vue.js + pinia + tailwind css + axios + vite + lucide + custom icon resolver + python FastAPI. o1-preview will handle sorting it all out.

Take a look at this code from my multi-mode (a la vim or old terminal apps) block-based content editor.

I want to build on the keyboard interface and introduce a simple way to have simple commands with small popup. E.g. after doing "A" in "view" mode, show user a popup that expects H,T,I, or V.

Or, after pressing "P" in view mode - show a small popup that has an text input waiting for the permission role for the page.

Don't implement the changes, just think through how to extend existing code to make logic like that simple.

Remember, I like simple code, I don't like spaghetti code and many small classes/files.

Then in the second step I say - Hey, I like the solutions number 2-5, 10 and 12-16. Integrate them in a working code so that I would only copy and paste it.

Write for me the files that incorporate your suggestions: 2-5, 10, 12-16

In 95% of cases, the code works right away! This is a huge time saver compared to manually prompting with Sonnet 3.5 or even top-tier GPT-4.

The ability to trust the model with complex tasks, while keeping the code simple and maintainable, is a tangible advantage.

Instead of fine-tuning prompts and handling the tedious details, I can now focus more on solving the real problems at hand - and I’m really curious to see how far these models can push productivity in real-world development.

ML Under the Hood

Discussion about this post