On Chat GPT Dumbness, Trustbit Benchmarks and ML Product Labs
The reports of my dumbness are greatly exaggerated. (c) ChatGPT
There are a few things I would like to discuss with you today:
Is ChatGPT getting dumber? No
September LLM Benchmark from Trustbit - 70B catches up with GPT 3.5.
Two new guides on building with LLM at ML Product Labs
Is ChatGPT getting dumber? No
Have you heard stories that ChatGPT is getting dumber and dumber with each passing day? They are mostly based on the Stanford paper How Is ChatGPT’s Behavior Changing over Time? This paper quizzes ChatGPT on math issues and code generation, arriving at a conclusion that things are getting worse.
This research is a valuable one, since it teaches as a few important lessons. If you want ChatGPT (or any other LLM) to return unpredictable results, then:
Don't change system prompt, use whatever is in the default one.
Don't use few-shot samples to fix the output format.
If you need overall stability, never set temperature to 0.
I believe, that is exactly what the paper in question has done, achieving the desired outcome - unpredictable responses from ChatGPT that change between the versions.
Out of these points, the few-shot prompting is the most important one. While benchmarking a wide variety of LLMs, I have learned that providing good few-shot samples is the most universal approach for improving response quality.
Never trust any LLM to respond a specific format, based on your instructions alone. Just like it is in a human conversation, instructions can be misinterpreted. Different versions of LLMs can misinterpret instructions to a different degree, creating an illusion of getting more flawed.
By providing concrete examples of what we want to achieve, we anchor our expectations for LLM in a tangible form. Requirements always help.
And no, ChatGPT quality isn’t generally getting worse. There are a few minor regressions, but overall large language models just keeping better, faster and cheaper.
LLM Benchmarks - 70B catches up with GPT3.5
Speaking of the benchmarks. September version of the Trustbit LLM Benchmarks on Enterprise Workloads will be published soon. Here is the preview of the numbers:
Long story short, Llama 2 70B fine-tune called Nous Hermes has already beaten the old version of Chat GPT 3.5 on Enterprise Workloads. It is a bit pricey to run on your own hardware, but things are just getting started.
You’ll be able to find more insights in the September LLM benchmarks of Trustbit next week. You can watch out for this insights page or follow the @trustbittech on twitter. Explanation of cost and speed columns is coming there, too.
ML Product Labs
As a subscriber of this newsletter you get access to two new guides on building products with LLM under the hood:
Systematic approach on applying AI/ML in organisations
Principles for capturing feedback and user interactions for building better ML-driven products.
You can find them at ML Product labs: https://labs.abdullin.com.
Just make sure to use exact email that you used to subscribe to this newsletter. If that is your first login, you’ll also be prompted to setup a new password for the labs.
New subscribers will also get access to the labs, but they will need to wait for a day after subscribing to the newsletter (yeah, I’m still running that sync script manually :] )
Have a great week!