Schema-Guided Reasoning: What Changed in One Year
New learnings, industry adoption, and unexpected turns
I haven’t been writing to you for a while, so there is now a lot to catch up on. Here are the most important highlights.
Schema-Guided Reasoning really took off
SGR is a distilled explanation of what the community has been doing in the past uearsall along in diverse projects - using predefined schema (Structured Outputs) to force LLMs to think through the predefined steps.
I used to call the approach Custom Chain of Thought or SO COT, but it didn’t really stick due to ambiguity. It took some rethinking, a family vacation in Thailand, and reframing to finally distill the concept of SGR.
The methodology is documented in a series of knowledge base articles on SGR. You can read it at your leisure. I’ll just highlight the three most important points.
SGR is a methodology to improve existing prompts by forcing predefined reasoning - make them transparent and predictable.
SGR makes LLM workflows more testable - testability is the goal. What we can test, we can improve.
Testability lets us make systems more accurate, reduce hallucinations or run them on small local models.
And this is where the interesting things start happening.
SGR “Framework”
SGR is not a framework. But to demonstrate the principles, I wrote a self-contained sample of a business agent that can reason, use tools, memorise facts and adjust its plans to new circumstances. The codebase is just ~160 lines of Python code, including prompts (explained here).
The core of all these behaviours is just one single data structure. While LLM fills it in, it is forced to evaluate current state, plan the next steps and pick the next immediate action:
class NextStep(BaseModel):
# we’ll give some thinking space here
current_state: str
# Cycle to think about what remains to be done. at least 1 at most 5 steps
# we’ll use only the first step, discard all the rest.
plan_remaining_steps_brief: Annotated[List[str], MinLen(1), MaxLen(5)]
# let’s continue the cascade and check with LLM if the task is done
task_completed: bool
# Routing to one of the tools to execute the first remaining step
# if task is completed, model will pick ReportTaskCompletion
function: Union[
ReportTaskCompletion,
SendEmail,
GetCustomerData,
IssueInvoice,
VoidInvoice,
CreateRule,
] = Field(..., description=”execute first remaining step”)Here is what happened after this open research was published:
Team from a bank took the demo and ported it to run on a tiny Qwen3-4B model, just because they were interested if that can even work (it kind of does, but I would be more comfortable using larger model). Source code.
The community took the SGR core and started building an open-source web chat that is capable of tool use and independent deep research. It is similar to ChatGPT DeepResearch in spirit, but works even with local models. Source code.
Source code and principles of SGR DeepResearch were taken by banks and integrated into several products. They are studied by AI R&D teams, taken apart and introduced into the services and products.
Core SGR methodology was already used by two banks, a few MedTech companies, a CRM system, and a whole bunch of startups. Engineers like it for making LLM pipelines more predictable and reducing the number of hallucinations.
I also heard a report of a project that uses SGR DeepResearch core to drive internal knowledge mining (Confluence RAG and building knowledge graphs) in a company with just the Qwen3-4B model.
This sounds too good to be true, right? So here is a downside for you.
OpenAI thinks that SGR is a dead-end
During the last TED AI in Vienna I had a rare opportunity to chat with Łukasz Kaiser - mathematician and researcher at OpenAI, one of the authors of seminal “Attention is all you need” paper. He was sharing his thoughts about the progress of reasoning LLMs and thinking through what will happen next (hint: researcher models are the next step).
When I explained SGR to Łukasz and asked for his thoughts on the future of this direction for hardcoded reasoning, his immediate reaction was that this is a dead-end. It is good that this enables companies to reduce hallucinations, build products and deploy to smaller local models, but this is a dead end.
Why?
Because this limits reasoning. For instance, SGR-constrained model will probably be never capable of folding proteins (a feat worthy of a Nobel prize). Training is a better solution in his opinion.
So this gives us two major paths for AI in business:
If you have the resources to train and tune your own LLMs for specific business tasks (like OpenAI), then do it.
Otherwise - just take a capable small model and “tune” it to specific steps in the business process with SGR.
And here is one fun story about the latter option for you.
Case of SGR in Industrial Data Extraction
The project took place at TimeToAct Austria. The goal was to extract specifications of the industrial components from a collection of data sheets across multiple vendors. Each specification has ~60 properties, and the team was looking at ~20k entities to be extracted from a ~600 PDFs.
Each PDF is different. PDFs can have tables, charts, schematics and complex text. PDFs for different component types can have different document structure, containing from a few components to hundreds. While PDFs from different vendors are guaranteed to be different.
The team has managed to extract all that in roughly a week with 88% accuracy on hard eval dataset and ~99% accuracy when measured by the customer.
Under the hood the system used two SGR-optimised prompts that run on gpt-5-mini and the whole development and extraction used less than 30 EUR in OpenAI tokens.
That wasn’t the fun part, though. The fun part - the system used SGR workflow to drive code generation. Instead of “manually” extracting each of the components, the pipeline coded a specialised tool for each PDF.
You can think of it as an agent that writes code for his next step himself. This process generated in 687 tools and 109922 lines of code. No human has ever seen that code or cared about it. Why? Because:
the resulting accuracy of the pipeline is the only thing that matters, and it was beyond expectations
the code will never be maintained; if a change is needed the team will just throw another 20 EUR and rewrite everything from scratch.
How did the team make this work?
They spent most of the time in the project to setup an eval dataset that measured accuracy on an error map and drove the development of the pipeline
Focused on fast development iterations, which enabled them to run experiments in tight loops (10-30 minute iterations)
Used insights from the error maps to prioritise work and make changes in the SGR schema.
Relied on the concept of “just pay attention to the domain model and the language” that comes from the Domain-Driven Design.
Here is the overall illustration of the pipeline and the process:
Why did this project use gpt-5-mini?
Well, in truth, I wanted to use gpt-4o, because it would make such a cool story. But with 4o we could get only accuracy of ~65% on this pipeline. Gpt-5-nano also didn’t work out. So the next safe choice was gpt-5-mini, which is a really nice model from our benchmarks.
By the way, speaking of the benchmarks. Starting from the January 2025, all of our LLM Benchmarks allow models to use Schema-Guided Reasoning. Here is how the TOP20 leaderboard looks at the moment of writing:
I’ll guide you through the major highlights.
GPT-5 is currently the best model for business workloads (albeit expensive)
Grok-4 managed to improve a lot and got to the second place.
Qwen3-vl-235B-a22B is a mixture of experts model. In thinking mode it captured the third place, which is the highest place ever taken by an open source model.
gpt-5-mini is a relatively inexpensive and capable model. It scores 6th place. It has an downloadable equivalent called gpt-oss-120b which can’t work with images (needed for documents), but works exactly as well otherwise. Companies are adopting gpt-oss-120b frequently these days (or its smaller equivalent gpt-oss-20b).
The dense Qwen3-32B model is quite old, but it still holds pretty well against the competition - 13th place.
By default, reasoning models can be slow. Teams that want to get smart real-time responses usually use something like gpt-oss-120B (with disabled reasoning) from Cerebras or Groq (both providers use specialised processing units to achieve crazy fast response times).
You can read LLM benchmarks in more detail in this summer 2025 report from TimeToAct. September benchmark report will come soon, too.
Wrapping things up - ERC3 and Agents
It is a great time to be building business systems powered by AI and large models.
Yes, there’s hype, along with mismatched expectations and projects that sometimes fall short. However, once you begin exploring deeply, tracking patterns across projects, and converting insights into actionable steps with the community, really cool things emerge.
Like empowering teams worldwide to launch reasoning business agents using tiny local models, even when experts consider this path a dead-end or impractical.
By the way, the next step in this journey will happen soon. We are planning to run Enterprise RAG Challenge 3 in November. This time, we’ll be benchmarking reasoning business agents built by different teams. Top-performing teams from previous competitions have already signed up, along with the SGR DeepResearch team. They are planning to push SGR concepts to the limits, pitting them against alternative agentic architectures.
Interesting things tend to happen when you unite hundreds of talented teams from around the world and engage them in a friendly challenge that pushes the state of the art.
As always with ERCs, we’ll share all resources and publish new insights. Stay tuned!


