Forge: guardrails that make small models reliable

The claim circulating from the Hacker News thread (464 upvotes, 170 comments) was that forge takes an 8B local model from 53% to 99% on agentic tasks. The real numbers from the repo's own benchmarks are different, and that's actually more interesting.

Forge is a Python framework for self-hosted LLM tool-calling. You give it tools, a backend (Ollama, llama.cpp, Llamafile, or Anthropic), and it manages the loop: retrying failed tool calls, rescuing malformed JSON, injecting nudges when the model gets stuck, enforcing step ordering when you need it.

The actual eval numbers come from forge's own 26-scenario test suite, documented in the repo. For Ministral-3-8B-Instruct at Q8 with llama-server native function calling: bare (no guardrails) scores 32.5%, reforged scores 81.4%. That's not 53% to 99%, but a jump from 32% to 81% on a well-defined benchmark is still substantial. For a different config, the 8B Instruct model with prompt-injected FC goes from 50.2% to 84.4%. The 14B model in reasoning mode goes from 43.3% to 84.5%.

Where does 99% come from? The README says forge lifts Sonnet 4.6 from 85% to 98% on the same eval workload. That's a frontier API model, not an 8B local model.

So: the claim as widely repeated is wrong on both the baseline and the ceiling for local models. The actual data is still worth looking at.

What forge actually does is pull on a specific problem: small models that nominally support function calling are unreliable in practice. They call tools that don't exist. They return malformed JSON. They emit tool calls in Mistral's [TOOL_CALLS] format when you needed OpenAI's tool_calls schema. They respond with plain text when they were supposed to call a tool.

Forge's guardrail stack addresses each of these in sequence: response validation (does the tool call reference a real tool?), rescue parsing (can we extract a valid call from the malformed output?), retry nudges (send a corrective message and try again), and a synthetic respond tool that forces the model to be explicit about when it's done versus when it's calling something.

The proxy mode is the practical entry point. You run python -m forge.proxy between your client (aider, Continue, opencode, anything that speaks the OpenAI schema) and your local model server, and forge applies its guardrails transparently. The client thinks it's talking to a smarter model. That's a clean architecture: you don't rewrite anything, you just add a layer.

There's a published paper behind this: Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling, published at ACM (doi:10.1145/3786335.3813193). The ablation study there is presumably more rigorous than the README numbers, but the paper's DOI may not resolve yet as of today.

The honest framing here is: forge doesn't make your model smarter. It makes your model's tool-calling behavior predictable. For a workflow that needs to run reliably overnight on a local GPU, predictable is often more important than slightly better benchmark scores on a cloud model.

Related: Local AI in 2026: on the box