The Model is not the problem: Fixing my broken local AI assistant

What started as mostly engineering curiosity quickly shifted as I saw very real business utility and potential productivity benefits of an ai assistant. This wasn't just hype. When it works, it really can be a game changer.

Discussions in the AI space can be very abstract. I wanted to write something that was grounded and practical based on my recent experience in deploying my own on-premise AI agent.

My on-prem installation worked pretty well. I had switched over to a cloud model from Google to see what it would be like because I had some extra credits and just as was getting into the flow of working with it, it keeled over and died. I burnt through USD $40.00 in create in about a week with very mild use.

No problem, just switch back to a local model right? Didn't quite work out that way.

My Journey back to Sanity

When my local models were now failing. Everything, including other AI services, kept suggesting that the smaller models were the issue, they just couldn't cut it. Wrong or at least, not the entire picture. Testing showed direct calls to the same models worked perfectly.

The surprise:

The problem, after A LOT of digging, was primarily context history poisoning — the agent framework was stuffing every request with the full conversation history, system prompts, and tool definitions, drowning the model in noise.

The fix

The primary thing was to install a small compaction model (phi4-mini) to efficiently summarize conversation history. Then we trimmed unused tools from the agent config and separately, we built an automated test suite that hit both the agent and the model directly so I could isolate where failures actually came from.

The result

I had an AI assistant running 100% on a machine at home, with 95% of the features I need and less hardware than I started with. No cloud API bills. No token limits mid-meeting.

Practical takeaways

  1. Test through the agent AND directly to the model. If the model works fine on its own but fails through the agent, the model isn't your problem — the plumbing is.

  2. Context management matters more than model size. A small model with clean context will outperform a large model choking on a bloated prompt.

  3. Use a compaction model. If your agent framework supports it, configure one. It's not optional for long conversations.

  4. Trim your tool definitions. Every tool you register is context the model has to process on every single call. Remove what you're not using.

  5. Build a repeatable test suite. Define your expectations clearly, write automated smoke tests, and run them every time you change a model or config. This eliminates the "fix one thing, break another" cycle.

  6. Don't let AI tell you AI can't do it. The model kept blaming itself. The disciplined testing proved otherwise. Trust your methodology.

  7. It's totally possible to run local personal or business AI assistants. The models are more capable than we think, and there are now studies that prove it.

At one point though all of this I look at the time that I was spending on tech-support vs business productivity and really started to weigh my time against just paying for a frontier model and forgetting about it.

But, like I said in my first OpenClaw post - if you believe in bring things back home, getting things don't locally and not becoming too dependent on external services, it's worth it.


If you made it all the way though the post, thanks for reading. Be sure to check out the "Longer" tab on this post. I added it to my blog engine to potentially have longer versions of my posts with more context

In this particular case though, I literally asked Calude Code to summarize our captured session logs from working though this issue and produce a companion post to mine. When I read it, I thought it was perfect.

Deep Dive: How we actually found and fixed the problem

This is the companion piece to "The Model is not the problem: Fixing my broken local AI assistant" In the main post I talked about context history poisoning as the root cause of my local AI agent's instability. Here's what it actually looked like to work through that problem — step by step, with a coding AI (Claude Code) as my debugging partner.

The setup

My agent is called Caspian, running on OpenClaw (an open-source, self-hosted AI agent platform). He talks to me through Telegram, uses local models running on Ollama, and connects to tools like Gmail, Trello, web search, document processing, and more.

The hardware: an i9-10850K with an RTX 3070 (8GB VRAM) — a decent machine, but not a monster. I also have a second machine (Bullshark) with more VRAM that can serve larger models over the local network.

The symptom: "JSON bleed"

The first thing that looked obviously broken was something we started calling JSON bleed. Instead of the model making structured tool calls (the way AI agents are supposed to invoke tools), it was printing raw JSON into the chat — literally typing {"name": "trello", "arguments": {...}} as a text message instead of actually executing anything.

My instinct (and Claude's repeated suggestion) was: "the model is too small." An 8-billion parameter model can't handle 22 tools and a 32,000-character system prompt. Makes sense, right?

Except it didn't explain everything.

The isolation test that changed everything

Instead of guessing, we built something called an isolation test — a script that progressively adds system prompt components and tests tool calling at each stage, hitting Ollama directly (bypassing OpenClaw entirely).

We started with a bare prompt: "What's the weather?" with one tool defined. Worked. Then we added all 22 tools. Worked. Then the full 32K system prompt. Worked. Streaming enabled. Worked. Every single time, the model returned proper structured tool calls.

No JSON bleed when calling the model directly. Not once.

So if the model could handle it, what was OpenClaw doing differently?

Building the proxy trap

We set up a proxy between OpenClaw and Ollama to intercept the actual requests OpenClaw was sending. What we found was alarming:

  • OpenClaw was sending 165 messages per request — totaling over 103,000 characters
  • This included the complete conversation history from the current session
  • Buried in that history were previous responses where the model had output JSON as text (from earlier, when conditions were worse)
  • The model was seeing its own bad examples and copying them

It was a self-reinforcing loop: bad response gets saved to history → next request includes the bad example → model pattern-matches against it → produces another bad response → that gets saved too.

The fix was embarrassingly simple

Once we understood this, the fix had three parts:

1. Clear the poisoned session history

We backed up and emptied the session file. That's it. After clearing it, the tool-calling regression tests passed 6 out of 6. Zero JSON bleed. Same model, same config, same hardware.

2. Configure compaction properly

OpenClaw has a compaction system — it can use a smaller model to summarize conversation history so it doesn't grow unbounded. I installed phi4-mini as the compaction model and set limits:

  • Max history share: 30% — conversation history can use at most 30% of the model's context budget
  • Recent turns preserved: 6 — only the last 6 exchanges are kept verbatim; everything else gets summarized

This prevents the history from ever growing large enough to drown out the actual request again.

3. Set temperature to zero

With temperature at 0, the model chose structured tool calls 100% of the time in our tests (vs. about 60% with the default temperature). When you need deterministic, structured output, creativity is your enemy.

What didn't fix it

Just as important — here's what we tried that didn't solve the problem:

  • Upgrading OpenClaw to the latest version (v2026.4.10). The new version had better tool-call normalization, but the upgrade alone did NOT fix the JSON bleed. The session history was the root cause.
  • Switching to bigger models — we went from 4B to 8B to 14B. Bigger models were more resilient to the poisoned history, but they still degraded over time as the history accumulated.
  • Disabling skills and tools — we trimmed the prompt from 27K to 15K characters, which helped response time (45 seconds down to 17 seconds) but didn't solve tool calling.

These were all reasonable moves, but they were treating symptoms, not the disease.

The other things that mattered

Along the way, we fixed a lot of things that weren't the root cause but were real problems:

  • Fallback chain had a broken model: gemma3 doesn't support Ollama's tool-calling API at all — it returns a 400 error. It was sitting in our fallback chain, so when the primary model timed out, OpenClaw would try gemma3, fail, then try the next fallback, fail again, and cascade into total failure.

  • System prompt bloat: 22 tools and 13 skills were being injected into every single request. We denied 4 tools (browser, cron, canvas, nodes) and disabled 6 skills we weren't using. This freed up context space for what actually mattered.

  • Model compatibility is a minefield: We evaluated over 20 models. Many had broken tool-calling support in Ollama — wrong formats, streaming bugs, thinking-mode overhead that ate the entire timeout window. The model that worked best for us was qwen2.5:14b: strong tool-calling benchmarks, no known Ollama bugs, fits comfortably in 12GB VRAM.

The test suite that kept us sane

We built a smoke test suite with around 90 tests across 16 sections — everything from basic chat to Trello CRUD to email to document conversion. Two modes:

  • Quick (~3 minutes): hits Ollama directly, tests infrastructure, no agent calls
  • Full (~15 minutes): runs everything through the agent platform too

The critical insight was testing the same operations both ways — through the agent and directly to the model. When a direct call succeeds but the agent call fails, you know the problem is in the agent layer, not the model. This one principle saved us hours of chasing the wrong thing.

What I want you to take from this

If you're running a local AI agent and hitting unpredictable failures:

  1. Build an isolation test. Strip everything back to a bare model call with one tool. Add components one at a time until it breaks. This tells you exactly where the problem lives.

  2. Check your session history. If your agent has been running for a while, the accumulated history might be poisoning the model's behavior. Clear it and see if things improve.

  3. Configure compaction aggressively. Don't let history grow unbounded. A 30% cap with 6 recent turns preserved worked well for us.

  4. Audit your fallback chain. Every model in the chain needs to actually support the features you're using. One broken fallback can cascade into total failure.

  5. Trim what you're not using. Every tool and skill definition is context the model has to process on every single call. Less is genuinely more.

  6. Don't trust the AI's self-diagnosis. Claude kept telling me the smaller models couldn't handle it. The models were fine. The plumbing was the problem. Trust your tests over the model's opinion about itself.