I Ran Claude Code on Local LLMs for a Month. Here's What Worked for Me.

I have been running local LLMs on dual AMD MI60 GPUs for over a year, and recently pointed Claude Code at them to see how close I could get to frontier-quality agentic coding on my own hardware. I wanted to know: how close can local models actually get to Anthropic’s Claude for building real software?

This is the first post in a series documenting that experiment. Future posts will cover specific agentic coding sessions, hardware and model setup, and the tools I am building to close the gap.

My setup: Dual AMD MI60 GPUs with 64GB of total VRAM, running models through vLLM and LiteLLM. If that sounds unfamiliar, I have written about the hardware, the MI60 setup, and getting Claude Code working with local models. The short version: about $1,000 in used enterprise GPUs (two MI60s at $500 each on eBay), enough VRAM to run 70-80B parameter models.

Small Tasks Work. Complex Ones Do Not.#

I got Claude Code working against a local Qwen3-Coder-Next:80B MoE model. It successfully scaffolded a complete Flask application with SQLite, modern CSS, and responsive design. The agentic workflow created files, ran commands, iterated on errors. For small, well-scoped tasks it is genuinely usable.

The key phrase there is “small, well-scoped.”

Building a large codebase with interdependent modules, maintaining consistency across dozens of files, recovering from errors mid-task; this is where local models break down for me. The individual completions are often fine. But the agentic loop, where the model plans, executes, evaluates, and iterates, requires a level of sustained coherence that I have not been able to get from local models yet.

What surprised me is that scaling up parameters does not straightforwardly fix this. A 70B model produces fewer bugs than a 7B model, certainly. But the jump from “can scaffold a small app” to “can architect and build a complex multi-module project” has not materialized for me yet, even with the largest models I can run locally. The gap between local and frontier for agentic coding is not incremental. It is qualitative.

Why the Gap Exists#

The gap stems from several compounding technical factors, none of which explain it alone.

That 80B model is a Mixture of Experts architecture. It has 80 billion total parameters, but only about 3 billion are active for any given token; the rest sit idle. It is 80B of storage, not 80B of thinking. I have a dense Llama 3.3 70B running on the same hardware doing about 26 tokens per second, compared to about 35 for the MoE. The MoE is faster because it is doing less work per token, but the dense model uses all 70 billion parameters on every forward pass. That is roughly 23 times more compute per token. I have not yet run the dense 70B through Claude Code on the same agentic tasks. That experiment is next on the list. If the results are dramatically better, it confirms that active parameter count matters more than headline size. If they are not, it tells me the gap is more about training and alignment than raw compute. I will cover the hardware setup and benchmarks in detail in an upcoming post.

Training investment widens the gap further. Frontier labs invest orders of magnitude more compute into training than open-source model creators. More compute means more passes over more data, which translates directly into better generalization, fewer blind spots, and stronger reasoning chains. The data matters as much as the compute; frontier labs invest heavily in data curation, filtering, decontamination, and synthetic data generation with proprietary pipelines that smaller labs cannot replicate.

Post-training alignment is where reasoning quality really separates. Frontier labs do extensive reinforcement learning from human feedback, constitutional AI training, and iterative red-teaming. This is not just safety work. It is what teaches the model to reason carefully, follow complex multi-step instructions, and maintain coherence over long outputs. Open-source models get some of this, but not at the same depth or investment level.

Then there is quantization. My local MoE runs at Q4_0 precision via llama.cpp; my dense Llama 3.3 70B uses AWQ 4-bit via vLLM. Both are lossy compressions of the original weights. On simple tasks the loss is negligible. On complex reasoning where subtle weight differences compound across layers, quantization degrades output quality in ways that are difficult to measure but easy to feel. The model becomes slightly less precise at every step, and over a long chain of reasoning those small losses accumulate.

Context window is another moving target. The Qwen3-Coder-Next supports 256K tokens natively; my dense Llama 3.3 70B supports 128K. But the ceiling keeps rising; the latest generation has jumped to 1M tokens. Longer context means the model can hold an entire codebase in memory at once.

No single one of these factors explains the gap on its own. Less active compute per token, less training investment, less alignment work, lossy weight compression, and smaller context windows all stack on top of each other. Each one costs a few percentage points of quality. Together they produce that qualitative difference between a local model that works 70% of the time and a frontier model that works 95% of the time.

Closing the Gap#

I am not done experimenting. The most immediate test is running my dense Llama 3.3 70B through Claude Code on a real agentic task, the same kind of multi-module project that the MoE struggled with.

Beyond that, Claude Code recently introduced agent teams, where a lead agent decomposes a task and delegates sub-tasks to specialist agents that work in parallel. If teams decompose complex projects into smaller, focused units, each individual agent stays within what a 70B model handles well. The lead agent coordinates, but each worker has a narrower scope and a shorter context. I have not tested this with local models yet, but the architecture is promising because it plays to local model strengths rather than fighting their weaknesses.

There is also the Ralph Wiggum technique: running an AI coding agent in an autonomous loop against a specification until all tasks pass. A model that succeeds 70% of the time on a given task might seem unreliable. But a model that gets three attempts at the same task, with the ability to see its own errors, has a much higher effective success rate. The loop compensates for inconsistency with persistence. Combined with agent teams, the approach becomes: decompose the project into small tasks, let agents loop on each one until it passes, and coordinate the results. I plan to write about this experiment in a future post.

The third angle is giving local models better memory. Right now, Claude Code starts every session cold. It reads the codebase but has no record of past decisions, failed approaches, or architectural reasoning from previous sessions. An agent that remembers “we tried approach X and it failed because of Y” is dramatically more useful than one that suggests approach X again. I have started building a tool called project-brain that gives coding agents persistent project context: searchable documentation from codebases using local embeddings, session memory, decision logs, and pre-session context assembly. A local model that can search its own project history is working with more information than one that starts fresh every time. Memory does not make the model smarter, but it might make it effective enough to close part of the gap.

These three ideas, teams, loops, and memory, are all attempts to solve the same underlying problem from different angles. I do not know yet which combination will work, or whether any of them will close the gap enough to matter. But that is the experiment I am running.

Where I Am Now#

For agentic coding, I still use frontier Claude for anything complex. Local models handle small, well-scoped tasks. That is the honest answer today.

But every month the local options get a little better. New models drop, quantization methods improve, context windows grow, and the scaffolding tools get more sophisticated. I plan to be there when the gap closes, because closing it is the whole point of this series.