Running local agentic AI coding workflows without cloud subscriptions
Cloud coding assistants fail agentic workloads in a specific way: they hit quota walls mid-task, not mid-conversation. A chat session sends one prompt and waits. An agentic loop sends a planning prompt, reads a file, writes a diff, runs a linter, reads the error, and tries again, all before you’ve touched the keyboard. Each of those steps burns tokens, and the cumulative usage bears no resemblance to what quota calculators assume.
Google’s Gemini free tier is a clear example. The headline rate looks generous until you account for context re-injection. Agentic tools like Cline re-send the full conversation and file context with each step. A 100,000-token context window, cycled ten times through a single bug fix, consumes a million input tokens before a line ships. Free tier weekly caps collapse in hours, not days. Paid tiers are not immune; metered consumption models mean a busy day of agentic coding produces a bill that scales with the number of reasoning steps taken, not the number of features completed.
The calculation changes when the model runs locally. No rate resets, no quota emails, no API key revocations mid-refactor.
Why agentic token usage breaks cloud quota assumptions
Standard quota estimates assume conversational use: a prompt, a response, occasionally some context. Agentic tools work differently. Cline, for example, operates in a plan-then-act loop. It sends the task, the file tree, relevant source files, and prior conversation state on every single step. A task involving five files and twelve agent steps re-sends those five files twelve times. With an average file size of 200 lines of Python, that’s a lot of input tokens that never appear in any back-of-envelope estimate.
Codestral and GPT-4o pricing pages show per-million-token rates that look reasonable at chat scale. At agentic scale, with context windows filling and re-filling on each loop iteration, the per-task cost is closer to running a small batch job than sending a message.
The shift vendors are making from per-seat to credit-based pricing reflects this reality. Credits deplete faster under agentic use, and the consumption is opaque until you check your dashboard and find the credits gone.
Choosing a local model for coding tasks
Three models are worth considering for local agentic AI coding: Qwen2.5-Coder, Codestral, and DeepSeek Coder. The right choice depends on available VRAM and the context window you need.
Qwen2.5-Coder 7B fits comfortably on 8 GB of VRAM at Q4 quantisation. Context window is 128K tokens, which covers most single-file agentic tasks without truncation. It performs well on Python, TypeScript, and shell scripting. Use the 32B variant if you have a 24 GB card and want stronger multi-file reasoning.
Codestral 22B needs at least 16 GB of VRAM at Q4. Mistral’s fill-in-the-middle training makes it strong for inline completions, but it also handles instruction-following well enough for Cline’s planning steps. Context window is 32K tokens; tight for large codebases but manageable for focused tasks.
DeepSeek Coder V2 Lite runs on 8–10 GB of VRAM and handles multi-language tasks cleanly. The 16B instruct variant works with Cline’s system prompt format without modification.
Pull the model before building the stack:
bash
ollama pull qwen2.5-coder:7b
Setting up the Docker Compose stack
The stack has two named services: ollama for model serving and open-webui for a browser interface. Cline talks to the Ollama API directly; Open WebUI is optional but useful for testing prompts before running the agent.
yaml
version: “3.8”
services:
ollama:
image: ollama/ollama:latest
containername: ollama
runtime: nvidia
environment:
– NVIDIAVISIBLEDEVICES=all
volumes:
– ollamadata:/root/.ollama
ports:
– “11434:11434”
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
containername: open-webui
dependson:
– ollama
environment:
– OLLAMABASEURL=http://ollama:11434
ports:
– “3000:8080”
volumes:
– open-webui_data:/app/backend/data
restart: unless-stopped
volumes:
ollamadata:
open-webuidata:
If you are running AMD GPU, swap runtime: nvidia for devices with the ROCm device path. For CPU-only, remove the runtime and environment keys entirely; expect slower inference but full functionality.
Bring the stack up:
bash
docker compose up -d
Confirm Ollama is responding before pointing any agent at it:
bash
curl http://localhost:11434/api/tags
A JSON list of pulled models confirms the service is ready.
Pointing Cline at the local Ollama endpoint
Open VS Code with the Cline extension installed. Go to Cline settings and set the provider to Ollama. Set the base URL to http://localhost:11434. Select the model name that matches what you pulled, for example qwen2.5-coder:7b.
Cline sends requests to /api/chat on that base URL. No API key field is needed; leave it blank or Cline will ignore it when the Ollama provider is selected.
For Aider, set the model and base URL from the command line:
bash
aider –model ollama/qwen2.5-coder:7b \
–openai-api-base http://localhost:11434/v1 \
–openai-api-key ollama
The --openai-api-key ollama flag is a placeholder. Ollama’s OpenAI-compatible endpoint at /v1 requires a non-empty key string, but the value is irrelevant.
Restricting filesystem access to the project directory
Running an agent with access to your full home directory is unnecessary. Use a bind mount scoped to the project when running the agent container, or when launching Aider from a container rather than directly on the host.
For a containerised Aider setup, add a service to the Compose file:
yaml
aider:
image: paulgauthier/aider:latest
containername: aider
volumes:
– /home/youruser/projects/myproject:/app:rw
workingdir: /app
environment:
– OPENAIAPIBASE=http://ollama:11434/v1
– OPENAIAPIKEY=ollama
networkmode: service:ollama
stdinopen: true
tty: true
The bind mount /home/youruser/projects/myproject:/app:rw gives the container read-write access to that directory only. Nothing outside /app is reachable from inside the container.
If you run Cline on the host rather than in a container, restrict its working directory in VS Code’s workspace settings to the project root. Cline respects the workspace boundary and will not traverse above it without explicit instruction.
Confirming the stack runs with no outbound connections
Pull all images and models before going offline. Once pulled, the stack needs no internet access to function.
Block outbound connections at the Compose level using an isolated network with internal: true:
yaml
networks:
ai_local:
driver: bridge
internal: true
Attach both services to this network. Remove the ports mapping if you want to lock the stack entirely to local inter-container communication, or keep it if you need host access from VS Code or Aider running on the host.
Verify no outbound traffic is leaving once the stack is running:
bash
sudo ss -tunp | grep -E ‘ollama|webui’
You should see only loopback (127.0.0.1) and local network addresses. No connections to external IP ranges. If you want a harder check, run the stack with --network none on the Ollama container and confirm model inference still works; if it does, the model is fully loaded into memory and serving from local storage with no network dependency.
The stack costs nothing per token, resets nothing weekly, and runs as long as the machine does.
