Exploring AI Coding Assistants

After playing with AI for a while, I did what every sensible person would do: I decided to design my AI Coding Agent, because… Why the fuck not?

Designing a Coding Assistant

In my adventures in AI I’ve talked about the complexity of designing a good general-purpose AI assistant like ChatGPT. One simple question can trigger multiple model calls, tool calls (web search etc.) and verification stages to increase the quality.

Since I’ve learned my lesson, I started designing my coding agent with that kind of complex pipeline. I have 5090 RTX and 48GB RAM.

With this hardware there are strict limitations for local inference; for reasonable performance, the model needs to be:

32b parameters or less
Cannot reasonably load more than 2 models (1 big, 1 small 2-4b) at the same time.
I might still squeeze in tiny models if needed or load and unload in real time
Obviously models need to be open weight

My choice of poison for the task:

Qwen3:30b (qwen3:30b-a3b-instruct-2507-q4_K_M to be precise)
Ollama backend
Various small models for small tasks such as qwen3-embedding:4b and qwen3:4b-instruct

Tool Calling

The first step of having an agentic AI is being able to tool call and do it well!

For the sake of sanity, tool and function calling are practically the same thing. MCPs are tool calling with an extra step.

Tool calling is effectively your interface between chatting with your LLM and having your LLM actually do stuff. It’s built into the model’s training and surprisingly can differ model to model, but there are generic approaches.

A model without tool calling and a knowledge cutoff date of 2024:

User: Who is the president of USA?
Model (without tool call): Biden

A model with tool call, and the same knowledge cutoff date:

User: Who is the president of USA?
Tool instructions hidden from the user: <web_search(query), fetch(url)>

Model <think>: User is asking about the president of USA? I know we are in 2026 and from my training I don't have this data. I've been given a tool that can web search.

Tool Call: web_search('current president of USA 2026')
Tool Call: fetch('http//first-result-from-search')

Model <think>: I now have the latest information and respond to the user.
Model: It's Donald Trump

As you can see tool calling is essential to any meaningful AI agent.

MCPs are all the rage

Working with AI models, my biggest realization is they are much smarter than I thought and also much more isolated than I thought.

They call tools like it’s nobody’s business. Just give them the tool description and they’ll figure out how to use it. MCPs are quite popular in agent design because they rapidly add capabilities to your AI Agent.

However they are mostly over-engineered, and a simple CLI is as good as an MCP if not better for other reasons (such as context pollution and context limit).

Smarter people than me already noticed and wrote about this; read Mario’s article about why you don’t need MCPs. As a result of this, my first learning agent can call MCPs but only when I’m lazy to replicate the functionality as an internal tool.

If we were to 80/20 this thing, we would just add a bash(command) tool and call it a day. Just watch how good LLMs are at using standard unix tools to get things done.

AI Agents are dogs with a bone

LLMs are trained to get to the results, and they will aim to get there no matter what unless you stop them. For example, if you have a unit test to check functionality and tell your agent to pass that test, the agent will literally change the unit test instead of fixing the bug. Generally they get there after a couple of tries, but it’s a very common problem.

This means you need very good guardrails and strict controls before handing them something like bash(). Orrrrrr… you just put them in a container and let them wreak havoc as they please.

Local models are fast

This is my simple coding agent, AoC. Look at it go with a local model on a prompt for a spec to Build a static Linklog MVP using Vite + TypeScript. Execution takes about 3 minutes, which also includes installing packages etc. This was one-shot and worked well.

Full Prompt

Prompt is written by ChatGPT, and copied and pasted.

Build a static Linklog MVP using Vite + TypeScript (React optional)

“Linklog” (Hugo-friendly link sharing site)

Paste this as-is to your coding agent.

---

Build a production-quality project named **Linklog**: a minimal, self-hosted link-sharing site designed to work well with a Hugo blog workflow. It must support both **static export** and a **local interactive mode**.

### Goals

* Store links as **files in a repo** (no SaaS). Use a simple content format that diffs well.
* Provide a clean UI to browse, search, and filter links.
* Support a **static build** that outputs plain HTML (and RSS) that can be hosted anywhere.
* Provide deterministic tests and a repeatable build.

---

## Tech constraints

* Language: **TypeScript**
* Runtime: **Node.js 20+**
* Package manager: **pnpm**
* UI: **Next.js (App Router)** + minimal styling (plain CSS modules or Tailwind; your choice)
* Local DB for interactive mode: **SQLite** (via Prisma or better-sqlite3; your choice)
* Static export: must work with `next export` (or equivalent) and produce a static `out/` directory
* No external services required. No auth required.

---

## Data model and storage

### Source-of-truth format (repo files)

Links are stored as individual files under:

`content/links/YYYY/MM/DD/< slug >.md`

Each file contains YAML front matter plus a Markdown body note:

Front matter fields:

* `title: string`
* `url: string`
* `tags: string[]` (optional)
* `date: ISO string`
* `source: string` (optional) (e.g., "twitter", "friend", "paper")
* `status: "active" | "archived"` (default "active")

Body:

* freeform Markdown note (optional)

### Derived data (local interactive mode)

On `pnpm dev`, the app should:

* parse all `content/links/**` files
* populate SQLite tables (id/slug/date/title/url/tags/status/note)
* allow fast search and analytics (clicks)

---

## Features

### Browsing

* Home page: latest links, paginated
* Link detail page: shows title/url/tags/date/source/note
* Tag index page: list tags with counts
* Tag page: paginated links for a tag
* Status filter: active vs archived

### Search

* Full-text search over `title` + `note`
* Tokenized search is fine; must support partial matches
* Query param: `?q=...`

### Click tracking (local-only)

* When a user clicks an external link, increment a local counter.
* For static export, click tracking can be disabled or implemented as a no-op.
* Provide a “Most clicked” sort in interactive mode.

### Feeds / Export

* Generate RSS feed at `/rss.xml` containing last 50 active links
* Generate JSON export at `/export.json` containing all active links (stable ordering)

---

## CLI tools (required)

Create a CLI in `packages/cli` (or similar) with commands:

1. `linklog add --url < url > --title < title > --tags a,b,c --note "..." --source "..."`

* Creates the markdown file in the correct path
* Auto-slugifies title
* Uses today’s date unless `--date` provided

2. `linklog build-index`

* Parses `content/links/**`
* Outputs a derived index file for static mode: `public/index.json` (or `app/data/index.json`)
* This file is used by static export to avoid needing SQLite

Both commands must be tested.

---

## Static export behavior

Static export must NOT require SQLite.

* In static mode, the app reads from the derived JSON index file.
* All pages must still work: browse, tag pages, search (client-side search acceptable in static mode).
* RSS and export should still generate in static mode (can be generated at build time from JSON index).

---

## Testing & quality

### Required

* Unit tests for:

  * markdown parsing (front matter + body)

This prompt with Codex/Claude Code takes many more minutes, though it’s all about whether the agent decides to over-complicate or simplify things.

Using Local & Smaller models

There is a lot to unpack here:

Local & Small models are fast
Coding Agents without guardrails are scary! Just watch how they install a bunch of packages, I’ve also seen agents install packages that were downloaded only like 100 times! It’s very real that you can publish a package that claims to solve something and an agent will search for it, then just download and run it. (This happened with Opus 4.5). People also report common issues like an agent deleting something like /data outside of the project directory, grabbing API keys from .env files and using them in random places, or doing some other destructive shit.
30-32b parameter models are enough for small scope projects, or projects that you can make the working scope smaller by architecting it correctly.

Funny enough today OpenAI released and announced a new fast model GPT5.3 Codex Spark. I quickly ran their demo (Build a simple HTML snake game) and my local model delivered it faster than their demo, and it was working one-shot. Just to give you some perspective about complexity, cost, speed and privacy.

Benchmarks

Small models are still very limited compared to closed-weight frontier models. From aider’s benchmarks :

Model	Percent Correct
Qwen3:32b	40.0%
Qwen3:235B	59.6%
gpt-5 (high)	88.0%

It’s clear open weight models and low param/quantized models put a serious penalty on capabilities.

Harness vs. Model

Model Argument

After getting this successful, I started playing around with the “harness”, Claude Code, Codex, Open Code, Pi Agent etc.

While I was trying to benchmark my very primitive agent, I came across Terminal Bench. This is where everything suddenly got more interesting.

Terminal Bench guys have a “coding agent” called Terminus. It’s effectively just a Tmux terminal doing back and forth with the LLM. You’d think that extremely primitive approach would tank the benchmarks, but no. Ironically in a bunch of tests Terminus has higher scores than the original harness for that model.

Harness	Model	Accuracy
Claude Code	Opus 4.5	57.8%
Terminus	Opus 4.5	52.1%
Codex	GPT 5.2	64.9%
Terminus	GPT 5.2	54.0%

One thing is clear, Model matters more than the harness.

Model & Harness Integration

An overlooked aspect of these benchmarks is that certain models often work better with certain kinds of system prompts or tool calling approaches. Therefore some results are not that good because the harness is not optimized for the model used.

Just look at the Best Practice for Qwen3 written by Qwen guys.

Harness Argument

This is all new, we don’t really know what’s happening. A very recent blog post making the rounds is Improving the performance of Coding Agents with the Harness Change. The author notices the way coding assistants modify files is very complicated and causes a lot of unnecessary failures; he fixes that very cleverly and runs benchmarks with the new method. A good amount of models got better, some by 2%, and some got worse. Go ahead and look at the benchmark on the blog. It’s fascinating how much low-hanging fruit is out there.

This is all new, and no one knows the best practice, including OpenAI and Anthropic. Read their blogs and they constantly talk about what they tried, what seemed to work and what didn’t work at all. If you are looking for the best practice for doing this, it just doesn’t exist.

Lessons Learned

Models matter more than you think
Models are limited by their context, knowledge cutoff (more than you think)
Model peculiarities matter!
Tool Calling is critical and core to agentic AI