How I Use AI to Code Effectively, Part 1

A few weeks ago I shipped a feature on a Friday afternoon. Bank statement parser, deterministic tax engine, full chat tier with conversation threading, the works. Backend service layer, frontend screens, contract tests at the seam, QA pass against a live server, design system components added to the preview as each one was built. The PR cleared review with two comments, both nits. The deploy hit production at 6pm. Monday morning, zero bug reports.

I didn't write most of the code. AI agents did. I planned the work with Opus, executed the implementation with Sonnet, ran QA passes with two more agents driving a real browser and a live API. The output was indistinguishable from what I'd write by hand, except faster, and with better test coverage than I usually have the patience for.

It's because the AI is loaded with context before it writes a single line, every bug my team has ever shipped to production, every API convention, every banned pattern, every "we tried this in March and it didn't work." The agent walks into the session already knowing what the senior engineer in the room would have told it.

I've evolved this system across the products I ship — TaxLens, TrustRail, Medcord, Solon, Pracket, WorkSight, Ohlify, and others. It's not a framework or a tool. It's a collection of markdown files and a discipline for using them. It treats the AI agent as a recurring contractor on a maturing codebase, not a one-shot magic box. Each session inherits the accumulated rule set. Each session contributes new rules back when it finds new failure modes.

The result, after about 18 months of iteration:

Velocity — I ship features end-to-end in hours that used to take days. The bottleneck has moved from typing speed to thinking speed, which is where it should have always been.
Consistency — code from the AI looks like code I'd write, because the AI is reading the same conventions I follow. Reviewers can't usually tell which lines were mine and which were the agent's.
Reduced regression rate — recurring bug classes (the 204 .json() parsing bug, API drift, money-as-float, useEffect + fetch races) don't recur. The rules that prevent them are loaded on every session.
Agentic QA — every feature gets an agent-driven QA pass before I touch it. Browser-driven for the frontend, live API for the backend, with a PASS/FAIL/SKIP/BLOCKED report I read in two minutes.
Audit trail — every architectural decision is in a markdown file. When the next engineer (human or AI) asks why something is the way it is, the answer is one grep away.

This is Part 1 of two articles on the system. Today: how I plan with AI, how I spec-drive development, how I manage context across long sessions, and how I run code review at scale. Part 2 covers the QA agents, the design system pipeline, and the tools I've built around all of it.

The full system is open source at github.com/spiderocious/agentic-workflow. Every persona, skill, and example I'll reference below is in that repo. Open it in a tab and follow along.

Let's dig in.

Why This Works (and Why the Default Doesn't)

Before showing the mechanics, the foundational claim:

The default AI coding workflow is engineer types a task description, agent generates code, engineer reviews and ships. It works for trivial tasks. For anything non-trivial in a codebase with history, it produces plausible-looking code that violates conventions, repeats fixed bugs, and forces the engineer to spend more time reviewing than they saved by delegating.

The agent isn't wrong because it's bad at coding. It's wrong because it doesn't know what your team knows. It doesn't know that money is stored as bigint kobo, never float. It doesn't know that 204 responses break .json() parsing. It doesn't know that the auth middleware must come before the role-check middleware. It doesn't know that two engineers ago, someone added a route at /api/v1/me that was silently shadowed by /api/v1/:userId and the team spent three days debugging it.

All of that knowledge exists somewhere — in the heads of the engineers, in PR comments, in postmortems, in Slack threads, in git blame. None of it is loaded into the agent's context at the start of a session.

My system is built on a single inversion of that default: encode the lessons in files the agent reads on every session. Make the institutional memory available to the contractor on day one, every day. Everything else in this article is mechanics for doing that well.

The Three-Layer Model

The system has three layers. Each does one thing.

──────────────────────┐
│  LAYER 1: PERSONA                                            │
│  Who the AI is. Identity + invariants + which skills to     │
│  load. Small files (3-30KB).                                                  ──────────────────────┘
                        
            "Before writing code, load these:"
                         ▼
──────────────────────┐
│  LAYER 2: SKILL                                              │
│  HOW to do specific work. Detailed playbooks. Tool          │
│  references, code patterns, lint rules. Reusable across     │
│  multiple personas. Files range from 10KB to 40KB.          │
──────────────────────┘
                          │
                "Reference codebase docs:"
                          ▼
──────────────────────┐
│  LAYER 3: CODEBASE                                           │
│  Your repo. The personas point to your project's docs/      │
│  folder as the source of truth for "what good looks like    │
│  here."                                                       │
──────────────────────┘

A persona is who the AI is — a senior backend engineer, a QA engineer specialising in frontend, a mobile-aware API designer.
A skill is how the AI does specific work — write a service method, drive a real browser for QA, design an API endpoint with mobile consumption in mind.
The codebase is what the AI is working on, with project-specific docs that describe local conventions.

The separation is load-bearing.
Five reasons it matters:

1. Skills are reusable, personas are roles. hard-lessons.md is loaded by 6 different personas — backend, frontend, fullstack, mobile, qa-backend, qa-frontend. If a hard lesson lived inside any single persona file, the other five would drift away from it. Separation enforces single source of truth.

2. Personas can be tiny. The smallest persona in the repo is frontend.md. It's basically: "you are a senior frontend engineer, load these five skills, here are your guardrails." The persona's job is to select and orient, not to redefine.

3. Skills can be large without bloating context. backend-qa-agent.md is 40KB. If that lived inside the QA backend persona file, every "audit my API" invocation would burn 40KB of context. Instead, the persona is small and references the big skill — the agent only loads the heavy file when it actually needs it.

4. QA personas can audit dev personas' work using the same rulebook. qa-backend.md and backend.md both load backend-service-patterns.md. The QA agent greps the codebase for violations of the same rules the dev agent was told to follow. No translation step. No divergence.

5. Personas encode order and tone; skills encode correctness. A persona says "you think in this order: data model → service contract → HTTP surface." That's a working-style claim. A skill says "services return ServiceResult<T>, never throw." That's a rule. Different kinds of content; different files.

The full mental model is written up in docs/how-personas-work.md in the repo.

Specs Are Not Documentation. They're Working Memory.

The single most impactful change in how I work with AI is treating specs as the source of truth that AI agents read, execute against, and update, not as documentation written after the fact.

The operating model:

Prose lives in a docs folder
Code lives in the project repo
AI agents shuttle between them

The docs aren't a description of what the code does. They're a manual addressed to the next AI agent that will work on this project. They include verbatim instructions like "Pattern-match this file before writing any code. Every API path, response unwrap, request body field, icon name, and meemaw pattern is pre-verified here against the actual backend source. Do not guess."

There are five kinds of spec documents, each with a different audience:

Document	Addressed to	What it captures
`prd.md` / `mvp.md`	The product mind (me)	Scope, user stories, deferrals
`*-build-plan.md`	The next AI implementer	Per-module file structure, exact API endpoints
`phase-N-spec.md`	The next AI implementer mid-build	Pre-verified gotchas, banned wrong-guesses
`*-handoff.md`	The next AI agent on a fresh session	Repo layout, must-read order
`rules-lessons.md`	Every future agent	"Every rule here was learned by breaking it"

There's a real example of each in the repo's examples/ folder — including a sample PRD at the repo root that shows the bare-minimum format.

The Spec

Every MVP spec is a bulleted list of user-story sentences in the exact form "A user can / a user must / a user will be able to ...". Each one is testable, demoable, and pointable-to in a QA report.

From the examples/mvp-spec.md file:

## Module 1 — Upload

- A user can choose between two input paths: Upload a CSV statement or Try with sample data.
- A user must accept the privacy notice before upload becomes available.
- A user can upload a CSV file up to 10MB. Larger files must be rejected with a clear message.
- A user must see a real-time progress indicator while the file uploads and parses.
- A user will be able to see how many transactions were detected immediately after parsing.

Each bullet is one unit of work. The pattern (A user can / must / will be able to) is itself the test contract.

Here's the trick — and this is the part most spec-driven workflows miss:

The MVP features becomes the QA test cases which then becomes the test assertion. Same thing, three lifecycles.

In the MVP:

A user can upload a CSV file up to 10MB.

In the QA handoff (see examples/frontend-qa-handoff.md):

On this screen, the user must be able to upload a CSV file up to 10MB.

In the test script:

await test('A-UP-03', 'Upload rejects files over 10MB', async () => {
  const res = await postFile('/statements/upload', oversizedFile);
  assertStatus(res, 413);
});

One sentence, no translation loss. This is the cheapest way to keep specs and tests in sync. If the MVP says it, the QA tests it, end of debate.

The Pre-Flight Checklist Pattern

MVP specs describe scope. Phase specs front-load gotchas.

A technical documentation phase spec is a pre-verified execution manual that pattern-matches every corners/patterns the agent will hit. The opener tells the agent what kind of document it's reading:

Pattern-match this file before writing any code for any feature in phases 4, 5, or 6. Every API path, response unwrap, request body field, icon name, and meemaw pattern is pre-verified here against the actual backend source. Do not guess.

The structure of a phase spec:

Mandatory checklist before every feature — 10–15 boxes the agent must tick
Verified primitives — icon names, EP constants, response shapes — explicitly marked WRONG vs CORRECT
Cheat tables for every endpoint mapping URL → service return → frontend unwrap
Banned patterns with wrong/right code pairs

The example phase spec at examples/phase-spec.md shows an API drift table like this:

Action	Correct path	EP constant	Match?
List transactions	`GET /api/v1/statements/${id}/transactions`	`EP.STATEMENT_TRANSACTIONS(id)`	OK
Reclassify	`PATCH /api/v1/statements/${id}/transactions/${txnId}`	`EP.STATEMENT_RECLASSIFY` → `/reclassify`	WRONG (no `/reclassify` suffix)
Bulk reclassify	`POST /api/v1/statements/${id}/transactions/bulk-reclassify`	`EP.STATEMENT_BULK_RECLASSIFY(id)`	OK

This is the spec doing AI-prep work: instead of letting the agent re-derive every endpoint, I front-load the drift map so the agent burns zero cycles on solved problems.

It feels excessive when you're writing it. It pays back massively the first time the next agent reads it and avoids three obvious bugs.

Bugs Become Rules

The atomic operation of the whole system is: every bug shipped to production becomes a rule the next agent must follow.

From a sanitized rules-lessons doc:

### 1. Trace the full response chain before writing any hook

Before writing a hook, trace this exact path: 1. Read the route handler — what does ResponseUtil.ok(res, X) pass? 2. Read the service — what does the method return? 3. Know the envelope: ResponseUtil.ok(res, data) → { data }.

Examples from real mistakes: ts* *// Route: ResponseUtil.ok(res, result) where result = { items, total, page, limit, totalPages }* *// WRONG: r.data.data / r.data.meta.total* *// RIGHT: r.data.items / r.data.total* *

Never assume. Read it.

That // Examples from real mistakes: line is the literal moment a bug becomes a permanent agent-facing rule.

The graduation path:

Bug ships
Engineer fixes it, writes a one-paragraph entry in the project's rules-lessons.md
Every future AI session on that project reads rules-lessons.md before writing code
If the same bug pattern shows up in a second project, it graduates to the workspace-level skills/hard-lessons.md
Every persona that loads hard-lessons.md (six of them) now knows the pattern

This is the loop that makes the system get harder to break with each cycle. The bugs you've already had become the bugs you don't have anymore.

A full example of a project's rules-lessons file is at examples/rules-lessons.md. The workspace-level version is at skills/hard-lessons.md.

Planning with Opus, Executing with Sonnet

Now to the part most people get wrong: context management.

Long sessions degrade. The model gets confused. Earlier decisions get forgotten. Code drifts from the spec. The agent that started crisp ends up making mistakes that the same model in a fresh session would never make.

My pattern, after a lot of trial and error:

Use the most capable model (Opus, currently 4.7 / 4.8) for planning. Use the fastest competent model (latest Sonnet) for execution.

This isn't a cost-saving move. It's a context-management move.

Planning sessions (Opus)

When I start a new feature or a new phase, I open a fresh session with Opus and ask for three deliverables in sequence:

High-level plan — "Here's the MVP spec. What are the modules, the order, the dependencies between them?"
Detailed plan — "Let's drill into Module 2. What are the API endpoints, the frontend screens, the data model changes? Where are the gotchas?"
Tech docs — "Write the phase spec. Include the API drift table, the icon registry, the banned patterns."

Opus is genuinely better at this kind of work — holding many constraints in mind, sensing where the design will break, asking clarifying questions before producing the plan. The output of these three steps is a set of markdown files (high-level-plan.md, phase-2-spec.md, migration-notes.md) that go into the project's docs folder.

Then I close the session. The plan is the deliverable, not the chat history.

Execution sessions (Sonnet)

For implementation, I open a fresh session with the latest Sonnet and point it at the plan:

"Load personas/backend.md and follow it. Read the skills it lists. Then implement Module 2 per the spec in docs/phase-2-spec.md. The relevant rules-lessons are in docs/rules-lessons.md."

Sonnet executes faster, has plenty of context window for the actual implementation work, and produces clean code when the plan is good. The plan does the heavy thinking; Sonnet does the typing.

Clear context between tasks

The single best practice I've adopted: open a new chat for every task. Not "every feature" — every task.

Re-using the same chat across tasks pollutes the context. The agent remembers the previous task's decisions and applies them to the new one even when they don't transfer. Worse, the agent develops a kind of momentum — it stops re-reading the spec because "we already discussed it" — and starts inventing.

A fresh chat per task feels wasteful. It isn't. The cost of re-loading personas and skills (maybe 30 seconds of agent reading) is trivial next to the cost of the agent making a confused decision in turn 47 of a 60-turn chat because it conflated something from turn 12 with the current task.

Run the quality gates between every workflow

Before I hand control back to a fresh session, I run the project's quality gates myself in the previous one:

pnpm typecheck
pnpm lint
pnpm test --changed
pnpm build

All four must pass. If they don't, the new session starts with a broken baseline, and the agent will spend the first 10 minutes trying to figure out why things are red instead of doing the work I'm asking for.

This is the rule the quality-standards.md skill enforces, and it applies to humans too. Don't hand a broken state to the next agent.

Claude Settings That Actually Matter

A few things in my .claude/settings.json that genuinely move the needle:

Permissions allowlist. Default Claude prompts you to approve every Bash command. After the 200th git status, you stop reading the prompts. I add the read-only commands I run constantly to an allowlist so they don't prompt — ls, find, grep, git status, git log, git diff, cat, pnpm typecheck. The prompts that remain are the ones that actually need attention.

Hooks for automation. Claude Code lets you register hooks that run on specific events — SessionStart, PreToolUse, Stop. I use a Stop hook to play a sound when a long-running task finishes so I can context-switch productively. I use SessionStart hooks for project-specific setup ("if you're in this repo, load this persona automatically").

The hard "never" list in CLAUDE.md. Every project has a CLAUDE.md at the root with project-specific rules the agent reads on every session start. Mine includes things like "never bypass git hooks," "never --force push to main," "always run typecheck before claiming done."

The actual settings.json is project-specific, but the doctrine is in the skills/ and commands/ folders — particularly rules.md which captures the universal workspace rules (pnpm only, workspace:* for internal deps, 7-day npm release age, Node 20+, path aliases only).

Why Grep Beats LLM-Only Review at Scale

Now to the second half of the system: code review.

A typical AI code review pattern is "ask the LLM to read every changed file and find issues." This works for small diffs. It does not scale. At any real codebase size, a reviewer that reads every file end-to-end is slow, expensive, and inconsistent — the same code reviewed on Monday and Friday produces different findings.

My reviewer is multi-axis and runs across three persona lenses (backend, frontend, fullstack QA). Each axis has concrete grep recipes and verbatim violation patterns.

The four axes

Logic & correctness — strict TS, no any, no useEffect + fetch, no as casts, money as bigint only.

Security — every route in asyncHandler, auth middleware on protected routes, no z.any(), no real credentials in CI, refresh token reuse detection tested.

Performance — bundle optimization checks, server/client component boundary checks (Next.js), database index audits, N+1 query patterns, append-only ledger violations.

Consistency — pnpm only, workspace:* for internal deps, 7-day release age on deps, path aliases only, no cross-app imports, no Redux/Zustand, icon proxy enforced.

The grep recipes

For each axis, the QA personas have grep recipes that catch the common violations in seconds. From personas/qa-backend.md:

# Services that throw (must return ServiceResult<T>)
grep -rn "throw new\|throw new Error" src/features/ --include="*.service.ts" \
  | grep -v "//\|AppError\|ValidationError"

# req object passed into service calls (HTTP leaking into business logic)
grep -rn "service\.\(.*\)(req\|\.service\.\(.*\)(.*req" src/features/

# res.json() called directly (bypasses ResponseUtil envelope)
grep -rn "res\.json\|res\.send(" src/features/ | grep -v "ResponseUtil"

# Async route handlers without asyncHandler wrapper
grep -rn "router\.\(get\|post\|put\|patch\|delete\)(.*async" src/features/ \
  | grep -v "asyncHandler"

# any type in Zod schemas
grep -rn "z\.any()" src/

# Money stored as number/float (must be bigint)
grep -rn "amount.*: number\|balance.*: number\|price.*: number" src/

From personas/qa-frontend.md:

# Raw && in JSX (must be <Show when={...}>)
grep -rn "{.*&&" src/features/ --include="*.tsx"

# Raw .map() in JSX (must be <Repeat>)
grep -rn "\.map(" src/features/ --include="*.tsx"

# Direct lucide-react imports (must go through icon proxy)
grep -rn "from 'lucide-react'" src/features/

# Raw hex in Tailwind classNames (must use token classes)
grep -rn "bg-\[#\|text-\[#\|border-\[#" src/features/

Each recipe catches a known class of bug in milliseconds. The LLM is reserved for the 20% — semantic review, cross-file architectural sniff tests, "does this approach make sense given the rest of the codebase?"

The pattern: grep catches the patterns; the LLM catches the meaning. Both, never just one.

The slash commands

Built-in Claude Code commands I use constantly:

/code-review with an effort dial (low / medium / high / max / ultra). Low and medium produce a small number of high-confidence findings. Ultra runs a multi-agent review in the cloud.
/code-review --comment posts findings as inline PR comments.
/simplify — equivalent to /code-review --fix. Applies the findings to the working tree.
/review — review a pull request.
/security-review — security review of pending changes on the current branch.
/verify — verify a change actually works by running the app and observing behavior.

Pre-commit Husky hooks enforce the bare minimum (typecheck, lint, format, test on staged files) before commits land. CI fails if coverage of changed files drops below 80%. The reviewer fires both at PR creation and in the local diff.

MCP: Extending the Agent's Reach

Beyond the personas and skills, I use MCP servers (Model Context Protocol) to give the agent access to systems outside the local filesystem.

The four I rely on:

GitHub / GitLab MCP

Lets the agent read PRs, issues, comments, and commit history directly — without me copy-pasting. The pattern: "Read PR #142 and tell me what's still unresolved in the review thread." The agent fetches the PR, reads every comment, identifies which threads are open vs resolved, and produces a summary. The same thing for issues — "What did we decide about pagination in issue #88?"

This sounds small. It compounds. When the agent can read the surrounding context (the PR description, the linked issue, the review comments) before writing code, the code it writes accounts for the discussion.

Atlassian MCP (Jira, Confluence)

The same pattern, applied to ticket systems. "Read JIRA ticket PROJ-2417 and produce the implementation plan." The agent fetches the ticket, the acceptance criteria, any linked design docs in Confluence, and the comment thread. Then it writes the plan against all of that, not just my one-line summary.

I'm cautious about giving the agent write access to Jira — I prefer to keep the human in the loop on ticket transitions. But read access has been transformative.

Figma MCP

For design-driven work. The agent reads the Figma file, extracts the component structure, the design tokens, the spacing, the variants. Then it implements against the actual design, not against my verbal description of it.

This is especially good for design system work (more in Part 2). Combined with the /ship-design-system slash command, the agent can take a Figma file and translate it into a real React component library with high fidelity. The Figma MCP is what makes "the AI implements the design directly" actually feasible.

Video Watcher MCP

Less famous but genuinely useful. Lets the agent watch a video file (a recording, a tutorial, a Loom walkthrough) and produce a transcript-plus-screenshots summary. The pattern: "Here's a 15-minute Loom walkthrough of the bug. Watch it and tell me what's happening." The agent watches, extracts the relevant frames, transcribes the narration, and produces a description that includes the visual context.

This works for design reviews too — "Watch this Figma prototype recording and tell me the user flow." Massively faster than me writing it up.

What I don't use MCP for

I don't use MCP for things that need to happen at scale or in CI. The MCP servers are for interactive work — augmenting the agent's context with information from external systems. The actual code execution, the actual tests, the actual deploys still happen through the standard local tooling. MCP is a context-extension layer, not an execution layer.

Stitch Design and Claude Design

Two more tools that have changed the design-implementation loop:

Stitch Design (AI design generation)

Stitch is an AI design tool by Google that produces production-grade UI mockups from prompts. The trick is that AI design tools work best when given:

Screen-by-screen breakdowns — not full page descriptions
Extremely detailed specifications — every element, size, color, position
Light mode specifications — always specify light/dark mode
Exact measurements — pixel values, percentages, spacing units

The Golden Rule: if you haven't explicitly stated it, the AI will guess — and it will probably guess wrong.

I use Stitch as the front of a chain: prompt Stitch with a detailed brief → get a mockup → feed the mockup into Figma → use the Figma MCP to have Claude implement it. The chain converts "I have an idea for a screen" into shipped React code in an afternoon, with quality that would have taken a week the old way.

Claude Design (the design system commands)

The two slash commands in my repo — /design-system-agent and /ship-design-system — are a clean designer/implementer split:

/design-system-agent (the designer) — picks a stance from a catalog of 25, runs discovery, builds the HTML spec (scenes + foundation CSS + variations)
/ship-design-system (the shipper) — translates a finished Studio project into a real React component library inside a target repo

The shipper's hard rule, stated in four places in its system prompt: never invent design — only translate. If the HTML doesn't show a state, the shipper doesn't add one. Don't fight the repo (if existing components use default exports, the shipper uses default exports too).

The full workflow is documented in docs/how-to-use-design-system.md. The second part of this article series goes deeper into the design system pipeline — how the two-agent split keeps design and implementation honest.

The Vercel React Skills and the UI Skills

Two more layers that load on every frontend task:

Vercel React skills

A set of three skills from Vercel that I treat as universal frontend canon:

vercel-react-best-practices — performance optimization guidelines, ~70 rules across categories like async-, bundle-, server-, client-
vercel-composition-patterns — React composition patterns that scale (compound components, render props, context providers, React 19 API changes)
vercel-react-view-transitions — guide for native View Transition API implementations

These are not my skills. Vercel built them. They install via Claude Code's skill system (or symlink into ~/.claude/skills/). Together they give the agent a strong default for "what good React looks like in 2026" without me having to encode it.

UI / Web Design Guidelines skill

A general-purpose UI review skill. Useful for catching accessibility regressions, contrast issues, keyboard navigation gaps, the things that should not ship but routinely do because the developer was looking at the happy path on a 32-inch monitor.

These don't replace my frontend-fsd.md and frontend-guide.md — those are the project-specific rules. The Vercel skills are universal patterns; mine are project-specific conventions. Both load.

What This Architecture Gives Up

Honest accounting, because the system isn't free:

Onboarding overhead. A new engineer copying this scaffolding without my scar tissue will encode the wrong things. The lesson is the practice (bug → rule → spec → next agent inherits it), not the specific files. Copying my files without the discipline is cargo-culting.

Doc drift. Specs go stale. The agent reading a stale spec doesn't know the spec is wrong. I don't have a drift detector yet — a weekly job that re-greps the codebase and flags rule drift would keep the specs honest. It's on my "what I'd build next" list.

Grep recipes are codebase-specific. The recipes in qa-backend.md assume Express + service layer + ResponseUtil envelope. If your stack is different (NestJS, Fastify, Go, Rust), the recipes don't transfer directly — you need to rewrite them for your patterns. The methodology transfers; the specific commands don't.

Cross-project promotion of lessons is manual. When a bug pattern appears in two projects, I have to notice it and lift the rule from project-specific rules-lessons.md to workspace-level hard-lessons.md. There's no automation for this. A monthly "lift recurring lessons" review is on the backlog.

It only works because I've been bitten enough times to know what to encode. The system is the externalised memory of an engineer who has seen the failure modes. A junior engineer running this exact scaffolding will encode their first month's mistakes — which are not the same mistakes the system was designed to catch.

Evaluation

After running this system across the products I ship, three measurable properties:

New project starts faster. The universal skills (hard-lessons, quality-standards, rules, frontend-fsd, backend-service-patterns) are already loaded. The agent doesn't need to be told that money is bigint kobo or that services return ServiceResult<T>. It already knows. First-week velocity on a new project is meaningfully higher.

Recurring bug classes don't recur. The 204 .json() parsing bug, the API drift bug, the req-in-service bug, the optimistic update without rollback bug — these used to appear in every project. They don't anymore. They're in the skill files. The agent reads the skills. The bugs don't get written.

QA agents and dev agents grep against the same rulebook. Handoffs don't require translation. The dev agent was told "services return ServiceResult<T>." The QA agent greps for throw new in service files and files violations. Same rule, different lens.

The Closing Frame

The AI is a recurring contractor on a maturing codebase. Each session inherits the accumulated rule set. Each session contributes new rules back when it finds new failure modes.

The persona is who the contractor is. The skill is how the contractor works. The codebase docs are what the contractor needs to know about this specific project.

When you treat AI like a stranger you have to re-explain everything to, you get vacuum-problem output. When you treat AI like a contractor with a reference library of your team's lessons, you get something different.

The reference library is at github.com/spiderocious/agentic-workflow. Fork it. Edit it. Ignore the parts that don't apply. The files are deliberately small and modular so you can take what fits.

The spec is not documentation. It's the executable working memory of the team.

In Part 2: the QA agents, the design system pipeline, the multi-agent orchestration patterns, the self-built skills like agent-browser and the Demo Director persona, and what I'd build next.

Command Palette