I built an AI that argues with my other AI

One model in one context window is not a critic

The first version of every design I ship is wrong. Not badly wrong. Wrong in a way I can't see from inside it, which is the only kind that survives to production.

Design crits used to catch this. A senior designer, a PM, an engineer. Different blind spots, so between them the gap gets found.

AI quietly deleted that loop. I spent months running one model for everything: build, evaluate, refine, ship. Then someone would flag the obvious problem in thirty seconds.

Here's what took me too long to see. A model can't critique what it just built, and not because it's sycophantic. Because critique needs distance, and a single context window has none. It pattern-matches to its own output. Asking it to review itself is asking the designer who made the thing whether the thing is good. Of course they like it. They made it.

So I built a relay that refuses to let that happen.

Claude builds. ChatGPT critiques. Each critic role has its own system prompt, knowledge base, and toolset. Disagreement is the product.

How the brain does critique

Real critique isn't one region checking itself. It's adversarial.

The prefrontal cortex generates a plan. The anterior cingulate cortex flags when the plan contradicts what you actually want. The insula produces that something is off feeling you can't name yet.

Three systems. Each one has its own job. Each one is looking at the same thing from a different angle. The PFC can't overrule the others. It has to listen.

Single-model AI flattens all that into one voice. The relay splits it apart.

The architecture

Builder (Claude). Writes code. Builds Figma. Drafts copy. Owns the artifact.
Critic Panel (ChatGPT). Four roles defined in YAML. Critic shoots holes in the direction. Researcher checks what the literature and competitors actually say. PM answers "does this ship, or is it scope creep?" Engineer checks whether it's actually buildable. Each role has its own system prompt, its own file search, its own web access. None of them share context with the builder.
Relay (Python bridge). Ferries artifacts between them. Claude produces. The relay hands the output to each critic in turn. I read the critiques and decide what lands.

The asymmetry is the whole point. Claude doesn't know what ChatGPT will say. ChatGPT only sees what Claude shipped, not what Claude was thinking. That's the design crit. That's the gap.

The useful unit isn't "model." It's "role." A role is a model plus a system prompt plus a knowledge base plus a toolset. Treat roles as team members.

A real example

GiveCampus Volunteer Management alpha. Claude built a task card for a calling workflow. Clean layout. Good typography. It passed my eye.

Sent it through the critic panel. ChatGPT came back with one line:

You're burying the action. The card exists to get the volunteer into the call. The call button is tertiary. You've made metadata louder than the thing that matters.

That was correct. I'd copied the hierarchy from an older pattern and never asked if the pattern was right. Claude missed it because Claude built it. I missed it because it was late and I was running on empty. The critic caught it in one turn for the least flattering reason available: it had no stake, and no stake turns out to be the whole qualification.

Happens every time. Not because ChatGPT is smarter than Claude (it isn't, most days). Because it's separate.

What this taught me about designing with AI

Models aren't fungible. Claude has its own defaults, failure modes, blind spots. Claude is generous, over-complete, a little too eager. ChatGPT is terse, skeptical, cuts scope too fast. Gemini (when I reach for it) is a quiet validator. Useful for catching when the other two are both wrong.

You don't build an AI team by picking the smartest model. You pick, for each seat, the failure mode you can most afford to live with. The builder over-produces, so you can cut. The critic under-validates, so it stays mean. Smart isn't the variable. Which way it breaks is.

Builder needs a model that over-produces. You can always cut. Critic needs a model that under-validates. You want skeptical. Researcher needs strong tool use and citation discipline.

And they stay separate. Share memory once, they start agreeing. Agree once, you're back to one voice in your head.

The distinction

One AI model is your own inner voice bounced back louder. It agrees. It remembers what you said three turns ago and builds on it. It cannot review anything. It can only applaud.

Two models, wired so they never share context, never see each other's work, never align on style, now you have the beginning of a team.

Not "AI-assisted design." Something closer to running a small, specialized, adversarial team that happens to live in two chat windows.

The relay config and role prompts are on the tools page. Adapt them. Make them yours. The architecture matters more than the specific wording. Once you see the pattern, you can build it with whatever models are strongest this quarter.