AI Powerhouse · Testing
01 / 22
←/→ · T timer

TIME.

Pens down — debrief

press T to reset
Meeting Select the workshop · testing

.

AI writes more and more of our code. So we need a way to be sure it works. The code and the product. Today we make the agent prove it.

The rule of the day: short talks, long labs. Every section ends at your keyboard.
Get ready · have these ready

What to have ready.

Claude Code installed & workingSigned in, and it opens on your machine.
The Claude in Chrome plugin installedIt lets the agent open and click a real browser. We use it this afternoon. See the next slide to install it.
Your app running on your machineSo the agent can open it in the browser.
A small feature or bug, ready to buildReal, not a toy. Small enough to finish today.
Get ready · install this now

Install Claude in Chrome.

This is a Chrome plugin that lets the agent drive a real browser. It can open your app, click, type, and read the screen, the console, and the network. We use it this afternoon to test the product the way a user sees it.

Install it from code.claude.com/docs/en/chrome

Install the Chrome extension, then connect it to Claude Code. Do this before the afternoon lab.

Section 02 / 5why we're here

.

you are hereSpecifyGenerateComprehend
The workflow · what we've built so far

Where we are in the workflow.

Document: decisions become context for the next task a later session
Specifyyou · the front
Generatethe agent
Comprehendyou · the back
01Aligngrill until you agree✓ grill-me
02Planthe spec it follows✓ grill-with-docs
03Executethe agent builds✓ the loop
04Reviewprove it works, then merge← today
We built the front in session one. Today we make the back provable: tests for the code, the browser for the product.
The loop · how an agent works

An agent works in a loop.

↻ repeat
01Decidepick the next step
02Actrun a tool: edit, run, search
03Observelook at what happened
Observe is where the agent checks its own work. Take it away and the agent just guesses it is done. Testing is how we make “observe” real.
The problem
The agent writes the code.
Who checks it works?
Writing code by hand was slow. That slowness was the check. You read every line as you wrote it. Now the code arrives faster than anyone can read it. The old check is gone.
More codeIt lands in minutes, not days.
Less readingNo one can review it all by eye.
The fixMake the agent prove its work.
Two tracks · today's shape

Two kinds of quality.

track oneCode quality
A
B
C
Tests prove each part does what we asked.
assemble
& ship
track twoProduct quality
A
B
C
The browser walks the whole flow and hits the broken join.

Every part can pass. The product can still break where the parts join. So we check both: the parts and the whole.

Section 03 / 5track one · make it prove its own work

.

you are hereSpecifyGenerateComprehend TDD
Why it fits · back to the agent's loop

Why test-first fits the loop.

↻ until it’s green
01Decideplan the next slice
02Actwrite the code
03Observerun the test: red or green
It is the agent's observeThe loop ends in observe. A test makes that a fact: it passes, or it fails.
Write it first“Done” is fixed before the agent writes a line. The test is the target it aims at.
It scales past readingMore code lands than anyone can read. The test keeps you sure without reading every line.
Test-first · red · green · refactor

Write the test first.

RedWrite one failing test. It says what “done” means.
GreenWrite the least code to make it pass.
RefactorClean it up. Keep the test green.
Red proves the test can fail before the code exists. Green is the least code that makes it pass. Refactor tidies up while the test holds the behavior in place.
How we do it

How we run TDD.

Agree on doneThe agent grills you first. What should it build? What does “done” look like? You approve the list before any code.This friction is on purpose. It is how the team learns.
One test, then one bit of codeOne failing test. The least code to pass it. Then the next. Each test builds on what the last one taught you.
Never all the tests up frontWriting every test first is the mistake agents make by default. Those tests check imagined behavior, not real behavior.
The skill you'll use
/tdd
Grills youIt asks what “done” means and which behaviors matter, before any code.
One sliceOne failing test, then the least code to pass. Repeat, never in big batches.
Red · green · refactorIt runs the cycle for you and never refactors while a test is red.
Browser tooFor screens, it drives the real app to prove the flow. That is track two.
The catch · the agent can cheat its own test

The agent can fake a passing test.

booking_tests.cs·the test went red, so the agent “fixed” the test
// a new booking should still start as "pending"
Assert.Equal("pending", booking.Status);
+Assert.Equal("cancelled", booking.Status);
// now the test agrees with the broken code
✓ 214 passedbut the bug is still there
The rule Read the test changes more carefully than the code. When one agent writes both the code and the test, it can change or delete the test until it passes. A green check over many edited tests proves nothing until you read the edits.
The counter Break the code on purpose. A real test fails when the code is wrong. Coverage only says a line ran. Mutation testing says the test would notice if that line were wrong.
Exercise Ahands on · 30 min

Build it test-first.

Take your small feature. Load the skill. Let it grill you, then watch it work one slice at a time.

01
Load the skillrun /tdd on your feature
02
Let it grill youanswer until “done” is clear, then approve the list
03
Run the cycleone failing test, least code to pass, repeat
04
Read the tests backdo they describe behavior you understand?
30:00
T start / pause
You're done when
Your feature has passing tests you understand, and you saw them go red before green.
Section 04 / 5track two · test it like a user

.

you are hereSpecifyGenerateComprehend browser
Track two · why unit tests aren't enough

Why unit tests aren’t enough.

A unit test proves a partIt checks one small piece on its own. The agent wrote both the code and the test, so they agree by design.
The browser proves the productOnly running the real app shows what the user sees when the pieces come together.

Code review tells you what changed. It cannot tell you what the page does when a real user loads it.

How it works · Claude in Chrome

The agent drives a real browser.

ReproduceOpen the page. Do the action a user would.
InspectRead the screen, the console, the network calls.
DiagnoseCompare what you see to what you expect.
Fix & verifyChange the code. Run the flow again. Confirm it's clean.

First give it a rubric: which flow to try, what “passing” looks like, what counts as broken. Without that, the agent says “all fine” because it has no way to know what “not fine” is.

What it catches · a real example

A bug only the browser catches.

/planner/bookings
Amsterdam · room AConfirmed
Utrecht · room BConfirmed
The Hague · room Cnull
Rotterdam · room D7
Browser agent · blocking Status shows null. A diff that refactors the booking status codes can look fine when you read it. But the agent opens the page and sees the status column show null or a raw number, not “Confirmed.” It files that as a finding that blocks the merge.
Exercise Bhands on · 20 min

Test it like a user.

Give an agent a browser. Let it click through the feature you just built, the way a real user would.

01
Point it at your appthe running app, through Claude in Chrome
02
Give it a rubricthe flow to try, what passing looks like, what counts as broken
03
Have it click throughlet it report what it saw, with screenshots
Debrief
  • Did it find anything that broke in the real UI?
  • What could it not test on its own?
20:00
T start / pause
You're done when
An agent clicked through your feature in a real browser and reported back.
Section 05 / 5the last word

.

you are hereSpecifyGenerateComprehend
Where we are · the pieces, snapping in

What we added today.

Document: this loop compounds next a later session
Specifygrill-me · grill-with-docs
Generatethe agent
Comprehendtests · browser
01Aligngrill-me
02Plangrill-with-docs
03Executethe loop
04Review/tdd + browser✓ today
Four steps in. The next sessions add the rest: many agents at once, agents that run on their own, and the loop that compounds.
Close · what you own now


You have two habits the team owns now: the /tdd skill that makes the agent prove its code, and the browser check that proves the product. Both run inside the agent's loop, on every task.

The agent does the checking now. You still own the judgment: what to build, and whether it's right.

Decide Act Observe