AI Powerhouse · Testing
01 / 22
←/→ · T timer

TIME.

Pens down — debrief

press T to reset
Meeting Select the workshop · testing

.

AI writes more and more of our code. So we need a way to be sure it works. The code and the product. Today we make the agent prove it.

The rule of the day: short talks, long labs. Every section ends at your keyboard.
Get ready · have these ready

What to have ready.

Claude Code installed & workingSigned in, and it opens on your machine.
The Claude in Chrome plugin installedIt lets the agent open and click a real browser. We use it this afternoon. See the next slide to install it.
Your app running on your machineSo the agent can open it in the browser.
A small feature or bug, ready to buildReal, not a toy. Small enough to finish today.
Get ready · install this now

Install Claude in Chrome.

This is a Chrome plugin that lets the agent drive a real browser. It can open your app, click, type, and read the screen, the console, and the network. We use it this afternoon to test the product the way a user sees it.

Install it from code.claude.com/docs/en/chrome

Install the Chrome extension, then connect it to Claude Code. Do this before the afternoon lab.

The workflow · what we've built so far

Where we are in the workflow.

Document: decisions become context for the next task a later session
Specifyyou · the front
Generatethe agent
Comprehendyou · the back
01Aligngrill until you agree✓ grill-me
02Planthe spec it follows✓ grill-with-docs
03Executethe agent builds✓ the loop
04Reviewprove it works, then merge← today
We built the front in session one. Today we make the back provable: tests for the code, the browser for the product.
The loop · how an agent works

An agent works in a loop.

↻ repeat
01Decidepick the next step
02Actrun a tool: edit, run, search
03Observelook at what happened
Observe is where the agent checks its own work. Take it away and the agent just guesses it is done. Testing is how we make “observe” real.
The problem
The agent writes the code.
Who checks it works?
Writing code by hand was slow. That slowness was the check. You read every line as you wrote it. Now the code arrives faster than anyone can read it. The old check is gone.
More codeIt lands in minutes, not days.
Less readingNo one can review it all by eye.
The fixMake the agent prove its work.
Two tracks · today's shape

Two kinds of quality.

track oneCode quality
A
B
C
Tests prove each part does what we asked.
assemble
& ship
track twoProduct quality
A
B
C
The browser walks the whole flow and hits the broken join.

Every part can pass. The product can still break where the parts join. So we check both: the parts and the whole.

Section 03 / 5track one · make it prove its own work

.

you are hereSpecifyGenerateComprehend TDD
Why it fits · back to the agent's loop

Why test-first fits the loop.

↻ until it’s green
01Decideplan the next slice
02Actwrite the code
03Observerun the test: red or green
It is the agent's observeThe loop ends in observe. A test makes that a fact: it passes, or it fails.
Write it first“Done” is fixed before the agent writes a line. The test is the target it aims at.
It scales past readingMore code lands than anyone can read. The test keeps you sure without reading every line.
Test-first · red · green · refactor

Write the test first.

RedWrite one failing test. It says what “done” means.
GreenWrite the least code to make it pass.
RefactorClean it up. Keep the test green.
Red proves the test can fail before the code exists. Green is the least code that makes it pass. Refactor tidies up while the test holds the behavior in place.
How we do it

How we run TDD.

Our way one slice at a time
01
Grill firstagree what “done” means before any code
02
One test, then one bit of codethe least code to pass it, then repeat
Each test checks real behavior, and builds on the last.
The default all the tests up front
testtesttesttesttesttest
The mistake agents make. These check imagined behavior.
The skill you'll use
/tdd
Grills youIt asks what “done” means and which behaviors matter, before any code.
One sliceOne failing test, then the least code to pass. Repeat, never in big batches.
Red · green · refactorIt runs the cycle for you and never refactors while a test is red.
Code onlyIt tests the logic, not the screen. The screen gets proven in track two, with /browser-testing.
▶ Live demopresenter runs this · audience is next

Spec the feature, then build it test-first.

01 / NowWatch
Presenter: build “edit an existing to-do” live, in two steps, in this order.
  • Start from the working to-do appit already runs and passes its tests
  • First, grill-with-docs the featurespec “let users edit a to-do” before any code
  • Then run /tdd from that specred, then green, one slice at a time
  • Read the test, not just the checkdoes it match the spec we just wrote?
02 / NextYou
Then you do the same two steps on your own feature: grill-with-docs, then /tdd.
Exercise A →
hands on · 30 min
The catch · the agent can cheat its own test

The agent can fake a passing test.

booking_tests.cs·the test went red, so the agent “fixed” the test
// a new booking should still start as "pending"
Assert.Equal("pending", booking.Status);
+Assert.Equal("cancelled", booking.Status);
// now the test agrees with the broken code
✓ 214 passedbut the bug is still there
The rule Read the test changes more carefully than the code. When one agent writes both the code and the test, it can change or delete the test until it passes. A green check over many edited tests proves nothing until you read the edits.
The counter Break the code on purpose. A real test fails when the code is wrong. Coverage only says a line ran. Mutation testing says the test would notice if that line were wrong.
Exercise Ahands on · 30 min

Spec it, then build it test-first.

Two steps, in order. First write the spec with grill-with-docs. Then let /tdd build it, one slice at a time.

01
Grill-with-docs your featureanswer its questions until the spec is written
02
Read the spec backthis is what /tdd will build, so make it right
03
Run /tdd from the specone failing test, least code to pass, repeat
04
Read the tests backdo they match the spec you wrote?
30:00
T start / pause
You're done when
Your feature is built from a spec you wrote, with passing tests you saw go red before green.
Section 04 / 5track two · test it like a user

.

you are hereSpecifyGenerateComprehend browser
Track two · why unit tests aren't enough

Why unit tests aren’t enough.

a unit testChecks one part
CodeTest
The agent wrote both. They agree by design.
vs
the browserChecks the product
the running app
A real user, on the real screen.

Code review tells you what changed. Only the browser shows what the page does when a real user loads it.

How it works · Claude in Chrome

The agent drives a real browser.

ReproduceOpen the page. Do the action a user would.
InspectRead the screen, the console, the network calls.
DiagnoseCompare what you see to what you expect.
Fix & verifyChange the code. Run the flow again. Confirm it's clean.
What it catches · a real example

A bug only the browser catches.

/planner/bookings
Amsterdam · room AConfirmed
Utrecht · room BConfirmed
The Hague · room Cnull
Rotterdam · room D7
Browser agent · blocking Status shows null. A diff that refactors the booking status codes can look fine when you read it. But the agent opens the page and sees the status column show null or a raw number, not “Confirmed.” It files that as a finding that blocks the merge.
The skill you'll use
/browser-testing
Rubric firstTell it the flow, what “passing” looks like, and what counts as broken. Without that, it just says “all fine.”
Drives the appIt opens the page, clicks, types, and reads the screen, the console, and the network.
Reproduce, then verifyReproduce, inspect, diagnose, fix, then run the flow again and confirm it is clean.
Then save itOnce the flow is right, it writes a Playwright test that re-runs on every change.
▶ Live demopresenter runs this · audience is next

Now prove the edit screen in the browser.

01 / NowWatch
Presenter: the same edit feature as track one. Now the screen.
  • Build the edit screenthe form for the rules track one already proved
  • Give it a rubricclick the pencil, change the title, save, and watch the row update
  • Let it drive the real appthrough Claude in Chrome, the way a user would
  • Save the passing flow as a testa Playwright run that re-checks it
02 / NextYou
Then you prove your own feature’s screen the same way, with /browser-testing.
Exercise B →
hands on · 20 min
Exercise Bhands on · 20 min

Prove your feature in the browser.

Take the feature whose rules you built in Exercise A. Build a simple screen for it, then run /browser-testing to prove it in a real browser.

01
Build a simple screenthe UI for the rules you proved in Exercise A
02
Give /browser-testing a rubricthe flow to try, what passing looks like, what counts as broken
03
Let it drive and reportthrough Claude in Chrome, like a user, with screenshots
Debrief
  • Did it find anything that broke in the real UI?
  • What could it not test on its own?
20:00
T start / pause
You're done when
An agent drove your screen in a real browser and reported back. Bonus: save the passing flow as a Playwright test.
Section 05 / 5the last word

.

you are hereSpecifyGenerateComprehend
Where we are · the pieces, snapping in

What we added today.

Document: this loop compounds next a later session
Specifygrill-me · grill-with-docs
Generatethe agent
Comprehendtests · browser
01Aligngrill-me
02Plangrill-with-docs
03Executethe loop
04Review/tdd + browser✓ today
Four steps in. The next sessions add the rest: many agents at once, agents that run on their own, and the loop that compounds.
Close · what you own now


You have two skills the team owns now: /tdd that makes the agent prove its code, and /browser-testing that proves the product. Both run inside the agent's loop, on every task.

The agent does the checking now. You still own the judgment: what to build, and whether it's right.

Decide Act Observe