Meeting Select the workshop · testing

.

AI writes more and more of our code. So we need a way to be sure it works. The code and the product. Today we make the agent prove it.

The rule of the day: short talks, long labs. Every section ends at your keyboard.

Get ready · have these ready

What to have ready.

Claude Code installed & workingSigned in, and it opens on your machine.

The Claude in Chrome plugin installedIt lets the agent open and click a real browser. We use it this afternoon. See the next slide to install it.

Your app running on your machineSo the agent can open it in the browser.

A small feature or bug, ready to buildReal, not a toy. Small enough to finish today.

Get ready · install this now

Install Claude in Chrome.

This is a Chrome plugin that lets the agent drive a real browser. It can open your app, click, type, and read the screen, the console, and the network. We use it this afternoon to test the product the way a user sees it.

Install it from code.claude.com/docs/en/chrome

Install the Chrome extension, then connect it to Claude Code. Do this before the afternoon lab.

The workflow · what we've built so far

Where we are in the workflow.

↻ Document: decisions become context for the next task a later session

Specifyyou · the front

Generatethe agent

Comprehendyou · the back

01Aligngrill until you agree✓ grill-me

02Planthe spec it follows✓ grill-with-docs

03Executethe agent builds✓ the loop

04Reviewprove it works, then merge← today

We built the front in session one. Today we make the back provable: tests for the code, the browser for the product.

The loop · how an agent works

An agent works in a loop.

↻ repeat

01Decidepick the next step

→

02Actrun a tool: edit, run, search

→

03Observelook at what happened

Observe is where the agent checks its own work. Take it away and the agent just guesses it is done. Testing is how we make “observe” real.

The problem

The agent writes the code.
Who checks it works?

By handslow — you read every line

all read

With agentsfaster than anyone reads

readno one checks this

The fix

Make the agent prove its own work.

Two tracks · today's shape

Two kinds of quality.

track oneCode quality

A✓

B✓

C✓

Tests prove each part does what we asked.

→ assemble
& ship

track twoProduct quality

A✓

B✓

C✓

✗

The browser walks the whole flow and hits the broken join.

Every part can pass. The product can still break where the parts join. So we check both: the parts and the whole.

Section 03 / 5track one · make it prove its own work

.

you are hereSpecify›Generate›Comprehend TDD

Why it fits · back to the agent's loop

Why test-first fits the loop.

↻ until it’s green

01Decideplan the next slice

→

02Actwrite the code

→

03Observerun the test: red or green

It is the agent's observeThe loop ends in observe. A test makes that a fact: it passes, or it fails.

Write it first“Done” is fixed before the agent writes a line. The test is the target it aims at.

It scales past readingMore code lands than anyone can read. The test keeps you sure without reading every line.

Test-first · red · green · refactor

Write the test first.

RedWrite one failing test. It says what “done” means.

→

GreenWrite the least code to make it pass.

→

RefactorClean it up. Keep the test green.

Red proves the test can fail before the code exists. Green is the least code that makes it pass. Refactor tidies up while the test holds the behavior in place.

How we do it

How we run TDD.

✓ Our way one slice at a time

01

Grill firstagree what “done” means before any code

02

One test, then one bit of codethe least code to pass it, then repeat ↻

Each test checks real behavior, and builds on the last.

✗ The default all the tests up front

testtesttesttesttesttest

The mistake agents make. These check imagined behavior.

The skill you'll use

/tdd

Grills youIt asks what “done” means and which behaviors matter, before any code.

One sliceOne failing test, then the least code to pass. Repeat, never in big batches.

Red · green · refactorIt runs the cycle for you and never refactors while a test is red.

Code onlyIt tests the logic, not the screen. The screen gets proven in track two, with /browser-testing.

▶ Live demopresenter runs this · audience is next

Spec the feature, then build it test-first.

01 / NowWatch

Presenter: build “edit an existing to-do” live, in two steps, in this order.

Start from the working to-do appit already runs and passes its tests
First, grill-with-docs the featurespec “let users edit a to-do” before any code
Then run /tdd from that specred, then green, one slice at a time
Read the test, not just the checkdoes it match the spec we just wrote?

→

02 / NextYou

Then you do the same two steps on your own feature: grill-with-docs, then /tdd.

Exercise A →

hands on · 30 min

The catch · the agent can cheat its own test

The agent can fake a passing test.

booking_tests.cs·the test went red, so the agent “fixed” the test

// a new booking should still start as "pending"

−Assert.Equal("pending", booking.Status);

+Assert.Equal("cancelled", booking.Status);

// now the test agrees with the broken code

✓ 214 passedbut the bug is still there

The rule Read the test changes more carefully than the code. When one agent writes both the code and the test, it can change or delete the test until it passes. A green check over many edited tests proves nothing until you read the edits.

The counter Break the code on purpose. A real test fails when the code is wrong. Coverage only says a line ran. Mutation testing says the test would notice if that line were wrong.

Exercise Ahands on · 30 min

Spec it, then build it test-first.

Two steps, in order. First write the spec with grill-with-docs. Then let /tdd build it, one slice at a time.

01

Grill-with-docs your featureanswer its questions until the spec is written

02

Read the spec backthis is what /tdd will build, so make it right

03

Run /tdd from the specone failing test, least code to pass, repeat

04

Read the tests backdo they match the spec you wrote?

30:00

T start / pause

You're done when

Your feature is built from a spec you wrote, with passing tests you saw go red before green.

Section 04 / 5track two · test it like a user

.

you are hereSpecify›Generate›Comprehend browser

Track two · why unit tests aren't enough

Why unit tests aren’t enough.

a unit testChecks one part

↻

Code↔Test

The agent wrote both. They agree by design.

vs

the browserChecks the product

→the running app

A real user, on the real screen.

Code review tells you what changed. Only the browser shows what the page does when a real user loads it.

How it works · Claude in Chrome

The agent drives a real browser.

ReproduceOpen the page. Do the action a user would.

InspectRead the screen, the console, the network calls.

DiagnoseCompare what you see to what you expect.

Fix & verifyChange the code. Run the flow again. Confirm it's clean.

What it catches · a real example

A bug only the browser catches.

/planner/bookings

Amsterdam · room AConfirmed

Utrecht · room BConfirmed

The Hague · room Cnull

Rotterdam · room D7

Browser agent · blocking Status shows null. A diff that refactors the booking status codes can look fine when you read it. But the agent opens the page and sees the status column show null or a raw number, not “Confirmed.” It files that as a finding that blocks the merge.

The skill you'll use

/browser-testing

Rubric firstTell it the flow, what “passing” looks like, and what counts as broken. Without that, it just says “all fine.”

Drives the appIt opens the page, clicks, types, and reads the screen, the console, and the network.

Reproduce, then verifyReproduce, inspect, diagnose, fix, then run the flow again and confirm it is clean.

Then save itOnce the flow is right, it writes a Playwright test that re-runs on every change.

▶ Live demopresenter runs this · audience is next

Now prove the edit screen in the browser.

01 / NowWatch

Presenter: the same edit feature as track one. Now the screen.

Build the edit screenthe form for the rules track one already proved
Give it a rubricclick the pencil, change the title, save, and watch the row update
Let it drive the real appthrough Claude in Chrome, the way a user would
Save the passing flow as a testa Playwright run that re-checks it

→

02 / NextYou

Then you prove your own feature’s screen the same way, with /browser-testing.

Exercise B →

hands on · 20 min

Exercise Bhands on · 20 min

Prove your feature in the browser.

Take the feature whose rules you built in Exercise A. Build a simple screen for it, then run /browser-testing to prove it in a real browser.

01

Build a simple screenthe UI for the rules you proved in Exercise A

02

Give /browser-testing a rubricthe flow to try, what passing looks like, what counts as broken

03

Let it drive and reportthrough Claude in Chrome, like a user, with screenshots

Debrief

Did it find anything that broke in the real UI?
What could it not test on its own?

20:00

T start / pause

You're done when

An agent drove your screen in a real browser and reported back. Bonus: save the passing flow as a Playwright test.

Section 05 / 5the last word

.

you are hereSpecify›Generate›Comprehend ✓

Where we are · the pieces, snapping in

What we added today.

↻ Document: this loop compounds next a later session

Specifygrill-me · grill-with-docs

Generatethe agent

Comprehendtests · browser

01Aligngrill-me✓

02Plangrill-with-docs✓

03Executethe loop✓

04Review/tdd + browser✓ today

Four steps in. The next sessions add the rest: many agents at once, agents that run on their own, and the loop that compounds.

Close · what you own now

You have two skills the team owns now: /tdd that makes the agent prove its code, and /browser-testing that proves the product. Both run inside the agent's loop, on every task.

The agent does the checking now. You still own the judgment: what to build, and whether it's right.

Decide Act Observe ■

TIME.

.