Q-Learning: A Tiny Agent Finds the Goal

Episode 1 · Step 0 · Position (0,0) · Episode reward 0 · Goals reached 0 / Traps hit 0

bad (Q<0) unknown good (Q>0) robot

Play

Drive it yourself

Pick a direction. The robot still learns from whatever happens.

you
pick

Learning settings

α 0.50

Learning rate — how strongly each try overwrites the old number. Big α = changes mind fast.

γ 0.90

Future discount — how much a reward one step later is worth. €10 now vs €9 next step vs €8.10 two steps away…

ε 0.20

Curiosity — chance to ignore the best known action and try something random. Needed at the start; lower it later.

What just happened?

Click Take 1 step or one of the arrow buttons to start. Each step, we'll show the exact math the robot does to its memory.

(no steps yet)

The update rule, in one line: Q(S,A) ← Q(S,A) + α × [ r + γ × max Q(S') − Q(S,A) ].
Important: r is the immediate reward from the step you just took (−1 normal step, −5 trap, +10 goal) — it is not curiosity ε. These values come from the Reward rules at the top and you can change them live.
The part inside the brackets is the surprise: how much better (or worse) things went than the robot expected.

Episode story

One line per finished try. Green = reached the Goal. Red = fell in the Trap. The path is the exact sequence of arrows the robot took from Start. A ★ marks a new best (fewest steps to the Goal).

No episodes finished yet. Press Run whole episode (or keep pressing Take 1 step) and the first story will appear here.

The robot's memory (Q-table) one number per (square, action)

Green = "this move looks good." Red = "this move leads somewhere bad." White = "I don't know yet." The square with a blue outline is the one just updated.

State	↑ Up	↓ Down	← Left	→ Right

Learning progress is it getting better?

Reward per episode (higher is better)

Steps per episode (lower is better)

How does this actually work?

The robot keeps a memory table with one number per (square, arrow). That number is its guess for "how many points will I end up with if I take this arrow from this square and then keep playing well?"

Every step, two things happen:

1) It picks an arrow. Most of the time it picks the arrow with the biggest number (greedy). Sometimes — with chance ε — it picks a random arrow instead (curiosity). Without curiosity, the robot never discovers paths its table currently thinks are bad.

2) It updates one number. After moving and seeing what happened, it compares:

What I expected: the old number Q(S,A).
What actually seems true now: the reward I just got, plus a discounted peek at the best number in the new square, r + γ × max Q(S').

The gap between those two is the surprise. The robot nudges its old number a little bit (a fraction α) toward the better estimate. Small α = cautious learner. Big α = jumpy learner.

Why does this work? At first every number is zero and the robot wanders randomly. The moment it touches the goal, one number gets a big positive surprise (the Goal reward). The next time the robot is one square away from that square, its number gets updated using the now-positive neighbour. Episode after episode, the good news propagates backward — like a wave in reverse — from the goal all the way to the start. Eventually even the first move has a positive number, and just "follow the biggest arrow" solves the maze.

Try this: press Forget all, then Auto-train. Watch the squares near the goal turn green first, and the green slowly creep up toward the top-left.