Pick a direction. The robot still learns from whatever happens.
pick
Learning rate — how strongly each try overwrites the old number. Big α = changes mind fast.
Future discount — how much a reward one step later is worth. €10 now vs €9 next step vs €8.10 two steps away…
Curiosity — chance to ignore the best known action and try something random. Needed at the start; lower it later.
Important: r is the immediate reward from the step you just took (−1 normal step, −5 trap, +10 goal) — it is not curiosity ε. These values come from the Reward rules at the top and you can change them live.
The part inside the brackets is the surprise: how much better (or worse) things went than the robot expected.
One line per finished try. Green = reached the Goal. Red = fell in the Trap. The path is the exact sequence of arrows the robot took from Start. A ★ marks a new best (fewest steps to the Goal).
The robot's memory (Q-table) one number per (square, action)
Green = "this move looks good." Red = "this move leads somewhere bad." White = "I don't know yet." The square with a blue outline is the one just updated.
| State | ↑ Up | ↓ Down | ← Left | → Right |
|---|
Learning progress is it getting better?
How does this actually work?
The robot keeps a memory table with one number per (square, arrow). That number is its guess for "how many points will I end up with if I take this arrow from this square and then keep playing well?"
Every step, two things happen:
1) It picks an arrow. Most of the time it picks the arrow with the biggest number (greedy). Sometimes — with chance ε — it picks a random arrow instead (curiosity). Without curiosity, the robot never discovers paths its table currently thinks are bad.
2) It updates one number. After moving and seeing what happened, it compares:
What I expected: the old number Q(S,A).
What actually seems true now: the reward I just got, plus a discounted peek at the best number in the new square, r + γ × max Q(S').
The gap between those two is the surprise. The robot nudges its old number a little bit (a fraction α) toward the better estimate. Small α = cautious learner. Big α = jumpy learner.
Why does this work? At first every number is zero and the robot wanders randomly. The moment it touches the goal, one number gets a big positive surprise (the Goal reward). The next time the robot is one square away from that square, its number gets updated using the now-positive neighbour. Episode after episode, the good news propagates backward — like a wave in reverse — from the goal all the way to the start. Eventually even the first move has a positive number, and just "follow the biggest arrow" solves the maze.
Try this: press Forget all, then Auto-train. Watch the squares near the goal turn green first, and the green slowly creep up toward the top-left.