|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This interactive simulation demonstrates Q-Learning, one of the most fundamental algorithms in Reinforcement Learning (RL). Watch an agent learn to navigate a gridworld environment, discovering optimal paths to goals while avoiding obstacles and pits through trial-and-error learning. What is Reinforcement Learning?Reinforcement Learning is a paradigm where an agent learns by interacting with an environment:
The Q-Learning AlgorithmQ-Learning is an off-policy, model-free algorithm that learns the value of state-action pairs directly:
The Q-Function Q(s, a):
Q(s, a) = Expected total reward starting from state s, taking action a, and following optimal policy thereafter
Higher Q-value = better action to take in that state.
The Update Rule (Temporal Difference Learning):
Q(s, a) ← Q(s, a) + α · [R + γ · maxa'Q(s', a') - Q(s, a)]
Update current estimate based on reward received + best future value.
Key Parameters
The Exploration vs Exploitation DilemmaThis is the central challenge in RL:
ε-Greedy Strategy: With probability ε, take a random action; otherwise, take the best known action. We decay ε over time so the agent explores early but exploits later. The Bellman EquationQ-Learning is based on the Bellman Optimality Equation:
Q*(s, a) = E[Rt+1 + γ · maxa'Q*(s', a') | st=s, at=a]
The optimal Q-value equals the expected immediate reward plus the discounted optimal future value. This recursive relationship allows us to bootstrap - update estimates based on other estimates! Understanding s, a, s', a' NotationThe Q-Learning formula uses prime (') notation to indicate "next" values:
max Q(s', a') means: "From the new state s', look at ALL possible actions (up, down, left, right) and pick the Q-value of the BEST one." The agent doesn't actually take a' — it just looks ahead to estimate future value. Q-Values: Per Arrow, Not Per CellA common misconception is that Q-values are assigned per cell. In fact, Q is assigned for each arrow (state-action pair) within a cell:
This is why Q-Learning can tell you not just "where" is good, but "which direction to go" from any position. Each arrow in the simulation represents one Q(s, a) value! Q-Table InitializationThe Q-table stores the learned value of each state-action pair. All Q-values start at 0:
Q-Table (5×5 grid = 25 states × 4 actions = 100 Q-values):
State (0,0): { up: 0, down: 0, left: 0, right: 0 } State (0,1): { up: 0, down: 0, left: 0, right: 0 } State (1,0): { up: 0, down: 0, left: 0, right: 0 } ... all 25 states initialized to zeros ... First Step Calculation Example
Setup: Agent at s=(0,0), takes action a="right", lands at s'=(1,0), Reward R=-1
Q-Learning Update:
Q(s,a) ← Q(s,a) + α · [R + γ · max Q(s',a') - Q(s,a)]
Q(0,0, right) ← 0 + 0.1 × [-1 + 0.9 × max(0,0,0,0) - 0] ↑ ↑ R all zeros initially! Q(0,0, right) ← 0 + 0.1 × [-1 + 0 - 0] Q(0,0, right) ← -0.1 ✓ First non-zero value! How Values Propagate Over TimeInitially all Q-values are 0 (dim cyan arrows). As learning progresses:
The Agent-Environment Interaction Loop
The Agent-Environment Loop:
Each Step:
1. Agent observes current state 2. Agent chooses action (ε-greedy) 3. Environment returns new state + reward 4. Agent updates Q-table (learns!) 5. Repeat until terminal state Rewards at Every StepRewards are given at every step, not just terminal states:
The step cost is crucial: without it, the agent might wander forever. With -1 per step, shorter paths yield higher total reward! Gridworld as a Learning Environment
Why Step Cost (-1)?The small negative reward for each step is crucial:
🗺️ Environment
Edit Cell Type
� Learning Parameters
Learning Rate (α)
0.10
Discount Factor (γ)
0.90
Exploration (ε)
0.30
⚡ Simulation
Speed
200ms
📊 Statistics
Episode
0
Step
0
Ep. Reward
0
Avg Reward
0
🎮 Gridworld Environment
Click cells to edit
Q(s, a) ← Q(s, a) + α · [R + γ · max Q(s', a') - Q(s, a)]
Arrows: Cyan = positive Q, Magenta = negative Q. Background: Green/Red heatmap.
Action determinationRun or Step to see how the action is chosen (explore vs exploit).
🧮 Q-Update Calculation
State (s):
-
Action (a):
-
Next State (s'):
-
Reward (R):
-
Old Q(s,a):
-
max Q(s',a'):
-
TD Target [R + γ·maxQ]:
-
TD Error [Target - Q]:
-
New Q(s,a):
-
Waiting for action...
📝 Action LogWaiting to start...
Start 🚀
Empty
Wall 🧱
Pit 💀
Goal 🏆
Agent 🤖
Usage Instructions
Understanding the Q-Value Visualization
The "Glass Box" Approach:
Each cell displays four triangles pointing in the direction of each possible action (↑↓←→):
Cell background colors (separate from arrows):
As learning progresses, you'll see cyan triangles "pointing toward the goal" - this is the learned policy emerging! Experiments to Try
The Learning Process
Episode 1: Agent explores randomly, falls in pits, eventually finds goal by chance
Episode 10: Q-values near goal become positive, agent starts "gravitating" toward it Episode 50: Clear policy emerges - triangles point toward optimal path Episode 100+: Agent consistently finds efficient path, ε has decayed Mathematical DetailsTD Error (Temporal Difference Error):
δ = R + γ · maxa'Q(s', a') - Q(s, a)
The "surprise" - difference between expected and actual value. Positive δ = outcome better than expected → increase Q Negative δ = outcome worse than expected → decrease Q Convergence Guarantee: Q-Learning is proven to converge to optimal Q* if:
Agent Moves by Q-Values, Not Rewards!A common misconception: "The agent moves toward rewards." This is imprecise!
Key Insight: The agent cannot see rewards before acting — rewards are only revealed after the action is taken. The action is chosen based on Q-values (learned estimates), not direct reward lookup.
Timeline of Events:
1. Agent is at state s 2. Choose action based on Q(s, a) ← Uses Q-values! 3. Execute action a 4. Environment gives reward R ← Reward revealed AFTER 5. Agent arrives at state s' 6. Update Q-value using R ← Learning happens Analogy: Reward is like a teacher's grade — you only get it after submitting your answer. Q-value is like your study notes — you consult them to decide your answer. You don't choose answers based on grades (you don't know them yet!), you choose based on what you've learned from past grades. Known Rewards vs. True RLIn this Gridworld simulation, rewards ARE known and deterministic (we defined them!). So why use Q-Learning?
Q-Learning shines when:
Bootstrapping: How We Handle Unknown Future RewardsThe Q-value represents cumulative future reward:
Q(s,a) ≈ R + γ·R' + γ²·R'' + γ³·R''' + ...
↑ ↑ ↑ ↑ known unknown unknown unknown The Problem: How do we calculate Q NOW when it depends on FUTURE rewards we don't know? The Solution — Bootstrapping: We use our current estimate of future rewards!
The Recursive Trick:
Q(s,a) = R + γ · [R' + γ·R'' + γ²·R''' + ...]
↑ └───────────┬───────────┘ known! This IS Q(s', a')! So: Q(s,a) = R + γ · max Q(s', a') ↑ ↑ known estimated from Q-table! Why Does This Work? Iteration! Each update makes the estimate slightly better:
Episode 1: Q(s,a) = 0 (wrong, but it's a start)
Episode 2: Q(s,a) = -0.5 (a bit better) Episode 3: Q(s,a) = -0.9 (getting there) ... Episode N: Q(s,a) = 97.2 (converged to true value!) The "Ripple Effect": When the agent reaches the goal, the +100 reward updates nearby Q-values. Those updated Q-values then help update states farther away. The value "ripples" backward through the state space!
The Magic of Bootstrapping: We're using our own (initially wrong) estimates to improve those same estimates, and mathematically this process converges to the true values! This is called Temporal Difference (TD) Learning.
Real-World Applications
Extensions Beyond Basic Q-Learning
Key Insights
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||