Web Simulation 

 

 

 

 

Reinforcement Learning: Q-Learning in Gridworld 

This interactive simulation demonstrates Q-Learning, one of the most fundamental algorithms in Reinforcement Learning (RL). Watch an agent learn to navigate a gridworld environment, discovering optimal paths to goals while avoiding obstacles and pits through trial-and-error learning.

What is Reinforcement Learning?

Reinforcement Learning is a paradigm where an agent learns by interacting with an environment:

  • Agent: The learner/decision-maker (🤖 in our simulation)
  • Environment: The world the agent operates in (the grid)
  • State (s): Current position/situation of the agent
  • Action (a): What the agent can do (Up, Down, Left, Right)
  • Reward (R): Feedback signal (+100 for goal, -100 for pit, -1 per step)
  • Policy (π): Strategy for choosing actions (what we're learning!)

The Q-Learning Algorithm

Q-Learning is an off-policy, model-free algorithm that learns the value of state-action pairs directly:

The Q-Function Q(s, a):
Q(s, a) = Expected total reward starting from state s, taking action a, and following optimal policy thereafter
Higher Q-value = better action to take in that state.
The Update Rule (Temporal Difference Learning):
Q(s, a) ← Q(s, a) + α · [R + γ · maxa'Q(s', a') - Q(s, a)]
Update current estimate based on reward received + best future value.

Key Parameters

Parameter Symbol Range Effect
Learning Rate α (alpha) 0 to 1 How much new information overrides old. High = fast but unstable, Low = slow but stable
Discount Factor γ (gamma) 0 to 1 How much to value future rewards. High = farsighted, Low = shortsighted
Exploration Rate ε (epsilon) 0 to 1 Probability of taking random action. Balances exploration vs exploitation

The Exploration vs Exploitation Dilemma

This is the central challenge in RL:

  • Exploitation: Use current knowledge to maximize reward (go where Q-values are highest)
  • Exploration: Try new actions to discover potentially better paths (random actions)

ε-Greedy Strategy: With probability ε, take a random action; otherwise, take the best known action. We decay ε over time so the agent explores early but exploits later.

The Bellman Equation

Q-Learning is based on the Bellman Optimality Equation:

Q*(s, a) = E[Rt+1 + γ · maxa'Q*(s', a') | st=s, at=a]

The optimal Q-value equals the expected immediate reward plus the discounted optimal future value.

This recursive relationship allows us to bootstrap - update estimates based on other estimates!

Understanding s, a, s', a' Notation

The Q-Learning formula uses prime (') notation to indicate "next" values:

Symbol Meaning Example
s Current state Agent is at position (2, 1)
a Action taken Agent chooses "right"
s' Next state (after action) Agent moves to (3, 1)
a' Possible next actions What could agent do from s'?

max Q(s', a') means: "From the new state s', look at ALL possible actions (up, down, left, right) and pick the Q-value of the BEST one." The agent doesn't actually take a' — it just looks ahead to estimate future value.

Q-Values: Per Arrow, Not Per Cell

A common misconception is that Q-values are assigned per cell. In fact, Q is assigned for each arrow (state-action pair) within a cell:

Cell (2,1)
Q(s, up) = 5.2
2.0   ●   12.7
Q(s, down) = -1.3
Best: RIGHT (12.7)
4 separate Q-values:

Q((2,1), up) = 5.2
Q((2,1), down) = -1.3
Q((2,1), left) = 2.0
Q((2,1), right) = 12.7

This is why Q-Learning can tell you not just "where" is good, but "which direction to go" from any position. Each arrow in the simulation represents one Q(s, a) value!

Q-Table Initialization

The Q-table stores the learned value of each state-action pair. All Q-values start at 0:

Q-Table (5×5 grid = 25 states × 4 actions = 100 Q-values):

State (0,0): { up: 0, down: 0, left: 0, right: 0 }
State (0,1): { up: 0, down: 0, left: 0, right: 0 }
State (1,0): { up: 0, down: 0, left: 0, right: 0 }
... all 25 states initialized to zeros ...

First Step Calculation Example

Setup: Agent at s=(0,0), takes action a="right", lands at s'=(1,0), Reward R=-1

Q-Learning Update:
Q(s,a) ← Q(s,a) + α · [R + γ · max Q(s',a') - Q(s,a)]

Q(0,0, right) ← 0 + 0.1 × [-1 + 0.9 × max(0,0,0,0) - 0]
               ↑                    ↑
               R               all zeros initially!

Q(0,0, right) ← 0 + 0.1 × [-1 + 0 - 0]
Q(0,0, right) ← -0.1 ✓ First non-zero value!

How Values Propagate Over Time

Initially all Q-values are 0 (dim cyan arrows). As learning progresses:

Episode What Happens
1-10 Random exploration, small negative values accumulate from step costs (-0.1, -0.2...)
10-50 Agent occasionally reaches goal, cells near goal get positive Q-values
50-100 Positive values propagate backward through Bellman updates
100+ Clear gradient emerges: arrows point toward goal, away from pits

The Agent-Environment Interaction Loop

The Agent-Environment Loop:
AGENT
🤖
Q-Table
Action (at) →
← State (st+1)
← Reward (rt+1)
ENVIRONMENT
(5×5 Grid)
🚀 🧱 💀 🏆
Each Step:
1. Agent observes current state
2. Agent chooses action (ε-greedy)
3. Environment returns new state + reward
4. Agent updates Q-table (learns!)
5. Repeat until terminal state

Rewards at Every Step

Rewards are given at every step, not just terminal states:

  • -1 (step cost): Every move to an empty cell — encourages efficiency
  • -5 (wall penalty): Trying to move into wall/boundary
  • -100 (pit): Terminal state — episode ends
  • +100 (goal): Terminal state — success!

The step cost is crucial: without it, the agent might wander forever. With -1 per step, shorter paths yield higher total reward!

Gridworld as a Learning Environment

Cell Type Icon Reward Terminal?
Start 🚀 -1 (step cost) No
Empty - -1 (step cost) No
Wall 🧱 -5 (blocked) No (can't enter)
Pit 💀 -100 Yes (episode ends)
Goal 🏆 +100 Yes (success!)

Why Step Cost (-1)?

The small negative reward for each step is crucial:

  • Without it, the agent might wander forever
  • It encourages efficiency - finding the shortest path
  • Creates urgency to reach the goal quickly

🗺️ Environment

Edit Cell Type

�  Learning Parameters

Learning Rate (α) 0.10
Discount Factor (γ) 0.90
Exploration (ε) 0.30

⚡ Simulation

Speed 200ms

📊 Statistics

Episode 0
Step 0
Ep. Reward 0
Avg Reward 0
🎮 Gridworld Environment Click cells to edit
Q(s, a) ← Q(s, a) + α · [R + γ · max Q(s', a') - Q(s, a)]
Arrows: Cyan = positive Q, Magenta = negative Q. Background: Green/Red heatmap.

Action determination

Run or Step to see how the action is chosen (explore vs exploit).

🧮 Q-Update Calculation

State (s): -
Action (a): -
Next State (s'): -
Reward (R): -
Old Q(s,a): -
max Q(s',a'): -
TD Target [R + γ·maxQ]: -
TD Error [Target - Q]: -
New Q(s,a): -
Waiting for action...

📝 Action Log

Waiting to start...
Start 🚀
Empty
Wall 🧱
Pit 💀
Goal 🏆
Agent 🤖

Usage Instructions

  1. Select an Environment: Use the dropdown to choose a preset grid layout (Simple, Walls, Cliff, Maze, Dangerous).
  2. Edit the Grid (Optional):
    • Select a cell type (Wall, Pit, Goal) from the buttons
    • Click on grid cells to toggle them
    • Note: Start position cannot be changed
  3. Adjust Parameters:
    • Learning Rate (α): Higher = faster learning but less stable
    • Discount Factor (γ): Higher = values future rewards more
    • Exploration (ε): Higher = more random exploration
    • Speed: Lower ms = faster simulation
  4. Run the Simulation:
    • ▶ Run: Start continuous learning
    • Step: Execute one step at a time
    • ↺ Agent: Reset agent position only (keep Q-table)
    • Reset All: Clear everything including learned values
  5. Observe the Learning:
    • Triangles: Show Q-values for each direction (brighter = higher value)
    • Cell Colors: Green = positive max Q-value, Red = negative
    • Agent Path: Watch how the agent learns to navigate
    • Hover on Cells: See exact Q-values for each action

Understanding the Q-Value Visualization

The "Glass Box" Approach:

Each cell displays four triangles pointing in the direction of each possible action (↑↓←→):

  • Cyan arrows: Positive Q-value (good action to take)
  • Magenta arrows: Negative Q-value (leads to bad outcomes)
  • Brighter = higher magnitude Q-value

Cell background colors (separate from arrows):

  • Green background: High max Q-value (good to be here)
  • Red background: Negative max Q-value (dangerous area)

As learning progresses, you'll see cyan triangles "pointing toward the goal" - this is the learned policy emerging!

Experiments to Try

Experiment Setup What to Observe
Basic Learning Simple Path preset, default parameters Watch triangles gradually point toward goal
High Exploration Set ε = 0.8 Agent wanders randomly, slow to converge
Low Exploration Set ε = 0.05 Agent commits early, might miss better paths
Shortsighted Set γ = 0.1 Agent only cares about immediate rewards
Cliff Walking Cliff preset, watch carefully Agent learns to avoid the "cliff" of pits
Fast Learning Set α = 0.5, speed = 50ms Rapid Q-value changes, might be unstable

The Learning Process

Episode 1: Agent explores randomly, falls in pits, eventually finds goal by chance
Episode 10: Q-values near goal become positive, agent starts "gravitating" toward it
Episode 50: Clear policy emerges - triangles point toward optimal path
Episode 100+: Agent consistently finds efficient path, ε has decayed

Mathematical Details

TD Error (Temporal Difference Error):

δ = R + γ · maxa'Q(s', a') - Q(s, a)

The "surprise" - difference between expected and actual value.
Positive δ = outcome better than expected → increase Q
Negative δ = outcome worse than expected → decrease Q

Convergence Guarantee: Q-Learning is proven to converge to optimal Q* if:

  • All state-action pairs are visited infinitely often
  • Learning rate decreases appropriately (∑α = ∞, ∑α² < ∞)
  • In practice: ε-greedy exploration ensures sufficient exploration

Agent Moves by Q-Values, Not Rewards!

A common misconception: "The agent moves toward rewards." This is imprecise!

Common Saying Precise Statement
"Agent moves toward reward" Agent moves toward high Q-value
"Agent avoids negative reward" Agent avoids low Q-value (learned from past negative rewards)
"Reward guides behavior" Reward trains Q; Q guides behavior
Key Insight: The agent cannot see rewards before acting — rewards are only revealed after the action is taken. The action is chosen based on Q-values (learned estimates), not direct reward lookup.
Timeline of Events:

1. Agent is at state s
2. Choose action based on Q(s, a) ← Uses Q-values!
3. Execute action a
4. Environment gives reward R ← Reward revealed AFTER
5. Agent arrives at state s'
6. Update Q-value using R ← Learning happens

Analogy: Reward is like a teacher's grade — you only get it after submitting your answer. Q-value is like your study notes — you consult them to decide your answer. You don't choose answers based on grades (you don't know them yet!), you choose based on what you've learned from past grades.

Known Rewards vs. True RL

In this Gridworld simulation, rewards ARE known and deterministic (we defined them!). So why use Q-Learning?

Entity Knows Rewards? Notes
We (designers) ✓ Yes We defined CONFIG.rewards
Agent (true RL) ✗ No Must discover through experience
Agent (our code) Could cheat... But we don't let it!

Q-Learning shines when:

  • Stochastic transitions: Same action → different outcomes
  • Unknown environment: Robot exploring new building
  • Delayed rewards: Chess — win/lose only at end
  • Complex reward shaping: Cumulative, discounted rewards

Bootstrapping: How We Handle Unknown Future Rewards

The Q-value represents cumulative future reward:

Q(s,a) ≈ R + γ·R' + γ²·R'' + γ³·R''' + ...
     ↑     ↑       ↑         ↑
   known  unknown  unknown  unknown

The Problem: How do we calculate Q NOW when it depends on FUTURE rewards we don't know?

The Solution — Bootstrapping: We use our current estimate of future rewards!

The Recursive Trick:

Q(s,a) = R + γ · [R' + γ·R'' + γ²·R''' + ...]
         ↑        └───────────┬───────────┘
      known!              This IS Q(s', a')!

So: Q(s,a) = R + γ · max Q(s', a')
               ↑              ↑
            known         estimated from Q-table!

Why Does This Work? Iteration! Each update makes the estimate slightly better:

Episode 1:   Q(s,a) = 0      (wrong, but it's a start)
Episode 2:   Q(s,a) = -0.5   (a bit better)
Episode 3:   Q(s,a) = -0.9   (getting there)
...
Episode N:   Q(s,a) = 97.2   (converged to true value!)

The "Ripple Effect": When the agent reaches the goal, the +100 reward updates nearby Q-values. Those updated Q-values then help update states farther away. The value "ripples" backward through the state space!

The Magic of Bootstrapping: We're using our own (initially wrong) estimates to improve those same estimates, and mathematically this process converges to the true values! This is called Temporal Difference (TD) Learning.

Real-World Applications

Domain Application
Robotics Navigation, manipulation, walking robots
Games Game AI (Atari, Go, Chess), NPC behavior
Finance Portfolio optimization, trading strategies
Healthcare Treatment planning, drug dosing
Autonomous Vehicles Decision making, path planning
Recommendation Systems Content personalization, ad placement

Extensions Beyond Basic Q-Learning

  • SARSA: On-policy variant (uses actual next action, not max)
  • Deep Q-Networks (DQN): Neural network approximates Q-function
  • Double DQN: Addresses overestimation bias
  • Policy Gradient: Directly optimize policy instead of values
  • Actor-Critic: Combines value and policy learning

Key Insights

  • Credit Assignment: How does the agent know which actions led to reward? Q-Learning propagates value backwards through the Bellman equation.
  • Delayed Rewards: The discount factor γ handles rewards that come many steps later.
  • Bootstrapping: We update estimates using other estimates - powerful but can propagate errors.
  • Tabular vs Function Approximation: This demo uses a table; real-world uses neural networks.