MDP Visualization Demo

An interactive tool for exploring Markov Decision Processes (MDPs), Value Iteration, and Reinforcement Learning concepts.

Open standalone app

About This Demo

This tool lets you build and visualize Markov Decision Processes and solve them using Dynamic Programming algorithms. Use it to explore:

Bellman Equations: How values propagate through state-action-state transitions
Value Iteration: Iterative policy evaluation using Bellman backups
Q-Values: Action-value functions Q(s,a) for state-action pairs
Stochastic Transitions: Actions with probabilistic outcomes
Optimal Policies: Computing the best action at each state

The Bellman Equation

The Bellman optimality equation for an MDP is:

V(s) = max_a Q(s,a)
Q(s,a) = Σ P(s'|s,a) · [r + γ·V(s')]

Where:

V(s) = value of state s
Q(s,a) = action-value for taking action a in state s
P(s'|s,a) = probability of transitioning to s' after taking action a
r = immediate reward
γ (gamma) = discount factor (0 to 1)

How to Use

Building an MDP

Add States: Double-click empty canvas to create a new state node (circle)
Add Actions: Double-click a state to create an action from that state (square)
Create Transitions: Shift+drag from a state to another state to add a transition
Edit Properties: Use the properties panel to modify:
- State rewards and labels
- Mark states as Initial (blue arrow) or Terminal (double circle)
- Action labels
- Transition probabilities and rewards

Selection & Movement

Click a node to select it
Shift+Click to add to selection
Cmd/Ctrl+Click to toggle selection
Drag box for area selection
Drag selected nodes to reposition them
Cmd/Ctrl+A to select all

Running Algorithms

Value Iteration: Click "Value Iteration (5)" to run 5 iterations
- Watch animated Bellman backups
- See values converge to optimal
- Particles show how Q(s,a) is computed from successor states
Test Lookahead: Demonstrates a single Bellman backup animation
- Particles travel forward to next states (weighted by probability)
- Return backward with backup values
- Shows how Q-values are computed
Backup Selected: Run Bellman backup only on selected states
- Useful for step-by-step exploration

Display Options

Toggle these to show/hide information:

Values: Show V(s) below each state
Q-Values: Show Q(s,a) next to each action node
Probs: Show transition probabilities on edges
Labels: Show node labels
Optimal Policy: Highlight the best action at each state

Keyboard Shortcuts

Up/Down Arrows: Adjust reward of selected state by ±1
Delete/Backspace: Remove selected nodes/edges
Cmd/Ctrl+Z: Undo last action
Cmd/Ctrl+C/V: Copy and paste nodes
Right-click state: Context menu to mark as Initial/Terminal/Regular
Escape: Cancel connection or clear selection

Example Templates

Load pre-built MDPs from the dropdown:

Stochastic: Decision tree with stochastic outcomes
Binary: Pure decision tree (deterministic)
Cyclical: MDP with cycles (non-tree structure)
Grid World: Navigation grid with 4-directional movement
Cliff Walk: Classic cliff walking problem

Key Concepts

States (Circles)

Regular: gray circles
Initial: marked with blue arrow
Terminal: double circle
Reward: green (+) or red (−) number inside

Actions (Squares)

Belong to a specific state
Can have multiple stochastic outcomes
Q-values shown when toggled on

Transitions (Arrows)

From action to next state
Show probability and reward
Multiple outcomes = stochastic action

Value Iteration Animation

Forward phase: Particles travel to successor states
Backward phase: Values return to update Q(s,a)
Green particles = positive value, Red = negative, Blue = zero

Optimal Policy

Greedy with respect to Q-values: π*(s) = argmax_a Q(s,a)
Highlighted in green when toggled on

MDP Structure

The demo uses a state-action-state representation:

State → Action → Outcomes
  s      a       [(s₁, p₁, r₁), (s₂, p₂, r₂), ...]

Each action belongs to exactly one state
Actions can have multiple probabilistic outcomes
Probabilities must sum to 1.0
Rewards can be on states or transitions

Algorithm Details

Value Iteration

Iteratively updates state values until convergence:

For each iteration:
  For each state s:
    For each action a:
      Compute Q(s,a) = Σ P(s'|s,a)·[r + γ·V(s')]
    Update V(s) = max_a Q(s,a)

Bellman Backup (Lookahead)

Single backup computation for Q(s,a):

Send particles forward to successor states (weighted by probability)
Collect values V(s') at each successor
Return particles backward with backup values
Compute Q(s,a) = Σ P(s'|s,a)·[r + γ·V(s')]

Things to Try

Create your own MDPs and predict the optimal policy before running Value Iteration
Compare stochastic vs deterministic transitions — how does randomness affect values?
Experiment with different reward structures and discount factors (γ)
Use "Backup Selected" to step through the algorithm one state at a time
Try the Grid World template and observe how values propagate from the goal

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Online version
Bellman, R. (1957). Dynamic Programming. Princeton University Press.

IAC Network Demo - Interactive Activation and Competition model

MDP Visualization Demo ​

About This Demo ​

The Bellman Equation ​

How to Use ​

Building an MDP ​

Selection & Movement ​

Running Algorithms ​

Display Options ​

Keyboard Shortcuts ​

Example Templates ​

Key Concepts ​

MDP Structure ​

Algorithm Details ​

Value Iteration ​

Bellman Backup (Lookahead) ​

Things to Try ​

References ​

Related Demos ​