Skip to content

MDP Visualization Demo

An interactive tool for exploring Markov Decision Processes (MDPs), Value Iteration, and Reinforcement Learning concepts.

Open standalone app

About This Demo

This tool lets you build and visualize Markov Decision Processes and solve them using Dynamic Programming algorithms. Use it to explore:

  • Bellman Equations: How values propagate through state-action-state transitions
  • Value Iteration: Iterative policy evaluation using Bellman backups
  • Q-Values: Action-value functions Q(s,a) for state-action pairs
  • Stochastic Transitions: Actions with probabilistic outcomes
  • Optimal Policies: Computing the best action at each state

The Bellman Equation

The Bellman optimality equation for an MDP is:

V(s) = max_a Q(s,a)
Q(s,a) = Σ P(s'|s,a) · [r + γ·V(s')]

Where:

  • V(s) = value of state s
  • Q(s,a) = action-value for taking action a in state s
  • P(s'|s,a) = probability of transitioning to s' after taking action a
  • r = immediate reward
  • γ (gamma) = discount factor (0 to 1)

How to Use

Building an MDP

  1. Add States: Double-click empty canvas to create a new state node (circle)
  2. Add Actions: Double-click a state to create an action from that state (square)
  3. Create Transitions: Shift+drag from a state to another state to add a transition
  4. Edit Properties: Use the properties panel to modify:
    • State rewards and labels
    • Mark states as Initial (blue arrow) or Terminal (double circle)
    • Action labels
    • Transition probabilities and rewards

Selection & Movement

  • Click a node to select it
  • Shift+Click to add to selection
  • Cmd/Ctrl+Click to toggle selection
  • Drag box for area selection
  • Drag selected nodes to reposition them
  • Cmd/Ctrl+A to select all

Running Algorithms

  1. Value Iteration: Click "Value Iteration (5)" to run 5 iterations

    • Watch animated Bellman backups
    • See values converge to optimal
    • Particles show how Q(s,a) is computed from successor states
  2. Test Lookahead: Demonstrates a single Bellman backup animation

    • Particles travel forward to next states (weighted by probability)
    • Return backward with backup values
    • Shows how Q-values are computed
  3. Backup Selected: Run Bellman backup only on selected states

    • Useful for step-by-step exploration

Display Options

Toggle these to show/hide information:

  • Values: Show V(s) below each state
  • Q-Values: Show Q(s,a) next to each action node
  • Probs: Show transition probabilities on edges
  • Labels: Show node labels
  • Optimal Policy: Highlight the best action at each state

Keyboard Shortcuts

  • Up/Down Arrows: Adjust reward of selected state by ±1
  • Delete/Backspace: Remove selected nodes/edges
  • Cmd/Ctrl+Z: Undo last action
  • Cmd/Ctrl+C/V: Copy and paste nodes
  • Right-click state: Context menu to mark as Initial/Terminal/Regular
  • Escape: Cancel connection or clear selection

Example Templates

Load pre-built MDPs from the dropdown:

  • Stochastic: Decision tree with stochastic outcomes
  • Binary: Pure decision tree (deterministic)
  • Cyclical: MDP with cycles (non-tree structure)
  • Grid World: Navigation grid with 4-directional movement
  • Cliff Walk: Classic cliff walking problem

Key Concepts

States (Circles)

  • Regular: gray circles
  • Initial: marked with blue arrow
  • Terminal: double circle
  • Reward: green (+) or red (−) number inside

Actions (Squares)

  • Belong to a specific state
  • Can have multiple stochastic outcomes
  • Q-values shown when toggled on

Transitions (Arrows)

  • From action to next state
  • Show probability and reward
  • Multiple outcomes = stochastic action

Value Iteration Animation

  • Forward phase: Particles travel to successor states
  • Backward phase: Values return to update Q(s,a)
  • Green particles = positive value, Red = negative, Blue = zero

Optimal Policy

  • Greedy with respect to Q-values: π*(s) = argmax_a Q(s,a)
  • Highlighted in green when toggled on

MDP Structure

The demo uses a state-action-state representation:

State → Action → Outcomes
  s      a       [(s₁, p₁, r₁), (s₂, p₂, r₂), ...]
  • Each action belongs to exactly one state
  • Actions can have multiple probabilistic outcomes
  • Probabilities must sum to 1.0
  • Rewards can be on states or transitions

Algorithm Details

Value Iteration

Iteratively updates state values until convergence:

For each iteration:
  For each state s:
    For each action a:
      Compute Q(s,a) = Σ P(s'|s,a)·[r + γ·V(s')]
    Update V(s) = max_a Q(s,a)

Bellman Backup (Lookahead)

Single backup computation for Q(s,a):

  1. Send particles forward to successor states (weighted by probability)
  2. Collect values V(s') at each successor
  3. Return particles backward with backup values
  4. Compute Q(s,a) = Σ P(s'|s,a)·[r + γ·V(s')]

Things to Try

  • Create your own MDPs and predict the optimal policy before running Value Iteration
  • Compare stochastic vs deterministic transitions — how does randomness affect values?
  • Experiment with different reward structures and discount factors (γ)
  • Use "Backup Selected" to step through the algorithm one state at a time
  • Try the Grid World template and observe how values propagate from the goal

References

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Online version
  • Bellman, R. (1957). Dynamic Programming. Princeton University Press.