Markov Decision Processes Interactive App
▾
Controls & Display
Examples:
Default
Deterministic Tree
Stochastic Graph
Daw Two-Step
Cyclical MDP
Grid World
Cliff Walk
Risky Chain
Discount rate (γ): 0.9
Speed: 1.0x
Reset
Rewards
Probabilities
State Values
Q-Values
Greedy Policy
Style
▾
Algorithms
Expected Backup
Sample Backup
Prediction (Random Policy)
Policy Evaluation
Monte Carlo Estimation
TD(λ)
Control
Value Iteration
Monte Carlo Control
SARSA
Q-Learning
Dyna-Q
Building the MDP
Add states:
Double-click empty space
Add actions:
Double-click a state
Connect:
Shift+drag from an action to a state
Set type:
Right-click a state to mark as Initial, Terminal, or Regular
Edit rewards:
Select a state, then Up/Down arrows (±1)
Delete:
Select items, then Delete/Backspace
Selection & Editing
Select:
Click (single), Shift+Click (add), Cmd/Ctrl+Click (toggle)
Box select:
Click and drag on empty space
Select all:
Cmd/Ctrl+A
Move:
Drag selected state nodes
Undo:
Cmd/Ctrl+Z
Copy/Paste:
Cmd/Ctrl+C (selection or full graph JSON) / Cmd/Ctrl+V
Algorithms
Expected Backup:
Select states, compute Q = Σ P(s'|s,a)[R + γV(s')] using full distribution
Sample Backup:
Select states, sample one outcome and do TD update Q ← Q + α[R + γV(s') - Q]
Value Iteration:
Full dynamic programming until convergence
TD(λ):
Temporal difference with eligibility traces (λ=0 is TD(0), λ=1 is MC-like)
Monte Carlo:
Learn from sampled episodes (estimation or control)
Edit probabilities:
Select an edge, then Up/Down arrows (±0.05)
SARSA:
On-policy TD control (uses action actually taken)
Q-Learning:
Off-policy TD control (uses max over next actions)
Dyna-Q:
Model-based RL combining Q-learning with n simulated planning steps
Stop:
Click a running algorithm's button to stop it