What is a Markov Decision Process?
Posted on 2024-12-28 00:38
Markov Decision Process (MDP) Tutorial
What is an MDP?
A Markov Decision Process (MDP) is a framework for modeling decision-making where outcomes are partly random and partly under the control of a decision-maker. It is the foundational mathematical concept underlying Reinforcement Learning (RL), one of the major branches within the field of Artificial Intelligence. It is defined by:
- States (S): All possible states of the system.
- Actions (A): Possible actions the decision-maker can take.
- Transition Probabilities (P): The probability of moving from one state to another, given an action.
- Rewards (R): Immediate reward received after transitioning from one state to another via an action.
- Policy (π): A strategy mapping states to actions.
The goal is to find an optimal policy (π*) that maximizes the cumulative reward over time.
Key Properties
- Markov Property: The next state depends only on the current state and action, not on the history of previous states.
- Time Horizon: Can be finite or infinite.
- Discount Factor (γ): A value between 0 and 1, representing the importance of future rewards.
Steps in Solving an MDP
- Define the states, actions, rewards, and transition probabilities.
- Evaluate policies using:
- Value Iteration: Iteratively update value estimates until convergence.
- Policy Iteration: Iteratively improve the policy based on the value function.
- Find the optimal policy (π*).
Simple Example
Imagine a robot in a 3x3 grid trying to reach a goal state:
- States (S): Each grid cell (e.g., (0, 0), (2, 2)).
- Actions (A): {Up, Down, Left, Right}.
- Transition Probabilities (P): E.g., If the robot moves "Up" from (0, 0), it ends in (0, 1) with probability 1.
- Rewards (R): +10 for reaching the goal state (2, 2). -1 for each move to encourage efficiency.
Python Implementation
import numpy as np
grid_size = 3
states = [(i, j) for i in range(grid_size) for j in range(grid_size)]
actions = ['Up', 'Down', 'Left', 'Right']
goal_state = (2, 2)
gamma = 0.9
reward_step = -1
reward_goal = 10
theta = 1e-4
def transition(state, action):
i, j = state
if state == goal_state:
return state
if action == 'Up':
return (max(i - 1, 0), j)
elif action == 'Down':
return (min(i + 1, grid_size - 1), j)
elif action == 'Left':
return (i, max(j - 1, 0))
elif action == 'Right':
return (i, min(j + 1, grid_size - 1))
return state
def reward(state, next_state):
if next_state == goal_state:
return reward_goal
return reward_step
values = {state: 0 for state in states}
def value_iteration():
global values
while True:
delta = 0
new_values = values.copy()
for state in states:
if state == goal_state:
continue
v = values[state]
new_values[state] = max(
sum(
[
(1.0 * (reward(state, next_state) + gamma * values[next_state]))
for next_state in [transition(state, action)]
]
)
for action in actions
)
delta = max(delta, abs(v - new_values[state]))
values = new_values
if delta < theta:
break
def extract_policy():
policy = {}
for state in states:
if state == goal_state:
policy[state] = 'Goal'
continue
best_action = None
best_value = float('-inf')
for action in actions:
next_state = transition(state, action)
action_value = reward(state, next_state) + gamma * values[next_state]
if action_value > best_value:
best_value = action_value
best_action = action
policy[state] = best_action
return policy
value_iteration()
policy = extract_policy()
print("Value Function:")
for i in range(grid_size):
print([values[(i, j)] for j in range(grid_size)])
print("\nOptimal Policy:")
for i in range(grid_size):
print([policy[(i, j)] for j in range(grid_size)])
This Python code demonstrates Value Iteration for finding the optimal policy in a 3x3 grid world.
Here's a "real world" example: Autonomous Warehouse RobotThis post has been viewed 9 times.