What is a Markov Decision Process?

Posted on 2024-12-28 00:38

Markov Decision Process (MDP) Tutorial

What is an MDP?

A Markov Decision Process (MDP) is a framework for modeling decision-making where outcomes are partly random and partly under the control of a decision-maker. It is the foundational mathematical concept underlying Reinforcement Learning (RL), one of the major branches within the field of Artificial Intelligence. It is defined by:

States (S): All possible states of the system.
Actions (A): Possible actions the decision-maker can take.
Transition Probabilities (P): The probability of moving from one state to another, given an action.
Rewards (R): Immediate reward received after transitioning from one state to another via an action.
Policy (π): A strategy mapping states to actions.

The goal is to find an optimal policy (π*) that maximizes the cumulative reward over time.

Key Properties

Markov Property: The next state depends only on the current state and action, not on the history of previous states.
Time Horizon: Can be finite or infinite.
Discount Factor (γ): A value between 0 and 1, representing the importance of future rewards.

Steps in Solving an MDP

Define the states, actions, rewards, and transition probabilities.
Evaluate policies using:
- Value Iteration: Iteratively update value estimates until convergence.
- Policy Iteration: Iteratively improve the policy based on the value function.
Find the optimal policy (π*).

Simple Example

Imagine a robot in a 3x3 grid trying to reach a goal state:

States (S): Each grid cell (e.g., (0, 0), (2, 2)).
Actions (A): {Up, Down, Left, Right}.
Transition Probabilities (P): E.g., If the robot moves "Up" from (0, 0), it ends in (0, 1) with probability 1.
Rewards (R): +10 for reaching the goal state (2, 2). -1 for each move to encourage efficiency.

Python Implementation


import numpy as np

grid_size = 3
states = [(i, j) for i in range(grid_size) for j in range(grid_size)]
actions = ['Up', 'Down', 'Left', 'Right']
goal_state = (2, 2)
gamma = 0.9
reward_step = -1
reward_goal = 10
theta = 1e-4

def transition(state, action):
    i, j = state
    if state == goal_state:
        return state
    if action == 'Up':
        return (max(i - 1, 0), j)
    elif action == 'Down':
        return (min(i + 1, grid_size - 1), j)
    elif action == 'Left':
        return (i, max(j - 1, 0))
    elif action == 'Right':
        return (i, min(j + 1, grid_size - 1))
    return state

def reward(state, next_state):
    if next_state == goal_state:
        return reward_goal
    return reward_step

values = {state: 0 for state in states}

def value_iteration():
    global values
    while True:
        delta = 0
        new_values = values.copy()
        for state in states:
            if state == goal_state:
                continue
            v = values[state]
            new_values[state] = max(
                sum(
                    [
                        (1.0 * (reward(state, next_state) + gamma * values[next_state]))
                        for next_state in [transition(state, action)]
                    ]
                )
                for action in actions
            )
            delta = max(delta, abs(v - new_values[state]))
        values = new_values
        if delta < theta:
            break

def extract_policy():
    policy = {}
    for state in states:
        if state == goal_state:
            policy[state] = 'Goal'
            continue
        best_action = None
        best_value = float('-inf')
        for action in actions:
            next_state = transition(state, action)
            action_value = reward(state, next_state) + gamma * values[next_state]
            if action_value > best_value:
                best_value = action_value
                best_action = action
        policy[state] = best_action
    return policy

value_iteration()
policy = extract_policy()

print("Value Function:")
for i in range(grid_size):
    print([values[(i, j)] for j in range(grid_size)])

print("\nOptimal Policy:")
for i in range(grid_size):
    print([policy[(i, j)] for j in range(grid_size)])

This Python code demonstrates Value Iteration for finding the optimal policy in a 3x3 grid world.

Here's a "real world" example: Autonomous Warehouse Robot

This post has been viewed 171 times.