What is a Markov Decision Process?

Posted on 2024-12-28 00:38


Markov Decision Process (MDP) Tutorial

Markov Decision Process (MDP) Tutorial

What is an MDP?

A Markov Decision Process (MDP) is a framework for modeling decision-making where outcomes are partly random and partly under the control of a decision-maker. It is the foundational mathematical concept underlying Reinforcement Learning (RL), one of the major branches within the field of Artificial Intelligence. It is defined by:

  • States (S): All possible states of the system.
  • Actions (A): Possible actions the decision-maker can take.
  • Transition Probabilities (P): The probability of moving from one state to another, given an action.
  • Rewards (R): Immediate reward received after transitioning from one state to another via an action.
  • Policy (π): A strategy mapping states to actions.

The goal is to find an optimal policy (π*) that maximizes the cumulative reward over time.

Key Properties

  • Markov Property: The next state depends only on the current state and action, not on the history of previous states.
  • Time Horizon: Can be finite or infinite.
  • Discount Factor (γ): A value between 0 and 1, representing the importance of future rewards.

Steps in Solving an MDP

  1. Define the states, actions, rewards, and transition probabilities.
  2. Evaluate policies using:
    • Value Iteration: Iteratively update value estimates until convergence.
    • Policy Iteration: Iteratively improve the policy based on the value function.
  3. Find the optimal policy (π*).

Simple Example

Imagine a robot in a 3x3 grid trying to reach a goal state:

  • States (S): Each grid cell (e.g., (0, 0), (2, 2)).
  • Actions (A): {Up, Down, Left, Right}.
  • Transition Probabilities (P): E.g., If the robot moves "Up" from (0, 0), it ends in (0, 1) with probability 1.
  • Rewards (R): +10 for reaching the goal state (2, 2). -1 for each move to encourage efficiency.

Python Implementation

import numpy as np grid_size = 3 states = [(i, j) for i in range(grid_size) for j in range(grid_size)] actions = ['Up', 'Down', 'Left', 'Right'] goal_state = (2, 2) gamma = 0.9 reward_step = -1 reward_goal = 10 theta = 1e-4 def transition(state, action): i, j = state if state == goal_state: return state if action == 'Up': return (max(i - 1, 0), j) elif action == 'Down': return (min(i + 1, grid_size - 1), j) elif action == 'Left': return (i, max(j - 1, 0)) elif action == 'Right': return (i, min(j + 1, grid_size - 1)) return state def reward(state, next_state): if next_state == goal_state: return reward_goal return reward_step values = {state: 0 for state in states} def value_iteration(): global values while True: delta = 0 new_values = values.copy() for state in states: if state == goal_state: continue v = values[state] new_values[state] = max( sum( [ (1.0 * (reward(state, next_state) + gamma * values[next_state])) for next_state in [transition(state, action)] ] ) for action in actions ) delta = max(delta, abs(v - new_values[state])) values = new_values if delta < theta: break def extract_policy(): policy = {} for state in states: if state == goal_state: policy[state] = 'Goal' continue best_action = None best_value = float('-inf') for action in actions: next_state = transition(state, action) action_value = reward(state, next_state) + gamma * values[next_state] if action_value > best_value: best_value = action_value best_action = action policy[state] = best_action return policy value_iteration() policy = extract_policy() print("Value Function:") for i in range(grid_size): print([values[(i, j)] for j in range(grid_size)]) print("\nOptimal Policy:") for i in range(grid_size): print([policy[(i, j)] for j in range(grid_size)])

This Python code demonstrates Value Iteration for finding the optimal policy in a 3x3 grid world.

Here's a "real world" example: Autonomous Warehouse Robot

This post has been viewed 9 times.