Markov Decision Process (MDP)

At the heart of many important problems is not a single isolated choice, but a sequence of decisions connected over time. A doctor planning treatment for a patient, or a robot navigating through a warehouse, has to consider how today’s action shapes tomorrow’s possibilities. This process of making a series of decisions over time is called sequential decision-making.

The real world is full of noise and uncertainty. Actions do not always produce the result we expected. A robot may be told to move forward but end up slightly off course. A carefully designed treatment plan may still produce unexpected side effects. This is what we mean by a stochastic environment: outcomes are partly random and partly under the control of the decision-maker. That is the core challenge an agent faces in a complex environment: in a world where results are not fully controllable, how do you balance short-term reward and long-term goals well enough to make the best possible choice?

The arrival of MDPs marked a fundamental shift in how AI thought about decision-making. Early planning systems in AI usually assumed the world was deterministic, meaning one action always led to one predictable result. Programmers naturally liked that kind of certainty because it made the world feel like an exact state machine. But that model is fragile in the real world. MDPs admit something deeper: intelligent behavior cannot depend on perfect foresight. A truly intelligent decision-maker must be able to reason in terms of probability and expected value.

At the center of this framework is a powerful abstraction: divide the world into two parts, the agent and the environment. The agent is the decision-maker; the environment is the outside world it lives in and interacts with. This separation is surprisingly general. A robot moving goods around a factory is an agent in a physical environment. A chess engine is an agent acting in a digital board environment. A high-frequency trading system is an agent operating in the environment of a financial market. By modeling all of them the same way, MDPs let researchers develop general algorithms that are not tied to a single domain. In principle, the same algorithm can help a robot find the best path or a game character find the best strategy, as long as the problem can be expressed as an MDP.

To make the abstraction concrete, let us use a simple running example: Grid World. It is a toy problem, but it captures the core challenges of real sequential decision-making surprisingly well.

Imagine a 3x4 grid board. This world contains the following elements:

Agent: a simple robot, our decision-maker. It starts in the lower-left cell at (1,1).
Goal: the robot’s task is to reach the upper-right cell (4,3), where a treasure is waiting. Reaching it gives a large reward.
Hazard: the cell (4,2) contains a pit of fire. Falling into it leads to a serious penalty.
Obstacle: the cell (2,2) contains a wall, which the robot cannot enter or pass through.
Actions: in every non-terminal cell, the robot may choose one of four actions: up, down, left, or right.
Uncertainty: this is the key part. The robot’s movement is stochastic rather than fully controlled. We use a common model: when the robot intends to move in one direction, it succeeds with probability 80%; with probability 10% it slips 90 degrees to the left of the intended direction; and with probability 10% it slips 90 degrees to the right. If any move would hit a wall, the robot stays where it is.

Squares on the Board (States, S)

Every cell on the board represents a distinct situation the robot might be in. In MDP language, this is a state. The set of all available cells, excluding the wall, forms the state space. A state must contain all the information needed to make future decisions. In this game, the robot’s position (x, y) is enough to describe its current situation.

Moves We Can Take (Actions, A)

From any given state, the set of possible moves the robot can take, up, down, left, or right, forms the action space. Notice that the available actions can sometimes depend on the current state. For example, once the robot reaches the treasure or falls into the pit, which are terminal states, the game ends and no further action can be taken.

A Slippery Floor (Transition Probabilities, P)

This is the core idea in an MDP, and the part that most clearly captures stochasticity. The 80/10/10 movement model can be understood with the metaphor of a slippery floor. Formally, this is called the transition model or transition function, usually written as T(s, a, s') or P(s' | s, a). It answers a simple question: if I am currently in state s and take action a, what is the probability that I end up in state s'?

The transition function is part of the environment’s built-in “laws of physics.” The agent cannot change it. The robot can choose its intended action, such as “move up,” but where it finally lands is determined by the environment according to the transition probabilities. This creates a clean separation of responsibilities: the agent is the decision-maker, while the environment, defined by transition probabilities and rewards, is the system it operates inside. The agent’s job is not to rewrite the rules of the world, but to learn how to act as well as possible under those rules.

Rewards and Penalties (Reward, R)

To guide the robot, we need to provide feedback. In this setting, that feedback comes as rewards.

In our game:

Reaching the treasure gives a large positive reward, for example +1.
Falling into the pit gives a large negative reward, for example -1.
To encourage the robot to finish quickly, every ordinary step carries a small negative reward, for example -0.04, which can be understood as a time cost.

The reward function, usually written as R(s, a, s'), defines the immediate value received when the agent takes action a in state s and transitions to state s'.

This contains one of the central philosophical ideas of reinforcement learning: the reward hypothesis. The agent is never explicitly told, “Go to cell (4,3).” The only thing it is told is: maximize your future cumulative reward. All the goal-directed behavior we care about, moving toward the treasure, avoiding the fire, taking a shorter path, emerges from this simple scalar reward signal. The designer does not directly write the behavior. The designer writes the incentives. In that sense, a large part of reinforcement learning is really the art of reward engineering: designing a reward function that causes the behavior you want to emerge.

The Golden Rule: The Future Depends Only on the Present

The Markov Property

Before discussing how to solve an MDP, we need to understand the core assumption that makes the whole framework possible: the Markov property. This is the foundation everything else stands on, and it is named after the Russian mathematician Andrey Markov.

The Core Idea: The Future Is Independent of the Past, Given the Present

The Markov property is often described as “memorylessness.” If a stochastic process has the Markov property, it means the probability distribution over its future depends only on its current state, not on the path it took to get there.

In our grid world, that means when the robot is at (3,1), all the information needed to decide the best next move is contained in the fact that it is currently at (3,1). It does not matter whether it arrived there from (2,1) by moving right, or from (3,2) by moving down. That history is irrelevant to what happens next. No matter how the robot arrived, if it chooses “up” from (3,1), the resulting distribution stays the same: 80% to (3,2), 10% to (2,1), and 10% to (4,1).

A Non-Markov Counterexample

To see this more clearly, imagine a modified rule: the robot is not allowed to take the same action twice in a row. Now, if the robot is at (3,1), knowing only its current position is not enough. It must also remember what action it took in the previous step. If the previous move was “up,” then “up” is no longer available right now. In this case, the history directly changes the future possibilities, so the process is no longer Markov. The original state definition (3,1) is no longer sufficient. A state that restores the Markov property would need to include more information, for example ((current_position), (previous_action)).

Why This Assumption Matters So Much

The Markov property is an extremely powerful simplifying assumption. It prevents the state space from exploding. If an agent had to consider its full history in order to make a decision, then the number of possible states, really the number of possible histories, would grow exponentially over time, making the problem computationally intractable. The Markov property keeps the problem manageable by ensuring that the current state s already summarizes all the relevant historical information.

It is important to emphasize that the Markov property is not a physical law built into the world. It is a modeling choice. A problem that seems non-Markov can often be turned into a Markov one by redefining the state. For example, a robot’s future position may depend not only on its current location but also on its current velocity. If position alone is used as the state, the system is non-Markov. But if the state is defined as (position, velocity), then the next state may depend only on the current state and the chosen action, such as acceleration or braking, which restores the Markov property. That is why a key step in applying the MDP framework is designing a state representation that is compact but still Markov.

Policy

Once the world and the rules of the MDP are defined, the final goal is to find a way to solve it. The solution to an MDP is not a single path. It is a complete decision rule.

That complete decision rule is called a policy, usually written with the Greek letter pi. A policy is a mapping from states to actions. It tells the agent what to do in every possible state. In our grid world, a policy is like drawing an arrow inside every cell, up, down, left, or right, so the robot always knows which way to go.

A policy is fundamentally different from a traditional plan. A traditional plan might be a fixed sequence like “up, up, right, right.” But in our stochastic world, that kind of plan is fragile. If the first “up” action slips, then the whole rest of the sequence is broken. A policy is much more robust because it tells the agent what to do in any state it might actually end up in. It is reactive and general, not brittle and fixed.

Goal: Maximize Cumulative Reward

Our goal is to find the optimal policy, usually written as pi*. “Optimal” means that following this policy gives the agent the maximum expected total reward over the long run. This long-run total reward is often called the return or expected utility.

Infinite Horizons and Farsightedness (Discount Factor, gamma)

If our robot can keep moving forever, for example in an environment with no terminal state, then the total reward it collects might be infinite, which makes it impossible to compare strategies meaningfully. To solve this mathematical problem, and also to model a preference for sooner rewards over later ones, we introduce the discount factor, written as gamma, whose value lies between 0 and 1.

A useful analogy is this: would you rather get 100 dollars today, or 100 dollars a year from now? Most people prefer the first option, because future reward is uncertain and delayed satisfaction feels less valuable. The discount factor models exactly that time preference.

When calculating total return, a reward received one step in the future is multiplied by gamma, a reward received two steps in the future is multiplied by gamma^2, and so on. So the total discounted return becomes:

R0 + gamma R1 + gamma^2 R2 + ...

The discount factor plays a double role. Mathematically, it ensures that an infinite sum of future rewards converges to a finite number, which means the problem has a well-defined solution. Behaviorally, it shapes the kind of agent we get. A gamma close to 0 creates a shortsighted agent that cares almost only about immediate reward. A gamma close to 1 creates a farsighted agent willing to give up small short-term gains for larger long-term ones. In other words, gamma controls the planning horizon.

Formal Definition of an MDP

A Markov Decision Process is formally defined as a five-tuple (S, A, P, R, gamma).

S: a finite set of states, the state space.
A: a finite set of actions, the action space.
P: the state transition probability function, P(s' | s, a), which gives the probability of ending up in state s' after taking action a in state s.
R: the reward function, R(s, a, s'), which gives the immediate reward obtained when taking action a in state s and transitioning to s'.
gamma: the discount factor, which determines how much future rewards matter.

How to Find the Best Policy: Unlocking the Bellman Equation

Scoring Each State

To evaluate a policy, a natural idea is to evaluate the “value” of each state under that policy. The state-value function, written V^pi(s), is defined as the expected total discounted return the agent will receive if it starts in state s and continues following policy pi. Intuitively, a state with high value is a state from which high future reward is expected.

The Core Logic of Value (The Bellman Equation)

The Bellman equation does not come out of nowhere. It follows from a very intuitive statement:

The long-term value of being in a state s equals the expected immediate reward you receive when leaving that state, plus the discounted long-term value of the state you end up in next.

This breaks a complicated long-term problem into a local relationship involving only the present and the next step. That is the essence of dynamic programming.

For a given policy pi, the value function V^pi(s) satisfies the Bellman expectation equation:

V^pi(s) = sum_a pi(a|s) sum_s' P(s'|s,a) [R(s,a,s') + gamma V^pi(s')]

This equation is recursive: the value of one state depends on the values of successor states. Across all states, these relationships form a system of linear equations that can, at least in principle, be solved.

The Bellman Optimality Equation

But our goal is not just to evaluate a known policy. We want the best policy. For that, we need the Bellman optimality equation, which describes the condition satisfied by the optimal value function V*(s). The logic is:

The optimal value of a state is the expected total return obtained by taking the best possible action in that state.

That “best action” is the one that maximizes the combination of immediate reward and discounted future value. In the equation, this appears as a max operator:

V*(s) = max_a sum_s' P(s'|s,a) [R(s,a,s') + gamma V*(s')]

The beauty of the Bellman optimality equation is that it turns a global problem, finding the best strategy over an unbounded future, into a set of local recursive calculations. It tells us that if we somehow know the optimal values of all successor states, then we can compute the optimal value of the current one. That may sound circular, but it actually gives us a practical path: repeated local updates can gradually reveal the globally optimal long-term values.

Solving Grid World Step by Step

Now let us apply the Bellman optimality equation to our grid world through a classical algorithm called value iteration, and use it to compute the optimal policy step by step.

The core idea of value iteration is simple: repeatedly use the Bellman optimality equation as an update rule to refine the estimated value of each state until those values stop changing significantly.

Steps of Value Iteration

Initialization: Start from a state of total ignorance and set the initial value of every state to 0. That is, for all states s, let V0(s) = 0.
First iteration: Use the Bellman optimality equation to compute V1(s). For every non-terminal state in the grid, compute which of the four actions, up, down, left, right, gives the largest expected return.
- For example, consider the state (3,3), next to the treasure at (4,3). If the robot chooses “right,” there is an 80% chance it enters the treasure and receives +1, a 10% chance it slips upward and stays in place due to the wall, and a 10% chance it slips downward into the fire pit and receives -1. The algorithm evaluates all four actions and selects the action whose expected value makes V1(3,3) largest.
- For states farther from both the treasure and the pit, the first-iteration value is mostly driven by the -0.04 step cost.
Later iterations: In the second iteration, use V1 to compute V2. At this point, something interesting happens: value begins to propagate. In the first iteration, only states immediately adjacent to the treasure get large positive values. In the second iteration, their neighbors start to gain value too, because they can move into a high-value state in one step. In the same way, the negative value near the pit also spreads outward.
Convergence: After enough iterations, the value of each state stops changing much. At that point, the algorithm is considered converged, and we have an approximation of the optimal value function V*(s).
Extract the optimal policy: Once we have V*(s), finding the optimal policy pi*(s) becomes easy. For each state s, just look one step ahead, compute the expected return of taking each action, immediate reward plus discounted next-state value, and choose the action with the highest return.

The Core Logic Behind Intelligent Action

An MDP is not just a mathematical tool. It is a formal description of what goal-directed behavior looks like under uncertainty. It provides a framework in which an agent can act inside a stochastic environment, learn an optimal policy, and pursue a goal defined by reward. At the center of the entire solution process is the Markov property, which lets us connect the long-term value of one state to the value of its neighboring states through iterative computation.

To understand MDPs is to understand one of the basic ideas behind modern AI and reinforcement learning. It teaches us that complex intelligent behavior can emerge from simple rules and incentives, and it gives us a theoretical foundation for building systems that can learn and make decisions on their own in complex, dynamic, and uncertain worlds.

References

https://gibberblot.github.io/rl-notes/single-agent/MDPs.html

Yanlines