## Modeling Shortest Path Problem by MDP with Bellman Equation

Single source shortest path problem is a well-know problem, which can be solved with Dijkstra or Bellman-Ford algorithm. When I started to learn MDP for reinforcement learning, bellman equation comes into my eyes and I feel it is very similar to what I have learnt for solving the shortest path problem. I am wondering that are there any relations between the two kind of ways of thinking.

MDP (Markov Decision Process) is the fundamental model for reinforcement learning. There are six essential building blocks, namely the states $\mathcal{S}$, actions $\mathcal{A}$, transitions $\Pr(s^\prime\mid s, a)$, reward function $r(s,a,s^\prime)$, start state $s_0$ and goals $\mathcal{G}$, and occasionally we will have a discount factor $\gamma$ in case that the MDP can be infinitely long.

No matter you know MDP before or not, let us try to model the shortest path problem by MDP and you will understand what MDP is.

The $\mathcal S$ is the the set of vertex in a graph, $\mathcal A$ is the edges, in a graph once you pick up an edge, the transition is deterministic hence $\Pr(s^\prime\mid s, a)=1 \iff a=(s,s^\prime)\subset \mathcal E$, and the reward function $r(s, a, s^\prime)$ can be modeled as the negation of the cost of an edge $a=(s,s^\prime)$. The goals is just the set containing the terminal vertex and the start state is the source vertex. Suppose there is no negative cycle in our graph, no discount is needed because the traveling will stop in finite steps. MDP is indeed suit for the shortest path problem.

The objective of MDP is to find a policy $\pi:\mathcal S\rightarrow \mathcal A$ which maximize the expected reward. In our setting, due to that the transition function is deterministic, the expected reward is exactly the reward of a certain path that is determined by the policy function. What we call path is called horizon in MDP, which is way more fancy but exactly the same concept. The policy can be non-deterministic under reinforcement learning setting, there is a trade-off called exploration and exploitation for an agent to understand its circumstances. For the shortest path problem the optimal choice at any vertex is always deterministic.

To estimate the goodness of the policy we have a value function called $V^\pi(s)$, which tells you how much reward you can have when you start at state $s$ and follow the policy $pi$ all the way to the goal. The value function is indeed the analogy of the negation of the distance at each vertex (remember we want the shortest path). Due to the stochasticity of the transition function of MDP, we further have a state-action function $Q^\pi(s, a)$ which tells you the expectation of reward you can have when you start at state $s$ and take the action $a$. Apparently $Q^\pi(s, a)=\mathbb E_{s^\prime\sim\Pr(s^\prime\mid s,a)}[r(s, a, s^\prime) + V^\pi(s^\prime)]$. In the shortest path setting, the $Q^\pi(s,a)$ is just $V^\pi(s^\prime)$ given $a=(s,s^\prime)$ as an edge in a graph.

Now look at the value iteration algorithm for solving the MDP problem,

Input S, A, Pr, r
initialize: for any s in S, V(s) = 0
repeat
for any s, V(s) = max expect[r(s, a, s') + V(s')] where a is among all the actions you can take at state s, take expectation due to the stochasticity of state transition
until for any s, V(s) converges (won't be updated again)
calculate the Q(s, a) using is definition
for any s, pi(s) = arg max_a Q(s, a)


When it comes to our shortest path problem, the algorithm should be modified as follow

Input S, A, Pr, r
initialize: for any s in S, V(s) = 0
repeat
for any s, V(s) = max {r(s, a, s') + V(s')} where a is among all the actions you can take at state s, no stochasticity anymore
until for any s, V(s) converges
calculate the Q(s, a) using is definition, here Q(s, a) is the maximum of V(s') where s' is the state reached from s by a single action a
for any s, pi(s) = arg max_a Q(s, a)


Notice the repeat term is exactly the Bellman-Ford algorithm, the converge here actually means no vertex can be relaxed anymore.

Hence we bridge the MDP with the Bellman-Ford algorithm for solving the shortest path problem. We can tell now that MDP is like a generalization with stochasticity of a graph (let’s call it the probabilistic graph).