Abstract — POMDP (Partally Observable Markov Decision Processes) has been widely applied in fields such as robotic navigation, machine maintenance, marketing, medical diagnosis and so on via [1]. But its exact solution is inefficient in both space and time. This paper analyzes the smooth approximation of partially observable values (SPOVA) [2], which approximates belief values by a differentiable function and then uses gradient descent to update the belief values. This POMDP approximation algorithm is applied to the tunable pole balancing problem. The simulation results demonstrate that this regulated approach is capable of estimating state transition probabilities and improving its policy simultaneously. Keywords – POMDP; SPOVA; Pole-balancing.IntroductionThe Markov Decision Process (MDP) has proven to be a useful framework for solving a variety of problems in areas, including robotics, economics, and manufacturing. Unfortunately, many real-world problems cannot be modeled as MDPs, especially when problem states are partially observable. Partially observable Markov decision processes (POMDP) extend the MDP framework to include states that are not fully observable. With this extension, we are able to model more practical problems, while existing solution methods for MDP will no longer be applicable. The computational intensity of POMDP algorithms is much greater than that of MDP. This complexity is due to uncertainty about the actual state, which leads to a probability distribution among states. So POMDP algorithms deal with probability distributions, while MDP algorithms work on a finite number of discrete states. This difference transforms an optimization problem on a discrete space into one defined... middle of the paper... total reward discounted on an infinite horizon. The expected reward for policy π starting from belief b_o is defined asJ^π (b_o )=E[∑_(t=0)^∞▒β^tr(s_t,a_t )│b_o,π] (3)where 0, 1967.
tags