Abstract:
The current paper studies the problem of agnostic QQ-learning with function approximation in deterministic systems where the optimal QQ-function is approximable by a function in the class FF with approximation error δ≥0δ≥0. We propose a novel recursion-based algorithm and show that if δ=O(ρ/√dimE)δ=O(ρ/√dimE), then one can find the optimal policy using O(dimE)O(dimE) trajectories, where ρρ is the gap between the optimal QQ-value of the best actions and that of the second-best actions and dimEdimE is the Eluder dimension of FF. Our result has two implications:
\begin{enumerate}
\item In conjunction with the lower bound in [Du et al., 2020], our upper bound suggests that the condition $\delta = \widetilde{\Theta}\left(\rho/\sqrt{\dim_E}\right)$ is necessary and sufficient for algorithms with polynomial sample complexity.
\item In conjunction with the obvious lower bound in the tabular case, our upper bound suggests that the sample complexity $\widetilde{\Theta}\left(\dim_E\right)$ is tight in the agnostic setting.
\end{enumerate}\begin{enumerate}
\item In conjunction with the lower bound in [Du et al., 2020], our upper bound suggests that the condition $\delta = \widetilde{\Theta}\left(\rho/\sqrt{\dim_E}\right)$ is necessary and sufficient for algorithms with polynomial sample complexity.
\item In conjunction with the obvious lower bound in the tabular case, our upper bound suggests that the sample complexity $\widetilde{\Theta}\left(\dim_E\right)$ is tight in the agnostic setting.
\end{enumerate} Therefore, we help address the open problem on agnostic QQ-learning proposed in [Wen and Van Roy, 2013]. We further extend our algorithm to the stochastic reward setting and obtain similar results.
\item In conjunction with the lower bound in [Du et al., 2020], our upper bound suggests that the condition $\delta = \widetilde{\Theta}\left(\rho/\sqrt{\dim_E}\right)$ is necessary and sufficient for algorithms with polynomial sample complexity.
\item In conjunction with the obvious lower bound in the tabular case, our upper bound suggests that the sample complexity $\widetilde{\Theta}\left(\dim_E\right)$ is tight in the agnostic setting.
\end{enumerate} Therefore, we help address the open problem on agnostic QQ-learning proposed in [Wen and Van Roy, 2013]. We further extend our algorithm to the stochastic reward setting and obtain similar results.
Chat is not available.