Poster
Efficient Algorithm and Improved Regret for RL with Multinomial Logit Function Approximation
Long-Fei Li · Yu-Jie Zhang · Peng Zhao · Zhi-Hua Zhou
West Ballroom A-D #6403
[
Abstract
]
Fri 13 Dec 11 a.m. PST
— 2 p.m. PST
Abstract:
We study a new class of MDPs that employs multinomial logit (MNL) function approximation (FA) to ensure valid probability distributions over the state space. Despite its benefits, introducing non-linear FA raises significant challenges in both *computational* and *statistical* efficiency. The best-known method has achieved an $\widetilde{\mathcal{O}}(\kappa^{-1}dH^2\sqrt{K})$ regret, where $\kappa$ is a problem-dependent quantity, $d$ is the feature space dimension, $H$ is the episode length, and $K$ is the number of episodes. While this result attains the same rate in $K$ as the linear cases, the method requires storing all historical data and suffers from an $\mathcal{O}(K)$ computation cost per episode. Moreover, the quantity $\kappa$ can be exponentially small, leading to a significant gap for the regret compared to the linear cases. In this work, we first address the computational concerns by proposing an online algorithm that achieves the same regret with only $\mathcal{O}(1)$ computation cost. Then, we design two algorithms that leverage local information to enhance statistical efficiency. They not only maintain an $\mathcal{O}(1)$ computation cost per episode but achieve improved regrets of $\widetilde{\mathcal{O}}(\kappa^{-1/2}dH^2\sqrt{K})$ and $\widetilde{\mathcal{O}}(dH^2\sqrt{K} + d^2H^2\kappa^{-1})$ respectively. Finally, we establish a lower bound, justifying the optimality of our results in $d$ and $K$. To our knowledge, this is the **first** work that achieves almost the **same** *computational* and *statistical* efficiency as linear FA while employing non-linear FA for MDPs.
Live content is unavailable. Log in and register to view live content