Poster
Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation
Long-Fei Li · Yu-Jie Zhang · Peng Zhao · Zhi-Hua Zhou
West Ballroom A-D #6403
[
Abstract
]
Fri 13 Dec 11 a.m. PST
— 2 p.m. PST
Abstract:
We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its benefits, introducing the non-linear function raises significant challenges in both *computational* and *statistical* efficiency. The best-known result of Hwang and Oh [2023] has achieved an $\widetilde{\mathcal{O}}(\kappa^{-1}dH^2\sqrt{K})$ regret, where $\kappa$ is a problem-dependent quantity, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. While this result attains the same rate in $K$ as linear cases, the method requires storing all historical data and suffers from an $\mathcal{O}(K)$ computation cost per episode. Moreover, the quantity $\kappa$ can be exponentially small in the worst case, leading to a significant gap for the regret compared to linear function approximation. In this work, we first address the computational and storage issue by proposing an algorithm that achieves the same regret with only $\mathcal{O}(1)$ cost. Then, we design an enhanced algorithm that leverages local information to enhance statistical efficiency. It not only maintains an $\mathcal{O}(1)$ computation and storage cost per episode but also achieves an improved regret of $\widetilde{\mathcal{O}}(dH^2\sqrt{K} + d^2H^2\kappa^{-1})$, nearly closing the gap with linear function approximation. Finally, we establish the first lower bound for MNL function approximation, justifying the optimality of our results in $d$ and $K$.
Live content is unavailable. Log in and register to view live content