Timezone: »

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
Tiancheng Jin · Tal Lancewicki · Haipeng Luo · Yishay Mansour · Aviv Rosenberg

Wed Nov 30 02:00 PM -- 04:00 PM (PST) @ Hall J #817
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

Author Information

Tiancheng Jin (University of Southern California)
Tal Lancewicki (Tel Aviv University)
Haipeng Luo (University of Southern California)
Yishay Mansour (Tel Aviv University & Google)
Aviv Rosenberg (Amazon)
Aviv Rosenberg

I am an Applied Scientist at Amazon Alexa Shopping, Tel Aviv. Previously, I obtained my PhD from the department of computer science at Tel Aviv University, where I was fortunate to have Prof. Yishay Mansour as my advisor. Prior to that, I received my Bachelor's degree in Mathematics and Computer Science from Tel Aviv University. My primary research interest lies in theoretical and applied machine learning. More specifically, my PhD focused on data-driven sequential decision making such as reinforcement learning, online learning and multi-armed bandit.

More from the Same Authors