๐Ÿ“„ Publications

* indicates equal contribution

Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features

Jiuqi Wang, Shangtong Zhang

arXiv, 2024

Abstract Temporal difference (TD) learning with linear function approximation, abbreviated as linear TD, is a classic and powerful prediction algorithm in reinforcement learning. While it is well understood that linear TD converges almost surely to a unique point, this convergence traditionally requires the assumption that the features used by the approximator are linearly independent. However, this linear independence assumption does not hold in many practical scenarios. This work is the first to establish the almost sure convergence of linear TD without requiring linearly independent features. In fact, we do not make any assumptions on the features. We prove that the approximated value function converges to a unique point and the weight iterates converge to a set. We also establish a notion of local stability of the weight iterates. Importantly, we do not need to introduce any other additional assumptions and do not need to make any modification to the linear TD algorithm. Key to our analysis is a novel characterization of bounded invariant sets of the mean ODE of linear TD.

Link to Paper

Transformers Learn Temporal Difference Methods for In-Context Reinforcement Learning

Jiuqi Wang* , Ethan Blaser*, Hadi Daneshmand, Shangtong Zhang

arXiv, 2024

Contributed talk at the RLC Workshop on Training Agents with Foundation Models, 2024.
Spotlight Award at the ICML Workshop on In-Context Learning, 2024.

Abstract In-context learning refers to the learning ability of a model during inference time without adapting its parameters. The input (i.e., prompt) to the model (e.g., transformers) consists of both a context (i.e., instance-label pairs) and a query instance. The model is then able to output a label for the query instance according to the context during inference. A possible explanation for in-context learning is that the forward pass of (linear) transformers implements iterations of gradient descent on the instance-label pairs in the context. In this paper, we prove by construction that transformers can also implement temporal difference (TD) learning in the forward pass, a phenomenon we refer to as in-context TD. We demonstrate the emergence of in-context TD after training the transformer with a multi-task TD algorithm, accompanied by theoretical analysis. Furthermore, we prove that transformers are expressive enough to implement many other policy evaluation algorithms in the forward pass, including residual gradient, TD with eligibility trace, and average-reward TD.

Link to Paper

Deep Dive on Checkers Endgame Data

Jiuqi Wang , Martin Mรผller, Jonathan Schaeffer

IEEE Conference on Games, 2023

Abstract For games such as checkers and chess, large endgame databases/tablebases have been constructed to capture the perfect win/loss/draw value for positions near the end of the game. Such databases/tablebases can be used to enhance game-playing performance. However, this approach quickly runs into computational and storage resource limitations. An enticing alternative is to learn from such data and apply the learned evaluation to even larger data sets through transfer learning. This paper reports on research that uses deep learning to a) correctly learn a high percentage of checkers endgame positions; b) learn patterns that can be used for transfer learning; c) demonstrates that learning from a small sample of a large data set is an efficient way to compute a neural net evaluation that achieves most of the benefits; and d) shows that dynamically choosing between neural network prediction and using it in a one-ply search yields about 96% prediction accuracy.

Link to Paper