HackerNews中文版

我认为强化学习（RL）是一种通过模型预测生成训练数据的方法——它直接促使模型扩展输出范围，因为数据的多样性增加。然而，从根本上讲，强化学习依赖于自举（bootstrapping）并存在动态目标问题，这也是其稳定性差的原因之一。近似价值函数的最有效方法之一是时序差分（TD），但这会导致样本噪声、函数逼近误差和动态目标问题。我认为我们需要在贝尔曼方程的层面上扩展纯强化学习理论，以实现更稳定的强化学习。因此，我们需要一个更好的价值函数数学基础和一个相互一致的可行逼近方法——以避免这些问题。

查看原文

I think RL as a method which produces training data by model's predictions — It directly leads the model to extend its output range because of increased diversity of the data. However, fundamentally RL relies on bootstrapping and has moving target problem which are the reason of its poor stability. One of the most tractable method to approximate value function is TD which causes sample noise, function approximator error and moving target problems. I argue that we need to extend pure RL theory at the level of the Bellman equation to achieve more stable RL. Consequently, we need both a better mathematical foundation for value functions and a tractable approximation method that are aligned with each other — free from problems

我对强化学习的看法