Robbie had actions, a in A.
Robbie had observations, o in O.
Robbie acted in states s in S.
Robbie wanted to optimize a reward r.
How can Robbie make a policy pi mapping observations to actions and achieving maximal (expected) accumulated reward?
direct experience reset model generative model precise description full table