Basic Setting

There once was a robot named Robbie.

Robbie had actions, a in A.

Robbie had observations, o in O.

Robbie acted in states s in S.

Robbie wanted to optimize a reward r.

How can Robbie make a policy pi mapping observations to actions and achieving maximal (expected) accumulated reward?

direct experience reset model generative model precise description full table