In research, it’s often the case that solving a problem helps you realize that it wasn’t the right problem to solve. This is the case for the “reduce RL to classification” problem with the solution hinted at here and turned into a paper here.
The essential difficulty is that the method of stating and analyzing reductions ends up being nonalgorithmic (unlike previous reductions) unless you work with learning from teleoperated robots as Greg Grudic does. The difficulty here is due to the reduction being dependent on the optimal policy (which a human teleoperator might simulate, but which is otherwise unavailable).
So, this problem is “open” again with the caveat that this time we want a more algorithmic solution.
Whether or not this is feasible at all is still unclear and evidence in either direction would greatly interest me. A positive answer might have many practical implications in the long run.