Benchmarks for RL

A couple years ago, Drew Bagnell and I started the RLBench project to setup a suite of reinforcement learning benchmark problems. We haven’t been able to touch it (due to lack of time) for a year so the project is on hold. Luckily, there are several other projects such as CLSquare and RL-Glue with a similar goal, and we strongly endorse their continued development.

I would like to explain why, especially in the context of criticism of other learning benchmarks. For example, sometimes the UCI Machine Learning Repository is criticized. There are two criticisms I know of:

  1. Learning algorithms have overfit to the problems in the repository. It is easy to imagine a mechanism for this happening unintentionally. Strong evidence of this would be provided by learning algorithms which perform great on the UCI machine learning repository but very badly (relative to other learning algorithms) on non-UCI learning problems. I have seen little evidence of this but it remains a point of concern. There is a natural order to how much we trust performance results:
    1. Competitions. A well run prediction competition provides some of the best evidence available about which learning algorithms work well. There are still some mechanisms for being misled (too many entries, not enough testing data, unlucky choice of learning problem that your algorithm just has the wrong bias for), but they are more controlled than anywhere else.
    2. Benchmarks. Benchmarks provide a “reasonable sanity check” that the algorithm will do something well in practice. Compared to competitions, there are additional ways to be misled and (unfortunately) to mislead. Overfitting becomes a serious issue.
    3. Simulated data Simulated data can be very misleading. Typically, there is little reason to believe that simulated data reflects the structure of real-world problems.

    This ordering should always be kept in mind when considering empirical results.

  2. The repository enforced a certain interface which many natural learning problems did not fit into. Some interface must be enforced in order for a benchmark/repository to be useful. This implies that problems fitting the interface are preferred both in submission of new problems and in choice of algorithms to solve. This is a reasonable complaint, but it’s basically a complaint that the UCI machine learning repository was too succesful. The good news here is that the internet has made it so anyone can setup a repository. Just put up a webserver and announce the dataset.

It is important to consider the benefits of a benchmark suite as well. The standard in reinforcement learning for experimental studies in reinforcement learning algorithms has been showing that some new algorithm does something reasonable on one or two reinforcement learning problems. Naturally, what this implies in practice is that the algorithms were hand tuned to work on these one-or-two problems, and there was relatively little emphasis on building a general purpose reinforcement learning algorithm.

This is strongly at odds with the end goal of the reinforcement learning community! The way to avoid this is by setting up a benchmark suite (the more problems, the better). With a large set of problems that people can easily test on, the natural emphasis in empirical algorithm development shifts towards general purpose algorithms.

Another drawback of the “one or two problems” approach is incomparability of results. When people reimplement simulators for reinforcement learning problems, it turns out these reimplementations typically differ in the details, and these details can radically alter the difficulty of the problem. For example, there are many ways to implement the “car on the hill” problem, and some of them are much more difficult to solve than others. Again, a large set of problems people can easily test on solves this problem because the precise definition of the problem is shared.

One sometimes reasonable view of the world is that if you define a problem publicly, it is as good as solved since there are plenty of smart people out there eager to prove themselves. According to this view of the world, setting up a benchmark is a vital task.

One Reply to “Benchmarks for RL”

  1. Can u plz solve this problem?
    Problem Description:
    Write your own Benchmark Problem to get the peak MFLOPS rating of a computer which will contain all eight types of operation mentioned in fig -1.Use statistics of same figure the ratio between the numbers of these operations .Use fig-1 as a model to find the actual floating point operations for each real floating point operation .Use random number as data for these operations and ignore overflowed operations.

    Input:
    The following required parameters will be provided from command prompt.
    Number of operations :Total number of floating point operations your program has to execute.
    Random operations: Whether you have to execute the operations randomly or in a serial.

    Output:
    The output of the benchmark should be:
    1. Total execution time (excluding loop overhead, random number generation time etc)
    2. Peak MFLOPS rating of the computer.

    What to submit:
    • All the required source codes for successful compilation and testing.
    • Short report.
    • Execution file.
    Reference:

    Following fig shows numbers of floating –point operations in PEC CFP2000 171.swim

    Floating Point operation Times executed
    Load 77,033,084,546
    Store 2,823,523,329
    Copy 4,274,605,803
    Add 41,324,938,303
    Sub 221,443,753,876
    Mul 31,487,066,317
    Div 1,428,275,916
    Convert 11,760,563
    Total 199,827,008,653

    Fig 1

    Following fig shows the number of normalized floating –point operations per real operation is a program used by the authors of the Livermore FORTRAN Kernels to calculate MFLOPS.
    Real FP operations Normalized FP operations
    Add, Sub, Compare, Multiply 1
    Divide, Square root 4
    Functions 8

    Also assume that the latency of load store ,copy and convert is same as add.

Comments are closed.