Usage information for vw

The program "vw" implements all the algorithms, depending on flags.
 12:45PM wifetries-741: vw

Allowed options:
  -b [ --bit_precision ] arg (=18)       number of bits in the feature table
  -c [ --cache ]                         Use a cache.  The default is 
                                         .cache
  --cache_file arg                       The location of a cache_file.
  -d [ --data ] arg                      Example Set
  --decay_learning_rate arg (=0.7071068) Set Decay Rate of Learning Rate
  -f [ --final_regressor ] arg           Final regressor
  -h [ --help ]                          Output Arguments
  -i [ --initial_regressor ] arg         Initial regressor
  --initial_t arg (=1)                   initial t value
  --power_t arg (=0)                     t power value
  -l [ --learning_rate ] arg (=0.1)      Set Learning Rate
  --passes arg (=1)                      Number of Training Passes
  -p [ --predictions ] arg               File to output predictions to
  -q [ --quadratic ] arg                 Create and use quadratic features
  -r [ --raw_predictions ] arg           File to output unnormalized 
                                         predictions to
  -s [ --seg ]                           Use SEG algorithm
  -t [ --testonly ]                      Ignore label information and just test
  --threads arg (=1)                     Number of threads
  1. -b [ --bit_precision ] arg The internal representation of the learning algorithm is a large array of floats which are indexed by hashing the feature value. This flag controls log2 of the array size. If you want no collisions, then you need the 2*log(number of features) by the birthday paradox. On very large datasets where we can't easily represent all the features, we found this mechanism for sparsity to be more effective than the sparsification technique used in version 1. Note that your speed may be highly dependent on this parameter---if the weight vector fits in the l2 cache, you can be extremely efficient.
  2. -c [ --cache ] Whether or not to use a cache. For linear representations, this typically results in an order of magnitude speedup. The cache file contents depend on -b, and this dependence is autochecked. If a valid cache is not found, the program starts creating one.
  3. --cache_file arg The location of the cache file. By default it is data_file.cache.
  4. -d [ --data ] arg The training or testing file. See below for the format. The "-d" flag isn't necessary, because an unflagged argument is the datafile by default.
  5. --decay_learning_rate arg The learning rate is multiplied by this quantity after every pass over the data.
  6. -f [ --final_regressor ] arg Which file to output the final regressor into.
  7. -h [ --help ] Output the set of flags. Using no arguments has the same effect.
  8. -i [ --initial_regressor ] arg Start by loading an initial regressor. The regressor file contains -b, -s, and -q flag arguments used when producing the regressor and will overrule any that you try to give.
  9. --initial_t arg (=1) An offset to the initial count. This only impacts learning if the learning rate decays with t.
  10. --power_t arg (=0) The power on 1/(initial_t + t) which controls the learning rate.
  11. --passes arg The number of times the learning algorithm passes over the data. We found that decaying the learning rate by a factor of 1/20.5 was effective, so this is the default. You can change the decay rate via --decay_learning_rate or create your own multipass algorithm via use of --initial_regressor.
  12. -p [ --predictions ] arg File to output predictions to. This can be used during either training or testing. Note that if order matters then you should set --threads 1.
  13. -q [ --quadratic ] arg Whether or not and which quadratic features to create. The argument is two characters---the first character of two namespaces which are created.
  14. -r [ --raw_predictions ] arg A file to output raw (unnormalized) prediction to. This is sometimes helpful if you are using the score for ordering rather than probabilistic prediction.
  15. -s [ --seg ] Use the specialist exponentiated gradient algorithm.
  16. -t [ --testonly ]
  17. Ignore any available label information and don't train. You probably want to use -p and --threads 1 also.
  18. --threads arg The number of threads to use. Performance can vary substantially depending on whether the datafile is cached in RAM and whether the weight vector fits in your level 2 cache. The default is 2.

Data file format

The training set is a line-by-line format of the form <label> <weight> |<namespace> <feature> <feature> ... |<namespace> <feature> <feature> ...

The semantics is: features with the same name are different features in different namespaces. See the datasets above for examples. If you want to specify a value for a feature, you do this by adding :<float> to the namespace (for all features in the namespace) or the feature. For example "|txt:-1 foo bar baz" would say that the features "foo", "bar", and "baz" each have value -1 (rather then the default of 1). If you don't specify a label, the learning algorithm doesn't try to learn (but it does test). If you don't specify a weight, it defaults to 1.