You can get up to 40 points by doing any of the following projects. (No more than 40 points will be given regardless of how many projects you complete. Exception: If you are the best performer in the prototype selection problem, you will get 5 extra bonus points.) For example, you can do one theory problem (10 points) and both programming problems (30 points). Alternatively, you can do all theory problems (30 points) and a reading assignment (10 points). You have your choice.
All answers must be typed (preferably in Latex) and emailed to Alina by May 4 (midnight). Late projects will not be accepted.
Your goal is to design, implement, and evaluate such a k-NN prototype selection algorithm. Here is what you need to submit:
Write a function that takes a labeled dataset (X, y), a value k, and a total number of prototypes p; and returns the p prototype vectors and their labels.
Apply your function to the MNIST training data (follow the link to learn more about the dataset). You cannot use the test data in selecting your prototypes (doing so will automatically result in 0 points for the problem), but if you like, you can set aside some of the training data as a holdout validation set.
Evaluate the resulting prototypes on the MNIST test data using k-nearest neighbors (you can use weka's implementation of the algorithm), presenting both an overall error rate and a confusion matrix M, in which Mij is the number of digits i classified as j (perfect performance gives a diagonal matrix). Use cross-validation to choose k and p.
Baseline (Passive learning): Query all points in the stream.
Fixed margin: Query all points within a fixed margin g (experiment with different g, including g=0.3).
Shrinking: Query all points within a margin starting at g and dropping by 1/2 after every k correct predictions (experiment with different k, including k=4).
The DKM algorithm (Dasgupta, Kalai, and Monetleoni) and the CBGZ algorithm (Cesa-Bianchi, Gentile, and Zaniboni). Both algorithms are described in the following paper (Figures 1 and 2). Use the same methodology in your experiments (see section 4.2 in the paper):
C. Monteleoni and M. Kääriäinen, Practical Online Active Learning for Classification, CVPR 2007.
Here is what you need to submit:
Plot the test error as a function of the number of points queried, for all the methods.
Plot the number of updates performed as a function of the number of points queried. Explain the differences between the methods.
A report of what you have learned.
Your code (you can use any programming language as long as you make it easy for anyone to run the code and verify your results). As in problem 1, using test examples in tuning the algorithms will result in 0 credit.