COMS 4771 Machine Learning

COMS 4771: Final Projects

You can get up to 40 points by doing any of the following projects. (No more than 40 points will be given regardless of how many projects you complete. Exception: If you are the best performer in the prototype selection problem, you will get 5 extra bonus points.) For example, you can do one theory problem (10 points) and both programming problems (30 points). Alternatively, you can do all theory problems (30 points) and a reading assignment (10 points). You have your choice.

All answers must be typed (preferably in Latex) and emailed to Alina by May 4 (midnight). Late projects will not be accepted.

Theory problem set (three problems, 30 points total).

Reading assignment: Read and understand in depth up to two papers below (each paper is 10 points worth). A written quiz will determine the grade. The quizzes will be held from 1pm to 2pm on May 1 (30 minutes per paper).

Paper #1: Adam Kalai and Santosh Vempala, Efficient algorithms for the online decision problem, COLT '03.
Paper #2: Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, Robert Schapire, The Nonstochastic Multiarmed Bandit Problem, SIAM J. Comput. 32(1): 48-77 (2002).
Paper #3: Avrim Blum and Tom Mitchell. Combining Labeled and Unlabeled Data with Co-Training, COLT 1998.
Paper #4: S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1):60-65, 2003.
Paper #5: B. Taskar, C. Guestrin and D. Koller, Max-Margin Markov Networks, Neural Information Processing Systems Conference (NIPS), 2003.

Problems (30 points total):

Problem #1: (15 points) Prototype selection is the problem of choosing a few instances that most accurately and completely represent observed data for some particular task. There are many possible approaches to this problem, and which one is best will depend on the data and task at hand. In the context of k-nearest neighbors, the prototype selection problem can be stated as follows:
- You are given a labeled dataset (X, y), where X is a matrix with one data point per row, x_i âˆˆ R^d, and y is a column vector with y_i âˆˆ Y, the space of possible labels.
- Rather than using the entire dataset for k-NN classification, you want to pick p prototypes from R^dÃ—Y which will give good performance. In particular, the prototypes do not need to be chosen from X.
Your goal is to design, implement, and evaluate such a k-NN prototype selection algorithm. Here is what you need to submit:
- Clearly describe your algorithm and explain what it is trying to optimize.
- Write a function that takes a labeled dataset (X, y), a value k, and a total number of prototypes p; and returns the p prototype vectors and their labels.
- Apply your function to the MNIST training data (follow the link to learn more about the dataset). You cannot use the test data in selecting your prototypes (doing so will automatically result in 0 points for the problem), but if you like, you can set aside some of the training data as a holdout validation set.
- Evaluate the resulting prototypes on the MNIST test data using k-nearest neighbors (you can use weka's implementation of the algorithm), presenting both an overall error rate and a confusion matrix M, in which M_ij is the number of digits i classified as j (perfect performance gives a diagonal matrix). Use cross-validation to choose k and p.
Problem #2: (15 points) In this problem, you will experiment with active learning in the online setting, where you receive points one by one, and you have to make spot decisions about whether or not to query the label of the current point (i.e., you cannot go back and query the point you saw before). Investigate the following five modifications of the Perceptron algorithm:
- Baseline (Passive learning): Query all points in the stream.
- Fixed margin: Query all points within a fixed margin g (experiment with different g, including g=0.3).
- Shrinking: Query all points within a margin starting at g and dropping by 1/2 after every k correct predictions (experiment with different k, including k=4).
- The DKM algorithm (Dasgupta, Kalai, and Monetleoni) and the CBGZ algorithm (Cesa-Bianchi, Gentile, and Zaniboni). Both algorithms are described in the following paper (Figures 1 and 2). Use the same methodology in your experiments (see section 4.2 in the paper):
  C. Monteleoni and M. KÃ¤Ã¤riÃ¤inen, Practical Online Active Learning for Classification, CVPR 2007.
You will be using the MNIST dataset from Problem 1. Since MNIST is a multiclass dataset and the perceptron assumes binary data, you will need to construct binary classification problems. Experiment with the following four problems: Separate digit 2 from digit 8; digit 1 from digit 7; digits {2,8} from digit {9}; digit 0 from all other digits. Use the methodology from the paper above on these four problems.
Here is what you need to submit:
- Plot the test error as a function of the number of points queried, for all the methods.
- Plot the number of updates performed as a function of the number of points queried. Explain the differences between the methods.
- A report of what you have learned.
- Your code (you can use any programming language as long as you make it easy for anyone to run the code and verify your results). As in problem 1, using test examples in tuning the algorithms will result in 0 credit.
Questions?