VU FDA

Assignment 3: Classification analysis

Due date: December 6

For all questions please use Python. For these parts submit a single file of your python code called l2.ipynb. Please clearly comment the file so we can understand which parts of the code are associated with which project. For the writeup, also submit a pdf called l2.pdf. The Jupyter notebook file should produce this pdf. It is your responsibility to ensure that this can be run on the grader's computer. We will use a python distribution set up with scikit-learn, numpy, and matplotlib.

Question 1 (SVM) (40 points)

For this assignment we will use the SVM module for scikit learn. Instructions for installation are available on that website.

The data in voices.data consists of digitized frequencies of phonemes from five different classes. Each sample has 256 features, each of which can take values 0–255. The classes refer to what phoneme the person is speaking. There are more than five phonemes in all human languages I know, but five is complex enough for the assignment :)

Produce a separate plot, one for each class (i.e. 5), of the phoneme curves. This should help you understand which frequencies are likely to be important in classification. Describe which features you think are most important for classification based on these graphs (A short paragraph will be sufficient). Also, before doing part b, write down if you think a linear classifier is sufficient for this problem and why or why not (again, a paragraph should be sufficient).
Use the support vector machine classifier to try and classify the dataset. Use a 50/50 split of the training data and test data making sure that a particular speaker is not in both test and training datasets. Use a linear, quadratic, and Gaussian kernel for your classification and compare the results. Discuss the model performance for each different type of kernel (1-2 sentences). For each of these kernels feel free to play with the other parameters and report on what worked better or worse and why (1-2 sentences).

Question 2 (Neural network) (60 points)

For this part of the assignment we will be implementing our own nascent neural network library! Soon we will be unstoppable :) You can use the same dataset from question 1.

Write a program to fit a single hidden layer (with 10 hidden nodes) neural network that is trained using back propogation. For the stopping condition you can use either a variable number of iterations or a minumum change in error but report which one you used.
Apply your neural network program to the voices.data dataset. Use a cross-validation procedure with a 50/50 split of test and training data. Generate graphs for the test and training error of the neural network as a function of the number of training iterations.
Vary the number of hidden units in the network, from 1 to 10, and report on how performance changes with the number of hidden units. Justify your conclusions (2-3 paragraphs). Again, you will need the graphs of the training and test performance for each network architecture to justify your conclusions.

Procedure and Submission

Please submit a ZIP-document with your answers to Moodle. Use the following naming scheme for your submission: “lastname_matrikelnumber_A3.zip”. The naming of files is important. If you do not follow the submission instructions then you will receive a grade of 0 for the lab.

Late submission

Late Submissions are NOT possible. Any assignment submitted late will receive zero points.

Academic Honesty