Classification Lab

Lecturer: Prof. Dr. Moritz Grosse-Wentrup
Assistant (lab author and grader): Alex Markham

Due: 11:00 AM, 06 May 2019

0  Description and Instructions

0.1  Description

In this lab, you will start by proving some basic properties about the information theoretic concept of entropy. You will then use entropy in the loss function as you implement the ID3 algorithm for learning a binary decision tree classifier. Next, you will derive the complexity bounds of this classifier as a function of its depth. Finally, you will plot learning curves to explore the interaction between the complexity of the decision tree model, the amount of training data required, and the accuracy and generalizability of the model.

In addition to learning about decision tree classifiers, please view this lab as an opportunity to practice good software development/presentation skills and reporting computational results.

0.2  Submission Procedure

You must submit your assignment via [TBD]. Your submission must contain exactly two files:

0.3  Late Submission Policy

The main purpose of this lab is that you learn something. In some cases (e.g., becoming ill, having a traumatic experience), hard deadlines can hinder the learning process. In acknowledgment of this, extensions may be granted (within reason), but they require explicit communication and permission as soon as possible, preferably before the due date. Nevertheless, be aware that this is a long assignment, requiring approximately 20 hours of work, so it is imperative that you start early and do not procrastinate.

0.4  Academic Honesty

Again, the main purpose of this lab is that you learn something. Academic dishonesty undermines your chance to learn from this lab, so please follow the academic honesty policy.

1  Entropy properties (40 Points)

Consider a discrete random variable P that takes on possible values (1, …, n) with probabilities p = (p1, …, pn). The entropy H(P) of this random variable is defined as

H(P) = −
n
i=1
 pilogpi,     (1)

and the joint entropy H(P,Q) of the pair of random variables P and Q is defined as

H(P,Q) = −
n
i=1
 
m
j=1
 ℙ(P=iQ=j) logℙ(P=i,Q=j).     (2)

Let us also define the concept of conditional entropy for a discrete random variable P with distribution p = (p1, …, pn), given some other discrete random variable Q with distribution q = (q1, …, qm):

H(P|Q) = 
m
j=1
 qj H(P|Q=j),     (3)

where, without loss of generality, we have assumed that our random variables takes values 1, …, n and 1, …, m, respectively.

Prove the following properties:

  1. H(P,Q) = H(PQ) + H(Q) = H(QP) + H(P)
  2. If P and Q are independent, then H(PQ) = H(P)
  3. max{H(P),H(Q)} ≤ H(P,Q) ≤ H(P) + H(Q) Hint: You can assume that the inequality logxx−1 holds for all x ∈ ℝ, with equality if and only if x= 1.

2  ID3 algorithm

Recall the pseudocode for the ID3 algorithm from lecture and the textbook [1, p. 253]:



Your task is to implement ID3 using the Information Gain:

Gain(Si) = H(S) − H(S ∣ i).

Notice that the information gain makes use of entropy and conditional entropy as defined in Problem 1. Intuitively, the information gain is a measure of the change in entropy before and after a prospective split. The greedy strategy of ID3 is to choose the split that maximizes the information gain in each iteration.

In addition to the usual termination conditions for the recursive calls of the ID3 algorithm, make it so that your implementation takes an additional input parameter, n, the maximum number of nodes in the decision tree, allowing you to better control the complexity of the returned decision tree. Furthermore, add another input parameter, d, with the maximum number of features/attributes that the decision tree will use.

3  Complexity

Recall from the lecture and from the textbook [1, pp. 251–252] that decision trees are prone to overfitting and poor generalization if we do not limit their size and therefore complexity. In particular, recall that the difference between the true risk and the empirical risk can be bounded by a function of the number of training samples m, the number of features/attributes d of the samples, and the number of nodes n in the decision tree.

The relationship between m, d, n, the true risk, the empirical risk, and the generalizability is complicated: the empirical risk can be lowered by increasing the complexity (number of nodes and features), but this can increase the difference between the true and empirical risk (i.e., lower generalizability) if the more complex model is not also trained on more samples. To get a feeling for this, you will experiment with different values of these parameters and visualize your experiments with learning curves. (Note: There is nothing to submit for Section 3---its purpose is just to mention the complexity of ID3 and to segue to the next section.)

4  Learning Curves (60 Points)

You will be using the Congressional Voting Records Data Set from the UCI repository. Review the data description to get a sense of the classification task and the data. Notice that the attributes are generally binary (yea or nea) but that there cases where neither of these apply, in which case the attribute takes the value ?. You must deal with this missing value somehow. One option would be to randomly (from the uniform distrubition) assign yea or nea to all ? as soon as you load the data set, before training on it. Another option would be to make your implementation able to handle the missing values itself, for example, by making the implementation randomly assign yea or nea any time it encounters a ?. There are also other ways of handling the missing value. Pick whichever you like (one of the two proposed above, or some other method that you devise), to ensure there are no problems running your implementation on the given data.

The train and test split involves separating a data set into two parts:

The samples or data points assigned to each data set are randomly selected. This is an attempt to ensure that the training and evaluation of a model is not biased. If multiple algorithms are compared or multiple configurations of the same algorithm are compared, the same train and test split of the data set should be used. This is to ensure that the comparison of performance is consistent. We can achieve this by seeding the random number generator the same way before splitting the data, or by holding the same split of the data set for use by multiple algorithms/configurations.

Implement a function split_data() that splits the data into a training set and test set: the function should take two arguments, the data set to be split and the percentage of the data to be put into the training set, and return the training and test sets.

Now, make your training (70%) and test (30%) sets, which should be fixed for the rest of the lab.

In order to visualize the complex relationship between the model’s complexity and generalizablity, you will plot learning curves. These are plots of the model’s performance on the training set and the validation set as a function of the training set size.

Now, use this function to generate five learning curves, each with the same training set, test set, and number of increments, but the following varying complexities:

  1. d = 7 and n = 40; this is your baseline: it is not obviously complex enough to overfit, but it is still reasonably complex given the limited number of samples
  2. d = 4 and n = 40;
  3. d = 16 and n = 40;
  4. d = 7 and n = 20;
  5. d = 7 and n = 130;

Describe the five learning curves by answering these questions for 1.–5. (if applicable):

Finally, experiment with your own values of d and n until you find what you think is a good model (and better than 1.–5.). Explain in what ways you think it is good, and explain how think its performance relates to your choice of d and n as well as the data set.

References

[1]
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms, 2014. Can be accessed via http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/ by clicking Download in the left margin.