Machine Learning for dummies (just kidding)
Another semester ended, another to be added the countless number of semesters that we have seen since the beginning of our academic career as professional students. Still, I’m taking the time to share with you what has made this last one particularly significant for me. I was registered in the course Machine Learning (COMP-652) and I had such a great time that I didn’t see the semester flying by, or was it because I was overwhelmed by work, it’s hard to tell. Never mind, this course taught me, and the other students who were registered, about a bunch of tools, not far from being qualified of statistics, that I’m sure you too could benefits from knowing. Unfortunately, there is only one way for you to get the fine details, and it is to register yourself, yet, I’m going to give you an appreciation of two topics covered during this class and maybe you’ll find yourself interested in learning more on the subject.
Machine learning refers to the ensemble of algorithms that aims at extracting information from a data set. This description is very vague and this is reflected in the eclectic set of algorithms that are studied thorough the course.
Code available here.
The first concept taught is the one of regression. Whenever you parameterize some data, you are performing a form of regression. The most basic form of regression is linear regression (Fig. 1), which consist in mapping linearly, with more or less success, a set of input data, usually denoted x, to a set of output data, t as in target.
Linear regression gives the false impression that the fit will always be a linear function, which is true, but only in the relationship between the input data and the output data. Therefore, we usually use this equation to define linear regression:
Where the function ϕ(xi) represent the input data xi, an element of x a vector of input data, in the feature space. Let’s be honest, this notion of feature space kept me confused for a while, so I don’t expect you to fully grasp the concept from such a short description, but let me just push in another example to help you understand a little bit more of what I’m trying to explain. When basic linear regression doesn’t work, one might give it a try at polynomial regression (Fig. 2). Polynomial regression is still a linear regression, but we are now dealing with a linear combination of various degrees of x. Let see how we move on from the polynomial expression to the equation of linear regression. In a polynomial expression of degree n=3, this is the equation we get.
Notice that this is not a linear equation.
By concatenating all capital letters, the parameters of the model, into a vector W and defining the function:
The problem is then stated as:
Regression becomes a method to find the parameters W that best map the input on the output data. Often, we will try to minimize the mean squared error (MSE) between the data and the fit. In this case a closed solution exist and the weights minimizing the squared error can be computed with (I am skipping the details, but you can find an extensive description here):
But such an elegant resolution is not always available and when it is not, we will often use method such as gradient descent to find a satisfying solution to the error minimization problem.
Linear regression is presented because it represents de essence of machine learning. It is used to introduce concepts such as error measurement, problems associated with overfitting, cross-validation and the bias/variance trade off. Then, the Bayesian view to regression is introduced, this view is the foundation to several machine learning algorithm, such as logistic regression. Eventually, we move on to more complex algorithms that not only regress and extract the trend from a set of data but also perform classification.
Principal component analysis (PCA)
Several of you must already have heard of PCA. The main application of PCA is to perform dimensionality reduction, but as we will see, it can also be used to transform the input data representation space as to make categories linearly separable, even though they weren’t in their original reference frame.
Basically, PCA is a method that applies linear transformations on the set of data as to map it in a different frame of reference. Each orthogonal component (axis) of the new reference frame is computed as to collect as much of the variance of the data as possible. It is then possible to select a subset of the components, those that capture most of the variance, and drop the others, to obtain a dataset with a lower dimensionality that retain most of the information found in the data. (notice that this is a form of information compression)
To illustrate the PCA, I’m using a dataset I picked up at simafore. It is a dataset giving the nutrition facts for various brands of cereal. For the purpose of this example, I’m going to use 8 of the measurements reported (i.e. calories, protein, fat, sodium, fiber, carbohydrate, sugars and potassium) and the goal is to relate cereal brands that are similar to each other.
This might sound easy at first, after all, all high sugars content on one side and all low sugars content on the other! But that doesn’t take into account the content of fibers or protein… What if two different brands are very similar for their sugars content, but very different for on all other account? We are looking for a measurement of similarity that will take the 8 nutrition facts into account at the same time and if possible, report them in a frame of reference that can be visualized. Well, that’s precisely what PCA can do. (It might have occurred to some people that we could have normalized the values and computed the Euclidean distance between the cereal brands. Unfortunately I cannot explain the reasons here, but this becomes rapidly unpractical as the dimensionality of the dataset increase. See for a description of the curse of dimensionality.
Computing the PCA is a fairly simple operation (use Matlab: princomp, to let Matlab handle it). The first operation is to compute the covariance matrix. Let me remind you that each element of the NxN (N=8) covariance matrix is given by:
where E(x) is the expectation or the mean operator.
Then, we need to find the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors and eigenvalues are a set of vectors and magnitude that respect the following equality:
Where A is the matrix, v is a eigenvector and λ a eigenvalue. It is a very important concept that finds application in many fields related to linear algebra, unfortunately, again, I cannot go in depth here, but I refer you to wiki! if you are interested in knowing more about eigenvectors and eigenvalues. For now, you can assume that the resulting set of eigenvectors indicates the direction in which the energy of the covariance matrix is oriented, that all eigenvectors are orthogonal (not always the case in general, but always true in PCA) and that the eigenvalues tell us how much of the energy goes in each directions.
The set of eigenvectors is our reference frame. There are as many eigenvectors as there are dimensions in our dataset, the thing is however, that by looking at the eigenvalues we can decide to drop dimensions that don’t contain a significant amount of energy. The following figure shows us the ordered magnitude of the eigenvalues obtained from our dataset and the cumulative energy as a function of the number of components used.
It is clear from this illustration that even though we are provided with 8 measurements, it is possible to represent the information in a 3 dimensional space, and this, without losing any information. Warning though, the 3 components aren’t the three most important measurements! each component is a combination of the 8 measurements, this means that even if we are using only three components, all measurements have a pondered contribution in the new frame of reference. The next figure now take advantage of the reduction of dimensionality to plot in a 2 dimensional space (we are still capturing more than 95% of the energy in the variance) the points corresponding to the various cereal brands. For the sake of the demonstration, I used another algorithm (K-means) taught in the Machine Learning course to form clusters to group the most similar cereal brands in 3 classes (red, green , blue) in the new frame of reference.
It is very apparent it this example that the PCA algorithm generates a reference frame in which it becomes much more easier to perform analysis, such as clustering, than it would be in a space of higher dimension. In this example it even makes is possible to represent the significant portion of information, hidden in a 8 dimensions space, in a 2 or 3 dimensions space. I won’t go any further in PCA, but from here it would be easy to determine which nutrition facts are contributing the most to each component, if there was a need to interpret the results or to build a classifier that classify any new cereal brand in one of the three classes.
I’m taking the opportunity to tell you that this is the most basic form of PCA, there is another form called kernel PCA that perform the crazy operation of projecting the data in a higher dimension space (greater than 8 ) before performing the same dimensionality reduction. It is then possible to extract relationships that aren’t readily accessible in the dataset. This is another, more complex and very interesting, topic covered in COMP-652.
And the rest
I have introduced linear regression, polynomial regression and principal component analysis, but this is only the tip of the iceberg. The machine learning course cover many other algorithms, some even more powerful, such as Bayesian decoder, support-vector machines (SVM), neural networks, Kernel machines, K-means/hierarchical and other clustering methods, Hidden Markov Models (HMM), Bayesian networks and this is not an exhaustive list.
The course also emphasize a deep understanding of the practical problems associated with machine learning. What makes a fit better than the other? Given a set of data and a goal to be attained how do we pick the right machine learning approach? And even if we have the data and the algorithm, how do we implement it?
The background required for the course: probability and statistics, linear algebra (+ matrices), calculus and Matlab. In every case, a background is recommended, but with hard work, you can overcome most of the problems, plan to put in a lot of time on this class.
Finally, I recommend this course to anyone who is interested in data mining, data analysis, artificial intelligence, modeling, algorithms and so on. Let me know if you have any questions, it will be a pleasure for me to answer them.
PS: I’m planning on organizing a workshop on the subject of data analysis in neuroscience, several machine learning tools will be covered, stay tuned!