It is well known that support vector machines are among the best general-purpose classification algorithms in use today. The strength lies in the way that a separating hyperplane is generated to “best-separate” two classes of data i.e. by fitting a hyperplane which maximizes that distance between the plane and elements of each class on either side of the plane.
In cases where data is not linearly separable, SVM’s rely on projecting data into higher-dimensions where classes might be separable. Depending on the situation, this formalism may or may not work well. Another method named Boosting has been shown to perform well in a variety of scenarios, and is available for consideration.
Boosting builds a strong classifier from an ensemble of weak classifiers (Freund & Shapire, 1995, Friedman, Hastie, Tibshhirani, 1998). Consider the toy problem of building a classifier where one class has a circular distribution and another has a concentric distribution to the first. The idea is to start with a simple linear classifier that has a better-than-random chance at classification. Next follows an iterative procedure where the mis-classified examples from the first stage are “boosted” in weight, and new hyperplanes are designed to best separate the newly weighted dataset. This results in a sequence of hyperplanes fitted to the data such that their aggregate, when duly weighted, can separate the datasets well.
Boosting has been used to great effect in the Viola-Jones face detection algorithm (coming soon… )
One of the beginner topics in a computer vision class is face recognition. Given some assumptions, it is now a solved problem in the AI community – or at-least one which has been given sufficient amount of attention that there are numerous methods that work well at scale in a lot of scenarios.
As part of an effort at maintaining and growing my knowledge-base, I am in the process of spending some amount of time in reviewing course materials that I’d once wisely stored away for posterity. I’ve re-begun with CS231a – the Stanford University course on Computer Vision.
The first lecture discussed two popular methods for face recognition – the first being EigenFaces – a seminal method developed at MIT – http://www.face-rec.org/algorithms/PCA/jcn.pdf. The gist of it is finding the eigen-vectors of the aggregate vectorized image patch training sets – thus significantly reducing the dimensionality of the problem. Given a new image patch vector, find its projection in the lower-dimensional sub-space and run an OTS classifier on it to tell which one of the trained image patch classes (or faces) it is closest too. Nearest neighbor etc. should be sufficient. The eigenvector formulation ensures dimensionality-reduction while preserving the dimensions of maximum variance so that, in a sense, an image patch, say 90×90 (or a vector or length 8100), is compressed to, say, 20 dimensions, such that these 20 dimensions represent most of the variance that exists in image patches of faces. “PCA projection is optimal for reconstruction from a low-dimensional basis but may not be optimal for discrimination”.
The second method discussed was FischerFaces – which improved face recognition significantly over EigenFaces. http://www.cs.columbia.edu/~belhumeur/journal/fisherface-pami97.pdf. The motivation for the method is that EigenFaces, which doing well on dimensionality-reduction, does not preserve between-class separation as a result of doing PCA. This is nicely illustrated on page 4 of the paper in a diagram. It shows through a toy example how choosing the principal component causes a complete loss in separability between classes.
A lot of paper-reading lies ahead…
One can get reasonable accuracy with a Kalman Filter even with a poorly hacked together model. As a naive dabbler in these matters, I learned recently that though the Kalman Filter states you observe might be filtered suitably, the hidden, or rather, unmeasured states may not necessarily yield sensible values . Also, increasing the model complexity doesn’t necessarily improve matters if your fundamental assumptions about the model are flawed to begin with.
The measurement and model covariance matrices don’t change their values over time. They remain at the initialized values. However, the pre and posterior error co-variance values constantly change. The variances of the states start at fairly low values. As time progresses, the variance values tend to increase and then stabilize at fairly high values. This is probably an artifact of the model selection being poor.
It was also observed that the corrected state at the end of a cycle, if propagated forward in time, yields the predicted state for the next cycle. (In retrospect this should have been obvious – but one has a lot to learn)