I’ve always had a bit of trouble recognizing faces, but humans are generally quite good at the task. We’re not quite sure how the process works in our brain, but some interesting studies on macaques indicate two interesting things: neurons fire in clusters to recognize ‘feature patterns’, and primates seem to learn the skill by socializing early in life, rather than possessing it innately.
Humans, I would imagine, are even better than apes — bigger brains and evolutionary pressure towards community development might select for the skill. In classic human fashion, I set out to design a tool to compensate for my biological shortcomings.
Convolutional Neural Networks
CNNs power interesting computer vision tech such as surveillance cameras and identity verification. They’re fantastic for object recognition, because they solve some longstanding issues (object clutter, deformation, lighting etc) — they can ‘generalize’ better than earlier nets. They do this (mostly) through convolutional filters.
Since pixels are represented by numbers, we can define a size (3x3, 4x4 etc) for a filter to slide over the base image. We multiply the matrix with the filter to get a scalar in the output.
There’s plenty of filters to use, representing things like line/edge detection, curve recognition and any pattern you’d like. Stride length and padding offer more customization to how the convolution layers can generalize the image’s features, rather than getting lost in strict absolutes.
The classic example of a CNN is to classify an image: is this a cat or dog? Feed the net enough photos and it’ll learn to differentiate between the two (whether or not it knows what a ‘cat’ is in the abstract sense is another story).
But that’s not quite what ‘recognizing faces’ is. There’s two tasks in that phrase: identity matching, “which person does this face belong to” and differentiation, “are these two faces the same person”. ID matching has great use when you’ve got a large database of established image identities, like Facebook or China’s social credit system. But it relies on differentiation to function, so I decided ask my net: “Are these two faces the same person or different people”?
Fortunately, there’s a net architecture designed for such comparative tasks. Siamese nets are ‘twinned’ neural networks with identical structure and weights. The idea is to pass each one a different image and compare the outputs.
This comparative structure lets us analyze the difference between two images. In face differentiation, we pass the face through convolutional layers, flatten them to feature vectors, and then compare those vectors through contrastive loss or other functions. We end up with a “similarity score”, a decimal between 0 and 1 — the difference between the filtered vectors, or how different the faces are. By setting a threshold for that score, the net can decide whether the image pair is the same or different identities.
For my net, I used Keras and Jupyter, a combination that allows easy testing of different architectures and quick hyperparameter tuning. I decided on UTKfaces, a large collection of faces in the wild with plenty of feature variation. Keras is pleasant to use; the tricky part was setting up the data, ~2,000 RGB 3-channel 200x200 faces.
To answer the question, ‘same person at different angles or different people’, I needed to give my net examples of each case. However, UTKfaces has only one face for each identity. I used an image-augmentation generator to randomly warp each face (stretch, shear, zoom, flip, rotate) to simulate a face turning to different angles, projecting 3d onto 2d layers.
I tested several structures with varying results; the best performance, around 68%, came from 5 convolutional layers, sized 64 to 256 with 7x7 & 4x4 filters, and one flattened 4096 vector fed into Euclidean Distance as a comparative metric. Regularization and relu activations yielded significant boosts in performance.
Early stopping generally hit the best performance at around 40 epochs (100 steps, batch size 16). The model is definitely prone to overtraining, which I would chalk up to the data size and augmentation. That being said, a 68% success rate isn’t great, and therefore didn’t yield any usable similarity scores.
The two constraints on my initial model are data size and net complexity. I believe augmentation is ultimately a poor substitution for true same-user angled pairs; Siamese differentiation nets can actually learn quite well from a relatively small identity size, as long has each identity has more than a few examples, so I plan to test on different data next. Runtime on a laptop also has practical limits; I could train a much more complex net using remote GPU software.
I’m confident my net has room for drastic improvement, because:
- the macaque studies indicate that we learn face differentiation through clusters of neurons recognizing smaller feature patterns (CNN filters) and repeated exposure over long periods at young ages (training a network).
- it’s already been done! Facebook’s FaceNet achieved a 97% user differentiation rate using a 9-layer architecture trained on millions of user-uploaded photos.
My psuedoscientific instinct of the possible has been happily validated by the efforts of brilliant data scientists who now outperform humans at face differentiation. Man has developed tools that will develop man.
You can follow my project on Github. More to come in the following weeks.