Facial Keypoint Detection with Neural Networks

Nose Tip Detection

First, I used a very small portion of the dataset to predict the nose tip only. The images were compressed to 60x80 pixels and greyscale.

Groundtruth Keypoints

Hyperparameters and Architecture

I used three convolutional layers, first one of size 5, last two of size 3. The number of output channels started from 1 (since the input was in greyscale) to 16, then to 32, then to 64. Then, I had two fully connected layers, first one with output of 64, second one with output of 2 since we wanted a single (x, y) point for the nose. The convolutional layers each had relu and maxpooling applied, whereas the first fully connected layer had relu applied. I used an Adam optimizer and tried an initial learning rate of 1e-2, 1e-3, and 1e-4, finding that 1e-3 was the best out of the three. I ran for 15 epochs.

MSE Loss

Sampled Outputs

Best and Worst Cases

These images were picked based off of highest and lowest MSE loss. I believe the reason why it fails in the third picture might be because the mustache confused the network, and the reason why it fails in the fourth picture is because her angle is a bit unusual, compared to the other straight forward faces, like the photo that performs the best.

Full Facial Keypoints Detection

Next, I repeat the process with the same training and validation data, this time with all the facial keypoints. The data was augmented with random rotations and crops.

Groundtruth Keypoints

Hyperparameters and Architecture

I used five convolutional layers, first one of size 5, the rest of size 3. The number of output channels started from 1 again to 16, and continuously doubled. Then, I had two fully connected layers, first one with output of 256 this time, since there were more convolutional layers and following the powers of two, and second one with output of 58 * 2 since we wanted a single 58 points this time. Again, the convolutional layers each had relu and maxpooling applied, whereas the first fully connected layer had relu applied. Again I used an Adam optimizer and tried an initial learning rate of 1e-2, 1e-3, and 1e-4, finding that 1e-3 was the best out of the three. I ran for 20 epochs this time since there were more points, though I could've ran for longer.

MSE Loss

Sampled Outputs

Best and Worst Cases

Very interestingly, the second best face was also the second worst face for nose detection. I suspect the reason for the failures might have been how the person is looking more downwards rather than straight forwards. It seems the model did well on face that were looking straight on or to the side. As a reference, one of the samples shown above corresponds to the worst case for nose detection, which also did not do very well.

Learned Filters

These were the learned filters of the first layer. In some of them, you can see a sense of a round shape edge being learned.

Train With Larger Dataset

The process was repeated with a much larger dataset (6666 images instead of 240). This data had bounding boxes so it did not make sense to crop. Instead, I just augmented the data with rotations.

Hyperparameters and Architecture

I used a resnet18 model, changing the first conv net with 1 input channel instead of three since it was grayscale, and the last channel to have output of 68*2 since there are now 68 labeled points. I tried a few hyperparameters, running for 80 epochs. I found the best was having a batch size of 32 and learning rate of 1e-2. I tried a few times experimentally at the beginning before graphs with 1e-2 and larger batch sizes, but did not do much better than 0.0003.

batch size: 32
learning rate: 1e-2
min validation error: 0.00029225
min validation error epoch: 76

batch size: 32
learning rate: 1e-4
min validation error: 0.000415
min validation error epoch: 75

batch size: 32
learning rate: 1e-5
min validation error: 0.0033
min validation error epoch: 80

batch size: 64
learning rate: 1e-2
min validation error: 0.00032
min validation error epoch: 73

batch size: 64
learning rate: 1e-4
min validation error: 0.0005775
min validation error epoch: 73

batch size: 64
learning rate: 1e-5
min validation error: 0.007
min validation error epoch: 80

Test Set Predictions

Initially, my models with learning rates of 1e-3 seemed to generate hearts despite the low error of around 0.0003.

With hyperparameters of learning rate being 1e-2 and batch size of 32, these were my initial good results for the beginning of the test set.

Anti-aliased max pool

I repeated this with an anti-aliased resnet18 instead of a normal resnet18. As can be seen in the graphs below, there was improvement, but not by much.

batch size: 32
learning rate: 1e-2
min validation error: 0.0002574
min validation error epoch: 76

batch size: 32
learning rate: 1e-3
min validation error: 0.000267
min validation error epoch: 73

batch size: 32
learning rate: 1e-4
min validation error: 0.0003755
min validation error epoch: 78

Tests outside of Test Set

I tried the model on my own personal photos, but the vast majority failed.

I tested on more photos of myself not shown, and I realized the model kept identifying my eyebrows as eyes, my nose as a mouth, hair for eyebrows, and mouth line as bottom of chin. This was the same for my sister and dad. I then tried the model on the FEI database, thinking perhaps the model possibly did not identify asians well.

Automatic Morphing

Lastly, since it was able to decently predict the keypoints of the FEI database, I used those keypoints to morph together multiple faces automatically by incorporating my previous project.