Change sex and race to Selfie using neural networks

Hello, Habr! Today I want to tell you how you can change your face in a photo, using a rather complex pile from several generative neural networks and not only. Fashionable applications recently to turn themselves into a lady or grandfather work simpler, because the neural networks are slow, and the quality that can be obtained by classical methods of computer vision, and so good. Nevertheless, the proposed method seems to me very promising. Under the cut will be a little code, but a lot of pictures, links and personal experience working with GAN'ami.
The task can be broken down into the following steps:

Find and cut the face in the photo
transform the person in the right way (turn into a woman / black, etc.)
improve / enlarge the resulting image
insert the transformed face back into the original photo

Each of these steps can be implemented by a separate neural network, but you can do without them. Let's start in order.
Face Detection
Here the easiest. You can not invent anything and just take dlib.get_frontal_face_detector () ( example ). Dlib's detector uses a linear classifier, trained by HOG features by default.

As we see, the provided rectangle does not contain the entire face, so it is better to increase it. Optimal magnification factors can be picked up by hand. As a result, you can get a similar method and a similar result:
code def detect_single_face_dlib (img_rgb, rescale = (1.1, 1.5, 1.1, 1.3)):
fd_front_dlib = dlib.get_frontal_face_detector()
face = fd_front_dlib(img_rgb, 1)
if len(face) > 0:
face = sorted([(t.width() * t.height(), (t.left(),, t.width(), t.height()))
for t in face],
key=lambda t: t[0], reverse=True)[0][1]
return None

if rescale is not None and face is not None:
if type(rescale) != tuple:
rescale = (rescale, rescale, rescale, rescale)
(x, y, w, h) = face

w = min(img_rgb.shape[1] - x, int(w / 2 + rescale[2] * w / 2))
h = min(img_rgb.shape[0] - y, int(h / 2 + rescale[3] * h / 2))

fx = max(0, int(x + w / 2 * (1 - rescale[0])))
fy = max(0, int(y + h / 2 * (1 - rescale[1])))
fw = min(img_rgb.shape[1] - fx, int(w - w / 2 * (1 - rescale[0])))
fh = min(img_rgb.shape[0] - fy, int(h - h / 2 * (1 - rescale[1])))

face = (fx, fy, fw, fh)
return face

If the work of "old" methods for some reason does not suit you, you can try deep learning. To solve the problem of detecting a person, any Region Proposal Networks, for example YOLOv2 or Faster-RCNN . As you try - be sure to share what you have.
Face Transformation
Here is the most interesting. As you probably already know , you can transform your face or make a mask without neural networks, and it will work well. But generative networks are a much more promising tool for image processing. There are already a huge number of models, such as <your prefix> GAN, which are able to get up a lot of different transformations. The task of converting images from one set (domain) to another is called Domain Transfer. With some of the architectures of Domain Transfer networks, you could get acquainted with our recent review of GANs .
Why Cycle-GAN? Yes, because it works. Visit the project site and see what you can do with this model. As a dataset, two sets of images are sufficient: DomainA and DomainB. Let's say you have folders with photos of men and women, whites and Asians, apples and peaches. It's enough! You clone the authors repository with the implementation of Cycle-GAN on pytorch and start learning.
How it works
In this figure, from the original article , the operating principle of the model is fully and briefly described. In my opinion, a very simple and elegant solution that gives good results.

In fact, we train two generator functions. One - image - learns from the input image from the domain  image generate an image from the domain  image . Another is image - on the contrary, from  image in  image . Appropriate discriminators image and  image they are helped in this, as it is peculiar to GAN'am. The usual Advesarial Loss (or GAN Loss) is as follows:
Additionally, the authors introduce the so-called Cycle Consistensy Loss:
Its essence is that the image from the domain image, passing through the generator  image , and then through the generator  image , was the most similar to the original. In short, image.
Thus, the objective function takes the form:
and we solve the following optimization problem:
Here image is a hyperparameter that controls the weight of the extra loss.
But that is not all. It was noticed that the generators very much change the color gamut of the original image.
To fix this, the authors added an additional loss - Identity Loss. This is a kind of regularizer, which requires an identity mapping from the generator for images of the target domain. Those. if a zebra came to the zebra generator, then you do not need to change this picture.
On (my) surprise it helped to solve the problem of preserving the color scale. Here are examples from the authors of the article (Monet tries to transform the pictures into real photographs):
The architectures of the used networks
In order to describe the architectures used, we introduce some conventions.
c7s1-k is a convolutional layer 7x7 with subsequent batch-normalization and ReLU, with stride 1, padding 3 and number of filters k. Such layers do not reduce the dimensionality of our image. Example on pytorch:
[nn.Conv2d(n_channels, inplanes, kernel_size=7, padding=3),
nn.BatchNorm2d(k, affine=True),
dk is a convolutional layer of 3x3 with stride 2 and number of filters k. Such convolutions reduce the dimension of the input image by a factor of 2. Again an example on pytorch:
[nn.Conv2d(inplanes, inplanes * 2, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(inplanes * 2, affine=True),
Rk - residual block with two 3x3 convolutions with the same number of filters. In the authors it is constructed as follows:
resnet_block = []
resnet_block += [nn.Conv2d(inplanes, planes, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(planes, affine=True),
resnet_block += [nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(planes, affine=True)]
uk - 3x3 up-convolution layer with BatchNorm and ReLU - to increase the dimension of the image. Again an example:
[nn.ConvTranspose2d(inplanes * 4, inplanes * 2, kernel_size=3, stride=2,
padding=1, output_padding=1),
nn.BatchNorm2d(inplanes * 2, affine=True),
With the indicated designations, the generator with 9 slaughter-blocks looks like this:
c7s1-32,d64,d128,R128,R128,R128,R128,R128,R128,R128,R128,R128,u64,u32,c7s1-3 We make the original convolution with 32 filters, then reduce the image two times, increasing the number of filters, increasing the number of filters, then 9 resets-blocks, then double the image, reducing the number of filters, and the final convolution create the 3-channel output of our neural network.
For the discriminator we will use the notation Ck - convolution 4х4 with subsequent batch-norms and LeakyReLU with parameter 0.2. The architecture of the discriminator is as follows:
However, for the first layer of C64, we do not do BatchNorm.
And in the end, we add one neuron with the sigmoid activation function, which says whether the fake came to him or not.
A couple of words about the discriminator
Such a discriminator is a so-called fully-convolutional network - there are no fully connected layers in it, only convolutions. Such networks can accept images of any size for input, and the number of layers regulates the receptive field of the network. For the first time such architecture was presented in the article Fully Convolutional Networks for Semantic Segmentation .
In our case, the generator maps the input image not to one scalar, as usual, but to 512 output pictures (reduced size), which already draws the conclusion "real or fake". This can be interpreted as a weighted voting on 512 patches of the input image. The size of the patch (receptive field) can be estimated by flipping back to the input all the activations. But good people made the utility , which counts everything for us. Such networks are also called PatchGAN.
In our case, the 3-layer PatchGAN with the input image 256x256 has a receptive field of 70x70, and this is equivalent to if we cut out some random patches 70x70 from the input and judged them, the real picture came or generated. Controlling the depth of the network, we can control the size of patches. For example, 1-layer PatchGAN has a 16x16 receptive field, and in this case we are looking at low-level features. 5-layer PatchGAN will already look almost at the whole picture. Here Phillip Isola explained this magic intelligibly to me. Read, you should also become clearer. The main thing: fully convolutional networks work better than usual, and they need to be used.
Features of learning Cycle-GAN
To begin with, we tried to solve the problem of turning men's faces into women and vice versa. Fortunately, there are data sets for this. For example, CelebA , containing 200,000 photos of celebrity faces with binary tags Gender, Points, Beard, etc.

Actually, having broken this dataset by the necessary attribute, we get about 90k pictures of men and 110k - women. These are our DomainX and DomainY.
The average size of the faces on these photos, however, is not very large (about 150x150), and instead of resizing all the pictures to 256x256 we brought them to 128x128. Also, to preserve the aspect ratio, the pictures were not stretched, but fit into a black box 128x128. A typical generator input might look like this:
Perceptual Loss
Intuition and experience prompted us to consider the identity loss not in pixel space, but in the space of features of the pre-loaded VGG-16 network. This trick was first introduced in the article Perceptual Losses for Real-Time Style Transfer and Super-Resolution and is widely used in Style Transfer tasks. The logic here is simple: if we want to make generators invariant to the style of images from the target domain, then why count the error on pixels if there are features that contain information about the style. To what effect this led, you will learn a little later.
The training procedure
In general, the model turned out to be rather cumbersome, learning 4 networks at once. Images on them should be sent several times back and forth to count all the loesses, and then back and distribute all the gradients. One era of learning on CelebA (200k 128x128 images) on the GForce 1080 takes about 5 hours. So you do not experiment much. I will only say that our configuration differed from the author's only by replacing Identity Loss with Perceptual. PatchGAN'y with more or less layers did not go, left 3-layer. The optimizer for all networks is Adam with parameters betas = (0.5, 0.999). Learning rate by default is 0.0002 and decreased every era. BatchSize was equal to 1, and in all grids BatchNorm was replaced (by the authors) on InstanceNorm. An interesting point is that at the entrance the discriminator was not fed the last output of the generator, but a random picture from the buffer in 50 images. Thus, the image generated by the previous version of the generator could come to the discriminator. This trick and many others that the authors used are listed in the article by Sumit Chintala (author PyTorch) How to Train a GAN? Tips and tricks to make GANs work . I recommend to print this list and hang it near the workplace. We have not gotten our hands on trying everything that is there, for example LeakyReLU and alternative methods of upsampling for the generator. But they tinkered with a discriminator and a training schedule for a generator / discriminator pair. This really adds stability.
Now more pictures will go, you probably already were waiting.
In general, learning generative networks is not very different from other tasks of in-depth training. Here, often, you will not see the familiar picture of a downward passing loss and growing quality metric. Evaluate how good your model is by looking at the outputs of the generators. The typical picture that we observed looked something like this:

Generators gradually diverge, the rest of the loesses have slightly decreased, but nevertheless, the network produces decent pictures and has clearly learned something. By the way, to visualize the learning curve of the model, we used visdom , quite simple to configure and convenient tool from Facebook Research. Every few iterations, we looked at the following 8 pictures:

real_A - login from domain A
fake_B - real_A, transformed by the generator A-> B
rec_A - the reconstructed picture, fake_B, transformed by the generator B-> A
idty_B - real_A, transformed by the generator B-> A
and 4 similar images in reverse

On the average, good results can be seen already after the 5th era of training. Look, the generators' error does not decrease in any way, but this did not stop the network from turning a person like Hinton into a woman. Nothing sacred!

Sometimes things can go very badly.

In this case, you can press Ctrl + C and call the journalists, tell how you stopped the artificial intelligence.
In general, despite some artifacts and low resolution, Cycle-GAN cope with the task.
See for yourself:
Men <-> Women

White <-> Asians

White <-> Black

Have you noticed the interesting effect that is given by identity mapping and perceptual loss? Look at idty_A and idty_B. The woman becomes more feminine (more makeup, smooth light skin), the man adds vegetation to her face, whites become whiter, and black, sotstvetstvenno - black. Generators learn the average style for the entire domain, thanks to perceptual loss. Here, the creation of an application for "justification" of your photos arises. Shut up and give me your money!
Here's Leo:

And a few more celebrities:




Personally, this guy-Jolie frightened me.
And now, attention, a very complicated case.
Increase the resolution of
CycleGAN coped well with the task. But the resulting images are small in size and some artifacts. The task of increasing the resolution is called Image Superresolution. And this task has already been learned to solve with the help of neural networks. I want to note two state-of-the-art models: SR-ResNet and EDSR.
In the article Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , the authors propose a generic network architecture for the super-resolution (SRGAN) task, which is based on ResNet. In addition to per-pixel MSE, authors add Perceptual loss, using pre-activated VGG activations (see, everyone does so!) And a discriminator, naturally.
The generator uses residual blocks with 3x3 convolutions, 64 filters, batchnchnorm and ParametricReLu. To increase the resolution, two SubPixel layers are used.
In the discriminator there are 8 convolutional layers with a 3x3 core and an increasing number of channels from 64 to 512, activation everywhere - LeakyReLu, after each doubling the number of features - the resolution of the image is reduced by stride in convolutions. There are no pooling layers, at the end there are two fully connected layers and a sigmoid for classification. The generator and discriminator are schematically shown below.
Enhanced Deep Super-Resolution network is the same SRResNet, but with three modifications:

No batch-normalization. This allows you to reduce up to 40% of the memory used during training and increase the number of filters and layers
Outside the residual blocks, ReLu is not used
All resnet-blocks are multiplied by a factor of 0.1 before they are added to the previous activations. This allows you to stabilize training.
To learn the SR-network, you need to have a high-resolution image data set. Certain efforts and time had to be spent on sparring several thousand photos on the #face hashtag from the instagram. But where without it, we all know that data collection and processing is 80 +% of the volume of our work.
Learning SR-network is not taken on full images, but on patches of small size, cut out of them. This allows the generator to learn to work with small details. And in the working mode, you can submit images of any size to the network input, because this is a fully-convolutional network.
In practice, EDSR, which supposedly should work better and faster than its predecessor SRResNet, did not show the best results and was trained much more slowly.
As a result, for our payline we chose SRRestNet, trained on 64x64 patches, as Perceptual loss we used 2 and 5 layers of VGG-16, and we removed the discriminator altogether. Below are a few examples from the training set.



And this is how this model works on our artificial images. Not ideal, but not bad.

Inserting an image into the original
Even this problem can be solved by neural networks. I found one interesting work on image blending. GP-GAN: Towards Realistic High-Resolution Image Blending . I will not tell you the details, I will show only a picture from the article.

We did not manage to realize this thing, got off with a simple solution. We paste our square with the transformed face back into the original, gradually increasing the transparency of the picture closer to the edges. It turns out like this:

Again, not ideal, but in a hurry - ca.
Many things can still be tried to improve the current result. I have a whole list, only time and more video cards are needed.
And in general, I really want to gather on some hackaton and finish the resulting prototype to a normal web application.
And then you can think about how to try to transfer GANs to mobile devices. To do this, we need to try different techniques to speed up and reduce the neural networks: factorization, knowledge distillation, that's it. And about this we will soon have a separate post, stay tuned. Until next time!
Skull 1 november 2017, 9:21
Vote for this post
Bring it to the Main Page


Leave a Reply

Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute