MASK FACE INPAINTING BASED ON IMPROVED GENERATIVE ADVERSARIAL NETWORK

Face recognition technology has been widely used in all aspects of people's lives. However, the accuracy of face recognition is greatly reduced due to the obscuring of objects, such as masks and sunglasses. Wearing masks in public has been a crucial approach to preventing illness, especially since the Covid-19 outbreak. This poses challenges to applications such as face recognition. Therefore, the removal of masks via image inpainting has become a hot topic in the field of computer vision. Deep learning-based image inpainting techniques have taken observable results, but the restored images still have problems such as blurring and inconsistency. To address such problems, this paper proposes an improved inpainting model based on generative adversarial network: the model adds attention mechanisms to the sampling module based on pix2pix network; the residual module is improved by adding convolutional branches. The improved inpainting model can not only effectively restore faces obscured by face masks, but also realize the inpainting of randomly obscured images of human faces. To further validate the generality of the inpainting model, tests are conducted on the datasets of CelebA, Paris Street and Place2, and the experimental results show that both SSIM and PSNR have improved significantly.


INTRODUCTION
Image inpainting is the filling of missing areas of an image with pixels to make it as close to the original image as possible. Image inpainting or image completion has been one of the research focuses in the field of computer vision. Face image inpainting differs from general image inpainting in that the human face has distinctive features and complex geometric structures, especially the mouth, nose, and eyes, which contain rich and high-level semantic information (Jiang et al., 2022). Face image inpainting pays more attention to the rationality of image structures and semantics, and preserves the unique identity attributes of individuals.
Traditional inpainting algorithms Jia & Tang, 2003;Simakov et al., 2008) and deep learning-based inpainting algorithms Shao et al., 2022;Zeng et al., 2019;Zheng et al., 2019) are the two primary categories of image inpainting. Traditional inpainting algorithms are mainly based on the diffusion principle, which propagates neighbourhood pixels from intact areas to the missing areas and uses pixel similarity to complete image inpainting. Such methods can restore small missing areas, but they are less effective for large missing areas or images with meaningful texture structures. In recent years, due to the advent of convolutional neural networks and deep learning, many new inpainting models have emerged. Deep learning-based inpainting algorithms can effectively extract information from unobscured regions and generate pixels with rich semantics to fill the missing regions. The emergence of generative adversarial network technology has pushed image inpainting to a new level (Goodfellow, 2014). Context Encoder (CE) algorithm used adversarial training of the discriminator and generator to obtain the generator to complete the image inpainting for the first time (Pathak et al., 2016). The images restored by CE have obvious blurring. To enhance the effect of image inpainting, algorithms such as GLCIC (Globally and Locally Consistent Image Completion) (Iizuka et al., 2017), GMCNN (Generative Multi-column Convective Neural Networks)  and PEPSI (Parallel Extended-decoder Path for Semantic in Inpainting) (Sagong et al., 2019) appeared one after another on the basis of CE, but still did not eliminate the problems of blurring and artifacts.
In our paper, a new face inpainting model based on generative adversarial network is proposed by improving the original model of pix2pix (Isola et al., 2017). Spatial attention mechanism (Jaderberg et al., 2016) and channel attention mechanism (Zhang et al., 2018) are introduced in the Encoder and Decoder of the model, while multi-branch residual modules are added in the down-sampling to improve the model's capacity for feature extraction, so as to better accomplish image inpainting. The improved inpainting model is tested on the face mask dataset with better inpainting results. In order to confirm the model's efficacy, we also do experiments on Celeb A, Paris Street and Place2 datasets, and the results show that the inpainting of regular mask regions and random mask regions can be accomplished with a better visual performance. Figure 1 shows some examples of the inpainting results of our method. The following are the contributions of this paper: (1). Construction of the face mask dataset based on public CelebA.
(2). Improvement of the UNet structure and proposition of a face image inpainting model that can effectively remove face masks.
(3). Improvement of the residual module of the single branch by adding multiple branches with different convolution kernel sizes, and the model feature extraction capability is enhanced. The number of branches can be added more if needed. (4). One color picture contains both spatial and channel information. Spatial attention and channel attention are subtly blended into one model. This is crucial for face image inpainting.

Image Inpainting based on Gan
Goodfellow et al. introduced a novel framework for building generative models through adversarial training in their paper. The framework simultaneously trains the discriminator network D and the generator network G. The generator network generates new images with the aim of "fooling" the discriminator as much as possible, and we can use the generator to restore images.
Many methods of image inpainting have emerged on the basis of Gan. Yu et al. (Yu et al., 2018) proposed a depth generation model based on a contextual attention mechanism, which "borrows" features and pixels from relatively distant regions. During the training process, the surrounding image features are explicitly used as a reference, and the long-range correlation between the occluded and other regions is established by using an inflated convolution with a large perceptual field. The idea of partial convolution is further generalized by using gated convolution (Yu et al., 2019) to extend the feature selection mechanism of each layer in the network to learnable location information, and hand-drawn sketches are used to guide the inpainting process. Liu (Liu et al., 2018) proposed partial convolution to ensure that the pixel responses in the masked region are all synthesized from the unmasked region. And the occlusion mask is automatically updated as the effective region keeps increasing during network forwarding to improve the model's inpainting performance for non-regular occlusions.
The UNet structure is useful in Gan models because of its symmetry, which can effectively reduce the model's parameters. Ronneberger (Ronneberger et al., 2015) first proposed the UNet network model, which was used to capture the sampling structure and symmetric connections, replacing sliding window convolution and fully connected layers with better restoration. Zhou (Zhou et al., 2018) proposed the UNET++ network framework based on the UNet network, and replaced the original skip-connection with a dense connection, which improved the segmentation effect of the model. Mou (Mou et al., 2022) proposed a new model DGUNET(Deep Generalized Unfolding Network) based on the UNet network structure model, which was not friendly to large corrupted images.

Attention
Attention mechanism is a machine learning and deep learning data processing technique that is widely utilized in numerous learning tasks, including image processing, and natural language processing. In essence, the attention mechanism resembles how people observe their environment. In image processing, the most used attention mechanisms are spatial attention (SA) and channel attention (CA). These two attention mechanisms have different levels of resource allocation. SA locates the spatial region of interest for transformation to obtain the weights, while CA is to allocate the resources among the convolutional channels. Both of them focus on different parts, but both emphasize on focusing on the important information and ignoring the unimportant information. Yu (Wu et al., 2020) proposed a new depth-based generative model for large broken images and added contextual attention. They used convolution to calculate the matching scores of foreground and background images and then applied softmax for comparison to obtain the attention scores of each pixel. Xie C et al. in the literature (Xie et al., 2019) found the incoherence of existing methods for restoring the missing region of a hole, and therefore proposed a deep generative modelling method based on refinement. The method used a new semantic attention layer to not only preserve the contextual structural information but also to model the information around the hole.

AUTHORS' METHOD
In order to effectively recover the face obscured area by the mask, an improved generative adversarial network model is proposed in this paper based on the UNet structure. The whole model contains two sub-networks, the generator and the discriminator. The down-sampling modules of the generator incorporate the spatial attention mechanism, while the up-sampling modules use both the channel attention mechanism and the spatial attention mechanism. In addition, the authors improve the common residual module by adding convolutional branches using convolutional kernels of different sizes to increase the feature extraction capability in the down-sampling module. Figure 2 displays the model's overall framework.

Fig. 2. Overall framework
The masked face images are input to the generator network, and the faces without masks of the same size are obtained by convolution operations. The output images of the generator network and the real face images are both fed into the discriminator network at the same time to complete the determination of true or false. In this iteration, the generator network and the discriminator network are trained against each other to gradually improve their respective performance. We use the stable generator for face image inpainting. Figure 3 illustrates the general structure of the authors' generator network, where all images are uniformly resized to 128*128. The generator network consists of 7 downsampling modules, 6 up-sampling modules and 1 up-sampling convolutional layer. The down-sampling and up-sampling modules perform the 'Concatenate' operation at the same size level to better synthesize the extracted features from the bottom to top layers. The resolution of the feature images becomes smaller during down-sampling, and consequently, some of the effective features are lost. Therefore, a multi-branch residual module and a spatial attention mechanism to each down-sampling module to compensate for the lost information was added.

Fig. 3. Generator network
The original residual module has only one convolution branch, and the result of convolution and the input images are directly fused for processing. Considering that different sizes of convolution kernels can extract features at different levels, we transform the residual module into three branches with convolution kernels sizes 1*1, 3*3 and 5*5. Depending on the specific situation, we can also add more branches, as shown in figure 4. Each convolution branch can control the scaling of the image by the stride.

Fig. 4. Improved ResNet block
Two improved residual modules are used consecutively in the down-sampling module, one of which has a stride set to 2 to reduce the feature map resolution. The spatial attention module is connected behind the residual module to focus on the more important pixel information in space. Figure 5 displays the down-sampling's structure.  Figure 6 shows the up-sampling module, which reconstruct the feature image to the size of the original input by successive up-sampling. Each module contains residual module, spatial attention and channel attention in turn. The channel attention must be added after the 'Concatenate' operation in order to better extract useful information between channels.

Discriminator network
The discriminator network's primary job is to judge whether the input images are fake or real, so its input size is the same as the generator's output images. Figure 7 demonstrates that the output of the discriminator network is a 1*16*16 matrix, where each pixel point represents the discriminant value of a small region of the input image. This structure makes the discriminator network do a whole discriminant for a small region with better results.

Loss function
During the generative adversarial network training process, the loss functions are essential. The total loss of the generator network includes L1 loss, adversarial loss and TVLoss (Rudin et al., 1992), while the discriminator network's loss uses only adversarial loss. L1 loss enables pixel-level evaluation of the masked images and the original images, which plays a key role in the outcome of image inpainting. The adversarial loss is mainly used to help the alternate, adversarial training of the generator network and the discriminator network, thus improving the respective performance of the 2 sub-networks. During image inpainting, any noise will have a relatively large impact on the original image result, especially the boundary of the occluded area will become incoherent. TVLoss, as a regular term, can keep the image boundary smooth after inpainting.

EXPERIMENTAL RESULTS AND ANALYSIS
Based on the CelebA face dataset, we select the images of the front faces and manually add the three most common masks to form the mask face dataset. Meanwhile, to confirm the generality of the inpainting model, we also conduct tests on Celeb A, Place2 and Paris Street datasets respectively. The test results are compared with two classical methods, pix2pix and SRGan (Ledig et al., 2017). The optimizer used for model training is Adam, and b1, and b2 are 0.5, 0.999 respectively. The learning rate of the discriminator network and generator network is both set to 2*10 -4 . The weights of the generator loss function are γ L1 =100, γ adv =100, γ TV =1.

Datasets
The Chinese University of Hong Kong created the CelebFaces Attributes Collection (CelebA), a sizable face attribute dataset that includes 202599 photos of 10177 famous identities. The CelebA dataset is one of the most commonly used datasets for model training or testing related to face image processing. This dataset is very comprehensive and includes Europeans, Asians, men, women, older, middle-aged, laughing, serious, stationary, athletic, etc. Considering the need of our face inpainting experiment, 140000 images with frontal faces are randomly selected from the whole dataset, among which, the ratio of men to women is about 1:1. In addition, this paper takes the 3 most common masks that people use in daily life for construction. During the process, Photoshop software is used for mask masking and this work produces paired images of masked and unmasked faces as shown in figure 8. The left part is the original face and the right is the masked face. These two parts are put together for comparison.

Fig. 8. Mask face dataset
In the experiment, in addition to the CelebA dataset, we also use the Place2 and Paris Street datasets, and the distribution of each dataset is shown in table 1.

Qualitative Evaluation
Both pix2pix and SRGan algorithms have some blurring in the inpainting results, and cannot generate semantically coherent pixels in the mask-obscured region. However, there are 'Concatenate' connections in the pix2pix inpainting model, which can generate complex organs such as the nose and mouth. Therefore, the overall effect of pix2pix inpainting is better than that of SRGan. The improved residual module can extract features of different dimensions, while spatial attention and channel attention can selectively focus on some of the key information in space and channels which are very important for face image inpainting. Figure 9 shows that the proposed face image inpainting model in this paper generates semantically reasonable content in addition to the best local texture details.

Fig. 9. Comparison results on mask occlusion faces
In the following figure, 64*64 size masks are used to replace the general face masks. The experimental results show that the algorithm in this paper is more reasonable and coherent in inpainting details, and highly restores the masked area. The other two inpainting algorithms, have poorer inpainting results, with significant blurring and distortion.

Fig. 10. Comparison results on rectangle occlusion faces
To further verify the inpainting effect of the model on face images, we perform random masks on face images. Figure 11 shows the comparison of random masks around 10%, and figure 12 shows the comparison of random masks around 30%. Pix2pix and SRGan have poor inpainting effects, and there are clear blurs and visual blindness. The comparison results show the obvious superiority of the face inpainting algorithm in this paper. The proposed inpainting model in this paper not only accomplishes the inpainting task well on face images, but also works well on other datasets. So we also conduct comparison experiments on Place2 and Paris Street datasets. The experimental results demonstrate that the inpainting results of pix2pix and SRGan algorithms have artifacts and blind areas, while the inpainting model in this paper highly restores the occluded areas as shown in figure 13 and figure  14.

Quantitative Evaluation
In order to better demonstrate the superiority of the inpainting method in this paper, we also perform a quantitative analysis of the inpainting results. PSNR and SSIM (Hore & Ziou, 2010) are used as evaluation metrics for the inpainting images, respectively. PSNR is generally used to measure the image quality value between the maximum signal and the background noise. A larger value of PSNR indicates less distortion in the inpainting image. SSIM is a metric that measures the similarity of two input images, and a larger value of SSIM indicates better image inpainting quality. Table 2-table 5 show the results of the quantitative analysis. It is very clear that the inpainting results of our proposed algorithm are better than pix2pix and SRGan.

Loss function analysis
For generative adversarial network, the generator and discriminator are trained against each other, so adversarial loss is necessary. In addition to the adversarial loss, the total loss function of the generator includes the L1 loss and the TVLoss loss. As we can see in figure  15, the final face inpainting results are very poor if L1 loss is not used, which enables pixellevel processing. And for TVLoss, it can make the restored image smoother, especially at the boundary of the occluded area.

Fig. 15. Qualitative comparison of loss function
The quantitative analysis is performed in table 6, and the use of both L1 Loss and TVLoss loss leads to the improvement of both PSNR and SSIM2. These two losses play a good role in the whole image processing.

Ablation Experiment
In addition to performing quantitative and qualitative analyses, we also design ablation experiments to demonstrate the role of each module in the overall model. The experimental protocols are set up without channel attention, without spatial attention and without the improved residual module, respectively. The ablation experiments are still tested on the constructed mask faces dataset shown in figure 16.  Table 7 clearly shows the roles of the channel attention module, the spatial attention module and the improved residual module in the overall model. Among them, the improved residual module is particularly effective in extracting features and reconstructing pixels in the occluded region.

DISCUSSION
In this paper, the authors propose an improved face inpainting model based on generative adversarial network, which can remove face masks. Face images are different from other general images in that face images have complex structures and faces contain rich semantic information. Therefore, face image inpainting is more difficult than general image inpainting. Face image inpainting has gained a lot of attention and some results have been achieved. However, the restored face images often have problems such as blurring and inconsistent boundaries. The uathors add improved residual modules to the generator network based on the UNet structure, and skillfully employ two attention mechanisms to improve the feature extraction of images for efficient mask removal. The experimental results show that the proposed inpainting method can not only achieve face image inpainting, but also be effective for general image inpainting. Therefore, if one person wears sunglasses or a girl's flowing sea covers her forehead, it should also be possible to remove it. This provides convenience for the police to catch criminals, etc.
Considering that human faces have distinct geometric contours, especially eyes, ears, and nose have symmetric properties, we plan to add such a priori knowledge to the face inpainting model to further enhance the inpainting of face images. In addition, since different faces have similarities, we can do more scientific planning for the training dataset. For example, the dataset should contain images of various skin tones, various age groups, and various expressions. A more complete dataset may result in a better-trained inpainting model.

CONCLUSIONS
To effectively restore the blocked area by face mask and solve the problems of blurring and artifacts in the restored face images, this paper proposes an improved face image inpainting model based on the pix2pix generative adversarial network. Compared with the original method, authors' model introduces the attention mechanisms, improves the residual module, and uses skip-connections to connect the corresponding down-sampling module and up-sampling module. The experimental results show that the algorithm can realize face mask occlusion inpainting, rectangle occlusion inpainting and random occlusion inpainting. Compared with pix2pix and SRGan algorithms, the inpainting results of the improved model are significantly improved. In addition, good inpainting results have been obtained on Paris Street and Place2 datasets, which proves that the inpainting model has good portability.
Author Contributions: Qingyu Liu participated in all experiments, coordinated the dataanalysis and contributed to the writing of the manuscript. And designed the research plan and organized the study. Roben A. Juanatas designed the research plan and organized the study. Also he coordinated the data-analysis and contributed to the writing of the manuscript.
Funding: This study received support from the following sources: the University Natural Science Acknowledgments: I would like to acknowledge and give my sincere thanks to Dr.Roben Juanatas and Dr.Mideth Abisado who made this paper possible. Your advice and guidance carried me through all the stages of writing my paper. Also, I would like to thank my other teachers for letting my studying enjoyable moment, Dr.Rex Bringula, Dr.Eric Blancaflor, Dr.Vladimir Mariano and Dr.Rodolfo Raga, for your brilliant encouragement and suggestions, thanks to all of you.
Conflicts of Interest: The authors declare no conflict of interest