StyleRes: Transforming the Residuals for Real Image Editing with StyleGAN

Bilkent University

StyleRes edits images with both high reconstruction quality and editing accuracy

Abstract

We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN’s latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality.

In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses.

We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.

Method


StyleRes encodes missing features for high-fidelity reconstruction of given input via the first encoder, E1. Those encoded features are the ones which could not be encoded to low-rate W+ space via E0 due to the information bottleneck. Through the second encoder, E2, StyleRes learns to transform features based on the manipulated features. During training, latent codes are edited by interpolating encoded W+'s with randomly generated ones by StyleGAN's mapping network. During inference, they are edited with semantically meaningful directions discovered by methods such as InterfaceGAN and GANSpace. Note that StyleGAN generator is shown as two parts just for the ease of visualizing the diagram. First part includes the layers that generate features to 64x64 and the second part generates the higher resolution features and final image.

Results

StyleRes edits achieved using different methods

Input

Bangs (+)

Input

Bangs (+)

Input

Bangs (+)

Input

Glasses (+)

Input

Glasses (+)

Input

Glasses (+)

Input

Bobcut (+)

Input

Bobcut (+)

Input

Bobcut (+)

Input

Close Eyes

Input

Close Eyes

Input

Close Eyes

Input

Lipstick (+)

Input

Lipstick (+)

Input

Lipstick (+)

Input

Beard (+)

Input

Beard (+)

Input

Beard (+)

Input

Smile (+)

Input

Smile (+)

Input

Smile (+)

Input

Smile (-)

Input

Smile (-)

Input

Smile (-)

Input

Age (+)

Input

Age (+)

Input

Age (+)

Input

Age (-)

Input

Age (-)

Input

Age (-)

Input

Pose

Input

Pose

Input

Pose

Related Links

In the paper, we compared our results mainly with the following ones:

e4e learns an encoder specifically for image editing. We also used their work as our basic encoder. The basic encoders usually cannot preserve fine details.

HFGI learns an encoder to add high frequency details to the results of a basic encoder. Conceptually, our work is most similar to this one.

Hyperstyle estimates new weigths for a pretrained StyleGAN, which might be more correct for a specific image. Their network also learns the high frequency details.

StyleGAN2 generator is inverted with our approach.

We utilized edits found by InterfaceGAN, GanSpace and StyleClip.

Acknowledgement

This work was supported by The Scientific and Technological Research Council of Turkey (TUBITAK) under Grant No 121E09.

A. Dundar was supported by Marie Skłodowska-Curie Individual Fellowship.