Neural style transfer loss fucntion

Question

Neural style transfer loss fucntion

A_the_kunal

2019年11月6日 08:49

In the loss function between generated image and content image, we calculate the error taking the only activation of the corresponding channel but for calculation of loss function between style and generated image we calculate the gram matrix. But we should do the same for the loss function between content and generated image to find the correlation between different channels.

So why we do not use gram matrix in loss function calculation between content and generated? Computational efficiency is not an option :)

Topic neural-style-transfer

Category Data Science

Maverick Meerkat · Accepted Answer · 2019年11月6日 08:49

Gram matrix is just a way to theoretically represent style. In the NST paper, the authors defined style as the correlation between different features of a certain layer, where the layer position (i.e. a deeper layer or a shallower layer)

determines the local scale on which the style is matched.

They mentioned that (they found that)

matching the styles up to higher layers in the network preserves local images structure... leading to a smoother and more continuous visual experience.

So the idea here is to take the style of different scales from the style-image and blend it with the generated (originally white-noise) image.

The content image is used to constraint the style transfer, to not diverge too much from the original image. There is no need for a Gram matrix here. Just a simple square distance will do.

Reconstructing the content is degrading as you go deeper in the network. I.e. reconstructing the original input image from given activation(s) of layer x is almost perfect for the first few layers, but becomes less precise/detailed in final layers.

So, the idea is to choose a layer which is far enough to allow for effective style transfer (i.e. allows for meaningful and deep changes in the reconstruction of the image using back propagation), while not too far to lose too much details of the image.

They tried different layers and eventually took the Conv4 layer. They also tried taking the content layer from Conv2 - but the results are shown in figure 5,

When matching the content on a lower layer of the network, the algorithm matches [too] much of the detailed pixel information in the photograph and the generated image appears as if the texture of the artwork is merely blended over the photograph

Neil Slater · Accepted Answer · 2019年8月19日 19:27

So why we do not use gram matrix in loss function calculation between content and generated?

"Same content" means same objects in same locations in the image. If the original content shows a face at the top, we want to see a stylised face at the top.
"Same style" means same combinations of textures and line style together in any location in the image. The gram matrix is a good proxy measure of this, because it is invariant to the locations at which the style correlations occur, and similar line shapes and textures will tend to trigger the same combinations of low-level feature map channels.

If you used gram matrix for the higher order content channels, you would probably find that larger objects would be treated like a style and would be moved, duplicated and distorted in the final output. Depending on the trained network, which layer(s) you picked as representing content etc, the effect might still be interesting or artistic. But it would no longer be style transfer, but perhaps some hybrid of style transfer and deep dream.

Neural style transfer loss fucntion

About