Gram matrix is just a way to theoretically represent style. In the NST paper, the authors defined style as the correlation between different features of a certain layer, where the layer position (i.e. a deeper layer or a shallower layer)
determines the local scale on which the style is matched.

They mentioned that (they found that)
matching the styles up to higher layers in the network preserves
local images structure... leading to a smoother and more continuous
visual experience.
So the idea here is to take the style of different scales from the style-image and blend it with the generated (originally white-noise) image.
The content image is used to constraint the style transfer, to not diverge too much from the original image. There is no need for a Gram matrix here. Just a simple square distance will do.
Reconstructing the content is degrading as you go deeper in the network. I.e. reconstructing the original input image from given activation(s) of layer x is almost perfect for the first few layers, but becomes less precise/detailed in final layers.
So, the idea is to choose a layer which is far enough to allow for effective style transfer (i.e. allows for meaningful and deep changes in the reconstruction of the image using back propagation), while not too far to lose too much details of the image.
They tried different layers and eventually took the Conv4 layer. They also tried taking the content layer from Conv2 - but the results are shown in figure 5,
When matching the content on a lower layer of the network, the algorithm matches [too] much of the detailed pixel information in the photograph and the generated image appears as if the texture of the artwork is merely blended over the photograph
