Knowing the basic rules of non-linear processing mechanics from Munteanu on which extensions such as transformer neural network, gpt 1,2,3 dalee, imagen and others are based. It is easy to describe its main components and functions.
This document describes the most important components for generative neural networks.
As an example, lets say the neural network must memorize the representation of a ball in the image,
the most primitive parts of the image starting with the first pixels join together to form the first patterns,
then in the following layers the patterns are composed into new structures from the layer below and join forming more complex patterns thus until the whole image is compressed into a complex tree that represents the details of the image.
Feedforward neural networks could not generate the content but could describe what was in the image. Thus, providing an image, the neural network identifies and returns the text with the associated description. Now the new mechanism for associating the description of the image allows not only to detect and return the description through text, but also to use the non-linear stimulus mechanism in such a way that the description entered by the user results in moving the stimulus towards the hierarchy with the associated image and starting with the highest level that describes the image, the stimulus moves down passing through all the patterns to the lowest level of the image diffusing the associated details down to the pixels.
The non-linear stimulation mechanism also allows to reconstruct the details that are missing in the image , in other words horizontal reflection. As a simpler example, we can describe the learning of a simple text that describes the sentence “A chair on the beach” and knowing that the network learns by joining the close patterns (parts) we will obtain in the first layer of the hierarchy the patterns [A chair] [on the ][the beach] then in the next layer in the hierarchy these patterns will be joined together forming the [[A chair ][on the] [ the beach]] pattern.
Thus, if the neural network is presented with the text “A chair on the ” the network will infer the pattern [[A chair][on the]] then this pattern will stimulate the highgher level neuron [[A chair][on the] [the beach]] that will serve as a context for the association of the lower level neuron [the beach] and will diffuse it thus completing the missing details.
The hierarchical representation of text structures is easier to imagine as you can view the network as a 2D form which has width and height where at the bottom level the words stand in a row and with each level above these patterns unite forming patterns that describe more complex patterns.
In the case of images, this hierarchy is a little more difficult to imagine as it has not only width, height but also depth.
If you present to the network only a part of the image of a chair for example, and the network already has this representation in memory, then first of all it inferes the context of the part of image and passes it up to the highest representation in the hierarchy, then the representation of this pattern stimulates from top to bottom or in other words it diffuses the rest of the parts of the image forming a whole picture.
Image generation using text description.
If a few text details are provided to generate the image, for example if the network has learned the image of ” a chair in the kitchen”, ” a chair on the beach”, and “a ball on the beach” , then the user injects the text description to generate a chair, the network will generate the chair but when it tries to autocomplete all the details in the given case there will be two options, 1 the visual association with the beach and 2 the association with the kitchen so the result is not concrete and mixed or at best it will choose the first option to diffuse.
If you specify concretely to generate the chair on the beach, then it will take the priority of the properly learned image and will generate a chair on the beach.
If you specify it to generate a ball on the beach, the correct image will also take priority and will be generated.
And if you specify it to generate a ball and a chair, then there will be two variants with priority and the network will try to generate everything together.
Thus, if you indicate to generate the image of a chair next to the ball, the diffusion mechanism having the chair and the ball as context and also knowing that these contexts are associated with the beach, then the chair will be generated, then sand around it, then the ball and sand around , located in the same positions as in the original images. How exactly is this connection formed if the network has not been explicitly trained to generate “chair and ball on the beach”. The answer is simple in the hierarchical representation, relational structures are formed, simply put, [chair next to sand] and [sand next to ball] are learned and these patterns have the common connection factor “sand” which will be generated correctly if the network has learned the ball and the chair in different positions, and if these learned images have these subjects in the same place, then the image will be generated incorrectly. This image generation mechanism is similar to text generation, only that it works with more details.