There has been an intense debate about image generation within the last weeks, after the technology has been made publicly available. After several proprietary programs have offered limited free tiers for some time, with stable diffusion an open-source model was published this August. One popular discussion has been the question, if the created images really can be compared to images created by human creativity. I don’t want to get into this discussion here. Rather my question is, what would happen in a next step: At the moment we talk about AI produced images as the output of the programs and deep learning models. But what happens if they become part of the very image databases the models are trained on? What if the AI generated images are fed back into the models for image generation? My hypothesis is, that this will produce entropy in the image generation process which, in the long run, will tend towards ever more noise.
Why? When trying out stable diffusion my experience was that most images are a bit off. Take for instance an image created with the command that stable diffusion suggests using as a sample: “a photograph of an astronaut riding a horse”.
While this creates astonishing images, many of which have been shared withing the last weeks, can you really tell what is happening with this horseish structure in the bottom? There is certainly a general appearance of “horse” in it, but as a whole it makes not much sense. What happens if this image is again used to train a model? I would suspect that, on a larger scale, it will contribute to a decay of the concepts the model has been trained on.
Of course, one obvious objection is that images created by humans are by no means adequate representations of the things that are depicted. Take as one random example Paul Klee’s “Fire in the Evening”, which requires some intellectual work to be related to the object named in the title.
But, how far off from a naturalist depiction of the object this may be, there is a fundamental difference between the abstraction and estrangement carried out in modern art and the distortions produced within the AI imagery. Art brings concepts into images. As such, concepts in images are always related to concepts outside of the realm of the imagery. In contrast, AI imagery always relates back to images. Here concepts always stay within the realm of images themselves. Within the purely algorithmic process there is a lack of external resources, that could provide the images with sense and thus (re-)sharpen the given concepts.
If these assumptions are correct, then the world of algorithmic image generation is dependent on a constant struggle with entropy. For now, this struggle mainly seems to take place at two levels, where humans still have a seminal role: when we “teach” the model new concepts; and when we “curate” the images produced by the AI.
If entropy really is a thing, teaching concepts is to be understood not as a one-time activity, but as a continuous accomplishment, where concepts will have to be actively and ongoingly respecified. For now, however, the process of curating the images is more prominent. Most current programs produce multiple images per command. Thus, they already anticipate a human decision process which will discard most of the images created. This process can assure a certain coherence in a recursive database. At the same many images contain artefacts that will be disregarded as some kind of stylized deviation. Take for instance the bubble-perm ears in this image of an “alligator mouse”.
So, there is still a chance that, despite the necessary step of curation, there will be a subcutaneous tendency towards entropy, when we generate images from databases that contain images in turn generated by deep learning models.
But, of course, up till now all of this is speculation. What steps could be taken to put this hypothesis to the test?
As it is well established and many examples already circulate, the prompt “a photograph of an astronaut riding a horse” could be a good starting point. It would be possible to focus on the three elements: “astronaut”, “horse” “riding”. Entropy would be given if these concepts are increasingly blurred. We can have a look at different things to measure this.
- Which percent of people recognize all three items in a given image?
- Which is the mean percentage of a) across a set of x images?
This mean percentage can be compared over sets of images created based on different image databases. A decreased percentage can indicate entropy within the model.
Now, images must be fed back into the database in the course of multiple steps. Across these steps, a change rate can give an impression if entropy occurs. The change rate c is given by (V1 – V2) / V1, where V1 is the percentage before a (next) feedback step and V2 the percentage after the feedback step. When the change rate is negative, there is entropy. When the change rate is zero or above, no entropy occurs.
Of course, this is only a very rough outline of a test for the entropy hypothesis. Many more aspects would have to be considered when implementing it. For instance, it could make sense to additionally compare different degrees of curating the output/input. This can be done by only feeding a predetermined number of “best” image back into the database, thereby giving credit to the fact, that curating is an important step that mitigates which images become publicly available and thus have a chance to become themselves part of new image databases for training models. Probably, I have missed important aspects, or I have fundamentally misunderstood how image generation based on deep learning works. Don’t hesitate to let me know.