Week of April 28 – May 5, 2024
With the Docker issues out of the way and now having a much better understanding of how Docker works I was able to turn my attention to how I was trying to have the StyleGAN2 model process text and images. In order for the model to process a users input, a text encoder or image encoder is needed to break down the written prompt or supplied photo into latent vectors which are then fed into the generator. The generator will then create an image from that latent vector supplying noise every layer to improve the quality of the generator. At first I was trying to do the calculations myself and break the text or images down feeding them into the generator. Needless to say that didn’t work, after a few days one of my team mates suggested that we incorporate CLIP into the application, utilizing a pre-trained CLIP model that would guide our generator as it trains and creates images. After that meeting I found the paper “StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery” by Patashnik et al. which outlines the process that the team took
The idea behind the training is we will do a progressive training incorporating 128*128 ffhq dataset of 70,000 images, do a horizontal flip manipulation on the images effectively doubling the training image set increasing it to 140,000 images. After an untested amount of epochs/training images generated probably in the ballpark of 50 epochs or 7,000,000 images generated I will switch to the 512×512 dataset performing the same horizontal flip manipulation. The result should be a model that is capable of processing text and image latent vectors and with the guidance of the CLIP model should be able to produce somewhat recognizable images for the user.
The implementation of CLIP for use with text and image prompts is underway, this is a new model to me but I am getting the hang of it, and thanks to wonderful papers like the one mentioned earlier and “StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis” by Sauer et al. the process to fuse these two architectures has continued to progress.
Thank you for checking back in
Will Hoover
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., & Lischinski, D. (2021). StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. arXiv preprint arXiv:2103.17249.
Sauer, A., Min, J., & Geiger, A. (2022). StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. arXiv preprint arXiv:2104.00887.



Leave a comment