DeepFloyd IF

Stability AI has just announced the release of its latest AI model, DeepFloyd IF, a text-to-image model that uses cascaded pixel diffusion to seamlessly integrate text into images. Developed by Stability AI's research lab, DeepFloyd, this model seems to take the world of text-to-image generation by storm.

IF demonstrates the high image quality and language understanding capabilities of Imagen. It was trained with around 1.2 Billion images from the LAION-5B dataset. In tests, it even outperforms Google Imagen, achieving a Zero-Shot FID score of 6.66 on the COCO dataset, ahead of other available models such as Stable Diffusion.

IF also supports Image-to-Image-Translation and Impainting. Like Imagen, DeepFloyd IF relies on two super-resolution models that bring the resolution of the images to 1,024 x 1,024 pixels and offers different model sizes with up to 4.3 billion parameters. For the largest model with an upscale to 1,024 pixels, the team recommends 24 gigabytes of VRAM, while the largest model with a 256-pixel upscale still requires 16 gigabytes of VRAM.

DeepFloyd IF is a state-of-the-art text-to-image model that has been released on a non-commercial, research-permissible license. This allows research labs to examine and experiment with advanced text-to-image generation approaches. Stability AI intends to release a fully open-source version of DeepFloyd IF in the future, in line with its commitment to open-source development.

Description and Features

Deep text prompt understanding:

The model utilizes the T5-XXL-1.1 language model as a text encoder, with many text-image cross-attention layers providing better prompt and image alliance.

Application of text description into images:

Using the intelligence of the T5 model, DeepFloyd IF generates coherent and clear text alongside objects of different properties appearing in various spatial relations - a challenging task for most text-to-image models.

Source: https://stability.ai/blog/deepfloyd-if-text-to-image-model

A high degree of photorealism:

DeepFloyd IF also boasts a high degree of photorealism, with an impressive zero-shot FID score of 6.66 on the COCO dataset. This makes it a leading contender in the world of text-to-image models.

Aspect Ratio Shift:

DeepFloyd IF has the ability to generate images with non-standard aspect ratios, both vertical and horizontal. It can also perform zero-shot image-to-image translations, allowing users to modify style, patterns, and details in output while maintaining the basic form of the source image - all without the need for fine-tuning.

Zero-shot image-to-image translations:

With this capability, users can modify the style, patterns, and details of an output image while maintaining the basic form of the source image, without the need for fine-tuning. To achieve this, the model resizes the original image to 64 pixels, adds noise through forward diffusion, and then uses backward diffusion with a new prompt to denoise the image. In inpainting mode, the process happens in the local zone of the image. Super-resolution modules can further change the style of the output image based on a prompt text description. This advanced feature gives users a high degree of flexibility and creative control over image generation.