If you like our work, please consider supporting us so we can keep doing what we do. And as a current subscriber, enjoy this nice discount!

Also: if you haven’t yet, follow us on Twitter, TikTok, or YouTube!


Transformers are a popular technique for natural language processing tasks such as text classification, question answering, and machine translation. However, they can also be used for other tasks such as text-to-image and text-to-video generation.

In text-to-image generation, transformers can be used to generate images from text descriptions. This is a challenging task because the text descriptions may be incomplete, ambiguous, or even contradictory. However, transformers can learn to generate realistic images by reading a large number of text-image pairs. In last few posts we have seen how texts can be used to generate realistic images.

A number of AI startups have recently launched open-source text-to-image systems, including OpenAI's DALL-E, which has been available to all for the last month, and Stability.AI's Stable Diffusion, which was released a month ago.

A new video generator powered by an artificial intelligence named Make-A-Video was announced today by Meta. This tool can be used to create video content from text or image prompts, similar to existing image synthesis tools such as DALL-E and Stable Diffusion. As well as producing new videos, it is also capable of modifying existing videos. However, the software is not yet publicly available.

Make-A-Video by Meta AI
A state-of-the-art AI system generates high-quality videos from text prompts
0:00
/
A ballerina performs a beautiful and difficult dance on the roof of a very tall skyscraper; the city is lit up and glowing behind her. Source: https://makeavideo.studio . Video converted from webp to mp4 to embed here. Check https://makeavideo.studio for many other examples

Challenges with Text to Video generation

There are some serious challenges facing text-to-video generation:

  1. It is necessary to have a large amount of computing power in order to run these models. The amount of work involved in putting together a single short video is even greater than that involved in training large text-to-image AI models, which require millions of images for training.
  2. It is difficult to get these videos as a lot of videos tagged or labelled with their prompts are not easily available
  3. Just like the issues with DALL-E and similar tech, text descriptions may be incomplete, ambiguous, or even contradictory. Tools are to evolve to understand these and generate the required video

How is Meta solving this?

In order to train its model, Meta combined data from three open-source image and video databases. AI learned what objects were called and looked like based on standard text-image datasets of labelled still images. The system was assisted in learning how these objects should move in the world by a database of videos. Make-A-Video was able to achieve success by combining both approaches.

Read the following paper to know more: https://makeavideo.studio/Make-A-Video.pdf

This opens new vistas for creativity but also intensifies the debate on ethics and bias in using these models. Leave your comments on what you think will be the challenges or opportunities that may come soon with this.

We will keep an eye on when Meta makes these models public and write about it in everyday series. So far they have chosen to keep on further research on it.  

Stay tuned!


Do you like our work?
Consider becoming a paying subscriber to support us!