This talk introduces you to the world of generative AI with a focus on text to image and text to video for the creation of images and short videos. We will explain how neural networks use diffusion models and so-called transformer architectures to multimodally generate different output formats from short text inputs.
We focus on state-of-the-art technologies such as Sora or Midjourney. The applied tools, such as latent diffusion models, allow us to generate and process images and videos by combining text understanding through attention mechanisms and transformers through image denoising processes.
An in-depth look at the video generation process using Sora shows how visual data is compressed, split into patches and then reconstructed into the final video. In addition to Sora, we also discuss alternative methods such as Runway to cover a wide range of tools for image and video generation.
By the end of this talk, you will have gained a basic understanding of diffusion models, an overview of image and video generation tools, and a deeper understanding of how one selected tool works. The talk will be complemented by practical examples and demos.