Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generative models are revolutionizing several sectors.
  • Generative AI can transform texts to images, 3D images, audio, code, scientific texts, and create algorithms.

Paper Content

Introduction

  • Generative AI is a type of artificial intelligence that can generate novel content.
  • Expert systems used an if-else rule database to generate content.
  • Generative AI models use a discriminator or transformer model trained on a corpus or dataset.
  • Generative AI models are different from predictive machine learning systems.
  • Generative AI models can generate multimedia content from different input formats.

A taxonomy of generative ai models

  • Organized current generative artificial models into a taxonomy
  • Discovered 9 categories
  • Models described in detail in following section
  • Models published recently
  • Only 6 organizations behind deployment of models
  • Need computation power and skilled team to estimate parameters

Generative ai models categories

  • Nine categories described in Figure 1 of the previous section
  • Details of the models shown in Figure 1 are illustrated

Text-to-image models

  • DALL•E 2 is a model that generates images from text prompts
  • It uses the CLIP neural network to combine concepts, attributes and styles
  • CLIP embeddings are robust to image distribution shift and have impressive zero-shot capabilities
  • Imagen is a text-to-image diffusion model created by Google
  • It uses large transformer language models to encode text for image synthesis
  • Stable Diffusion is a latent-diffusion model developed by the CompVis group at LMU Munich
  • Muse is a Text-to-image transformer model that is more efficient than diffusion or autoregressive models

Text-to-3d models

  • Dreamfusion is a text-to-3D model developed by Google Research
  • Dreamfusion uses a pretrained 2D text-to-image diffusion model
  • Magic3D is a text-to-3D model made by NVIDIA Corporation
  • Magic3D uses a two-stage optimization framework
  • Magic3D achieves better results than Dreamfusion

Image-to-text models

  • Flamingo is a Visual Language Model created by Deepmind
  • VisualGPT is an image captioning model made by OpenAI
  • Both models can generate text from images

Text-to-video models

  • It is now possible to generate images and videos from text
  • Phenaki is a model developed by Google Research that can generate videos from open domain time variable prompts
  • Phenaki has three parts: C-ViViT encoder, training transformer and video generator
  • Soundify is a system developed by Runway that matches sound effects to video
  • Soundify has three parts: classification, synchronization and mix

Text-to-audio models

  • AudioLM is a model developed by Google for audio generation with long-term consistency
  • AudioLM is trained on large corpora of raw audio waveforms
  • AudioLM can generate coherent piano music continuations
  • Jukebox is a model developed by OpenAI for music with singing in the raw audio domain
  • Jukebox uses a hierarchical VQ-VAE architecture to compress audio into a discrete space
  • Jukebox is trained on 1.2 million songs from LyricWiki
  • Whisper is an Audio-to-Text converter developed by OpenAI
  • Whisper is trained on 680,000 hours of labeled audio data
  • Whisper uses an encoder-decoder transformer architecture

Text-to-text models

  • ChatGPT is a model by OpenAI which interacts in a conversational way
  • LaMDA is a language model for dialog applications which is pre-trained on 1.56T words of public dialog data and web text
  • PEER is a collaborative language model developed by Meta AI research trained on edit histories
  • Meta AI Speech from Brain is a model developed by Meta AI to help people unable to communicate through speech, typing or gestures
  • Training data for Meta AI Speech from Brain comes from four opensource datasets
  • EEG and MEG recordings are inserted into a brain model for Meta AI Speech from Brain
  • Results show that several components of the algorithm were beneficial to decoding performance

Text-to-code models

  • Text-to-text models do not cover all types of texts, such as code
  • Codex and Alphacode models help convert ideas into code
  • Codex is a general-purpose programming model
  • Programming can be broken down into two parts
  • Alphacode is a system for code generation for complex problems

Text-to-science models

  • Galactica is a large model for automatically organizing science
  • Minerva is a language model capable of solving mathematical and scientific questions
  • Galactica uses a transformer architecture and GeLU activation
  • Minerva solves problems by generating solutions step-by-step

Other models

  • Alphatensor is a revolutionary model for its ability to discover new algorithms
  • Alphatensor uses deep reinforcement learning to find tensor decompositions
  • GATO is a single generalist agent made by Deepmind that can be used for many tasks
  • Generative AI models can generate human motion and slides

Conclusions and further work

  • Generative Artificial Intelligence has capabilities such as creativity and personalization
  • Accurate in text-to-science and text-to-code tasks
  • Can help optimize creative and non-creative tasks
  • Difficult to find data for some models
  • Datasets and parameters have to be enormous
  • Models have trouble solving problems outside of the dataset
  • Takes a lot of time and computation capacity to run
  • Models face bias from the data
  • Accuracy is still an issue
  • Models need to be constrained due to lack of understanding of ethics
  • Still discovering what the purpose of this intelligence will be