Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Generative models are revolutionizing several sectors.
Generative AI can transform texts to images, 3D images, audio, code, scientific texts, and create algorithms.

Paper Content

Introduction

Generative AI is a type of artificial intelligence that can generate novel content.
Expert systems used an if-else rule database to generate content.
Generative AI models use a discriminator or transformer model trained on a corpus or dataset.
Generative AI models are different from predictive machine learning systems.
Generative AI models can generate multimedia content from different input formats.

A taxonomy of generative ai models

Organized current generative artificial models into a taxonomy
Discovered 9 categories
Models described in detail in following section
Models published recently
Only 6 organizations behind deployment of models
Need computation power and skilled team to estimate parameters

Generative ai models categories

Nine categories described in Figure 1 of the previous section
Details of the models shown in Figure 1 are illustrated

Text-to-image models

DALL•E 2 is a model that generates images from text prompts
It uses the CLIP neural network to combine concepts, attributes and styles
CLIP embeddings are robust to image distribution shift and have impressive zero-shot capabilities
Imagen is a text-to-image diffusion model created by Google
It uses large transformer language models to encode text for image synthesis
Stable Diffusion is a latent-diffusion model developed by the CompVis group at LMU Munich
Muse is a Text-to-image transformer model that is more efficient than diffusion or autoregressive models

Text-to-3d models

Dreamfusion is a text-to-3D model developed by Google Research
Dreamfusion uses a pretrained 2D text-to-image diffusion model
Magic3D is a text-to-3D model made by NVIDIA Corporation
Magic3D uses a two-stage optimization framework
Magic3D achieves better results than Dreamfusion

Image-to-text models

Flamingo is a Visual Language Model created by Deepmind
VisualGPT is an image captioning model made by OpenAI
Both models can generate text from images

Text-to-video models

It is now possible to generate images and videos from text
Phenaki is a model developed by Google Research that can generate videos from open domain time variable prompts
Phenaki has three parts: C-ViViT encoder, training transformer and video generator
Soundify is a system developed by Runway that matches sound effects to video
Soundify has three parts: classification, synchronization and mix

Text-to-audio models

AudioLM is a model developed by Google for audio generation with long-term consistency
AudioLM is trained on large corpora of raw audio waveforms
AudioLM can generate coherent piano music continuations
Jukebox is a model developed by OpenAI for music with singing in the raw audio domain
Jukebox uses a hierarchical VQ-VAE architecture to compress audio into a discrete space
Jukebox is trained on 1.2 million songs from LyricWiki
Whisper is an Audio-to-Text converter developed by OpenAI
Whisper is trained on 680,000 hours of labeled audio data
Whisper uses an encoder-decoder transformer architecture

Text-to-text models

ChatGPT is a model by OpenAI which interacts in a conversational way
LaMDA is a language model for dialog applications which is pre-trained on 1.56T words of public dialog data and web text
PEER is a collaborative language model developed by Meta AI research trained on edit histories
Meta AI Speech from Brain is a model developed by Meta AI to help people unable to communicate through speech, typing or gestures
Training data for Meta AI Speech from Brain comes from four opensource datasets
EEG and MEG recordings are inserted into a brain model for Meta AI Speech from Brain
Results show that several components of the algorithm were beneficial to decoding performance

Text-to-code models

Text-to-text models do not cover all types of texts, such as code
Codex and Alphacode models help convert ideas into code
Codex is a general-purpose programming model
Programming can be broken down into two parts
Alphacode is a system for code generation for complex problems

Text-to-science models

Galactica is a large model for automatically organizing science
Minerva is a language model capable of solving mathematical and scientific questions
Galactica uses a transformer architecture and GeLU activation
Minerva solves problems by generating solutions step-by-step

Other models

Alphatensor is a revolutionary model for its ability to discover new algorithms
Alphatensor uses deep reinforcement learning to find tensor decompositions
GATO is a single generalist agent made by Deepmind that can be used for many tasks
Generative AI models can generate human motion and slides

Conclusions and further work

Generative Artificial Intelligence has capabilities such as creativity and personalization
Accurate in text-to-science and text-to-code tasks
Can help optimize creative and non-creative tasks
Difficult to find data for some models
Datasets and parameters have to be enormous
Models have trouble solving problems outside of the dataset
Takes a lot of time and computation capacity to run
Models face bias from the data
Accuracy is still an issue
Models need to be constrained due to lack of understanding of ethics
Still discovering what the purpose of this intelligence will be

Link to paper#

Abstract#

Paper Content#

Introduction#

A taxonomy of generative ai models#

Generative ai models categories#

Text-to-image models#

Text-to-3d models#

Image-to-text models#

Text-to-video models#

Text-to-audio models#

Text-to-text models#

Text-to-code models#

Text-to-science models#

Other models#

Conclusions and further work#

Link to paper

Abstract

Paper Content

Introduction

A taxonomy of generative ai models

Generative ai models categories

Text-to-image models

Text-to-3d models

Image-to-text models

Text-to-video models

Text-to-audio models

Text-to-text models

Text-to-code models

Text-to-science models

Other models

Conclusions and further work