Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generative models are revolutionizing several sectors.
- Generative AI can transform texts to images, 3D images, audio, code, scientific texts, and create algorithms.
Paper Content
Introduction
- Generative AI is a type of artificial intelligence that can generate novel content.
- Expert systems used an if-else rule database to generate content.
- Generative AI models use a discriminator or transformer model trained on a corpus or dataset.
- Generative AI models are different from predictive machine learning systems.
- Generative AI models can generate multimedia content from different input formats.
A taxonomy of generative ai models
- Organized current generative artificial models into a taxonomy
- Discovered 9 categories
- Models described in detail in following section
- Models published recently
- Only 6 organizations behind deployment of models
- Need computation power and skilled team to estimate parameters
Generative ai models categories
- Nine categories described in Figure 1 of the previous section
- Details of the models shown in Figure 1 are illustrated
Text-to-image models
- DALL•E 2 is a model that generates images from text prompts
- It uses the CLIP neural network to combine concepts, attributes and styles
- CLIP embeddings are robust to image distribution shift and have impressive zero-shot capabilities
- Imagen is a text-to-image diffusion model created by Google
- It uses large transformer language models to encode text for image synthesis
- Stable Diffusion is a latent-diffusion model developed by the CompVis group at LMU Munich
- Muse is a Text-to-image transformer model that is more efficient than diffusion or autoregressive models
Text-to-3d models
- Dreamfusion is a text-to-3D model developed by Google Research
- Dreamfusion uses a pretrained 2D text-to-image diffusion model
- Magic3D is a text-to-3D model made by NVIDIA Corporation
- Magic3D uses a two-stage optimization framework
- Magic3D achieves better results than Dreamfusion
Image-to-text models
- Flamingo is a Visual Language Model created by Deepmind
- VisualGPT is an image captioning model made by OpenAI
- Both models can generate text from images
Text-to-video models
- It is now possible to generate images and videos from text
- Phenaki is a model developed by Google Research that can generate videos from open domain time variable prompts
- Phenaki has three parts: C-ViViT encoder, training transformer and video generator
- Soundify is a system developed by Runway that matches sound effects to video
- Soundify has three parts: classification, synchronization and mix
Text-to-audio models
- AudioLM is a model developed by Google for audio generation with long-term consistency
- AudioLM is trained on large corpora of raw audio waveforms
- AudioLM can generate coherent piano music continuations
- Jukebox is a model developed by OpenAI for music with singing in the raw audio domain
- Jukebox uses a hierarchical VQ-VAE architecture to compress audio into a discrete space
- Jukebox is trained on 1.2 million songs from LyricWiki
- Whisper is an Audio-to-Text converter developed by OpenAI
- Whisper is trained on 680,000 hours of labeled audio data
- Whisper uses an encoder-decoder transformer architecture
Text-to-text models
- ChatGPT is a model by OpenAI which interacts in a conversational way
- LaMDA is a language model for dialog applications which is pre-trained on 1.56T words of public dialog data and web text
- PEER is a collaborative language model developed by Meta AI research trained on edit histories
- Meta AI Speech from Brain is a model developed by Meta AI to help people unable to communicate through speech, typing or gestures
- Training data for Meta AI Speech from Brain comes from four opensource datasets
- EEG and MEG recordings are inserted into a brain model for Meta AI Speech from Brain
- Results show that several components of the algorithm were beneficial to decoding performance
Text-to-code models
- Text-to-text models do not cover all types of texts, such as code
- Codex and Alphacode models help convert ideas into code
- Codex is a general-purpose programming model
- Programming can be broken down into two parts
- Alphacode is a system for code generation for complex problems
Text-to-science models
- Galactica is a large model for automatically organizing science
- Minerva is a language model capable of solving mathematical and scientific questions
- Galactica uses a transformer architecture and GeLU activation
- Minerva solves problems by generating solutions step-by-step
Other models
- Alphatensor is a revolutionary model for its ability to discover new algorithms
- Alphatensor uses deep reinforcement learning to find tensor decompositions
- GATO is a single generalist agent made by Deepmind that can be used for many tasks
- Generative AI models can generate human motion and slides
Conclusions and further work
- Generative Artificial Intelligence has capabilities such as creativity and personalization
- Accurate in text-to-science and text-to-code tasks
- Can help optimize creative and non-creative tasks
- Difficult to find data for some models
- Datasets and parameters have to be enormous
- Models have trouble solving problems outside of the dataset
- Takes a lot of time and computation capacity to run
- Models face bias from the data
- Accuracy is still an issue
- Models need to be constrained due to lack of understanding of ethics
- Still discovering what the purpose of this intelligence will be