Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent work on text-conditional 3D object generation has shown promising results, but requires multiple GPU-hours to produce a single sample.
Generative image models produce samples in seconds or minutes.
This paper explores an alternative method for 3D object generation which produces 3D models in 1-2 minutes on a single GPU.
The method generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model.
Sample quality is still lower than state-of-the-art, but it is one to two orders of magnitude faster to sample from.
Evaluation code and models are available at https://github.com/openai/point-e.

Paper Content

Introduction

Text-to-image generative models can generate and modify high-quality images from natural language descriptions
Recent works have explored text-conditional generation in other modalities, such as video and 3D objects
This paper focuses on the problem of text-to-3D generation
Methods exist which train generative models directly on paired (text, 3D) data or unlabeled 3D data
Methods also exist which leverage pre-trained text-image models to optimize differentiable 3D representations
This paper combines the benefits of both categories by pairing a text-to-image model with an image-to-3D model
The text-to-image model leverages a large corpus of (text, image) pairs
The image-to-3D model is trained on a smaller dataset of (image, 3D) pairs
The two-stage generation process can be performed in a number of seconds
The generative stack is based on diffusion
The system can produce colored 3D point clouds that match both simple and complex text prompts

Background

Diffusion-based models were first proposed by Sohl-Dickstein et al. (2015) and popularized more recently
Gaussian diffusion setup of Ho et al. (2020) is used
Samples are produced by starting at random Gaussian noise and gradually reversing the noising process
Neural network is used to approximate q(x t−1 |x t )
Variance of p θ (x t−1 |x t ) is predicted as well as the mean
Differential equations can be used to sample from these models
Classifier guidance and classifier-free guidance can be used to trade off sample diversity for fidelity

Generative models explored for point clouds
GANs and GMMs used to fit latent representations
GANs trained from 2D images to generate 3D models
GANs used to generate 3D meshes
NeRFs used to generate complete 3D scenes
Text-conditional 3D generation optimized with 3D representations
3D models reconstructed from single or few images
VAEs used to predict point clouds from single views
Flow predictors and GANs used to generate novel views from few images
Image-to-image diffusion model used to synthesize novel views of an object

Method

Point•E does not optimize every view to match the text prompt
Point clouds must be preprocessed before rendering, which can sometimes lose information and result in lower performance than state-of-the-art techniques
Point•E produces samples in a small fraction of the time, making it more practical for certain applications

Dataset

Trained models on 3D models
Data formats and quality varied
Developed post-processing steps to ensure higher data quality
Converted data into generic format using Blender
Constructed point clouds from renders to sidestep issues with 3D meshes
Used heuristics to reduce frequency of low-quality models

View synthesis glide model

Point cloud models are conditioned on rendered views from a dataset.
Aim to generate 3D renders that match the dataset distribution.
Fine-tune GLIDE with a mixture of original and 3D datasets.
3D dataset is small, so only sampled 5% of the time.
Fine-tune for 100K iterations, making several epochs over 3D dataset.
Add special token to 3D renders’ text prompts to sample in-distribution renders.

Point cloud diffusion

Represent point cloud as tensor of shape K x 6
Use Transformer-based model to predict coordinates and colors
Input context is K x D tensor and 256 x D tensor from pre-trained ViT-L/14 CLIP model
Model is permutation-invariant to input point clouds

Point cloud upsampler

Image diffusion models use a hierarchy to achieve the best quality.
Point cloud generation uses a large base model to generate 1K points, then upsamples to 4K points using a smaller model.
Compute requirements scale with the number of points.
Upsampler uses the same architecture as the base model, with extra conditioning tokens for the low-resolution point cloud.
Model uses a separate linear embedding layer to distinguish conditioning information from new points.

Producing meshes

Convert point clouds into textured meshes using Blender
Producing meshes from point clouds is a difficult problem
Pre-trained SAP models sometimes lost important details of the shape
Use regression-based model to predict signed distance field of an object
Apply marching cubes to the SDF to extract a mesh
Assign colors to each vertex of the mesh using the color of the nearest point from the original point cloud

Results

Evaluated text-to-3D methods using CLIP R-Precision metric
Introduced new pair of metrics called P-IS and P-FID
Used PointNet++ model to extract features and predict class probabilities for point clouds

Model scaling and ablations

Trained a variety of base diffusion models
Evaluated models throughout training
Used same upsampler model and 306 pre-generated synthetic views
Text conditioning without text-to-image step results in worse CLIP R-Precision
Using single CLIP embedding to condition on images is worse than using a grid of embeddings
Scaling model improves speed of P-FID convergence and increases final CLIP R-Precision

Qualitative results

PointE can produce consistent and high-quality 3D shapes for complex prompts.
Point cloud diffusion model can fail to understand or extrapolate the conditioning image, resulting in a shape that does not match the original prompt.

Comparison to other methods

There is no standard set of benchmarks for text-conditional 3D synthesis.
Table 1 compares Point•E to other 3D generative techniques using CLIP R-Precision.
Point•E performs worse than the current state-of-the-art.
Two subtleties of the evaluation may explain some of the discrepancy.

Limitations and future work

Requires synthetic renderings, but could be trained to condition on real-world images
Produces colored 3D shapes at low resolution in point cloud format, could be extended to higher quality 3D representations
Could be used to initialize optimization-based techniques to speed up initial convergence
Has potential bias inherited from dataset
Ability to create point clouds that can be used to fabricate products in the real world

Conclusion

PointE is a system for text-conditional synthesis of 3D point clouds
PointE is capable of efficiently producing diverse and complex 3D shapes
PointE is trained with batch size 64 for 1,300,000 iterations
PointE uses 1024 diffusion timesteps
PointE uses a linear noise schedule for the upsampler model and a cosine noise schedule for the base models
PointE produces 10K samples using stochastic DDPM with the full noise schedule
PointE uses 64 steps of the Heun sampler from Karras et al. (2022) for both the base and upsampler models
PointE uses 150 diffusion steps for the base model and 50 diffusion steps for the upsampling model
PointE uses a PointNet++ model on ModelNet40 to compute P-FID and P-IS
PointE uses an encoder-decoder Transformer to convert point clouds into meshes
PointE uses a text-to-image model to produce in-distribution conditioning images
PointE also trains a pure text-conditional point cloud model without an additional image generation step

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Related work#

Method#

Dataset#

View synthesis glide model#

Point cloud diffusion#

Point cloud upsampler#

Producing meshes#

Results#

Model scaling and ablations#

Qualitative results#

Comparison to other methods#

Limitations and future work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Related work

Method

Dataset

View synthesis glide model

Point cloud diffusion

Point cloud upsampler

Producing meshes

Results

Model scaling and ablations

Qualitative results

Comparison to other methods

Limitations and future work

Conclusion