Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent work on text-conditional 3D object generation has shown promising results, but requires multiple GPU-hours to produce a single sample.
- Generative image models produce samples in seconds or minutes.
- This paper explores an alternative method for 3D object generation which produces 3D models in 1-2 minutes on a single GPU.
- The method generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model.
- Sample quality is still lower than state-of-the-art, but it is one to two orders of magnitude faster to sample from.
- Evaluation code and models are available at https://github.com/openai/point-e.
Paper Content
Introduction
- Text-to-image generative models can generate and modify high-quality images from natural language descriptions
- Recent works have explored text-conditional generation in other modalities, such as video and 3D objects
- This paper focuses on the problem of text-to-3D generation
- Methods exist which train generative models directly on paired (text, 3D) data or unlabeled 3D data
- Methods also exist which leverage pre-trained text-image models to optimize differentiable 3D representations
- This paper combines the benefits of both categories by pairing a text-to-image model with an image-to-3D model
- The text-to-image model leverages a large corpus of (text, image) pairs
- The image-to-3D model is trained on a smaller dataset of (image, 3D) pairs
- The two-stage generation process can be performed in a number of seconds
- The generative stack is based on diffusion
- The system can produce colored 3D point clouds that match both simple and complex text prompts
Background
- Diffusion-based models were first proposed by Sohl-Dickstein et al. (2015) and popularized more recently
- Gaussian diffusion setup of Ho et al. (2020) is used
- Samples are produced by starting at random Gaussian noise and gradually reversing the noising process
- Neural network is used to approximate q(x t−1 |x t )
- Variance of p θ (x t−1 |x t ) is predicted as well as the mean
- Differential equations can be used to sample from these models
- Classifier guidance and classifier-free guidance can be used to trade off sample diversity for fidelity
Related work
- Generative models explored for point clouds
- GANs and GMMs used to fit latent representations
- GANs trained from 2D images to generate 3D models
- GANs used to generate 3D meshes
- NeRFs used to generate complete 3D scenes
- Text-conditional 3D generation optimized with 3D representations
- 3D models reconstructed from single or few images
- VAEs used to predict point clouds from single views
- Flow predictors and GANs used to generate novel views from few images
- Image-to-image diffusion model used to synthesize novel views of an object
Method
- Point•E does not optimize every view to match the text prompt
- Point clouds must be preprocessed before rendering, which can sometimes lose information and result in lower performance than state-of-the-art techniques
- Point•E produces samples in a small fraction of the time, making it more practical for certain applications
Dataset
- Trained models on 3D models
- Data formats and quality varied
- Developed post-processing steps to ensure higher data quality
- Converted data into generic format using Blender
- Constructed point clouds from renders to sidestep issues with 3D meshes
- Used heuristics to reduce frequency of low-quality models
View synthesis glide model
- Point cloud models are conditioned on rendered views from a dataset.
- Aim to generate 3D renders that match the dataset distribution.
- Fine-tune GLIDE with a mixture of original and 3D datasets.
- 3D dataset is small, so only sampled 5% of the time.
- Fine-tune for 100K iterations, making several epochs over 3D dataset.
- Add special token to 3D renders’ text prompts to sample in-distribution renders.
Point cloud diffusion
- Represent point cloud as tensor of shape K x 6
- Use Transformer-based model to predict coordinates and colors
- Input context is K x D tensor and 256 x D tensor from pre-trained ViT-L/14 CLIP model
- Model is permutation-invariant to input point clouds
Point cloud upsampler
- Image diffusion models use a hierarchy to achieve the best quality.
- Point cloud generation uses a large base model to generate 1K points, then upsamples to 4K points using a smaller model.
- Compute requirements scale with the number of points.
- Upsampler uses the same architecture as the base model, with extra conditioning tokens for the low-resolution point cloud.
- Model uses a separate linear embedding layer to distinguish conditioning information from new points.
Producing meshes
- Convert point clouds into textured meshes using Blender
- Producing meshes from point clouds is a difficult problem
- Pre-trained SAP models sometimes lost important details of the shape
- Use regression-based model to predict signed distance field of an object
- Apply marching cubes to the SDF to extract a mesh
- Assign colors to each vertex of the mesh using the color of the nearest point from the original point cloud
Results
- Evaluated text-to-3D methods using CLIP R-Precision metric
- Introduced new pair of metrics called P-IS and P-FID
- Used PointNet++ model to extract features and predict class probabilities for point clouds
Model scaling and ablations
- Trained a variety of base diffusion models
- Evaluated models throughout training
- Used same upsampler model and 306 pre-generated synthetic views
- Text conditioning without text-to-image step results in worse CLIP R-Precision
- Using single CLIP embedding to condition on images is worse than using a grid of embeddings
- Scaling model improves speed of P-FID convergence and increases final CLIP R-Precision
Qualitative results
- PointE can produce consistent and high-quality 3D shapes for complex prompts.
- Point cloud diffusion model can fail to understand or extrapolate the conditioning image, resulting in a shape that does not match the original prompt.
Comparison to other methods
- There is no standard set of benchmarks for text-conditional 3D synthesis.
- Table 1 compares Point•E to other 3D generative techniques using CLIP R-Precision.
- Point•E performs worse than the current state-of-the-art.
- Two subtleties of the evaluation may explain some of the discrepancy.
Limitations and future work
- Requires synthetic renderings, but could be trained to condition on real-world images
- Produces colored 3D shapes at low resolution in point cloud format, could be extended to higher quality 3D representations
- Could be used to initialize optimization-based techniques to speed up initial convergence
- Has potential bias inherited from dataset
- Ability to create point clouds that can be used to fabricate products in the real world
Conclusion
- PointE is a system for text-conditional synthesis of 3D point clouds
- PointE is capable of efficiently producing diverse and complex 3D shapes
- PointE is trained with batch size 64 for 1,300,000 iterations
- PointE uses 1024 diffusion timesteps
- PointE uses a linear noise schedule for the upsampler model and a cosine noise schedule for the base models
- PointE produces 10K samples using stochastic DDPM with the full noise schedule
- PointE uses 64 steps of the Heun sampler from Karras et al. (2022) for both the base and upsampler models
- PointE uses 150 diffusion steps for the base model and 50 diffusion steps for the upsampling model
- PointE uses a PointNet++ model on ModelNet40 to compute P-FID and P-IS
- PointE uses an encoder-decoder Transformer to convert point clouds into meshes
- PointE uses a text-to-image model to produce in-distribution conditioning images
- PointE also trains a pure text-conditional point cloud model without an additional image generation step