Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Industries are moving towards modeling massive 3D virtual worlds
Need for content creation tools that can scale in terms of quantity, quality, and diversity of 3D content
Aim to train performant 3D generative models that synthesize textured meshes
Prior works lack geometric details, limited in mesh topology, don’t support textures, or use neural renderers
Introduce GET3D, a Generative model that directly generates Explicit Textured 3D meshes with complex topology, rich geometric details, and high-fidelity textures
GET3D able to generate high-quality 3D textured meshes

3D content is important for many industries
Manual creation of 3D assets is time-consuming and requires technical and artistic skills
Creating many 3D models is difficult
Generative 3D networks can produce high-quality and diverse 3D assets
Requirements for 3D generative models: detailed geometry, arbitrary topology, textured mesh, 2D image supervision
Prior work has focused on subsets of the requirements
GET3D is a novel approach that fulfills all requirements
GET3D can generate high-quality geometric and texture details
GET3D can be adapted to other tasks, such as material and lighting effects, text-guided 3D shape generation

3D generative models have been developed to generate photorealistic images
3D generative models focus mainly on generating geometry and disregard appearance
GET3D is able to generate diverse shapes with arbitrary topology, high quality geometry, and texture
3D-aware generative image synthesis has been developed to tackle the problem of 3D-aware image synthesis
GET3D directly outputs textured 3D meshes that can be readily used in standard graphics engines

GET3D framework synthesizes textured 3D shapes
Generation process is split into two parts: geometry branch and texture branch
Training uses efficient differentiable rasterizer to render textured mesh into 2D images
Model is differentiable, allowing for adversarial training from images
Generator and rendering/loss functions introduced in Sec 3.1 and 3.2

Aim to learn a 3D generator to map a sample from a Gaussian distribution to a mesh with texture
Sample two random input vectors
Use non-linear mapping networks to map input vectors to intermediate latent vectors
Utilize DMTet to extract 3D surface mesh from SDF
Train with adversarial losses defined on 2D images
Use a rasterization-based differentiable renderer to obtain RGB images and silhouettes
Utilize two 2D discriminators to classify inputs as real or fake
Model is end-to-end trainable

DMTet is a differentiable surface representation
It represents geometry as a signed distance field (SDF) defined on a deformable tetrahedral grid
Deforming the grid results in better resolution
It allows for explicit meshes with arbitrary topology and genus
SDF values and deformations are mapped to vertices using 3D convolutional and fully connected layers
Differentiable marching tetrahedra is used to extract the explicit mesh
Mesh topology is determined by the signs of SDF values
Shapes with arbitrary topology can be generated by predicting different signs of SDF values

Generating a texture map for an output mesh is difficult.
Texture is modeled as a function that maps 3D location to RGB color.
Texture field is conditioned on geometry latent code.
Texture field is represented using a tri-plane representation.
Feature vector of a surface point is recovered using projection and interpolation.
Texture field is only sampled at surface points.

We draw inspiration from Nvdiffrec to supervise our model during training.
We render the 3D mesh and texture field into 2D images using a differentiable renderer.
We use a 2D discriminator to distinguish the image from a real object or rendered from the generated object.
We use an adversarial objective with two separate discriminators for RGB images and silhouettes.

Datasets used for evaluation on ShapeNet: Car, Chair, and Motorbike
Training, validation, and test sets randomly split from each category
Camera poses randomly sampled from the upper hemisphere of each shape
Animal dataset from TurboSquid used for textures
House dataset from TurboSquid and Human Body dataset from Renderpeople used for qualitative results
Baselines: PointFlow, OccNet, GRAF, PiGAN, and EG3D
Metrics used to evaluate quality of synthesis: Chamfer Distance, Light Field Distance, Coverage score, Minimum Matching Distance, FID
Quantitative and qualitative results provided
Ablation study conducted
Text-guided 3D content synthesis supported
Limitations of GET3D: relies on 2D silhouettes and camera distribution during training, trained per-category
Broader Impact: potential misuse or harmful applications, need to de-bias datasets