Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Massive data corpora have enabled progress in AI
  • 3D data is a notable omission in large-scale datasets
  • Objaverse 1.0 is a large dataset of 3D models with descriptive captions, tags, and animations
  • Objaverse has potential applications in training generative 3D models, improving tail category segmentation, training open-vocabulary object-navigation models, and creating a new benchmark for robustness analysis of vision models

Paper Content

Introduction

  • Massive datasets have enabled and driven rapid progress in AI
  • Language corpora on the web led to large language models
  • Paired image and text datasets led to vision-and-language pretrained models
  • YouTube video datasets led to video capable models
  • Massive multimodal datasets led to models like CLIP and StableDiffusion
  • Datasets moved from manually curated to harnessing the power of the web
  • Datasets used to train deep learning models in other areas of research are not comparable
  • 3D assets used in training generative 3D models are maximally on the order of thousands
  • OBJAVERSE 1.0 is a large scale corpus of high-quality, richly annotated, 3D objects
  • OBJAVERSE contains over 800K 3D assets designed by over 100K artists
  • OBJAVERSE can support 3D generative modeling
  • OBJAVERSE can improve the performance of long tail instance segmentation models
  • OBJAVERSE can be used to build a benchmark for evaluating the robustness of state-of-the-art visual classification models
  • OBJAVERSE can be used to train embodied AI agents
  • OBJAVERSE can enable fast and exciting progress in 2D and 3D computer vision applications
  • Scaling the size and scope of training datasets has been shown to improve model performance
  • Early large scale datasets such as Imagenet and MS-COCO have accelerated progress in computer vision
  • YFCC100M is a dataset of 99.2M images and 800K videos
  • OpenImages is a large scale dataset of 9M images with labeled subsets
  • Web-scraped datasets of image-text pairs have been used to train models for vision-language representation learning
  • Current large-scale 2D image datasets offer scale, diversity, and realism
  • 3D datasets lack in scale, diversity, and realism
  • Text-to-3D models rely on 2D image-text supervision

Objaverse

  • OBJAVERSE is a 3D dataset for computer vision research
  • Objects are sourced from Sketchfab, with Creative Commons license
  • Metadata includes name, categories, tags, and description
  • OBJAVERSE-LVIS subset has objects assigned to 1156 LVIS categories
  • 44K animated objects and 63K characters
  • Articulated objects, exteriors, and interiors
  • Visual styles include 3D scans, 3D modeled objects, point clouds, and PBR
  • 818K objects, 160K artists, 2.35M tags, 21K WordNet entities

Applications

  • 3D generative modeling
  • Instance segmentation with CP3D
  • Open-vocabulary ObjectNav
  • Analyzing robustness in computer vision models

3d generative modeling

  • 3D generative modeling has improved recently
  • GET3D produces high quality 3D objects
  • OBJAVERSE contains diverse and realistic objects
  • 3 categories of objects (Shoe, Bag, Fruit&Veg) were chosen from OBJAVERSE
  • GET3D models trained on OBJAVERSE produce high-quality and diverse 3D meshes
  • Crowdworkers rated OBJAVERSE-trained model as more diverse 91% of the time
  • Fruits and vegetables produced highest quality output

Instance segmentation with cp3d

  • Simulated data is cheaper to obtain for computer vision
  • Annotated OBJAVERSE objects can be rendered into images to enhance model performance
  • Used segmented data from OBJAVERSE objects as auxiliary labels for training models on the LVIS dataset
  • Recognition is challenging due to the long tail of the object category distribution
  • Introduced 3DCP: an enhancement to the copy-and-paste technique
  • Rendered 5 distinct views of each object and cached them for use throughout training
  • Finetuned the pretrained ResNet-50 Mask-RCNN, yielding performance gains

Open-vocabulary objectnav

  • Introduce open-vocabulary Object-Nav task
  • Task requires agent to navigate to target object based on text description
  • 10K new homes procedurally generated in ProcTHOR
  • ObjectNav tasks have focused on training agents to navigate to 20 or so target objects
  • Existing interactive embodied AI simulations include around 2K total objects across around 100 object types
  • ObjectNav task massively scaled to open-vocabulary, 36K objects, and 1.1K object types
  • Object placement in homes made more natural using OBJAVERSE-LVIS
  • Object size correction to make objects look natural in a house
  • Preprocessing for AI2-THOR to load objects on the fly
  • Agent observes RGB egocentric view of environment
  • Agent trained with DD-PPO and evaluated on unseen houses
  • Agent achieves 19.9% success rate

Analyzing robustness

  • ImageNet has a bias towards forward-facing, canonical orientations.
  • Alcorn et al. studied the impact of this bias and found that computer vision systems are highly susceptible to deviations from canonical poses.
  • OBJAVERSE assets were used to design a benchmark for evaluating the robustness of computer vision classification models to orientation shifts.
  • 12 images of each object were rendered from random orientations and evaluated with 4 metrics.
  • Models were found to be strongly overfit to standard views of objects.

Conclusion

  • OBJAVERSE is a 3D asset library with 818K high-quality models
  • 4 experimental studies show how OBJAVERSE can be used for various applications
  • Human subjects data was collected and approved by an Institutional Review Board
  • 30 testing target categories are listed in Table 6
  • 18 million simulation steps are used to train the agent
  • CLIP ViT-B/32 model is used to estimate the categorical coverage of the objects in OBJAVERSE
  • OBJAVERSE has 18 high-level categories with evenly split relative share
  • Each object has a 3D model, thumbnail image, name, description, tags, category, and stats
  • Examples of applications include generative 3D models, instance segmentation, Open Vocabulary Object Navigation, and rotational robustness of vision models
  • Human annotators were used to provide the category labels and rate the relative diversity of 3D objects