Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Massive data corpora have enabled progress in AI
3D data is a notable omission in large-scale datasets
Objaverse 1.0 is a large dataset of 3D models with descriptive captions, tags, and animations
Objaverse has potential applications in training generative 3D models, improving tail category segmentation, training open-vocabulary object-navigation models, and creating a new benchmark for robustness analysis of vision models

Massive datasets have enabled and driven rapid progress in AI
Language corpora on the web led to large language models
Paired image and text datasets led to vision-and-language pretrained models
YouTube video datasets led to video capable models
Massive multimodal datasets led to models like CLIP and StableDiffusion
Datasets moved from manually curated to harnessing the power of the web
Datasets used to train deep learning models in other areas of research are not comparable
3D assets used in training generative 3D models are maximally on the order of thousands
OBJAVERSE 1.0 is a large scale corpus of high-quality, richly annotated, 3D objects
OBJAVERSE contains over 800K 3D assets designed by over 100K artists
OBJAVERSE can support 3D generative modeling
OBJAVERSE can improve the performance of long tail instance segmentation models
OBJAVERSE can be used to build a benchmark for evaluating the robustness of state-of-the-art visual classification models
OBJAVERSE can be used to train embodied AI agents
OBJAVERSE can enable fast and exciting progress in 2D and 3D computer vision applications

Scaling the size and scope of training datasets has been shown to improve model performance
Early large scale datasets such as Imagenet and MS-COCO have accelerated progress in computer vision
YFCC100M is a dataset of 99.2M images and 800K videos
OpenImages is a large scale dataset of 9M images with labeled subsets
Web-scraped datasets of image-text pairs have been used to train models for vision-language representation learning
Current large-scale 2D image datasets offer scale, diversity, and realism
3D datasets lack in scale, diversity, and realism
Text-to-3D models rely on 2D image-text supervision

Simulated data is cheaper to obtain for computer vision
Annotated OBJAVERSE objects can be rendered into images to enhance model performance
Used segmented data from OBJAVERSE objects as auxiliary labels for training models on the LVIS dataset
Recognition is challenging due to the long tail of the object category distribution
Introduced 3DCP: an enhancement to the copy-and-paste technique
Rendered 5 distinct views of each object and cached them for use throughout training
Finetuned the pretrained ResNet-50 Mask-RCNN, yielding performance gains

Introduce open-vocabulary Object-Nav task
Task requires agent to navigate to target object based on text description
10K new homes procedurally generated in ProcTHOR
ObjectNav tasks have focused on training agents to navigate to 20 or so target objects
Existing interactive embodied AI simulations include around 2K total objects across around 100 object types
ObjectNav task massively scaled to open-vocabulary, 36K objects, and 1.1K object types
Object placement in homes made more natural using OBJAVERSE-LVIS
Object size correction to make objects look natural in a house
Preprocessing for AI2-THOR to load objects on the fly
Agent observes RGB egocentric view of environment
Agent trained with DD-PPO and evaluated on unseen houses
Agent achieves 19.9% success rate

ImageNet has a bias towards forward-facing, canonical orientations.
Alcorn et al. studied the impact of this bias and found that computer vision systems are highly susceptible to deviations from canonical poses.
OBJAVERSE assets were used to design a benchmark for evaluating the robustness of computer vision classification models to orientation shifts.
12 images of each object were rendered from random orientations and evaluated with 4 metrics.
Models were found to be strongly overfit to standard views of objects.

OBJAVERSE is a 3D asset library with 818K high-quality models
4 experimental studies show how OBJAVERSE can be used for various applications
Human subjects data was collected and approved by an Institutional Review Board
30 testing target categories are listed in Table 6
18 million simulation steps are used to train the agent
CLIP ViT-B/32 model is used to estimate the categorical coverage of the objects in OBJAVERSE
OBJAVERSE has 18 high-level categories with evenly split relative share
Each object has a 3D model, thumbnail image, name, description, tags, category, and stats
Examples of applications include generative 3D models, instance segmentation, Open Vocabulary Object Navigation, and rotational robustness of vision models
Human annotators were used to provide the category labels and rate the relative diversity of 3D objects