Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Massive data corpora have enabled progress in AI
- 3D data is a notable omission in large-scale datasets
- Objaverse 1.0 is a large dataset of 3D models with descriptive captions, tags, and animations
- Objaverse has potential applications in training generative 3D models, improving tail category segmentation, training open-vocabulary object-navigation models, and creating a new benchmark for robustness analysis of vision models
Paper Content
Introduction
- Massive datasets have enabled and driven rapid progress in AI
- Language corpora on the web led to large language models
- Paired image and text datasets led to vision-and-language pretrained models
- YouTube video datasets led to video capable models
- Massive multimodal datasets led to models like CLIP and StableDiffusion
- Datasets moved from manually curated to harnessing the power of the web
- Datasets used to train deep learning models in other areas of research are not comparable
- 3D assets used in training generative 3D models are maximally on the order of thousands
- OBJAVERSE 1.0 is a large scale corpus of high-quality, richly annotated, 3D objects
- OBJAVERSE contains over 800K 3D assets designed by over 100K artists
- OBJAVERSE can support 3D generative modeling
- OBJAVERSE can improve the performance of long tail instance segmentation models
- OBJAVERSE can be used to build a benchmark for evaluating the robustness of state-of-the-art visual classification models
- OBJAVERSE can be used to train embodied AI agents
- OBJAVERSE can enable fast and exciting progress in 2D and 3D computer vision applications
Related work
- Scaling the size and scope of training datasets has been shown to improve model performance
- Early large scale datasets such as Imagenet and MS-COCO have accelerated progress in computer vision
- YFCC100M is a dataset of 99.2M images and 800K videos
- OpenImages is a large scale dataset of 9M images with labeled subsets
- Web-scraped datasets of image-text pairs have been used to train models for vision-language representation learning
- Current large-scale 2D image datasets offer scale, diversity, and realism
- 3D datasets lack in scale, diversity, and realism
- Text-to-3D models rely on 2D image-text supervision
Objaverse
- OBJAVERSE is a 3D dataset for computer vision research
- Objects are sourced from Sketchfab, with Creative Commons license
- Metadata includes name, categories, tags, and description
- OBJAVERSE-LVIS subset has objects assigned to 1156 LVIS categories
- 44K animated objects and 63K characters
- Articulated objects, exteriors, and interiors
- Visual styles include 3D scans, 3D modeled objects, point clouds, and PBR
- 818K objects, 160K artists, 2.35M tags, 21K WordNet entities
Applications
- 3D generative modeling
- Instance segmentation with CP3D
- Open-vocabulary ObjectNav
- Analyzing robustness in computer vision models
3d generative modeling
- 3D generative modeling has improved recently
- GET3D produces high quality 3D objects
- OBJAVERSE contains diverse and realistic objects
- 3 categories of objects (Shoe, Bag, Fruit&Veg) were chosen from OBJAVERSE
- GET3D models trained on OBJAVERSE produce high-quality and diverse 3D meshes
- Crowdworkers rated OBJAVERSE-trained model as more diverse 91% of the time
- Fruits and vegetables produced highest quality output
Instance segmentation with cp3d
- Simulated data is cheaper to obtain for computer vision
- Annotated OBJAVERSE objects can be rendered into images to enhance model performance
- Used segmented data from OBJAVERSE objects as auxiliary labels for training models on the LVIS dataset
- Recognition is challenging due to the long tail of the object category distribution
- Introduced 3DCP: an enhancement to the copy-and-paste technique
- Rendered 5 distinct views of each object and cached them for use throughout training
- Finetuned the pretrained ResNet-50 Mask-RCNN, yielding performance gains
Open-vocabulary objectnav
- Introduce open-vocabulary Object-Nav task
- Task requires agent to navigate to target object based on text description
- 10K new homes procedurally generated in ProcTHOR
- ObjectNav tasks have focused on training agents to navigate to 20 or so target objects
- Existing interactive embodied AI simulations include around 2K total objects across around 100 object types
- ObjectNav task massively scaled to open-vocabulary, 36K objects, and 1.1K object types
- Object placement in homes made more natural using OBJAVERSE-LVIS
- Object size correction to make objects look natural in a house
- Preprocessing for AI2-THOR to load objects on the fly
- Agent observes RGB egocentric view of environment
- Agent trained with DD-PPO and evaluated on unseen houses
- Agent achieves 19.9% success rate
Analyzing robustness
- ImageNet has a bias towards forward-facing, canonical orientations.
- Alcorn et al. studied the impact of this bias and found that computer vision systems are highly susceptible to deviations from canonical poses.
- OBJAVERSE assets were used to design a benchmark for evaluating the robustness of computer vision classification models to orientation shifts.
- 12 images of each object were rendered from random orientations and evaluated with 4 metrics.
- Models were found to be strongly overfit to standard views of objects.
Conclusion
- OBJAVERSE is a 3D asset library with 818K high-quality models
- 4 experimental studies show how OBJAVERSE can be used for various applications
- Human subjects data was collected and approved by an Institutional Review Board
- 30 testing target categories are listed in Table 6
- 18 million simulation steps are used to train the agent
- CLIP ViT-B/32 model is used to estimate the categorical coverage of the objects in OBJAVERSE
- OBJAVERSE has 18 high-level categories with evenly split relative share
- Each object has a 3D model, thumbnail image, name, description, tags, category, and stats
- Examples of applications include generative 3D models, instance segmentation, Open Vocabulary Object Navigation, and rotational robustness of vision models
- Human annotators were used to provide the category labels and rate the relative diversity of 3D objects