Hi there 👋

Welcome to arxiv-summary, your one-stop destination for GPT-3 generated summaries of the latest machine learning and AI papers on arxiv.org. The goal is to make these papers more understandable and human-parsable, by providing clear and concise bullet points. I hope you find this site useful and come back often. Any feedback via the social buttons below is really appreciated. Thank you for visiting!

Learning and Verification of Task Structure in Instructional Videos

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos Pre-train VideoTaskformer by predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video Learn step representations globally, leveraging video of the entire surrounding task as context Introduce two new benchmarks for detecting mistakes in instructional videos Introduce a long-term forecasting benchmark Outperforms previous baselines on these tasks Evaluate VideoTaskformer on 3 existing benchmarks and achieves new state-of-the-art performance Paper Content Introduction Trying to build a bookshelf using a YouTube video Need to repeatedly hit pause on the video Interactive assistant can guide user through task Composite task involves multiple fine-grained activities Ideal assistant has high-level and low-level understanding Prior work models step representations from single short video clips VideoTaskformer learns step representations for masked video steps Mistake detection task and dataset for verifying video representations VideoTaskformer learns step representations for whole video Network learns to predict labels for masked steps Representations improve performance on downstream tasks VideoTaskformer capable of detecting mistake types Related works Large-scale narrated instructional video datasets enable learning joint video-language representations and task structure from videos Assembly-101 dataset and Ikea ASM provide videos of people assembling and disassembling toys and furniture Existing benchmarks for evaluating representations learned on instructional video datasets include step localization, step classification, procedural activity recognition, and step forecasting Recent works attempt to learn procedures from instructional videos Video action recognition models have improved over the last few years Works learn representations for longer video clips containing semantically more complex actions Learning task structure through masked modeling of steps Goal is to learn task-aware step representations from instructional videos Developed VideoTaskformer, a video model pre-trained using a BERT style masked modeling loss Masking is done at the step level Framework consists of two steps: pre-training and fine-tuning Pre-training is done on weakly labeled data During fine-tuning, a subset of the parameters is adjusted using labeled data from the downstream tasks Pre-training approach uses masked step modeling loss Step modeling extends masked language modeling techniques used in BERT and VideoBERT Evaluated on 6 downstream tasks Step representations are learned from entire video with all steps as input Step classification and distribution matching are used as training objectives Downstream tasks Mistake step detection: Identify which step in a video is incorrect Mistake ordering detection: Verify if the steps in a video are in the correct temporal order Short-term forecasting: Predict the step label given the previous n segments Long-term step forecasting: Predict the step labels for the next 5 steps given a single step Procedural activity recognition: Recognize the procedural activity (i....

DreamBooth3D: Subject-Driven Text-to-3D Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Presents DreamBooth3D, an approach to personalize 3D models from 3-6 images Combines text-to-image and text-to-3D models Naively combining these methods fails to yield satisfactory 3D assets Uses 3-stage optimization strategy to leverage 3D consistency and personalization Produces high-quality, subject-specific 3D assets with text-driven modifications Paper Content Introduction Text-to-Image (T2I) generative models can create and edit visual content Recent works have demonstrated high-quality Text-to-3D generation Applications in graphics, VR, movies, and gaming Text prompts allow for some degree of control, but difficult to precisely control identity, geometry, and appearance Recent success in personalizing T2I models for subject-specific 2D image generation DreamBooth3D proposed for subject-driven Text-to-3D generation Given a few casual image captures of a subject, generate subject-specific 3D assets Draws inspiration from recent works 3-stage optimization framework proposed Synergistic optimization of NeRF and T2I models Results indicate realistic 3D assets with high likeness to given subject Related works Text-to-Image Generation uses GANs, autoregressive models, and masked image models to generate images Denoising diffusion models can generate high-quality images and be conditioned on various inputs 3D Generation uses 3D reconstruction from images or generative models from image collections Text-to-3D methods generate 3D assets from text prompts Subject-driven Generation enables users to personalize image generation for specific subjects Textual Inversion optimizes for a new “word” in the embedding space of a pre-trained text-to-image model Approach Input consists of k casual subject captures with n pixels and a text prompt Aim is to generate a 3D asset that captures the identity and is faithful to the text prompt 3D assets are optimized in the form of Neural Radiance Fields Problem is more challenging than typical 3D reconstruction setting Technique is based on advances in Text-to-3D optimization and personalization Preliminaries DreamBooth is a method to personalize a text-to-image (T2I) diffusion model....

The Quantization Model of Neural Scaling

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed Quantization Model explains power law dropoff of loss with model and data size Quantization Hypothesis states that learned network capabilities are quantized into discrete chunks Power law in use frequencies explains observed power law scaling of loss Validated prediction on toy datasets and studied scaling curves for large language models Paper Content Introduction Larger neural networks trained on more data perform better than smaller neural networks trained on less data Mean test loss decreases as a power law in both the number of network parameters and the number of training samples Larger models often have emergent abilities, i....

Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Detection algorithms have been proposed to identify AI-generated text A 11B parameter paraphrase generation model (DIPPER) was trained to paraphrase paragraphs Several detectors were tested and found to be evaded by DIPPER A defense was introduced to increase robustness of AI-generated text detection to paraphrase attacks Code, model and data will be open sourced for future research Paper Content Introduction LLMs can write coherent and relevant longform text Fears of malicious applications such as fake news and homework answers Algorithms proposed to detect machine-generated text Unclear how robust these algorithms are to paraphrase attacks Demonstrate vulnerability of existing detectors to paraphrase attacks Train 11B parameter paraphrase generation model called DIPPER DIPPER can paraphrase paragraph-length texts DIPPER has two features to help evade AI-generated text detectors Attack several recently proposed AI-generated text detection algorithms Experiments show all detection algorithms misclassify AI-generated texts Propose to use retrieval methods to detect AI-generated text 97....

3D-POP -- An automated annotation approach to facilitate markerless 2D-3D tracking of freely moving birds with marker-based motion capture

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Recent advances in machine learning and computer vision are revolutionizing the field of animal behavior. Large datasets of annotated images of animals for markerless pose tracking are still scarce. A method is proposed that uses a motion capture system to obtain a large amount of annotated data on animal movement and posture....

Reinforcement Learning with Exogenous States and Rewards

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Exogenous state variables and rewards can slow reinforcement learning. Reward function decomposes additively into endogenous and exogenous components. Decomposition of state space into exogenous and endogenous state spaces must be discovered. Algorithms introduced to discover exogenous and endogenous subspaces of state space. Experiments show that these methods produce speedups in reinforcement learning....

Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed method for editing NeRF scenes with text-instructions Uses an image-conditioned diffusion model (InstructPix2Pix) Iteratively edits input images while optimizing underlying scene Results in optimized 3D scene that respects edit instruction Able to edit large-scale, real-world scenes More realistic, targeted edits than prior work Paper Content Introduction Capturing a realistic digital representation of a real-world 3D scene is easy Captured 3D content is replacing traditional processes of manually-generated assets Tools for editing 3D assets are underdeveloped Text instructions can be used to edit 3D scenes 2D diffusion model is used to extract shape and appearance priors Related work NeRFs are a popular approach for generating photorealistic novel views of a scene Editing NeRFs is a challenge Physics-based inductive biases can be used to enable changes in materials or scene lighting Bounding boxes can be used to allow easy compositing of different objects and spatial manipulations Cli-mateNeRF extracts rough geometry from a NeRF and uses physical simulation to apply weather changes Most physically-based edits revolve around changing physical properties of the reconstructed scene Recent works have explored artistic 3D stylization of NeRFs EditNeRF explores editing NeRFs by manipulating latent codes learned from object categories ClipNeRF and NeRF-Art extend this line of work by encouraging similarity between CLIP embeddings of the scene and a short text prompt Recent progress in pre-trained large-scale models has enabled rapid progress in the domain of generating 3D content from scratch Instruction-based 2D image-conditioned diffusion model enables purely language-based interface for 3D editing Method Takes as input a reconstructed NeRF scene, source data, and a natural-language editing instruction Outputs an edited version of the NeRF and input images using a diffusion model and NeRF training Background Neural radiance fields (NeRFs) are a way to represent and render a 3D scene....

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Recent works have shown promising results on novel view synthesis from single or few images. Models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. Proposed framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models. FeatureNeRF maps 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks....

LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Finding correspondences between images of the same object is important for understanding its geometry. Recent years have seen progress in this area due to deep learning based local image features and learnable matchers. Learnable matchers often underperform when there is only small regions of co-visibility between image pairs. We propose a Learnable Feature Matching framework that uses models based on graph neural networks and integrates 3D signals to boost correspondence estimation....

Can we trust the evaluation on ChatGPT?

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract ChatGPT is a large language model with mass adoption Evaluating ChatGPT’s performance is challenging due to its closed nature and continuous updates Data contamination is an issue when evaluating ChatGPT Stance detection is used as a case study to highlight the issue of data contamination Fair model evaluation is a challenge in the age of closed and continuously trained models Paper Content Introduction Methods Zhang et al....