arxiv-summary: AI-summarized AI papers

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Transformer-based language models (LMs) create hidden representations of their inputs at every layer. A method is suggested to cast the hidden representations as final representations, bypassing the transformer computation in-between. This method produces more accurate approximations than the prevailing practice of inspecting hidden representations from all layers. The method allows “peeking” into early layer representations of GPT-2 and BERT....

NeRFMeshing: Distilling Neural Radiance Fields into Geometrically-Accurate 3D Meshes

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Neural Radiance Fields (NeRFs) enable novel view synthesis NeRFs represent 3D scenes for computing image rendering 3D meshes are the main scene representation supported by most computer graphics and simulation pipelines Obtaining 3D meshes from NeRFs is an open challenge Proposed architecture enables easy 3D surface reconstruction from any NeRF-driven approach Final 3D mesh is physically accurate and can be rendered in real time Paper Content Introduction Accurate 3D scene and object reconstruction is important in robotics, photogrammetry, AR/VR Novel view synthesis (NVS) has made advances in recent years Neural radiance fields (NeRFs) is a 3D representation that emits radiance Related work has focused on improving NeRF in terms of image quality, robustness, training speed and rendering speed It is unclear how to obtain accurate 3D meshes from radiance fields NeRFs cannot be integrated with most computer graphics pipelines We introduce NeRFMeshing, an end-to-end pipeline for extracting accurate meshes from trained NeRF-based networks Our method produces meshes with neural colors and accurate geometry that can be rendered in real time Our method can be used with any NeRF, enabling to incorporate new advances Our model preserves the high fidelity of neural radiance fields and can be used for real-time novel view synthesis Related work Neural Radiance Field (NeRF) formulation introduced in [13] Subsequent works have addressed limitations of original approach Original formulation lacks accurate underlying geometry Our work relies on NeRF networks trained from images Alternative to radiance fields is to learn Signed Distance Function (SDF) Our method does not rely on fixed grid template during training Exploit adaptive power of NeRFs to robustly represent 3D scenes Recent approaches advance speed and geometric accuracy of NeRFs Method Overview of NeRF presented in Fig....

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Language-guided image generation has been successful using diffusion models. Texts can be too vague to accurately describe specific subjects. UMM-Diffusion takes joint texts and images as input and generates customized images. Input images are projected to pseudo word embedding and combined with text to guide image generation. Sampling technique of diffusion models used to eliminate irrelevant parts of input images....

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Animating virtual avatars with co-speech gestures can be used in human-machine interaction. Existing methods rely on GANs, which have issues with mode collapse and unstable training. A novel diffusion-based framework, DiffGesture, is proposed to capture audio-to-gesture associations and preserve temporal coherence. Paper Content Introduction Making co-speech gestures is an innate human behavior that helps people express and understand thoughts Animating virtual avatars to gesticulate co-speech movements is important in embodied AI Recent research focuses on audio-driven co-speech gesture generation Early attempts treated this task as a searching-and-connecting problem Deep neural networks have been used to learn the mapping from speech audio to human skeletons GAN-based methods have been used to guarantee realism Diffusion probabilistic models provide a new perspective for realistic generation Difficult to adapt existing diffusion models for co-speech gesture generation Proposed DiffGesture framework to capture audio-gesture associations while maintaining temporal coherence Diffusion Audio-Gesture Transformer to model audio-gesture long-term temporal dependency Diffusion Gesture Stabilizer to eliminate temporal inconsistency Results outperform state-of-the-arts with superior performance Related work Co-speech gesture generation is important for various applications....

GLEN: General-Purpose Event Detection for Thousands of Types

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Development of event extraction systems hindered by lack of large-scale datasets GLEN dataset created to make event extraction systems more accessible GLEN covers 3,465 different event types, 20x larger than current datasets GLEN created using DWD Overlay and PropBank annotation New multi-stage event detection model proposed Model exhibits 10% F1 gain compared to classification baselines and definition-based models Label noise still largest challenge for improving performance Paper Content Introduction ACE 2005 is the current standard benchmark for event extraction, but it has limited ontology and domain....

Secret-Keeping in Question Answering

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Existing research focuses on providing an answer when a system can System should not answer a question to protect sensitive users or information Models can expose sensitive information under interrogation Research seeks to determine if it is possible to teach a system to keep a fact secret Proof-of-concept architecture designed and implemented Evaluation determines that while possible, there are directions for future research Paper Content Introduction QA systems seek to provide a direct answer to an information need posed by a user in natural language Input questions can be information seeking or probing Outputs of a QA system can be extractive, generative, multi-choice or categorical Current QA evaluation focuses on measuring the ‘accuracy’ of returned answers Most follow the gold-standard pattern with variations to the source of questions and context Work exploring answerability measures whether or not a QA system is capable of answering a given question Research question is how to implement a secret-keeping system capable of protecting secret information from disclosure System design, experiments and results are outlined in sections 2, 3 and 4 Ethics, related work and future work are outlined in sections 5, 6 and 7 Contributions....

Translating Radiology Reports into Plain Language using ChatGPT and GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract ChatGPT is a large language model with human-like expression and reasoning abilities This study investigates the feasibility of using ChatGPT to translate radiology reports into plain language Radiology reports from 62 low-dose chest CT lung cancer screening scans and 76 brain MRI metastases screening scans were collected ChatGPT can successfully translate radiology reports with an average score of 4....

ART: Automatic multi-step reasoning and tool-use for large language models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract LLMs can perform complex reasoning in few- and zero-shot settings LLMs can generate intermediate chain of thought reasoning steps LLMs can use external tools to support computation Prior work requires hand-crafting task-specific demonstrations ART uses frozen LLMs to automatically generate intermediate reasoning steps ART selects demonstrations from a task library ART integrates external tool output before resuming generation ART improves performance on unseen tasks in BigBench and MMLU benchmarks ART is extensible and can be improved with minimal human intervention Paper Content Introduction In-context learning allows LLMs to quickly adapt to new tasks with natural language instructions and a few demonstrations LLMs can be used without annotating large datasets or hosting the LLM There are limitations around multi-step reasoning Recent work proposes prompting LLMs to mimic a chain of thought or providing them with access to tools Existing methods for chained reasoning with tool use are difficult to extend to new tasks and tools ART is a framework that automatically generates decompositions for instances of new tasks and selects and uses the most appropriate available tools ART retrieves demonstrations of related tasks from a task library to enable few-shot decomposition and tool use ART provides the LLM with demonstrations of how to decompose instances of several related tasks and how to select and use tools ART matches or outperforms automatically generated CoT reasoning chains on tasks Tool-use improves performance on test tasks ART improves over direct few-shot prompting ART enables human intervention and improvement of the reasoning process ART with additional human feedback surpasses the best-known results for GPT3 Related work Finetuning LLMs on public NLP datasets is an effective technique for cross-task generalization Aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback further improves in-context learning performance Finetuning on an aggregated mixture of tasks with scaling models to 540B parameters achieves state-of-the-art in-context learning performance Chain-of-thought prompting is a popular gradient-free technique that encourages LLMs to generate intermediate reasoning steps LLMs can generate CoT-style multi-step reasoning in a zero-shot manner LLMs can automatically generate CoT-style prompts ART uses API access to InstructGPT and Codex to leverage their emergent in-context learning abilities ART introduces a common language that enables cross-task demonstrations and flexible and extensible tool use Overview ART is presented with a new task description and input instance ART retrieves similar tasks from a task library Task library is written in a specific format defined by a custom parsing expression grammar Grammar decomposes each task instance into a sequence of sub-steps Sub-steps contain symbols corresponding to tools in a tool library LLM writes its own program at generation time ART pauses generation whenever a tool call is encountered Humans can add new decomposition demonstrations or tools to improve performance Task library Constructed a library of programs for a small seed set of tasks from Big-Bench Identified five skills useful across more than half of the tasks in Big-Bench Grouped tasks into clusters such as Arithmetic and Algebra Wrote programs for a few instances of each task, including calls to external tools Defined a query language to represent decomposed reasoning steps and incorporate function calls to external tools Tool library Whenever a sub-task query name matches a tool name, generation is stopped and resumed after the tool is called and its output is incorporated A tool library is seeded with tools that have demonstrations in the task library Search is done using SerpAPI2 Codex model is used for code generation Python code is run in a virtual environment with pre-installed packages Human feedback Pilot use of task-specific feedback in Table 6 Edit 5 random instances of model-generated programs that resulted in errors Corrections include sub-steps, adding missing sub-steps, and defining new tools Comparing human feedback applied to CoT-style reasoning Corrections include 35% of tokens in baseline and 15....

UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract LLMs are popular for their impressive abilities, but require fine-tuning or prompt engineering to generalize. UPRISE is a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. UPRISE is universal in a cross-task and cross-model scenario. UPRISE mitigates the hallucination problem in experiments with ChatGPT. Paper Content Introduction Large Language Models (LLMs) have shown impressive capabilities across a range of tasks....

Rotation-Invariant Transformer for Point Cloud Matching

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Handcrafted descriptors are rotation invariant, but deep matchers are not. Deep matchers use data augmentation to obtain rotation invariance, but this is not always effective. RoITr is a Rotation-Invariant Transformer to cope with pose variations in point cloud matching. RoITr uses an attention mechanism with PPF-based coordinates to create a pose-invariant geometry....