diffusion

Title: Text-to-Image Models for Counterfactual Explanations: a Black-Box Approach. (arXiv:2309.07944v1 [cs.CV])

Title: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models. (arXiv:2309.07986v1 [cs.CV])

ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.

Title: Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions. (arXiv:2309.08097v1 [cs.CV])

Title: Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models. (arXiv:2309.08251v1 [cs.CV])

Title: Unsupervised Disentangling of Facial Representations with 3D-aware Latent Diffusion Models. (arXiv:2309.08273v1 [cs.CV])

Title: Large Intestine 3D Shape Refinement Using Point Diffusion Models for Digital Phantom Generation. (arXiv:2309.08289v1 [cs.CV])

self-supervised

Title: DA-RAW: Domain Adaptive Object Detection for Real-World Adverse Weather Conditions. (arXiv:2309.08152v1 [cs.CV])

Title: Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens. (arXiv:2309.08531v1 [cs.CV])

Title: Structural Self-Supervised Objectives for Transformers. (arXiv:2309.08272v1 [cs.CL])

In the first part, we introduce three alternative pre-training objectives to BERT's Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution (C-RTS), and Swapped Language Modeling (SLM). These objectives involve token swapping instead of masking, with RTS and C-RTS aiming to predict token originality and SLM predicting the original token values. Results show that RTS and C-RTS require less pre-training time while maintaining performance comparable to MLM. Surprisingly, SLM outperforms MLM on certain tasks despite using the same computational budget.

In the second part, we proposes self-supervised pre-training tasks that align structurally with downstream applications, reducing the need for labeled data. We use large corpora like Wikipedia and CC-News to train models to recognize if text spans originate from the same paragraph or document in several ways. By doing continuous pre-training, starting from existing models like RoBERTa, ELECTRA, DeBERTa, BART, and T5, we demonstrate significant performance improvements in tasks like Fact Verification, Answer Sentence Selection, and Summarization. These improvements are especially pronounced when limited annotation data is available. The proposed objectives also achieve state-of-the-art results on various benchmark datasets, including FEVER (dev set), ASNQ, WikiQA, and TREC-QA, as well as enhancing the quality of summaries. Importantly, these techniques can be easily integrated with other methods without altering the internal structure of Transformer models, making them versatile for various NLP applications.

Title: Headless Language Models: Learning without Predicting with Contrastive Weight Tying. (arXiv:2309.08351v1 [cs.CL])

Title: Supervised Stochastic Neighbor Embedding Using Contrastive Learning. (arXiv:2309.08077v1 [cs.LG])

Title: Understanding the limitations of self-supervised learning for tabular anomaly detection. (arXiv:2309.08374v1 [cs.LG])

foundation model

Title: Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer. (arXiv:2309.07929v1 [cs.CV])

Title: BROW: Better featuRes fOr Whole slide image based on self-distillation. (arXiv:2309.08259v1 [cs.CV])

Title: Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding. (arXiv:2309.08585v1 [cs.CV])

Title: Scaling Laws for Sparsely-Connected Foundation Models. (arXiv:2309.08520v1 [cs.LG])

Title: Compositional Foundation Models for Hierarchical Planning. (arXiv:2309.08587v1 [cs.LG])

generative

Title: Breathing New Life into 3D Assets with Generative Repainting. (arXiv:2309.08523v1 [cs.CV])

Title: An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing. (arXiv:2309.08008v1 [cs.CL])

Title: Reward Engineering for Generating Semi-structured Explanation. (arXiv:2309.08347v1 [cs.CL])

Title: Masked Generative Modeling with Enhanced Sampling Scheme. (arXiv:2309.07945v1 [cs.LG])

Title: An Automated Machine Learning Approach for Detecting Anomalous Peak Patterns in Time Series Data from a Research Watershed in the Northeastern United States Critical Zone. (arXiv:2309.07992v1 [cs.LG])

anomaly

in-context

Title: LASER: LLM Agent with State-Space Exploration for Web Navigation. (arXiv:2309.08172v1 [cs.CL])

Title: Bridging Topic, Domain, and Language Shifts: An Evaluation of Comprehensive Out-of-Distribution Scenarios. (arXiv:2309.08316v1 [cs.CL])

Unlike prior studies focusing on specific shifts and metrics in isolation, we comprehensively analyze OOD generalization. We define three metrics to pinpoint generalization flaws and propose eleven classification tasks covering topic, domain, and language shifts. Overall, we find superior performance of prompt-based fine-tuning, notably when train and test splits primarily differ semantically. Simultaneously, in-context learning is more effective than prompt-based or vanilla fine-tuning for tasks when training data embodies heavy discrepancies in label distribution compared to testing data. This reveals a crucial drawback of gradient-based learning: it biases LMs regarding such structural obstacles.

Title: ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer. (arXiv:2309.08583v1 [cs.CL])

memory

Title: VERSE: Virtual-Gradient Aware Streaming Lifelong Learning with Anytime Inference. (arXiv:2309.08227v1 [cs.LG])

Title: Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. (arXiv:2309.08168v1 [cs.CL])

Title: VulnSense: Efficient Vulnerability Detection in Ethereum Smart Contracts by Multimodal Learning with Graph Neural Network and Language Model. (arXiv:2309.08474v1 [cs.CR])

Title: A Data Source for Reasoning Embodied Agents. (arXiv:2309.07974v1 [cs.LG])

Title: Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition. (arXiv:2309.07988v1 [cs.LG])

To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. Experiments on on-device Transformer-based streaming speech recognition models show that folding attention reduces model size (and corresponding memory consumption) by up to 24% and power consumption by up to 23%, all without compromising model accuracy or computation overhead.

Title: Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization. (arXiv:2309.08546v1 [cs.LG])

few-shot

Title: ECEA: Extensible Co-Existing Attention for Few-Shot Object Detection. (arXiv:2309.08196v1 [cs.CV])

Title: SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels. (arXiv:2309.08513v1 [cs.CV])

Title: Using Large Language Model to Solve and Explain Physics Word Problems Approaching Human Level. (arXiv:2309.08182v1 [cs.CL])

Title: Neural Machine Translation Models Can Learn to be Few-shot Learners. (arXiv:2309.08590v1 [cs.CL])