diffusion

Title: DiffGAN-F2S: Symmetric and Efficient Denoising Diffusion GANs for Structural Connectivity Prediction from Brain fMRI. (arXiv:2309.16205v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16205
Code URL: null
Copy Paste: [[2309.16205] DiffGAN-F2S: Symmetric and Efficient Denoising Diffusion GANs for Structural Connectivity Prediction from Brain fMRI](http://arxiv.org/abs/2309.16205) #diffusion
Summary:
Mapping from functional connectivity (FC) to structural connectivity (SC) can facilitate multimodal brain network fusion and discover potential biomarkers for clinical implications. However, it is challenging to directly bridge the reliable non-linear mapping relations between SC and functional magnetic resonance imaging (fMRI). In this paper, a novel diffusision generative adversarial network-based fMRI-to-SC (DiffGAN-F2S) model is proposed to predict SC from brain fMRI in an end-to-end manner. To be specific, the proposed DiffGAN-F2S leverages denoising diffusion probabilistic models (DDPMs) and adversarial learning to efficiently generate high-fidelity SC through a few steps from fMRI. By designing the dual-channel multi-head spatial attention (DMSA) and graph convolutional modules, the symmetric graph generator first captures global relations among direct and indirect connected brain regions, then models the local brain region interactions. It can uncover the complex mapping relations between fMRI and structural connectivity. Furthermore, the spatially connected consistency loss is devised to constrain the generator to preserve global-local topological information for accurate intrinsic SC prediction. Testing on the public Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the proposed model can effectively generate empirical SC-preserved connectivity from four-dimensional imaging data and shows superior performance in SC prediction compared with other related models. Furthermore, the proposed model can identify the vast majority of important brain regions and connections derived from the empirical method, providing an alternative way to fuse multimodal brain networks and analyze clinical disease.

Title: Object Motion Guided Human Motion Synthesis. (arXiv:2309.16237v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16237
Code URL: null
Copy Paste: [[2309.16237] Object Motion Guided Human Motion Synthesis](http://arxiv.org/abs/2309.16237) #diffusion
Summary:
Modeling human behaviors in contextual environments has a wide range of applications in character animation, embodied AI, VR/AR, and robotics. In real-world scenarios, humans frequently interact with the environment and manipulate various objects to complete daily tasks. In this work, we study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework that can generate full-body manipulation behaviors from only the object motion. Since naively applying diffusion models fails to precisely enforce contact constraints between the hands and the object, OMOMO learns two separate denoising processes to first predict hand positions from object motion and subsequently synthesize full-body poses based on the predicted hand positions. By employing the hand positions as an intermediate representation between the two denoising processes, we can explicitly enforce contact constraints, resulting in more physically plausible manipulation motions. With the learned model, we develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated. Through extensive experiments, we demonstrate the effectiveness of our proposed pipeline and its ability to generalize to unseen objects. Additionally, as high-quality human-object interaction datasets are scarce, we collect a large-scale dataset consisting of 3D object geometry, object motion, and human motion. Our dataset contains human-object interaction motion for 15 objects, with a total duration of approximately 10 hours.

Title: Distilling ODE Solvers of Diffusion Models into Smaller Steps. (arXiv:2309.16421v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16421
Code URL: null
Copy Paste: [[2309.16421] Distilling ODE Solvers of Diffusion Models into Smaller Steps](http://arxiv.org/abs/2309.16421) #diffusion
Summary:
Distillation techniques have substantially improved the sampling speed of diffusion models, allowing of the generation within only one step or a few steps. However, these distillation methods require extensive training for each dataset, sampler, and network, which limits their practical applicability. To address this limitation, we propose a straightforward distillation approach, Distilled-ODE solvers (D-ODE solvers), that optimizes the ODE solver rather than training the denoising network. D-ODE solvers are formulated by simply applying a single parameter adjustment to existing ODE solvers. Subsequently, D-ODE solvers with smaller steps are optimized by ODE solvers with larger steps through distillation over a batch of samples. Our comprehensive experiments indicate that D-ODE solvers outperform existing ODE solvers, including DDIM, PNDM, DPM-Solver, DEIS, and EDM, especially when generating samples with fewer steps. Our method incur negligible computational overhead compared to previous distillation techniques, enabling simple and rapid integration with previous samplers. Qualitative analysis further shows that D-ODE solvers enhance image quality while preserving the sampling trajectory of ODE solvers.

Title: CCEdit: Creative and Controllable Video Editing via Diffusion Models. (arXiv:2309.16496v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16496
Code URL: null
Copy Paste: [[2309.16496] CCEdit: Creative and Controllable Video Editing via Diffusion Models](http://arxiv.org/abs/2309.16496) #diffusion
Summary:
In this work, we present CCEdit, a versatile framework designed to address the challenges of creative and controllable video editing. CCEdit accommodates a wide spectrum of user editing requirements and enables enhanced creative control through an innovative approach that decouples video structure and appearance. We leverage the foundational ControlNet architecture to preserve structural integrity, while seamlessly integrating adaptable temporal modules compatible with state-of-the-art personalization techniques for text-to-image generation, such as DreamBooth and LoRA.Furthermore, we introduce reference-conditioned video editing, empowering users to exercise precise creative control over video editing through the more manageable process of editing key frames. Our extensive experimental evaluations confirm the exceptional functionality and editing capabilities of the proposed CCEdit framework. Demo video is available at https://www.youtube.com/watch?v=UQw4jq-igN4.

Title: KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing. (arXiv:2309.16608v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16608
Code URL: null
Copy Paste: [[2309.16608] KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing](http://arxiv.org/abs/2309.16608) #diffusion
Summary:
Text-conditioned image editing is a recently emerged and highly practical task, and its potential is immeasurable. However, most of the concurrent methods are unable to perform action editing, i.e. they can not produce results that conform to the action semantics of the editing prompt and preserve the content of the original image. To solve the problem of action editing, we propose KV Inversion, a method that can achieve satisfactory reconstruction performance and action editing, which can solve two major problems: 1) the edited result can match the corresponding action, and 2) the edited object can retain the texture and identity of the original real image. In addition, our method does not require training the Stable Diffusion model itself, nor does it require scanning a large-scale dataset to perform time-consuming training.

Title: DeepPCR: Parallelizing Sequential Operations in Neural Networks. (arXiv:2309.16318v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16318
Code URL: null
Copy Paste: [[2309.16318] DeepPCR: Parallelizing Sequential Operations in Neural Networks](http://arxiv.org/abs/2309.16318) #diffusion
Summary:
Parallelization techniques have become ubiquitous for accelerating inference and training of deep neural networks. Despite this, several operations are still performed in a sequential manner. For instance, the forward and backward passes are executed layer-by-layer, and the output of diffusion models is produced by applying a sequence of denoising steps. This sequential approach results in a computational cost proportional to the number of steps involved, presenting a potential bottleneck as the number of steps increases. In this work, we introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations used in inference and training of neural networks. DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm. This reduces the complexity of computing the sequential operations from $\mathcal{O}(L)$ to $\mathcal{O}(\log_2L)$, thus yielding a speedup for large $L$. To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons, and reach speedups of up to $30\times$ for forward and $200\times$ for backward pass. We additionally showcase the flexibility of DeepPCR by parallelizing training of ResNets with as many as 1024 layers, and generation in diffusion models, enabling up to $7\times$ faster training and $11\times$ faster generation, respectively, when compared to the sequential approach.

self-supervised

Title: GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes. (arXiv:2309.16019v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16019
Code URL: null
Copy Paste: [[2309.16019] GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes](http://arxiv.org/abs/2309.16019) #self-supervised
Summary:
This paper tackles the challenges of self-supervised monocular depth estimation in indoor scenes caused by large rotation between frames and low texture. We ease the learning process by obtaining coarse camera poses from monocular sequences through multi-view geometry to deal with the former. However, we found that limited by the scale ambiguity across different scenes in the training dataset, a na\"ive introduction of geometric coarse poses cannot play a positive role in performance improvement, which is counter-intuitive. To address this problem, we propose to refine those poses during training through rotation and translation/scale optimization. To soften the effect of the low texture, we combine the global reasoning of vision transformers with an overfitting-aware, iterative self-distillation mechanism, providing more accurate depth guidance coming from the network itself. Experiments on NYUv2, ScanNet, 7scenes, and KITTI datasets support the effectiveness of each component in our framework, which sets a new state-of-the-art for indoor self-supervised monocular depth estimation, as well as outstanding generalization ability. Code and models are available at https://github.com/zxcqlf/GasMono

Title: Masked autoencoders are scalable learners of cellular morphology. (arXiv:2309.16064v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16064
Code URL: null
Copy Paste: [[2309.16064] Masked autoencoders are scalable learners of cellular morphology](http://arxiv.org/abs/2309.16064) #self-supervised
Summary:
Inferring biological relationships from cellular phenotypes in high-content microscopy screens provides significant opportunity and challenge in biological research. Prior results have shown that deep vision models can capture biological signal better than hand-crafted features. This work explores how weakly supervised and self-supervised deep learning approaches scale when training larger models on larger datasets. Our results show that both CNN- and ViT-based masked autoencoders significantly outperform weakly supervised models. At the high-end of our scale, a ViT-L/8 trained on over 3.5-billion unique crops sampled from 95-million microscopy images achieves relative improvements as high as 28% over our best weakly supervised models at inferring known biological relationships curated from public databases.

Title: Self-supervised Cross-view Representation Reconstruction for Change Captioning. (arXiv:2309.16283v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16283
Code URL: null
Copy Paste: [[2309.16283] Self-supervised Cross-view Representation Reconstruction for Change Captioning](http://arxiv.org/abs/2309.16283) #self-supervised
Summary:
Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a hallucination'' representation with the caption andbefore'' representation. By pushing it closer to the ``after'' representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.

Title: Vision Transformers Need Registers. (arXiv:2309.16588v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16588
Code URL: null
Copy Paste: [[2309.16588] Vision Transformers Need Registers](http://arxiv.org/abs/2309.16588) #self-supervised
Summary:
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

Title: Unsupervised Fact Verification by Language Model Distillation. (arXiv:2309.16540v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.16540
Code URL: null
Copy Paste: [[2309.16540] Unsupervised Fact Verification by Language Model Distillation](http://arxiv.org/abs/2309.16540) #self-supervised
Summary:
Unsupervised fact verification aims to verify a claim using evidence from a trustworthy knowledge base without any kind of data annotation. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful, and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments whilst preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on the standard FEVER fact verification benchmark (+8% accuracy) with linear evaluation.

Title: Graph-level Representation Learning with Joint-Embedding Predictive Architectures. (arXiv:2309.16014v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16014
Code URL: null
Copy Paste: [[2309.16014] Graph-level Representation Learning with Joint-Embedding Predictive Architectures](http://arxiv.org/abs/2309.16014) #self-supervised
Summary:
Joint-Embedding Predictive Architectures (JEPAs) have recently emerged as a novel and powerful technique for self-supervised representation learning. They aim to learn an energy-based model by predicting the latent representation of a target signal $y$ from a context signal $x$. JEPAs bypass the need for data augmentation and negative samples, which are typically required by contrastive learning, while avoiding the overfitting issues associated with generative-based pretraining. In this paper, we show that graph-level representations can be effectively modeled using this paradigm and propose Graph-JEPA, the first JEPA for the graph domain. In particular, we employ masked modeling to learn embeddings for different subgraphs of the input graph. To endow the representations with the implicit hierarchy that is often present in graph-level concepts, we devise an alternative training objective that consists of predicting the coordinates of the encoded subgraphs on the unit hyperbola in the 2D plane. Extensive validation shows that Graph-JEPA can learn representations that are expressive and competitive in both graph classification and regression problems.

Title: Feature Normalization Prevents Collapse of Non-contrastive Learning Dynamics. (arXiv:2309.16109v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16109
Code URL: null
Copy Paste: [[2309.16109] Feature Normalization Prevents Collapse of Non-contrastive Learning Dynamics](http://arxiv.org/abs/2309.16109) #self-supervised
Summary:
Contrastive learning is a self-supervised representation learning framework, where two positive views generated through data augmentation are made similar by an attraction force in a data representation space, while a repulsive force makes them far from negative examples. Non-contrastive learning, represented by BYOL and SimSiam, further gets rid of negative examples and improves computational efficiency. While learned representations may collapse into a single point due to the lack of the repulsive force at first sight, Tian et al. (2021) revealed through the learning dynamics analysis that the representations can avoid collapse if data augmentation is sufficiently stronger than regularization. However, their analysis does not take into account commonly-used feature normalization, a normalizer before measuring the similarity of representations, and hence excessively strong regularization may collapse the dynamics, which is an unnatural behavior under the presence of feature normalization. Therefore, we extend the previous theory based on the L2 loss by considering the cosine loss, which involves feature normalization. We show that the cosine loss induces sixth-order dynamics (while the L2 loss induces a third-order one), in which a stable equilibrium dynamically emerges even if there are only collapsed solutions with given initial parameters. Thus, we offer a new understanding that feature normalization plays an important role in robustly preventing the dynamics collapse.

foundation model

Title: The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering. (arXiv:2309.15954v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.15954
Code URL: null
Copy Paste: [[2309.15954] The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering](http://arxiv.org/abs/2309.15954) #foundation model
Summary:
The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the computational budget, etc. We slice and dice our design choices, provide in-depth analysis, and discuss open questions. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.

Title: Effective Long-Context Scaling of Foundation Models. (arXiv:2309.16039v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.16039
Code URL: null
Copy Paste: [[2309.16039] Effective Long-Context Scaling of Foundation Models](http://arxiv.org/abs/2309.16039) #foundation model
Summary:
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

Title: MHG-GNN: Combination of Molecular Hypergraph Grammar with Graph Neural Network. (arXiv:2309.16374v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16374
Code URL: null
Copy Paste: [[2309.16374] MHG-GNN: Combination of Molecular Hypergraph Grammar with Graph Neural Network](http://arxiv.org/abs/2309.16374) #foundation model
Summary:
Property prediction plays an important role in material discovery. As an initial step to eventually develop a foundation model for material science, we introduce a new autoencoder called the MHG-GNN, which combines graph neural network (GNN) with Molecular Hypergraph Grammar (MHG). Results on a variety of property prediction tasks with diverse materials show that MHG-GNN is promising.

generative

Title: AutoEncoding Tree for City Generation and Applications. (arXiv:2309.15941v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.15941
Code URL: null
Copy Paste: [[2309.15941] AutoEncoding Tree for City Generation and Applications](http://arxiv.org/abs/2309.15941) #generative
Summary:
City modeling and generation have attracted an increased interest in various applications, including gaming, urban planning, and autonomous driving. Unlike previous works focused on the generation of single objects or indoor scenes, the huge volumes of spatial data in cities pose a challenge to the generative models. Furthermore, few publicly available 3D real-world city datasets also hinder the development of methods for city generation. In this paper, we first collect over 3,000,000 geo-referenced objects for the city of New York, Zurich, Tokyo, Berlin, Boston and several other large cities. Based on this dataset, we propose AETree, a tree-structured auto-encoder neural network, for city generation. Specifically, we first propose a novel Spatial-Geometric Distance (SGD) metric to measure the similarity between building layouts and then construct a binary tree over the raw geometric data of building based on the SGD metric. Next, we present a tree-structured network whose encoder learns to extract and merge spatial information from bottom-up iteratively. The resulting global representation is reversely decoded for reconstruction or generation. To address the issue of long-dependency as the level of the tree increases, a Long Short-Term Memory (LSTM) Cell is employed as a basic network element of the proposed AETree. Moreover, we introduce a novel metric, Overlapping Area Ratio (OAR), to quantitatively evaluate the generation results. Experiments on the collected dataset demonstrate the effectiveness of the proposed model on 2D and 3D city generation. Furthermore, the latent features learned by AETree can serve downstream urban planning applications.

Title: Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness. (arXiv:2309.15991v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.15991
Code URL: null
Copy Paste: [[2309.15991] Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness](http://arxiv.org/abs/2309.15991) #generative
Summary:
Artificial neural networks typically struggle in generalizing to out-of-context examples. One reason for this limitation is caused by having datasets that incorporate only partial information regarding the potential correlational structure of the world. In this work, we propose TIDA (Targeted Image-editing Data Augmentation), a targeted data augmentation method focused on improving models' human-like abilities (e.g., gender recognition) by filling the correlational structure gap using a text-to-image generative model. More specifically, TIDA identifies specific skills in captions describing images (e.g., the presence of a specific gender in the image), changes the caption (e.g., "woman" to "man"), and then uses a text-to-image model to edit the image in order to match the novel caption (e.g., uniquely changing a woman to a man while maintaining the context identical). Based on the Flickr30K benchmark, we show that, compared with the original data set, a TIDA-enhanced dataset related to gender, color, and counting abilities induces better performance in several image captioning metrics. Furthermore, on top of relying on the classical BLEU metric, we conduct a fine-grained analysis of the improvements of our models against the baseline in different ways. We compared text-to-image generative models and found different behaviors of the image captioning models in terms of encoding visual encoding and textual decoding.

Title: Learning Effective NeRFs and SDFs Representations with 3D Generative Adversarial Networks for 3D Object Generation: Technical Report for ICCV 2023 OmniObject3D Challenge. (arXiv:2309.16110v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16110
Code URL: null
Copy Paste: [[2309.16110] Learning Effective NeRFs and SDFs Representations with 3D Generative Adversarial Networks for 3D Object Generation: Technical Report for ICCV 2023 OmniObject3D Challenge](http://arxiv.org/abs/2309.16110) #generative
Summary:
In this technical report, we present a solution for 3D object generation of ICCV 2023 OmniObject3D Challenge. In recent years, 3D object generation has made great process and achieved promising results, but it remains a challenging task due to the difficulty of generating complex, textured and high-fidelity results. To resolve this problem, we study learning effective NeRFs and SDFs representations with 3D Generative Adversarial Networks (GANs) for 3D object generation. Specifically, inspired by recent works, we use the efficient geometry-aware 3D GANs as the backbone incorporating with label embedding and color mapping, which enables to train the model on different taxonomies simultaneously. Then, through a decoder, we aggregate the resulting features to generate Neural Radiance Fields (NeRFs) based representations for rendering high-fidelity synthetic images. Meanwhile, we optimize Signed Distance Functions (SDFs) to effectively represent objects with 3D meshes. Besides, we observe that this model can be effectively trained with only a few images of each object from a variety of classes, instead of using a great number of images per object or training one model per class. With this pipeline, we can optimize an effective model for 3D object generation. This solution is one of the final top-3-place solutions in the ICCV 2023 OmniObject3D Challenge.

Title: Generative Semi-supervised Learning with Meta-Optimized Synthetic Samples. (arXiv:2309.16143v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16143
Code URL: null
Copy Paste: [[2309.16143] Generative Semi-supervised Learning with Meta-Optimized Synthetic Samples](http://arxiv.org/abs/2309.16143) #generative
Summary:
Semi-supervised learning (SSL) is a promising approach for training deep classification models using labeled and unlabeled datasets. However, existing SSL methods rely on a large unlabeled dataset, which may not always be available in many real-world applications due to legal constraints (e.g., GDPR). In this paper, we investigate the research question: Can we train SSL models without real unlabeled datasets? Instead of using real unlabeled datasets, we propose an SSL method using synthetic datasets generated from generative foundation models trained on datasets containing millions of samples in diverse domains (e.g., ImageNet). Our main concepts are identifying synthetic samples that emulate unlabeled samples from generative foundation models and training classifiers using these synthetic samples. To achieve this, our method is formulated as an alternating optimization problem: (i) meta-learning of generative foundation models and (ii) SSL of classifiers using real labeled and synthetic unlabeled samples. For (i), we propose a meta-learning objective that optimizes latent variables to generate samples that resemble real labeled samples and minimize the validation loss. For (ii), we propose a simple unsupervised loss function that regularizes the feature extractors of classifiers to maximize the performance improvement obtained from synthetic samples. We confirm that our method outperforms baselines using generative foundation models on SSL. We also demonstrate that our methods outperform SSL using real unlabeled datasets in scenarios with extremely small amounts of labeled datasets. This suggests that synthetic samples have the potential to provide improvement gains more efficiently than real unlabeled data.

Title: FG-NeRF: Flow-GAN based Probabilistic Neural Radiance Field for Independence-Assumption-Free Uncertainty Estimation. (arXiv:2309.16364v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16364
Code URL: null
Copy Paste: [[2309.16364] FG-NeRF: Flow-GAN based Probabilistic Neural Radiance Field for Independence-Assumption-Free Uncertainty Estimation](http://arxiv.org/abs/2309.16364) #generative
Summary:
Neural radiance fields with stochasticity have garnered significant interest by enabling the sampling of plausible radiance fields and quantifying uncertainty for downstream tasks. Existing works rely on the independence assumption of points in the radiance field or the pixels in input views to obtain tractable forms of the probability density function. However, this assumption inadvertently impacts performance when dealing with intricate geometry and texture. In this work, we propose an independence-assumption-free probabilistic neural radiance field based on Flow-GAN. By combining the generative capability of adversarial learning and the powerful expressivity of normalizing flow, our method explicitly models the density-radiance distribution of the whole scene. We represent our probabilistic NeRF as a mean-shifted probabilistic residual neural model. Our model is trained without an explicit likelihood function, thereby avoiding the independence assumption. Specifically, We downsample the training images with different strides and centers to form fixed-size patches which are used to train the generator with patch-based adversarial learning. Through extensive experiments, our method demonstrates state-of-the-art performance by predicting lower rendering errors and more reliable uncertainty on both synthetic and real-world datasets.

Title: DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. (arXiv:2309.16653v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16653
Code URL: null
Copy Paste: [[2309.16653] DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation](http://arxiv.org/abs/2309.16653) #generative
Summary:
Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS). Though promising results have been exhibited, these methods often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space. In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks. To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details. Extensive experiments demonstrate the superior efficiency and competitive generation quality of our proposed approach. Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.

Title: RealFill: Reference-Driven Generation for Authentic Image Completion. (arXiv:2309.16668v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16668
Code URL: null
Copy Paste: [[2309.16668] RealFill: Reference-Driven Generation for Authentic Image Completion](http://arxiv.org/abs/2309.16668) #generative
Summary:
Recent advances in generative imagery have brought forth outpainting and inpainting models that can produce high-quality, plausible image content in unknown regions, but the content these models hallucinate is necessarily inauthentic, since the models lack sufficient context about the true scene. In this work, we propose RealFill, a novel generative approach for image completion that fills in missing regions of an image with the content that should have been there. RealFill is a generative inpainting model that is personalized using only a few reference images of a scene. These reference images do not have to be aligned with the target image, and can be taken with drastically varying viewpoints, lighting conditions, camera apertures, or image styles. Once personalized, RealFill is able to complete a target image with visually compelling contents that are faithful to the original scene. We evaluate RealFill on a new image completion benchmark that covers a set of diverse and challenging scenarios, and find that it outperforms existing approaches by a large margin. See more results on our project page: https://realfill.github.io

Title: Demystifying CLIP Data. (arXiv:2309.16671v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16671
Code URL: https://github.com/facebookresearch/metaclip
Copy Paste: [[2309.16671] Demystifying CLIP Data](http://arxiv.org/abs/2309.16671) #generative
Summary:
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.

Title: Social Media Fashion Knowledge Extraction as Captioning. (arXiv:2309.16270v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.16270
Code URL: null
Copy Paste: [[2309.16270] Social Media Fashion Knowledge Extraction as Captioning](http://arxiv.org/abs/2309.16270) #generative
Summary:
Social media plays a significant role in boosting the fashion industry, where a massive amount of fashion-related posts are generated every day. In order to obtain the rich fashion information from the posts, we study the task of social media fashion knowledge extraction. Fashion knowledge, which typically consists of the occasion, person attributes, and fashion item information, can be effectively represented as a set of tuples. Most previous studies on fashion knowledge extraction are based on the fashion product images without considering the rich text information in social media posts. Existing work on fashion knowledge extraction in social media is classification-based and requires to manually determine a set of fashion knowledge categories in advance. In our work, we propose to cast the task as a captioning problem to capture the interplay of the multimodal post information. Specifically, we transform the fashion knowledge tuples into a natural language caption with a sentence transformation method. Our framework then aims to generate the sentence-based fashion knowledge directly from the social media post. Inspired by the big success of pre-trained models, we build our model based on a multimodal pre-trained generative model and design several auxiliary tasks for enhancing the knowledge extraction. Since there is no existing dataset which can be directly borrowed to our task, we introduce a dataset consisting of social media posts with manual fashion knowledge annotation. Extensive experiments are conducted to demonstrate the effectiveness of our model.

Title: Compositional Sculpting of Iterative Generative Processes. (arXiv:2309.16115v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16115
Code URL: https://github.com/timgaripov/compositional-sculpting
Copy Paste: [[2309.16115] Compositional Sculpting of Iterative Generative Processes](http://arxiv.org/abs/2309.16115) #generative
Summary:
High training costs of generative models and the need to fine-tune them for specific tasks have created a strong interest in model reuse and composition. A key challenge in composing iterative generative processes, such as GFlowNets and diffusion models, is that to realize the desired target distribution, all steps of the generative process need to be coordinated, and satisfy delicate balance conditions. In this work, we propose Compositional Sculpting: a general approach for defining compositions of iterative generative processes. We then introduce a method for sampling from these compositions built on classifier guidance. We showcase ways to accomplish compositional sculpting in both GFlowNets and diffusion models. We highlight two binary operations $\unicode{x2014}$ the harmonic mean ($p_1 \otimes p_2$) and the contrast ($p_1 \unicode{x25D1}\,p_2$) between pairs, and the generalization of these operations to multiple component distributions. We offer empirical results on image and molecular generation tasks.

Title: Max-Sliced Mutual Information. (arXiv:2309.16200v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16200
Code URL: null
Copy Paste: [[2309.16200] Max-Sliced Mutual Information](http://arxiv.org/abs/2309.16200) #generative
Summary:
Quantifying the dependence between high-dimensional random variables is central to statistical learning and inference. Two classical methods are canonical correlation analysis (CCA), which identifies maximally correlated projected versions of the original variables, and Shannon's mutual information, which is a universal dependence measure that also captures high-order dependencies. However, CCA only accounts for linear dependence, which may be insufficient for certain applications, while mutual information is often infeasible to compute/estimate in high dimensions. This work proposes a middle ground in the form of a scalable information-theoretic generalization of CCA, termed max-sliced mutual information (mSMI). mSMI equals the maximal mutual information between low-dimensional projections of the high-dimensional variables, which reduces back to CCA in the Gaussian case. It enjoys the best of both worlds: capturing intricate dependencies in the data while being amenable to fast computation and scalable estimation from samples. We show that mSMI retains favorable structural properties of Shannon's mutual information, like variational forms and identification of independence. We then study statistical estimation of mSMI, propose an efficiently computable neural estimator, and couple it with formal non-asymptotic error bounds. We present experiments that demonstrate the utility of mSMI for several tasks, encompassing independence testing, multi-view representation learning, algorithmic fairness, and generative modeling. We observe that mSMI consistently outperforms competing methods with little-to-no computational overhead.

Title: Uncertainty-Aware Decision Transformer for Stochastic Driving Environments. (arXiv:2309.16397v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16397
Code URL: null
Copy Paste: [[2309.16397] Uncertainty-Aware Decision Transformer for Stochastic Driving Environments](http://arxiv.org/abs/2309.16397) #generative
Summary:
Offline Reinforcement Learning (RL) has emerged as a promising framework for learning policies without active interactions, making it especially appealing for autonomous driving tasks. Recent successes of Transformers inspire casting offline RL as sequence modeling, which performs well in long-horizon tasks. However, they are overly optimistic in stochastic environments with incorrect assumptions that the same goal can be consistently achieved by identical actions. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer (UNREST) for planning in stochastic driving environments without introducing additional transition or complex generative models. Specifically, UNREST estimates state uncertainties by the conditional mutual information between transitions and returns, and segments sequences accordingly. Discovering the uncertainty accumulation' andtemporal locality' properties of driving environments, UNREST replaces the global returns in decision transformers with less uncertain truncated returns, to learn from true outcomes of agent actions rather than environment transitions. We also dynamically evaluate environmental uncertainty during inference for cautious planning. Extensive experimental results demonstrate UNREST's superior performance in various driving scenarios and the power of our uncertainty estimation strategy.

Title: Recent Advances of Differential Privacy in Centralized Deep Learning: A Systematic Survey. (arXiv:2309.16398v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16398
Code URL: null
Copy Paste: [[2309.16398] Recent Advances of Differential Privacy in Centralized Deep Learning: A Systematic Survey](http://arxiv.org/abs/2309.16398) #generative
Summary:
Differential Privacy has become a widely popular method for data protection in machine learning, especially since it allows formulating strict mathematical privacy guarantees. This survey provides an overview of the state-of-the-art of differentially private centralized deep learning, thorough analyses of recent advances and open problems, as well as a discussion of potential future developments in the field. Based on a systematic literature review, the following topics are addressed: auditing and evaluation methods for private models, improvements of privacy-utility trade-offs, protection against a broad range of threats and attacks, differentially private generative models, and emerging application domains.

anomaly

Title: Weakly-Supervised Video Anomaly Detection with Snippet Anomalous Attention. (arXiv:2309.16309v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16309
Code URL: null
Copy Paste: [[2309.16309] Weakly-Supervised Video Anomaly Detection with Snippet Anomalous Attention](http://arxiv.org/abs/2309.16309) #anomaly
Summary:
With a focus on abnormal events contained within untrimmed videos, there is increasing interest among researchers in video anomaly detection. Among different video anomaly detection scenarios, weakly-supervised video anomaly detection poses a significant challenge as it lacks frame-wise labels during the training stage, only relying on video-level labels as coarse supervision. Previous methods have made attempts to either learn discriminative features in an end-to-end manner or employ a twostage self-training strategy to generate snippet-level pseudo labels. However, both approaches have certain limitations. The former tends to overlook informative features at the snippet level, while the latter can be susceptible to noises. In this paper, we propose an Anomalous Attention mechanism for weakly-supervised anomaly detection to tackle the aforementioned problems. Our approach takes into account snippet-level encoded features without the supervision of pseudo labels. Specifically, our approach first generates snippet-level anomalous attention and then feeds it together with original anomaly scores into a Multi-branch Supervision Module. The module learns different areas of the video, including areas that are challenging to detect, and also assists the attention optimization. Experiments on benchmark datasets XDViolence and UCF-Crime verify the effectiveness of our method. Besides, thanks to the proposed snippet-level attention, we obtain a more precise anomaly localization.

Title: Digital Twin-based Anomaly Detection with Curriculum Learning in Cyber-physical Systems. (arXiv:2309.15995v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.15995
Code URL: https://github.com/xuqinghua-china/tosem
Copy Paste: [[2309.15995] Digital Twin-based Anomaly Detection with Curriculum Learning in Cyber-physical Systems](http://arxiv.org/abs/2309.15995) #anomaly
Summary:
Anomaly detection is critical to ensure the security of cyber-physical systems (CPS). However, due to the increasing complexity of attacks and CPS themselves, anomaly detection in CPS is becoming more and more challenging. In our previous work, we proposed a digital twin-based anomaly detection method, called ATTAIN, which takes advantage of both historical and real-time data of CPS. However, such data vary significantly in terms of difficulty. Therefore, similar to human learning processes, deep learning models (e.g., ATTAIN) can benefit from an easy-to-difficult curriculum. To this end, in this paper, we present a novel approach, named digitaL twin-based Anomaly deTecTion wIth Curriculum lEarning (LATTICE), which extends ATTAIN by introducing curriculum learning to optimize its learning paradigm. LATTICE attributes each sample with a difficulty score, before being fed into a training scheduler. The training scheduler samples batches of training data based on these difficulty scores such that learning from easy to difficult data can be performed. To evaluate LATTICE, we use five publicly available datasets collected from five real-world CPS testbeds. We compare LATTICE with ATTAIN and two other state-of-the-art anomaly detectors. Evaluation results show that LATTICE outperforms the three baselines and ATTAIN by 0.906%-2.367% in terms of the F1 score. LATTICE also, on average, reduces the training time of ATTAIN by 4.2% on the five datasets and is on par with the baselines in terms of detection delay time.

Title: HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs). (arXiv:2309.16021v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.16021
Code URL: null
Copy Paste: [[2309.16021] HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs)](http://arxiv.org/abs/2309.16021) #anomaly
Summary:
Machine learning (ML) is crucial in network anomaly detection for proactive threat hunting, reducing detection and response times significantly. However, challenges in model training, maintenance, and frequent false positives impact its acceptance and reliability. Explainable AI (XAI) attempts to mitigate these issues, allowing cybersecurity teams to assess AI-generated alerts with confidence, but has seen limited acceptance from incident responders. Large Language Models (LLMs) present a solution through discerning patterns in extensive information and adapting to different functional requirements. We present HuntGPT, a specialized intrusion detection dashboard applying a Random Forest classifier using the KDD99 dataset, integrating XAI frameworks like SHAP and Lime for user-friendly and intuitive model interaction, and combined with a GPT-3.5 Turbo, it delivers threats in an understandable format. The paper delves into the system's architecture, components, and technical accuracy, assessed through Certified Information Security Manager (CISM) Practice Exams, evaluating response quality across six metrics. The results demonstrate that conversational agents, supported by LLM and integrated with XAI, provide robust, explainable, and actionable AI solutions in intrusion detection, enhancing user understanding and interactive experience.

in-context

Title: Visual In-Context Learning for Few-Shot Eczema Segmentation. (arXiv:2309.16656v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16656
Code URL: null
Copy Paste: [[2309.16656] Visual In-Context Learning for Few-Shot Eczema Segmentation](http://arxiv.org/abs/2309.16656) #in-context
Summary:
Automated diagnosis of eczema from digital camera images is crucial for developing applications that allow patients to self-monitor their recovery. An important component of this is the segmentation of eczema region from such images. Current methods for eczema segmentation rely on deep neural networks such as convolutional (CNN)-based U-Net or transformer-based Swin U-Net. While effective, these methods require high volume of annotated data, which can be difficult to obtain. Here, we investigate the capabilities of visual in-context learning that can perform few-shot eczema segmentation with just a handful of examples and without any need for retraining models. Specifically, we propose a strategy for applying in-context learning for eczema segmentation with a generalist vision model called SegGPT. When benchmarked on a dataset of annotated eczema images, we show that SegGPT with just 2 representative example images from the training dataset performs better (mIoU: 36.69) than a CNN U-Net trained on 428 images (mIoU: 32.60). We also discover that using more number of examples for SegGPT may in fact be harmful to its performance. Our result highlights the importance of visual in-context learning in developing faster and better solutions to skin imaging tasks. Our result also paves the way for developing inclusive solutions that can cater to minorities in the demographics who are typically heavily under-represented in the training data.

Title: MedEdit: Model Editing for Medical Question Answering with External Knowledge Bases. (arXiv:2309.16035v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.16035
Code URL: null
Copy Paste: [[2309.16035] MedEdit: Model Editing for Medical Question Answering with External Knowledge Bases](http://arxiv.org/abs/2309.16035) #in-context
Summary:
Large Language Models (LLMs), although powerful in general domains, often perform poorly on domain-specific tasks like medical question answering (QA). Moreover, they tend to function as "black-boxes," making it challenging to modify their behavior. Addressing this, our study delves into model editing utilizing in-context learning, aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then we incorporate them into the query prompt for the LLM. Focusing on medical QA using the MedQA-SMILE dataset, we evaluate the impact of different retrieval models and the number of facts provided to the LLM. Notably, our edited Vicuna model exhibited an accuracy improvement from 44.46% to 48.54%. This work underscores the potential of model editing to enhance LLM performance, offering a practical approach to mitigate the challenges of black-box LLMs.

Title: A Benchmark for Learning to Translate a New Language from One Grammar Book. (arXiv:2309.16575v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.16575
Code URL: null
Copy Paste: [[2309.16575] A Benchmark for Learning to Translate a New Language from One Grammar Book](http://arxiv.org/abs/2309.16575) #in-context
Summary:
Large language models (LLMs) can perform impressive feats with in-context learning or lightweight finetuning. It is natural to wonder how well these models adapt to genuinely new tasks, but how does one find tasks that are unseen in internet-scale training sets? We turn to a field that is explicitly motivated and bottlenecked by a scarcity of web data: low-resource languages. In this paper, we introduce MTOB (Machine Translation from One Book), a benchmark for learning to translate between English and Kalamang -- a language with less than 200 speakers and therefore virtually no presence on the web -- using several hundred pages of field linguistics reference materials. This task framing is novel in that it asks a model to learn a language from a single human-readable book of grammar explanations, rather than a large mined corpus of in-domain data, more akin to L2 learning than L1 acquisition. We demonstrate that baselines using current LLMs are promising but fall short of human performance, achieving 44.7 chrF on Kalamang to English translation and 45.8 chrF on English to Kalamang translation, compared to 51.6 and 57.0 chrF by a human who learned Kalamang from the same reference materials. We hope that MTOB will help measure LLM capabilities along a new dimension, and that the methods developed to solve it could help expand access to language technology for underserved communities by leveraging qualitatively different kinds of data than traditional machine translation.

memory

Title: Open Compound Domain Adaptation with Object Style Compensation for Semantic Segmentation. (arXiv:2309.16127v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16127
Code URL: null
Copy Paste: [[2309.16127] Open Compound Domain Adaptation with Object Style Compensation for Semantic Segmentation](http://arxiv.org/abs/2309.16127) #memory
Summary:
Many methods of semantic image segmentation have borrowed the success of open compound domain adaptation. They minimize the style gap between the images of source and target domains, more easily predicting the accurate pseudo annotations for target domain's images that train segmentation network. The existing methods globally adapt the scene style of the images, whereas the object styles of different categories or instances are adapted improperly. This paper proposes the Object Style Compensation, where we construct the Object-Level Discrepancy Memory with multiple sets of discrepancy features. The discrepancy features in a set capture the style changes of the same category's object instances adapted from target to source domains. We learn the discrepancy features from the images of source and target domains, storing the discrepancy features in memory. With this memory, we select appropriate discrepancy features for compensating the style information of the object instances of various categories, adapting the object styles to a unified style of source domain. Our method enables a more accurate computation of the pseudo annotations for target domain's images, thus yielding state-of-the-art results on different datasets.

Title: Accurate and lightweight dehazing via multi-receptive-field non-local network and novel contrastive regularization. (arXiv:2309.16494v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16494
Code URL: null
Copy Paste: [[2309.16494] Accurate and lightweight dehazing via multi-receptive-field non-local network and novel contrastive regularization](http://arxiv.org/abs/2309.16494) #memory
Summary:
Recently, deep learning-based methods have dominated image dehazing domain. Although very competitive dehazing performance has been achieved with sophisticated models, effective solutions for extracting useful features are still under-explored. In addition, non-local network, which has made a breakthrough in many vision tasks, has not been appropriately applied to image dehazing. Thus, a multi-receptive-field non-local network (MRFNLN) consisting of the multi-stream feature attention block (MSFAB) and cross non-local block (CNLB) is presented in this paper. We start with extracting richer features for dehazing. Specifically, we design a multi-stream feature extraction (MSFE) sub-block, which contains three parallel convolutions with different receptive fields (i.e., $1\times 1$, $3\times 3$, $5\times 5$) for extracting multi-scale features. Following MSFE, we employ an attention sub-block to make the model adaptively focus on important channels/regions. The MSFE and attention sub-blocks constitute our MSFAB. Then, we design a cross non-local block (CNLB), which can capture long-range dependencies beyond the query. Instead of the same input source of query branch, the key and value branches are enhanced by fusing more preceding features. CNLB is computation-friendly by leveraging a spatial pyramid down-sampling (SPDS) strategy to reduce the computation and memory consumption without sacrificing the performance. Last but not least, a novel detail-focused contrastive regularization (DFCR) is presented by emphasizing the low-level details and ignoring the high-level semantic information in the representation space. Comprehensive experimental results demonstrate that the proposed MRFNLN model outperforms recent state-of-the-art dehazing methods with less than 1.5 Million parameters.

Title: Controllable Text Generation with Residual Memory Transformer. (arXiv:2309.16231v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.16231
Code URL: https://github.com/littlehacker26/residual_memory_transformer
Copy Paste: [[2309.16231] Controllable Text Generation with Residual Memory Transformer](http://arxiv.org/abs/2309.16231) #memory
Summary:
Large-scale Causal Language Models (CLMs), e.g., GPT3 and ChatGPT, have brought great success in text generation. However, it is still an open challenge to control the generation process of CLM while balancing flexibility, control granularity, and generation efficiency. In this paper, we provide a new alternative for controllable text generation (CTG), by designing a non-intrusive, lightweight control plugin to accompany the generation of CLM at arbitrary time steps. The proposed control plugin, namely Residual Memory Transformer (RMT), has an encoder-decoder setup, which can accept any types of control conditions and cooperate with CLM through a residual learning paradigm, to achieve a more flexible, general, and efficient CTG. Extensive experiments are carried out on various control tasks, in the form of both automatic and human evaluations. The results show the superiority of RMT over a range of state-of-the-art approaches, proving the effectiveness and versatility of our approach.

Title: Augmenting LLMs with Knowledge: A survey on hallucination prevention. (arXiv:2309.16459v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.16459
Code URL: null
Copy Paste: [[2309.16459] Augmenting LLMs with Knowledge: A survey on hallucination prevention](http://arxiv.org/abs/2309.16459) #memory
Summary:
Large pre-trained language models have demonstrated their proficiency in storing factual knowledge within their parameters and achieving remarkable results when fine-tuned for downstream natural language processing tasks. Nonetheless, their capacity to access and manipulate knowledge with precision remains constrained, resulting in performance disparities on knowledge-intensive tasks when compared to task-specific architectures. Additionally, the challenges of providing provenance for model decisions and maintaining up-to-date world knowledge persist as open research frontiers. To address these limitations, the integration of pre-trained models with differentiable access mechanisms to explicit non-parametric memory emerges as a promising solution. This survey delves into the realm of language models (LMs) augmented with the ability to tap into external knowledge sources, including external knowledge bases and search engines. While adhering to the standard objective of predicting missing tokens, these augmented LMs leverage diverse, possibly non-parametric external modules to augment their contextual processing capabilities, departing from the conventional language modeling paradigm. Through an exploration of current advancements in augmenting large language models with knowledge, this work concludes that this emerging research direction holds the potential to address prevalent issues in traditional LMs, such as hallucinations, un-grounded responses, and scalability challenges.

Title: Random and Safe Cache Architecture to Defeat Cache Timing Attacks. (arXiv:2309.16172v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.16172
Code URL: null
Copy Paste: [[2309.16172] Random and Safe Cache Architecture to Defeat Cache Timing Attacks](http://arxiv.org/abs/2309.16172) #memory
Summary:
Caches have been exploited to leak secret information due to the different times they take to handle memory accesses. Cache timing attacks include non-speculative cache side and covert channel attacks and cache-based speculative execution attacks. We first present a systematic view of the attack and defense space and show that no existing defense has addressed both speculative and non-speculative cache timing attack families, which we do in this paper. We propose Random and Safe (RaS) cache architectures to decorrelate the cache state changes from memory requests. RaS fills the cache with ``safe'' cache lines that are likely to be used in the future, rather than with demand-fetched, security-sensitive lines. RaS captures a group of safe addresses during runtime and fetches addresses randomly displaced from these addresses. Our proposed RaS architecture is flexible to allow security-performance trade-offs. We show different designs of RaS architectures that can defeat cache side-channel attacks and cache-based speculative execution attacks. The RaS variant against cache-based speculative execution attacks has 4.2% average performance overhead and other RaS variants against both attack families have 7.9% to 45.2% average overhead. For some benchmarks, RaS defenses improve the performance while providing security.

Title: ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers. (arXiv:2309.16119v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16119
Code URL: https://github.com/kuleshov-group/modulora-experiment
Copy Paste: [[2309.16119] ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers](http://arxiv.org/abs/2309.16119) #memory
Summary:
We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 3-bit or 4-bit precision on as little as one 48GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 3-bit LLMs for the first time--leveraging state-of-the-art 3-bit OPTQ quantization often outperforms finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, ModuLoRA attains competitive performance on text classification, natural language infernece, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release ModuLoRA together with a series of low-precision models--including the first family of 3-bit instruction following Alpaca LLMs--as part of LLMTOOLS, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.

Title: Universal Sleep Decoder: Aligning awake and sleep neural representation across subjects. (arXiv:2309.16457v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16457
Code URL: null
Copy Paste: [[2309.16457] Universal Sleep Decoder: Aligning awake and sleep neural representation across subjects](http://arxiv.org/abs/2309.16457) #memory
Summary:
Decoding memory content from brain activity during sleep has long been a goal in neuroscience. While spontaneous reactivation of memories during sleep in rodents is known to support memory consolidation and offline learning, capturing memory replay in humans is challenging due to the absence of well-annotated sleep datasets and the substantial differences in neural patterns between wakefulness and sleep. To address these challenges, we designed a novel cognitive neuroscience experiment and collected a comprehensive, well-annotated electroencephalography (EEG) dataset from 52 subjects during both wakefulness and sleep. Leveraging this benchmark dataset, we developed the Universal Sleep Decoder (USD) to align neural representations between wakefulness and sleep across subjects. Our model achieves up to 16.6% top-1 zero-shot accuracy on unseen subjects, comparable to decoding performances using individual sleep data. Furthermore, fine-tuning USD on test subjects enhances decoding accuracy to 25.9% top-1 accuracy, a substantial improvement over the baseline chance of 6.7%. Model comparison and ablation analyses reveal that our design choices, including the use of (i) an additional contrastive objective to integrate awake and sleep neural signals and (ii) the pretrain-finetune paradigm to incorporate different subjects, significantly contribute to these performances. Collectively, our findings and methodologies represent a significant advancement in the field of sleep decoding.

few-shot

Title: Reflection Invariance Learning for Few-shot Semantic Segmentation. (arXiv:2309.15850v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.15850
Code URL: null
Copy Paste: [[2309.15850] Reflection Invariance Learning for Few-shot Semantic Segmentation](http://arxiv.org/abs/2309.15850) #few-shot
Summary:
Few-shot semantic segmentation (FSS) aims to segment objects of unseen classes in query images with only a few annotated support images. Existing FSS algorithms typically focus on mining category representations from the single-view support to match semantic objects of the single-view query. However, the limited annotated samples render the single-view matching struggle to perceive the reflection invariance of novel objects, which results in a restricted learning space for novel categories and further induces a biased segmentation with demoted parsing performance. To address this challenge, this paper proposes a fresh few-shot segmentation framework to mine the reflection invariance in a multi-view matching manner. Specifically, original and reflection support features from different perspectives with the same semantics are learnable fused to obtain the reflection invariance prototype with a stronger category representation ability. Simultaneously, aiming at providing better prior guidance, the Reflection Invariance Prior Mask Generation (RIPMG) module is proposed to integrate prior knowledge from different perspectives. Finally, segmentation predictions from varying views are complementarily merged in the Reflection Invariance Semantic Prediction (RISP) module to yield precise segmentation predictions. Extensive experiments on both PASCAL-$5^\textit{i}$ and COCO-$20^\textit{i}$ datasets demonstrate the effectiveness of our approach and show that our method could achieve state-of-the-art performance. Code is available at \url{https://anonymous.4open.science/r/RILFS-A4D1}

Title: Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts. (arXiv:2309.15915v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.15915
Code URL: null
Copy Paste: [[2309.15915] Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts](http://arxiv.org/abs/2309.15915) #few-shot
Summary:
Recent vision-language models are driven by large-scale pretrained models. However, adapting pretrained models on limited data presents challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and language. We introduce a parameter-efficient method to address these challenges, combining multimodal prompt learning and a transformer-based mapping network, while keeping the pretrained models frozen. Our experiments on several video question answering benchmarks demonstrate the superiority of our approach in terms of performance and parameter efficiency on both zero-shot and few-shot settings. Our code is available at https://engindeniz.github.io/vitis.

Title: Logarithm-transform aided Gaussian Sampling for Few-Shot Learning. (arXiv:2309.16337v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.16337
Code URL: https://github.com/ganatra-v/gaussian-sampling-fsl
Copy Paste: [[2309.16337] Logarithm-transform aided Gaussian Sampling for Few-Shot Learning](http://arxiv.org/abs/2309.16337) #few-shot
Summary:
Few-shot image classification has recently witnessed the rise of representation learning being utilised for models to adapt to new classes using only a few training examples. Therefore, the properties of the representations, such as their underlying probability distributions, assume vital importance. Representations sampled from Gaussian distributions have been used in recent works, [19] to train classifiers for few-shot classification. These methods rely on transforming the distributions of experimental data to approximate Gaussian distributions for their functioning. In this paper, I propose a novel Gaussian transform, that outperforms existing methods on transforming experimental data into Gaussian-like distributions. I then utilise this novel transformation for few-shot image classification and show significant gains in performance, while sampling lesser data.

Title: Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection. (arXiv:2309.16424v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.16424
Code URL: https://github.com/jiayingwu19/prompt-and-align
Copy Paste: [[2309.16424] Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection](http://arxiv.org/abs/2309.16424) #few-shot
Summary:
Despite considerable advances in automated fake news detection, due to the timely nature of news, it remains a critical open question how to effectively predict the veracity of news articles based on limited fact-checks. Existing approaches typically follow a "Train-from-Scratch" paradigm, which is fundamentally bounded by the availability of large-scale annotated data. While expressive pre-trained language models (PLMs) have been adapted in a "Pre-Train-and-Fine-Tune" manner, the inconsistency between pre-training and downstream objectives also requires costly task-specific supervision. In this paper, we propose "Prompt-and-Align" (P&A), a novel prompt-based paradigm for few-shot fake news detection that jointly leverages the pre-trained knowledge in PLMs and the social context topology. Our approach mitigates label scarcity by wrapping the news article in a task-related textual prompt, which is then processed by the PLM to directly elicit task-specific knowledge. To supplement the PLM with social context without inducing additional training overheads, motivated by empirical observation on user veracity consistency (i.e., social users tend to consume news of the same veracity type), we further construct a news proximity graph among news articles to capture the veracity-consistent signals in shared readerships, and align the prompting predictions along the graph edges in a confidence-informed manner. Extensive experiments on three real-world benchmarks demonstrate that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.

Title: Compositional Program Generation for Systematic Generalization. (arXiv:2309.16467v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.16467
Code URL: null
Copy Paste: [[2309.16467] Compositional Program Generation for Systematic Generalization](http://arxiv.org/abs/2309.16467) #few-shot
Summary:
Compositional generalization is a key ability of humans that enables us to learn new concepts from only a handful examples. Machine learning models, including the now ubiquitous transformers, struggle to generalize in this way, and typically require thousands of examples of a concept during training in order to generalize meaningfully. This difference in ability between humans and artificial neural architectures, motivates this study on a neuro-symbolic architecture called the Compositional Program Generator (CPG). CPG has three key features: modularity, type abstraction, and recursive composition, that enable it to generalize both systematically to new concepts in a few-shot manner, as well as productively by length on various sequence-to-sequence language tasks. For each input, CPG uses a grammar of the input domain and a parser to generate a type hierarchy in which each grammar rule is assigned its own unique semantic module, a probabilistic copy or substitution program. Instances with the same hierarchy are processed with the same composed program, while those with different hierarchies may be processed with different programs. CPG learns parameters for the semantic modules and is able to learn the semantics for new types incrementally. Given a context-free grammar of the input language and a dictionary mapping each word in the source language to its interpretation in the output language, CPG can achieve perfect generalization on the SCAN and COGS benchmarks, in both standard and extreme few-shot settings.