[[2305.07710] Zero-shot racially balanced dataset generation using an existing biased StyleGAN2](http://arxiv.org/abs/2305.07710) #security
Facial recognition systems have made significant strides thanks to data-heavy deep learning models, but these models rely on large privacy-sensitive datasets. Unfortunately, many of these datasets lack diversity in terms of ethnicity and demographics, which can lead to biased models that can have serious societal and security implications. To address these issues, we propose a methodology that leverages the biased generative model StyleGAN2 to create demographically diverse images of synthetic individuals. The synthetic dataset is created using a novel evolutionary search algorithm that targets specific demographic groups. By training face recognition models with the resulting balanced dataset containing 50,000 identities per race (13.5 million images in total), we can improve their performance and minimize biases that might have been present in a model trained on a real dataset.
[[2305.08076] Improving Defensive Distillation using Teacher Assistant](http://arxiv.org/abs/2305.08076) #security
Adversarial attacks pose a significant threat to the security and safety of deep neural networks being applied to modern applications. More specifically, in computer vision-based tasks, experts can use the knowledge of model architecture to create adversarial samples imperceptible to the human eye. These attacks can lead to security problems in popular applications such as self-driving cars, face recognition, etc. Hence, building networks which are robust to such attacks is highly desirable and essential. Among the various methods present in literature, defensive distillation has shown promise in recent years. Using knowledge distillation, researchers have been able to create models robust against some of those attacks. However, more attacks have been developed exposing weakness in defensive distillation. In this project, we derive inspiration from teacher assistant knowledge distillation and propose that introducing an assistant network can improve the robustness of the distilled model. Through a series of experiments, we evaluate the distilled models for different distillation temperatures in terms of accuracy, sensitivity, and robustness. Our experiments demonstrate that the proposed hypothesis can improve robustness in most cases. Additionally, we show that multi-step distillation can further improve robustness with very little impact on model accuracy.
[[2305.08005] Beyond the Safeguards: Exploring the Security Risks of ChatGPT](http://arxiv.org/abs/2305.08005) #security
The increasing popularity of large language models (LLMs) such as ChatGPT has led to growing concerns about their safety, security risks, and ethical implications. This paper aims to provide an overview of the different types of security risks associated with ChatGPT, including malicious text and code generation, private data disclosure, fraudulent services, information gathering, and producing unethical content. We present an empirical study examining the effectiveness of ChatGPT's content filters and explore potential ways to bypass these safeguards, demonstrating the ethical implications and security risks that persist in LLMs even when protections are in place. Based on a qualitative analysis of the security implications, we discuss potential strategies to mitigate these risks and inform researchers, policymakers, and industry professionals about the complex security challenges posed by LLMs like ChatGPT. This study contributes to the ongoing discussion on the ethical and security implications of LLMs, underscoring the need for continued research in this area.
[[2305.07830] Interchain Timestamping for Mesh Security](http://arxiv.org/abs/2305.07830) #security
Fourteen years after the invention of Bitcoin, there has been a proliferation of many permissionless blockchains. Each such chain provides a public ledger that can be written to and read from by anyone. In this multi-chain world, a natural question arises: what is the optimal security an existing blockchain, a consumer chain, can extract by only reading and writing to k other existing blockchains, the provider chains? We design a protocol, called interchain timestamping, and show that it extracts the maximum economic security from the provider chains, as quantified by the slashable safety resilience. We observe that interchain timestamps are already provided by light-client based bridges, so interchain timestamping can be readily implemented for Cosmos chains connected by the Inter-Blockchain Communication (IBC) protocol. We compare interchain timestamping with cross-staking, the original solution to mesh security, as well as with Trustboost, another recent security sharing protocol.
[[2305.08039] Systematic Meets Unintended: Prior Knowledge Adaptive 5G Vulnerability Detection via Multi-Fuzzing](http://arxiv.org/abs/2305.08039) #security
The virtualization and softwarization of 5G and NextG are critical enablers of the shift to flexibility, but they also present a potential attack surface for threats. However, current security research in communication systems focuses on specific aspects of security challenges and lacks a holistic perspective. To address this challenge, a novel systematic fuzzing approach is proposed to reveal, detect, and predict vulnerabilities with and without prior knowledge assumptions from attackers. It also serves as a digital twin platform for system testing and defense simulation pipeline. Three fuzzing strategies are proposed: Listen-and-Learn (LAL), Synchronize-and-Learn (SyAL), and Source-and-Learn (SoAL). The LAL strategy is a black-box fuzzing strategy used to discover vulnerabilities without prior protocol knowledge, while the SyAL strategy, also a black-box fuzzing method, targets vulnerabilities more accurately with attacker-accessible user information and a novel probability-based fuzzing approach. The white-box fuzzing strategy, SoAL, is then employed to identify and explain vulnerabilities through fuzzing of significant bits. Using the srsRAN 5G platform, the LAL strategy identifies 129 RRC connection vulnerabilities with an average detection duration of 0.072s. Leveraging the probability-based fuzzing algorithm, the SyAL strategy outperforms existing models in precision and recall, using significantly fewer fuzzing cases. SoAL detects three man-in-the-middle vulnerabilities stemming from 5G protocol vulnerabilities. The proposed solution is scalable to other open-source and commercial 5G platforms and protocols beyond RRC. Extensive experimental results demonstrate that the proposed solution is an efficient and efficient approach to validate 5G security; meanwhile, it serves as real-time vulnerability detection and proactive defense.
[[2305.08226] NLP-based Cross-Layer 5G Vulnerabilities Detection via Fuzzing Generated Run-Time Profiling](http://arxiv.org/abs/2305.08226) #security
The effectiveness and efficiency of 5G software stack vulnerability and unintended behavior detection are essential for 5G assurance, especially for its applications in critical infrastructures. Scalability and automation are the main challenges in testing approaches and cybersecurity research. In this paper, we propose an innovative approach for automatically detecting vulnerabilities, unintended emergent behaviors, and performance degradation in 5G stacks via run-time profiling documents corresponding to fuzz testing in code repositories. Piloting on srsRAN, we map the run-time profiling via Logging Information (LogInfo) generated by fuzzing test to a high dimensional metric space first and then construct feature spaces based on their timestamp information. Lastly, we further leverage machine learning-based classification algorithms, including Logistic Regression, K-Nearest Neighbors, and Random Forest to categorize the impacts on performance and security attributes. The performance of the proposed approach has high accuracy, ranging from $ 93.4 \% $ to $ 95.9 \% $, in detecting the fuzzing impacts. In addition, the proof of concept could identify and prioritize real-time vulnerabilities on 5G infrastructures and critical applications in various verticals.
[[2305.07973] On the Computational Cost of Stochastic Security](http://arxiv.org/abs/2305.07973) #security
We investigate whether long-run persistent chain Monte Carlo simulation of Langevin dynamics improves the quality of the representations achieved by energy-based models (EBM). We consider a scheme wherein Monte Carlo simulation of a diffusion process using a trained EBM is used to improve the adversarial robustness and the calibration score of an independent classifier network. Our results show that increasing the computational budget of Gibbs sampling in persistent contrastive divergence improves the calibration and adversarial robustness of the model, elucidating the practical merit of realizing new quantum and classical hardware and software for efficient Gibbs sampling from continuous energy potentials.
[[2305.07815] MetaMorphosis: Task-oriented Privacy Cognizant Feature Generation for Multi-task Learning](http://arxiv.org/abs/2305.07815) #privacy
With the growth of computer vision applications, deep learning, and edge computing contribute to ensuring practical collaborative intelligence (CI) by distributing the workload among edge devices and the cloud. However, running separate single-task models on edge devices is inefficient regarding the required computational resource and time. In this context, multi-task learning allows leveraging a single deep learning model for performing multiple tasks, such as semantic segmentation and depth estimation on incoming video frames. This single processing pipeline generates common deep features that are shared among multi-task modules. However, in a collaborative intelligence scenario, generating common deep features has two major issues. First, the deep features may inadvertently contain input information exposed to the downstream modules (violating input privacy). Second, the generated universal features expose a piece of collective information than what is intended for a certain task, in which features for one task can be utilized to perform another task (violating task privacy). This paper proposes a novel deep learning-based privacy-cognizant feature generation process called MetaMorphosis that limits inference capability to specific tasks at hand. To achieve this, we propose a channel squeeze-excitation based feature metamorphosis module, Cross-SEC, to achieve distinct attention of all tasks and a de-correlation loss function with differential-privacy to train a deep learning model that produces distinct privacy-aware features as an output for the respective tasks. With extensive experimentation on four datasets consisting of diverse images related to scene understanding and facial attributes, we show that MetaMorphosis outperforms recent adversarial learning and universal feature generation methods by guaranteeing privacy requirements in an efficient way for image and video analytics.
[[2305.07881] Black-box Source-free Domain Adaptation via Two-stage Knowledge Distillation](http://arxiv.org/abs/2305.07881) #privacy
Source-free domain adaptation aims to adapt deep neural networks using only pre-trained source models and target data. However, accessing the source model still has a potential concern about leaking the source data, which reveals the patient's privacy. In this paper, we study the challenging but practical problem: black-box source-free domain adaptation where only the outputs of the source model and target data are available. We propose a simple but effective two-stage knowledge distillation method. In Stage \uppercase\expandafter{\romannumeral1}, we train the target model from scratch with soft pseudo-labels generated by the source model in a knowledge distillation manner. In Stage \uppercase\expandafter{\romannumeral2}, we initialize another model as the new student model to avoid the error accumulation caused by noisy pseudo-labels. We feed the images with weak augmentation to the teacher model to guide the learning of the student model. Our method is simple and flexible, and achieves surprising results on three cross-domain segmentation tasks.
[[2305.07751] Private and Communication-Efficient Algorithms for Entropy Estimation](http://arxiv.org/abs/2305.07751) #privacy
Modern statistical estimation is often performed in a distributed setting where each sample belongs to a single user who shares their data with a central server. Users are typically concerned with preserving the privacy of their samples, and also with minimizing the amount of data they must transmit to the server. We give improved private and communication-efficient algorithms for estimating several popular measures of the entropy of a distribution. All of our algorithms have constant communication cost and satisfy local differential privacy. For a joint distribution over many variables whose conditional independence is given by a tree, we describe algorithms for estimating Shannon entropy that require a number of samples that is linear in the number of variables, compared to the quadratic sample complexity of prior work. We also describe an algorithm for estimating Gini entropy whose sample complexity has no dependence on the support size of the distribution and can be implemented using a single round of concurrent communication between the users and the server. In contrast, the previously best-known algorithm has high communication cost and requires the server to facilitate interaction between the users. Finally, we describe an algorithm for estimating collision entropy that generalizes the best known algorithm to the private and communication-efficient setting.
[[2305.07803] The Case for the Anonymization of Offloaded Computation](http://arxiv.org/abs/2305.07803) #privacy
Computation offloading (often to external computing resources over a network) has become a necessity for modern applications. At the same time, the proliferation of machine learning techniques has empowered malicious actors to use such techniques in order to breach the privacy of the execution process for offloaded computations. This can enable malicious actors to identify offloaded computations and infer their nature based on computation characteristics that they may have access to even if they do not have direct access to the computation code. In this paper, we first demonstrate that even non-sophisticated machine learning algorithms can accurately identify offloaded computations. We then explore the design space of anonymizing offloaded computations through the realization of a framework, called Camouflage. Camouflage features practical mechanisms to conceal characteristics related to the execution of computations, which can be used by malicious actors to identify computations and orchestrate further attacks based on identified computations. Our evaluation demonstrated that Camouflage can impede the ability of malicious actors to identify executed computations by up to 60%, while incurring modest overheads for the anonymization of computations.
[[2305.08107] Balancing Privacy and Utility of Spatio-Temporal Data for Taxi-Demand Prediction](http://arxiv.org/abs/2305.08107) #privacy
Taxi-demand prediction is an important application of machine learning that enables taxi-providing facilities to optimize their operations and city planners to improve transportation infrastructure and services. However, the use of sensitive data in these systems raises concerns about privacy and security. In this paper, we propose the use of federated learning for taxi-demand prediction that allows multiple parties to train a machine learning model on their own data while keeping the data private and secure. This can enable organizations to build models on data they otherwise would not be able to access. Despite its potential benefits, federated learning for taxi-demand prediction poses several technical challenges, such as class imbalance, data scarcity among some parties, and the need to ensure model generalization to accommodate diverse facilities and geographic regions. To effectively address these challenges, we propose a system that utilizes region-independent encoding for geographic lat-long coordinates. By doing so, the proposed model is not limited to a specific region, enabling it to perform optimally in any area. Furthermore, we employ cost-sensitive learning and various regularization techniques to mitigate issues related to data scarcity and overfitting, respectively. Evaluation with real-world data collected from 16 taxi service providers in Japan over a period of six months showed the proposed system predicted demand level accurately within 1\% error compared to a single model trained with integrated data. The system also effectively defended against membership inference attacks on passenger data.
[[2305.08138] Traceable mixnets](http://arxiv.org/abs/2305.08138) #privacy
We introduce the notion of \emph{traceable mixnets}. In a traditional mixnet, multiple mix-servers jointly permute and decrypt a list of ciphertexts to produce a list of plaintexts, along with a proof of correctness, such that the association between individual ciphertexts and plaintexts remains completely hidden. However, in many applications, the privacy-utility tradeoff requires answering some specific queries about this association, without revealing any information beyond the query result. We consider queries of the following type: a) given a ciphertext in the mixnet input list, whether it encrypts one of a given subset of plaintexts in the output list, and b) given a plaintext in the mixnet output list, whether it is a decryption of one of a given subset of ciphertexts in the input list. Traceable mixnets allow the mix-servers to jointly prove answers to the above queries to a querier such that neither the querier nor a threshold number of mix-servers learn any information beyond the query result. If the querier is not corrupted, the corrupted mix-servers do not even learn the query result. We propose a construction of a traceable mixnet using novel distributed zero-knowledge proofs of \emph{set membership} and a related primitive we introduce called \emph{reverse set membership}. Although the set membership problem has been studied in the single-prover setting, the main challenge in our distributed setting lies in making sure that none of the mix-servers learn the association between ciphertexts and plaintexts during the proof. Our construction is faster than existing techniques by at least one order of magnitude.
[[2305.08034] DNN-Defender: An in-DRAM Deep Neural Network Defense Mechanism for Adversarial Weight Attack](http://arxiv.org/abs/2305.08034) #defense
With deep learning deployed in many security-sensitive areas, machine learning security is becoming progressively important. Recent studies demonstrate attackers can exploit system-level techniques exploiting the RowHammer vulnerability of DRAM to deterministically and precisely flip bits in Deep Neural Networks (DNN) model weights to affect inference accuracy. The existing defense mechanisms are software-based, such as weight reconstruction requiring expensive training overhead or performance degradation. On the other hand, generic hardware-based victim-/aggressor-focused mechanisms impose expensive hardware overheads and preserve the spatial connection between victim and aggressor rows. In this paper, we present the first DRAM-based victim-focused defense mechanism tailored for quantized DNNs, named DNN-Defender that leverages the potential of in-DRAM swapping to withstand the targeted bit-flip attacks. Our results indicate that DNN-Defender can deliver a high level of protection downgrading the performance of targeted RowHammer attacks to a random attack level. In addition, the proposed defense has no accuracy drop on CIFAR-10 and ImageNet datasets without requiring any software training or incurring additional hardware overhead.
[[2305.08192] Diffusion Models for Imperceptible and Transferable Adversarial Attack](http://arxiv.org/abs/2305.08192) #attack
Many existing adversarial attacks generate $L_p$-norm perturbations on image RGB space. Despite some achievements in transferability and attack success rate, the crafted adversarial examples are easily perceived by human eyes. Towards visual imperceptibility, some recent works explore unrestricted attacks without $L_p$-norm constraints, yet lacking transferability of attacking black-box models. In this work, we propose a novel imperceptible and transferable attack by leveraging both the generative and discriminative power of diffusion models. Specifically, instead of direct manipulation in pixel space, we craft perturbations in latent space of diffusion models. Combined with well-designed content-preserving structures, we can generate human-insensitive perturbations embedded with semantic clues. For better transferability, we further "deceive" the diffusion model which can be viewed as an additional recognition surrogate, by distracting its attention away from the target regions. To our knowledge, our proposed method, DiffAttack, is the first that introduces diffusion models into adversarial attack field. Extensive experiments on various model structures (including CNNs, Transformers, MLPs) and defense methods have demonstrated our superiority over other attack methods.
[[2305.08037] ChargeX: Exploring State Switching Attack on Electric Vehicle Charging Systems](http://arxiv.org/abs/2305.08037) #attack
Electric Vehicle (EV) has become one of the promising solutions to the
ever-evolving environmental and energy crisis. The key to the wide adoption of
EVs is a pervasive charging infrastructure, composed of both private/home
chargers and public/commercial charging stations. The security of EV charging,
however, has not been thoroughly investigated. This paper investigates the
communication mechanisms between the chargers and EVs, and exposes the lack of
protection on the authenticity in the SAE J1772 charging control protocol. To
showcase our discoveries, we propose a new class of attacks, ChargeX, which
aims to manipulate the charging states or charging rates of EV chargers with
the goal of disrupting the charging schedules, causing a denial of service
(DoS), or degrading the battery performance. ChargeX inserts a hardware attack
circuit to strategically modify the charging control signals. We design and
implement multiple attack systems, and evaluate the attacks on a public
charging station and two home chargers using a simulated vehicle load in the
lab environment. Extensive experiments on different types of chargers
demonstrate the effectiveness and generalization of ChargeX. Specifically, we
demonstrate that ChargeX can force the switching of an EV's charging state from
stand by" to
charging", even when the vehicle is not in the charging state.
We further validate the attacks on a Tesla Model 3 vehicle to demonstrate the
disruptive impacts of ChargeX. If deployed, ChargeX may significantly demolish
people's trust in the EV charging infrastructure.
[[2305.07687] Mastering Percolation-like Games with Deep Learning](http://arxiv.org/abs/2305.07687) #attack
Though robustness of networks to random attacks has been widely studied, intentional destruction by an intelligent agent is not tractable with previous methods. Here we devise a single-player game on a lattice that mimics the logic of an attacker attempting to destroy a network. The objective of the game is to disable all nodes in the fewest number of steps. We develop a reinforcement learning approach using deep Q-learning that is capable of learning to play this game successfully, and in so doing, to optimally attack a network. Because the learning algorithm is universal, we train agents on different definitions of robustness and compare the learned strategies. We find that superficially similar definitions of robustness induce different strategies in the trained agent, implying that optimally attacking or defending a network is sensitive the particular objective. Our method provides a new approach to understand network robustness, with potential applications to other discrete processes in disordered systems.
[[2305.07812] Lightweight Delivery Detection on Doorbell Cameras](http://arxiv.org/abs/2305.07812) #robust
Despite recent advances in video-based action recognition and robust spatio-temporal modeling, most of the proposed approaches rely on the abundance of computational resources to afford running huge and computation-intensive convolutional or transformer-based neural networks to obtain satisfactory results. This limits the deployment of such models on edge devices with limited power and computing resources. In this work we investigate an important smart home application, video based delivery detection, and present a simple and lightweight pipeline for this task that can run on resource-constrained doorbell cameras. Our proposed pipeline relies on motion cues to generate a set of coarse activity proposals followed by their classification with a mobile-friendly 3DCNN network. For training we design a novel semi-supervised attention module that helps the network to learn robust spatio-temporal features and adopt an evidence-based optimization objective that allows for quantifying the uncertainty of predictions made by the network. Experimental results on our curated delivery dataset shows the significant effectiveness of our pipeline compared to alternatives and highlights the benefits of our training phase novelties to achieve free and considerable inference-time performance gains.
[[2305.08031] On enhancing the robustness of Vision Transformers: Defensive Diffusion](http://arxiv.org/abs/2305.08031) #robust
Privacy and confidentiality of medical data are of utmost importance in healthcare settings. ViTs, the SOTA vision model, rely on large amounts of patient data for training, which raises concerns about data security and the potential for unauthorized access. Adversaries may exploit vulnerabilities in ViTs to extract sensitive patient information and compromising patient privacy. This work address these vulnerabilities to ensure the trustworthiness and reliability of ViTs in medical applications. In this work, we introduced a defensive diffusion technique as an adversarial purifier to eliminate adversarial noise introduced by attackers in the original image. By utilizing the denoising capabilities of the diffusion model, we employ a reverse diffusion process to effectively eliminate the adversarial noise from the attack sample, resulting in a cleaner image that is then fed into the ViT blocks. Our findings demonstrate the effectiveness of the diffusion model in eliminating attack-agnostic adversarial noise from images. Additionally, we propose combining knowledge distillation with our framework to obtain a lightweight student model that is both computationally efficient and robust against gray box attacks. Comparison of our method with a SOTA baseline method, SEViT, shows that our work is able to outperform the baseline. Extensive experiments conducted on a publicly available Tuberculosis X-ray dataset validate the computational efficiency and improved robustness achieved by our proposed architecture.
[[2305.07969] GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content](http://arxiv.org/abs/2305.07969) #robust
This paper presents a novel approach for detecting ChatGPT-generated vs. human-written text using language models. To this end, we first collected and released a pre-processed dataset named OpenGPTText, which consists of rephrased content generated using ChatGPT. We then designed, implemented, and trained two different models for text classification, using Robustly Optimized BERT Pretraining Approach (RoBERTa) and Text-to-Text Transfer Transformer (T5), respectively. Our models achieved remarkable results, with an accuracy of over 97% on the test dataset, as evaluated through various metrics. Furthermore, we conducted an interpretability study to showcase our model's ability to extract and differentiate key features between human-written and ChatGPT-generated text. Our findings provide important insights into the effective use of language models to detect generated text.
[[2305.07731] Predicting COVID-19 pandemic by spatio-temporal graph neural networks: A New Zealand's study](http://arxiv.org/abs/2305.07731) #robust
Modeling and simulations of pandemic dynamics play an essential role in understanding and addressing the spreading of highly infectious diseases such as COVID-19. In this work, we propose a novel deep learning architecture named Attention-based Multiresolution Graph Neural Networks (ATMGNN) that learns to combine the spatial graph information, i.e. geographical data, with the temporal information, i.e. timeseries data of number of COVID-19 cases, to predict the future dynamics of the pandemic. The key innovation is that our method can capture the multiscale structures of the spatial graph via a learning to cluster algorithm in a data-driven manner. This allows our architecture to learn to pick up either local or global signals of a pandemic, and model both the long-range spatial and temporal dependencies. Importantly, we collected and assembled a new dataset for New Zealand. We established a comprehensive benchmark of statistical methods, temporal architectures, graph neural networks along with our spatio-temporal model. We also incorporated socioeconomic cross-sectional data to further enhance our prediction. Our proposed model have shown highly robust predictions and outperformed all other baselines in various metrics for our new dataset of New Zealand along with existing datasets of England, France, Italy and Spain. For a future work, we plan to extend our work for real-time prediction and global scale. Our data and source code are publicly available at https://github.com/HySonLab/pandemic_tgnn
[[2305.07872] SPP-CNN: An Efficient Framework for Network Robustness Prediction](http://arxiv.org/abs/2305.07872) #robust
This paper addresses the robustness of a network to sustain its connectivity and controllability against malicious attacks. This kind of network robustness is typically measured by the time-consuming attack simulation, which returns a sequence of values that record the remaining connectivity and controllability after a sequence of node- or edge-removal attacks. For improvement, this paper develops an efficient framework for network robustness prediction, the spatial pyramid pooling convolutional neural network (SPP-CNN). The new framework installs a spatial pyramid pooling layer between the convolutional and fully-connected layers, overcoming the common mismatch issue in the CNN-based prediction approaches and extending its generalizability. Extensive experiments are carried out by comparing SPP-CNN with three state-of-the-art robustness predictors, namely a CNN-based and two graph neural networks-based frameworks. Synthetic and real-world networks, both directed and undirected, are investigated. Experimental results demonstrate that the proposed SPP-CNN achieves better prediction performances and better generalizability to unknown datasets, with significantly lower time-consumption, than its counterparts.
[[2305.07895] On the Hidden Mystery of OCR in Large Multimodal Models](http://arxiv.org/abs/2305.07895) #extraction
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. It remains less explored about their efficacy in text-related visual tasks. We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition, text-based visual question answering, and key information extraction. Our findings reveal strengths and weaknesses in these models, which primarily rely on semantic understanding for word recognition and exhibit inferior perception of individual character shapes. They also display indifference towards text length and have limited capabilities in detecting fine-grained features in images. Consequently, these results demonstrate that even the current most powerful large multimodal models cannot match domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Evaluation pipeline will be available at https://github.com/Yuliang-Liu/MultimodalOCR.
[[2305.08135] Distinguish Before Answer: Generating Contrastive Explanation as Knowledge for Commonsense Question Answering](http://arxiv.org/abs/2305.08135) #extraction
Existing knowledge-enhanced methods have achieved remarkable results in certain QA tasks via obtaining diverse knowledge from different knowledge bases. However, limited by the properties of retrieved knowledge, they still have trouble benefiting from both the knowledge relevance and distinguishment simultaneously. To address the challenge, we propose CPACE, a Concept-centric Prompt-bAsed Contrastive Explanation Generation model, which aims to convert obtained symbolic knowledge into a contrastive explanation for better distinguishing the differences among given candidates. Firstly, following previous works, we retrieve different types of symbolic knowledge with a concept-centric knowledge extraction module. After that, we generate corresponding contrastive explanations using acquired symbolic knowledge and explanation prompts as guidance for better modeling the knowledge distinguishment and interpretability. Finally, we regard the generated contrastive explanation as external knowledge for downstream task enhancement. We conduct a series of experiments on three widely-used question-answering datasets: CSQA, QASC, and OBQA. Experimental results demonstrate that with the help of generated contrastive explanation, our CPACE model achieves new SOTA on CSQA (89.8% on the testing set, 0.9% higher than human performance), and gains impressive improvement on QASC and OBQA (4.2% and 3.5%, respectively).
[[2305.07854] A Federated Learning-based Industrial Health Prognostics for Heterogeneous Edge Devices using Matched Feature Extraction](http://arxiv.org/abs/2305.07854) #extraction
Data-driven industrial health prognostics require rich training data to develop accurate and reliable predictive models. However, stringent data privacy laws and the abundance of edge industrial data necessitate decentralized data utilization. Thus, the industrial health prognostics field is well suited to significantly benefit from federated learning (FL), a decentralized and privacy-preserving learning technique. However, FL-based health prognostics tasks have hardly been investigated due to the complexities of meaningfully aggregating model parameters trained from heterogeneous data to form a high performing federated model. Specifically, data heterogeneity among edge devices, stemming from dissimilar degradation mechanisms and unequal dataset sizes, poses a critical statistical challenge for developing accurate federated models. We propose a pioneering FL-based health prognostic model with a feature similarity-matched parameter aggregation algorithm to discriminatingly learn from heterogeneous edge data. The algorithm searches across the heterogeneous locally trained models and matches neurons with probabilistically similar feature extraction functions first, before selectively averaging them to form the federated model parameters. As the algorithm only averages similar neurons, as opposed to conventional naive averaging of coordinate-wise neurons, the distinct feature extractors of local models are carried over with less dilution to the resultant federated model. Using both cyclic degradation data of Li-ion batteries and non-cyclic data of turbofan engines, we demonstrate that the proposed method yields accuracy improvements as high as 44.5\% and 39.3\% for state-of-health estimation and remaining useful life estimation, respectively.
[[2305.07845] Understanding Model Averaging in Federated Learning on Heterogeneous Data](http://arxiv.org/abs/2305.07845) #federate
Model averaging, a widely adopted technique in federated learning (FL), aggregates multiple client models trained on heterogeneous data to obtain a well-performed global model. However, the rationale behind its success is not well understood. To shed light on this issue, we investigate the geometric properties of model averaging by visualizing the loss/error landscape. The geometrical visualization shows that the client models surround the global model within a common basin, and the global model may deviate from the bottom of the basin even though it performs better than the client models. To further understand this phenomenon, we decompose the expected prediction error of the global model into five factors related to client models. Specifically, we find that the global-model error after early training mainly comes from i) the client-model error on non-overlapping data between client datasets and the global dataset and ii) the maximal distance between the global and client models. Inspired by these findings, we propose adopting iterative moving averaging (IMA) on global models to reduce the prediction error and limiting client exploration to control the maximal distance at the late training. Our experiments demonstrate that IMA significantly improves the accuracy and training speed of existing FL methods on benchmark datasets with various data heterogeneity.
[[2305.08070] A Survey of Federated Evaluation in Federated Learning](http://arxiv.org/abs/2305.08070) #federate
In traditional machine learning, it is trivial to conduct model evaluation since all data samples are managed centrally by a server. However, model evaluation becomes a challenging problem in federated learning (FL), which is called federated evaluation in this work. This is because clients do not expose their original data to preserve data privacy. Federated evaluation plays a vital role in client selection, incentive mechanism design, malicious attack detection, etc. In this paper, we provide the first comprehensive survey of existing federated evaluation methods. Moreover, we explore various applications of federated evaluation for enhancing FL performance and finally present future research directions by envisioning some challenges.
[[2305.08104] Federated TD Learning over Finite-Rate Erasure Channels: Linear Speedup under Markovian Sampling](http://arxiv.org/abs/2305.08104) #federate
Federated learning (FL) has recently gained much attention due to its effectiveness in speeding up supervised learning tasks under communication and privacy constraints. However, whether similar speedups can be established for reinforcement learning remains much less understood theoretically. Towards this direction, we study a federated policy evaluation problem where agents communicate via a central aggregator to expedite the evaluation of a common policy. To capture typical communication constraints in FL, we consider finite capacity up-link channels that can drop packets based on a Bernoulli erasure model. Given this setting, we propose and analyze QFedTD - a quantized federated temporal difference learning algorithm with linear function approximation. Our main technical contribution is to provide a finite-sample analysis of QFedTD that (i) highlights the effect of quantization and erasures on the convergence rate; and (ii) establishes a linear speedup w.r.t. the number of agents under Markovian sampling. Notably, while different quantization mechanisms and packet drop models have been extensively studied in the federated learning, distributed optimization, and networked control systems literature, our work is the first to provide a non-asymptotic analysis of their effects in multi-agent and federated reinforcement learning.
[[2305.07892] DAC-MR: Data Augmentation Consistency Based Meta-Regularization for Meta-Learning](http://arxiv.org/abs/2305.07892) #fair
Meta learning recently has been heavily researched and helped advance the contemporary machine learning. However, achieving well-performing meta-learning model requires a large amount of training tasks with high-quality meta-data representing the underlying task generalization goal, which is sometimes difficult and expensive to obtain for real applications. Current meta-data-driven meta-learning approaches, however, are fairly hard to train satisfactory meta-models with imperfect training tasks. To address this issue, we suggest a meta-knowledge informed meta-learning (MKIML) framework to improve meta-learning by additionally integrating compensated meta-knowledge into meta-learning process. We preliminarily integrate meta-knowledge into meta-objective via using an appropriate meta-regularization (MR) objective to regularize capacity complexity of the meta-model function class to facilitate better generalization on unseen tasks. As a practical implementation, we introduce data augmentation consistency to encode invariance as meta-knowledge for instantiating MR objective, denoted by DAC-MR. The proposed DAC-MR is hopeful to learn well-performing meta-models from training tasks with noisy, sparse or unavailable meta-data. We theoretically demonstrate that DAC-MR can be treated as a proxy meta-objective used to evaluate meta-model without high-quality meta-data. Besides, meta-data-driven meta-loss objective combined with DAC-MR is capable of achieving better meta-level generalization. 10 meta-learning tasks with different network architectures and benchmarks substantiate the capability of our DAC-MR on aiding meta-model learning. Fine performance of DAC-MR are obtained across all settings, and are well-aligned with our theoretical insights. This implies that our DAC-MR is problem-agnostic, and hopeful to be readily applied to extensive meta-learning problems and tasks.
[[2305.07789] Answering Complex Questions over Text by Hybrid Question Parsing and Execution](http://arxiv.org/abs/2305.07789) #interpretability
The dominant paradigm of textual question answering systems is based on end-to-end neural networks, which excels at answering natural language questions but falls short on complex ones. This stands in contrast to the broad adaptation of semantic parsing approaches over structured data sources (e.g., relational database, knowledge graphs), that convert natural language questions to logical forms and execute them with query engines. Towards combining the strengths of neural and symbolic methods, we propose a framework of question parsing and execution on textual QA. It comprises two central pillars: (1) We parse the question of varying complexity into an intermediate representation, named H-expression, which is composed of simple questions as the primitives and symbolic operations representing the relationships among them; (2) To execute the resulting H-expressions, we design a hybrid executor, which integrates the deterministic rules to translate the symbolic operations with a drop-in neural reader network to answer each decomposed simple question. Hence, the proposed framework can be viewed as a top-down question parsing followed by a bottom-up answer backtracking. The resulting H-expressions closely guide the execution process, offering higher precision besides better interpretability while still preserving the advantages of the neural readers for resolving its primitive elements. Our extensive experiments on MuSiQue, 2WikiQA, HotpotQA, and NQ show that the proposed parsing and hybrid execution framework outperforms existing approaches in supervised, few-shot, and zero-shot settings, while also effectively exposing its underlying reasoning process.
[[2305.07982] Zero-shot Faithful Factual Error Correction](http://arxiv.org/abs/2305.07982) #interpretability
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans' ability to identify and correct factual errors, we present a zero-shot framework that formulates questions about input claims, looks for correct answers in the given evidence, and assesses the faithfulness of each correction based on its consistency with the evidence. Our zero-shot framework outperforms fully-supervised approaches, as demonstrated by experiments on the FEVER and SciFact datasets, where our outputs are shown to be more faithful. More importantly, the decomposability nature of our framework inherently provides interpretability. Additionally, to reveal the most suitable metrics for evaluating factual error corrections, we analyze the correlation between commonly used metrics with human judgments in terms of three different dimensions regarding intelligibility and faithfulness.
[[2305.08010] ProKnow: Process Knowledge for Safety Constrained and Explainable Question Generation for Mental Health Diagnostic Assistance](http://arxiv.org/abs/2305.08010) #explainability
Current Virtual Mental Health Assistants (VMHAs) provide counseling and suggestive care. They refrain from patient diagnostic assistance because they lack training in safety-constrained and specialized clinical process knowledge. In this work, we define Proknow as an ordered set of information that maps to evidence-based guidelines or categories of conceptual understanding to experts in a domain. We also introduce a new dataset of diagnostic conversations guided by safety constraints and Proknow that healthcare professionals use. We develop a method for natural language question generation (NLG) that collects diagnostic information from the patient interactively. We demonstrate the limitations of using state-of-the-art large-scale language models (LMs) on this dataset. Our algorithm models the process knowledge through explicitly modeling safety, knowledge capture, and explainability. LMs augmented with ProKnow guided method generated 89% safer questions in the depression and anxiety domain. The Explainability of the generated question is assessed by computing similarity with concepts in depression and anxiety knowledge bases. Overall, irrespective of the type of LMs augmented with our ProKnow, we achieved an average 82% improvement over simple pre-trained LMs on safety, explainability, and process-guided question generation. We qualitatively and quantitatively evaluate the efficacy of the proposed ProKnow-guided methods by introducing three new evaluation metrics for safety, explainability, and process knowledge adherence.
[[2305.08092] Meta-DM: Applications of Diffusion Models on Few-Shot Learning](http://arxiv.org/abs/2305.08092) #diffusion
In the field of few-shot learning (FSL), extensive research has focused on improving network structures and training strategies. However, the role of data processing modules has not been fully explored. Therefore, in this paper, we propose Meta-DM, a generalized data processing module for FSL problems based on diffusion models. Meta-DM is a simple yet effective module that can be easily integrated with existing FSL methods, leading to significant performance improvements in both supervised and unsupervised settings. We provide a theoretical analysis of Meta-DM and evaluate its performance on several algorithms. Our experiments show that combining Meta-DM with certain methods achieves state-of-the-art results.
[[2305.07783] ROI-based Deep Image Compression with Swin Transformers](http://arxiv.org/abs/2305.07783) #transformer
Encoding the Region Of Interest (ROI) with better quality than the background has many applications including video conferencing systems, video surveillance and object-oriented vision tasks. In this paper, we propose a ROI-based image compression framework with Swin transformers as main building blocks for the autoencoder network. The binary ROI mask is integrated into different layers of the network to provide spatial information guidance. Based on the ROI mask, we can control the relative importance of the ROI and non-ROI by modifying the corresponding Lagrange multiplier $ \lambda $ for different regions. Experimental results show our model achieves higher ROI PSNR than other methods and modest average PSNR for human evaluation. When tested on models pre-trained with original images, it has superior object detection and instance segmentation performance on the COCO validation dataset.
[[2305.07840] CEMFormer: Learning to Predict Driver Intentions from In-Cabin and External Cameras via Spatial-Temporal Transformers](http://arxiv.org/abs/2305.07840) #transformer
Driver intention prediction seeks to anticipate drivers' actions by analyzing their behaviors with respect to surrounding traffic environments. Existing approaches primarily focus on late-fusion techniques, and neglect the importance of maintaining consistency between predictions and prevailing driving contexts. In this paper, we introduce a new framework called Cross-View Episodic Memory Transformer (CEMFormer), which employs spatio-temporal transformers to learn unified memory representations for an improved driver intention prediction. Specifically, we develop a spatial-temporal encoder to integrate information from both in-cabin and external camera views, along with episodic memory representations to continuously fuse historical data. Furthermore, we propose a novel context-consistency loss that incorporates driving context as an auxiliary supervision signal to improve prediction performance. Comprehensive experiments on the Brain4Cars dataset demonstrate that CEMFormer consistently outperforms existing state-of-the-art methods in driver intention prediction.
[[2305.07931] GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples](http://arxiv.org/abs/2305.07931) #transformer
Affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method,model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the Attention module and the Value vector. Therefore, we propose a novel model binarization technique, called Group Superposition Binarization (GSB), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Experiments on three datasets with limited numbers of training samples demonstrate that the proposed GSB model achieves state-of-the-art performance among the binary quantization schemes and exceeds its full-precision counterpart on some indicators.
[[2305.07979] A Two-Stage Real Image Deraining Method for GT-RAIN Challenge CVPR 2023 Workshop UG$^{\textbf{2}}$+ Track 3](http://arxiv.org/abs/2305.07979) #transformer
In this technical report, we briefly introduce the solution of our team HUST\li VIE for GT-Rain Challenge in CVPR 2023 UG$^{2}$+ Track 3. In this task, we propose an efficient two-stage framework to reconstruct a clear image from rainy frames. Firstly, a low-rank based video deraining method is utilized to generate pseudo GT, which fully takes the advantage of multi and aligned rainy frames. Secondly, a transformer-based single image deraining network Uformer is implemented to pre-train on large real rain dataset and then fine-tuned on pseudo GT to further improve image restoration. Moreover, in terms of visual pleasing effect, a comprehensive image processor module is utilized at the end of pipeline. Our overall framework is elaborately designed and able to handle both heavy rainy and foggy sequences provided in the final testing phase. Finally, we rank 1st on the average structural similarity (SSIM) and rank 2nd on the average peak signal-to-noise ratio (PSNR). Our code is available at https://github.com/yunguo224/UG2_Deraining.
[[2305.08190] TSGN: Temporal Scene Graph Neural Networks with Projected Vectorized Representation for Multi-Agent Motion Prediction](http://arxiv.org/abs/2305.08190) #transformer
Predicting future motions of nearby agents is essential for an autonomous vehicle to take safe and effective actions. In this paper, we propose TSGN, a framework using Temporal Scene Graph Neural Networks with projected vectorized representations for multi-agent trajectory prediction. Projected vectorized representation models the traffic scene as a graph which is constructed by a set of vectors. These vectors represent agents, road network, and their spatial relative relationships. All relative features under this representation are both translationand rotation-invariant. Based on this representation, TSGN captures the spatial-temporal features across agents, road network, interactions among them, and temporal dependencies of temporal traffic scenes. TSGN can predict multimodal future trajectories for all agents simultaneously, plausibly, and accurately. Meanwhile, we propose a Hierarchical Lane Transformer for capturing interactions between agents and road network, which filters the surrounding road network and only keeps the most probable lane segments which could have an impact on the future behavior of the target agent. Without sacrificing the prediction performance, this greatly reduces the computational burden. Experiments show TSGN achieves state-of-the-art performance on the Argoverse motion forecasting benchmar.
[[2305.07759] TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](http://arxiv.org/abs/2305.07759) #transformer
Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).
In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.
We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency.
We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.
[[2305.07893] PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity](http://arxiv.org/abs/2305.07893) #transformer
One of the components of natural language processing that has received a lot of investigation recently is semantic textual similarity. In computational linguistics and natural language processing, assessing the semantic similarity of words, phrases, paragraphs, and texts is crucial. Calculating the degree of semantic resemblance between two textual pieces, paragraphs, or phrases provided in both monolingual and cross-lingual versions is known as semantic similarity. Cross lingual semantic similarity requires corpora in which there are sentence pairs in both the source and target languages with a degree of semantic similarity between them. Many existing cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset, which the propagation of the machine translation error reduces the accuracy of the model. On the other hand, when we want to use semantic similarity features for machine translation the same machine translations should not be used for semantic similarity. For Persian, which is one of the low resource languages, no effort has been made in this regard and the need for a model that can understand the context of two languages is felt more than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Also, different models based on transformers have been fine-tuned using this dataset. The results show that using the PESTS dataset, the Pearson correlation of the XLM ROBERTa model increases from 85.87% to 95.62%.
[[2305.08096] Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation](http://arxiv.org/abs/2305.08096) #transformer
Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel this mystery from an empirical perspective and show that the knowledge comes from the top-1 predictions of teachers, which also helps us build a potential connection between word- and sequence-level KD. Further, we point out two inherent issues in vanilla word-level KD based on this finding. Firstly, the current objective of KD spreads its focus to whole distributions to learn the knowledge, yet lacks special treatment on the most crucial top-1 information. Secondly, the knowledge is largely covered by the golden information due to the fact that most top-1 predictions of teachers overlap with ground-truth tokens, which further restricts the potential of KD. To address these issues, we propose a novel method named \textbf{T}op-1 \textbf{I}nformation \textbf{E}nhanced \textbf{K}nowledge \textbf{D}istillation (TIE-KD). Specifically, we design a hierarchical ranking loss to enforce the learning of the top-1 information from the teacher. Additionally, we develop an iterative KD procedure to infuse more additional knowledge by distilling on the data without ground-truth targets. Experiments on WMT'14 English-German, WMT'14 English-French and WMT'16 English-Romanian demonstrate that our method can respectively boost Transformer$_{base}$ students by +1.04, +0.60 and +1.11 BLEU scores and significantly outperform the vanilla word-level KD baseline. Besides, our method shows higher generalizability on different teacher-student capacity gaps than existing KD techniques.
[[2305.08173] Croatian Film Review Dataset (Cro-FiReDa): A Sentiment Annotated Dataset of Film Reviews](http://arxiv.org/abs/2305.08173) #transformer
This paper introduces Cro-FiReDa, a sentiment-annotated dataset for Croatian in the domain of movie reviews. The dataset, which contains over 10,000 sentences, has been annotated at the sentence level. In addition to presenting the overall annotation process, we also present benchmark results based on the transformer-based fine-tuning approach
[[2305.08018] DRew: Dynamically Rewired Message Passing with Delay](http://arxiv.org/abs/2305.08018) #transformer
Message passing neural networks (MPNNs) have been shown to suffer from the phenomenon of over-squashing that causes poor performance for tasks relying on long-range interactions. This can be largely attributed to message passing only occurring locally, over a node's immediate neighbours. Rewiring approaches attempting to make graphs `more connected', and supposedly better suited to long-range tasks, often lose the inductive bias provided by distance on the graph since they make distant nodes communicate instantly at every layer. In this paper we propose a framework, applicable to any MPNN architecture, that performs a layer-dependent rewiring to ensure gradual densification of the graph. We also propose a delay mechanism that permits skip connections between nodes depending on the layer and their mutual distance. We validate our approach on several long-range tasks and show that it outperforms graph Transformers and multi-hop MPNNs.
[[2305.08073] HiPerformer: Hierarchically Permutation-Equivariant Transformer for Time Series Forecasting](http://arxiv.org/abs/2305.08073) #transformer
It is imperative to discern the relationships between multiple time series for accurate forecasting. In particular, for stock prices, components are often divided into groups with the same characteristics, and a model that extracts relationships consistent with this group structure should be effective. Thus, we propose the concept of hierarchical permutation-equivariance, focusing on index swapping of components within and among groups, to design a model that considers this group structure. When the prediction model has hierarchical permutation-equivariance, the prediction is consistent with the group relationships of the components. Therefore, we propose a hierarchically permutation-equivariant model that considers both the relationship among components in the same group and the relationship among groups. The experiments conducted on real-world data demonstrate that the proposed method outperforms existing state-of-the-art methods.
[[2305.07804] Dr](http://arxiv.org/abs/2305.07804) #generative
Large Language Models (LLMs) have made significant strides in natural language processing but face challenges in terms of computational expense and inefficiency as they grow in size, especially in domain-specific tasks. Small Language Models (SLMs), on the other hand, often struggle in these tasks due to limited capacity and training data. In this paper, we introduce Dr. LLaMA, a method for improving SLMs through generative data augmentation using LLMs, focusing on medical question-answering tasks and the PubMedQA dataset. Our findings indicate that LLMs effectively refine and diversify existing question-answer pairs, resulting in improved performance of a much smaller model on domain-specific QA datasets after fine-tuning. This study highlights the challenges of using LLMs for domain-specific question answering and suggests potential research directions to address these limitations, ultimately aiming to create more efficient and capable models for specialized applications. We have also made our code available for interested researchers
[[2305.08208] Learning to Generalize for Cross-domain QA](http://arxiv.org/abs/2305.08208) #generative
There have been growing concerns regarding the out-of-domain generalization ability of natural language processing (NLP) models, particularly in question-answering (QA) tasks. Current synthesized data augmentation methods for QA are hampered by increased training costs. To address this issue, we propose a novel approach that combines prompting methods and linear probing then fine-tuning strategy, which does not entail additional cost. Our method has been theoretically and empirically shown to be effective in enhancing the generalization ability of both generative and discriminative models. Our approach outperforms state-of-the-art baselines, with an average increase in F1 score of 4.5%-7.9%. Furthermore, our method can be easily integrated into any pre-trained models and offers a promising solution to the under-explored cross-domain QA task. We release our source code at GitHub*.
[[2305.07671] LatentPINNs: Generative physics-informed neural networks via a latent representation learning](http://arxiv.org/abs/2305.07671) #generative
Physics-informed neural networks (PINNs) are promising to replace conventional partial differential equation (PDE) solvers by offering more accurate and flexible PDE solutions. However, they are hampered by the relatively slow convergence and the need to perform additional, potentially expensive, training for different PDE parameters. To solve this limitation, we introduce latentPINN, a framework that utilizes latent representations of the PDE parameters as additional (to the coordinates) inputs into PINNs and allows for training over the distribution of these parameters. Motivated by the recent progress on generative models, we promote the use of latent diffusion models to learn compressed latent representations of the PDE parameters distribution and act as input parameters to NN functional solutions. We use a two-stage training scheme in which the first stage, we learn the latent representations for the distribution of PDE parameters. In the second stage, we train a physics-informed neural network over inputs given by randomly drawn samples from the coordinate space within the solution domain and samples from the learned latent representation of the PDE parameters. We test the approach on a class of level set equations given by the nonlinear Eikonal equation. We specifically share results corresponding to three different sets of Eikonal parameters (velocity models). The proposed method performs well on new phase velocity models without the need for any additional training.
[[2305.07733] Measuring Surprise in the Wild](http://arxiv.org/abs/2305.07733) #generative
The quantitative measurement of how and when we experience surprise has mostly remained limited to laboratory studies, and its extension to naturalistic settings has been challenging. Here we demonstrate, for the first time, how computational models of surprise rooted in cognitive science and neuroscience combined with state-of-the-art machine learned generative models can be used to detect surprising human behavior in complex, dynamic environments like road traffic. In traffic safety, such models can support the identification of traffic conflicts, modeling of road user response time, and driving behavior evaluation for both human and autonomous drivers. We also present novel approaches to quantify surprise and use naturalistic driving scenarios to demonstrate a number of advantages over existing surprise measures from the literature. Modeling surprising behavior using learned generative models is a novel concept that can be generalized beyond traffic safety to any dynamic real-world environment.
[[2305.08164] Latent Processes Identification From Multi-View Time Series](http://arxiv.org/abs/2305.08164) #generative
Understanding the dynamics of time series data typically requires identifying the unique latent factors for data generation, \textit{a.k.a.}, latent processes identification. Driven by the independent assumption, existing works have made great progress in handling single-view data. However, it is a non-trivial problem that extends them to multi-view time series data because of two main challenges: (i) the complex data structure, such as temporal dependency, can result in violation of the independent assumption; (ii) the factors from different views are generally overlapped and are hard to be aggregated to a complete set. In this work, we propose a novel framework MuLTI that employs the contrastive learning technique to invert the data generative process for enhanced identifiability. Additionally, MuLTI integrates a permutation mechanism that merges corresponding overlapped variables by the establishment of an optimal transport formula. Extensive experimental results on synthetic and real-world datasets demonstrate the superiority of our method in recovering identifiable latent variables on multi-view time series.
[[2305.07766] NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models](http://arxiv.org/abs/2305.07766) #large language model
Temporal Logic (TL) can be used to rigorously specify complex high-level specification for systems in many engineering applications. The translation between natural language (NL) and TL has been under-explored due to the lack of dataset and generalizable model across different application domains. In this paper, we propose an accurate and generalizable transformation framework of English instructions from NL to TL, exploring the use of Large Language Models (LLMs) at multiple stages. Our contributions are twofold. First, we develop a framework to create a dataset of NL-TL pairs combining LLMs and human annotation. We publish a dataset with 28K NL-TL pairs. Then, we finetune T5 models on the lifted versions (i.e., the specific Atomic Propositions (AP) are hidden) of the NL and TL. The enhanced generalizability originates from two aspects: 1) Usage of lifted NL-TL characterizes common logical structures, without constraints of specific domains. 2) Application of LLMs in dataset creation largely enhances corpus richness. We test the generalization of trained models on five varied domains. To achieve full NL-TL transformation, we either combine the lifted model with AP recognition task or do the further finetuning on each specific domain. During the further finetuning, our model achieves higher accuracy (>95%) using only <10% training data, compared with the baseline sequence to sequence (Seq2Seq) model.
[[2305.07868] Bridging History with AI A Comparative Evaluation of GPT 3](http://arxiv.org/abs/2305.07868) #large language model
The rapid proliferation of information in the digital era underscores the importance of accurate historical representation and interpretation. While artificial intelligence has shown promise in various fields, its potential for historical fact-checking and gap-filling remains largely untapped. This study evaluates the performance of three large language models LLMs GPT 3.5, GPT 4, and GoogleBARD in the context of predicting and verifying historical events based on given data. A novel metric, Distance to Reality (DTR), is introduced to assess the models' outputs against established historical facts. The results reveal a substantial potential for AI in historical studies, with GPT 4 demonstrating superior performance. This paper underscores the need for further research into AI's role in enriching our understanding of the past and bridging historical knowledge gaps.
[[2305.07922] CodeT5+: Open Code Large Language Models for Code Understanding and Generation](http://arxiv.org/abs/2305.07922) #large language model
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.
[[2305.08088] Make Prompt-based Black-Box Tuning Colorful: Boosting Model Generalization from Three Orthogonal Perspectives](http://arxiv.org/abs/2305.08088) #large language model
Large language models (LLMs) have shown increasing power on various natural language processing (NLP) tasks. However, tuning these models for downstream tasks usually needs exorbitant costs or is unavailable due to commercial considerations. Recently, black-box tuning has been proposed to address this problem by optimizing task-specific prompts without accessing the gradients and hidden representations. However, most existing works have yet fully exploited the potential of gradient-free optimization under the scenario of few-shot learning. In this paper, we describe BBT-RGB, a suite of straightforward and complementary techniques for enhancing the efficiency and performance of black-box optimization. Specifically, our method includes three plug-and-play components: (1) Two-stage derivative-free optimization strategy that facilitates fast convergence and mitigates overfitting; (2) Automatic verbalizer construction with its novel usage under few-shot settings; (3) Better prompt initialization policy based on instruction search and auto-selected demonstration. Extensive experiments across various tasks on natural language understanding and inference demonstrate the effectiveness of our method. Our codes are publicly available at https://github.com/QiushiSun/BBT-RGB.
[[2305.07857] AURA : Automatic Mask Generator using Randomized Input Sampling for Object Removal](http://arxiv.org/abs/2305.07857) #segmentation
The objective of the image inpainting task is to fill missing regions of an image in a visually plausible way. Recently, deep-learning-based image inpainting networks have generated outstanding results, and some utilize their models as object removers by masking unwanted objects in an image. However, while trying to better remove objects using their networks, the previous works pay less attention to the importance of the input mask. In this paper, we focus on generating the input mask to better remove objects using the off-the-shelf image inpainting network. We propose an automatic mask generator inspired by the explainable AI (XAI) method, whose output can better remove objects than a semantic segmentation mask. The proposed method generates an importance map using randomly sampled input masks and quantitatively estimated scores of the completed images obtained from the random masks. The output mask is selected by a judge module among the candidate masks which are generated from the importance map. We design the judge module to quantitatively estimate the quality of the object removal results. In addition, we empirically find that the evaluation methods used in the previous works reporting object removal results are not appropriate for estimating the performance of an object remover. Therefore, we propose new evaluation metrics (FID$^$ and U-IDS$^$) to properly evaluate the quality of object removers. Experiments confirm that our method shows better performance in removing target class objects than the masks generated from the semantic segmentation maps, and the two proposed metrics make judgments consistent with humans.
[[2305.07943] Illumination-insensitive Binary Descriptor for Visual Measurement Based on Local Inter-patch Invariance](http://arxiv.org/abs/2305.07943) #segmentation
Binary feature descriptors have been widely used in various visual measurement tasks, particularly those with limited computing resources and storage capacities. Existing binary descriptors may not perform well for long-term visual measurement tasks due to their sensitivity to illumination variations. It can be observed that when image illumination changes dramatically, the relative relationship among local patches mostly remains intact. Based on the observation, consequently, this study presents an illumination-insensitive binary (IIB) descriptor by leveraging the local inter-patch invariance exhibited in multiple spatial granularities to deal with unfavorable illumination variations. By taking advantage of integral images for local patch feature computation, a highly efficient IIB descriptor is achieved. It can encode scalable features in multiple spatial granularities, thus facilitating a computationally efficient hierarchical matching from coarse to fine. Moreover, the IIB descriptor can also apply to other types of image data, such as depth maps and semantic segmentation results, when available in some applications. Numerical experiments on both natural and synthetic datasets reveal that the proposed IIB descriptor outperforms state-of-the-art binary descriptors and some testing float descriptors. The proposed IIB descriptor has also been successfully employed in a demo system for long-term visual localization. The code of the IIB descriptor will be publicly available.
[[2305.07954] Image Segmentation via Probabilistic Graph Matching](http://arxiv.org/abs/2305.07954) #segmentation
This work presents an unsupervised and semi-automatic image segmentation approach where we formulate the segmentation as a inference problem based on unary and pairwise assignment probabilities computed using low-level image cues. The inference is solved via a probabilistic graph matching scheme, which allows rigorous incorporation of low level image cues and automatic tuning of parameters. The proposed scheme is experimentally shown to compare favorably with contemporary semi-supervised and unsupervised image segmentation schemes, when applied to contemporary state-of-the-art image sets.
[[2305.08053] SCRNet: a Retinex Structure-based Low-light Enhancement Model Guided by Spatial Consistency](http://arxiv.org/abs/2305.08053) #segmentation
Images captured under low-light conditions are often plagued by several challenges, including diminished contrast, increased noise, loss of fine details, and unnatural color reproduction. These factors can significantly hinder the performance of computer vision tasks such as object detection and image segmentation. As a result, improving the quality of low-light images is of paramount importance for practical applications in the computer vision domain.To effectively address these challenges, we present a novel low-light image enhancement model, termed Spatial Consistency Retinex Network (SCRNet), which leverages the Retinex-based structure and is guided by the principle of spatial consistency.Specifically, our proposed model incorporates three levels of consistency: channel level, semantic level, and texture level, inspired by the principle of spatial consistency.These levels of consistency enable our model to adaptively enhance image features, ensuring more accurate and visually pleasing results.Extensive experimental evaluations on various low-light image datasets demonstrate that our proposed SCRNet outshines existing state-of-the-art methods, highlighting the potential of SCRNet as an effective solution for enhancing low-light images.
[[2305.08196] A Comprehensive Survey on Segment Anything Model for Vision and Beyond](http://arxiv.org/abs/2305.08196) #segmentation
Artificial intelligence (AI) is evolving towards artificial general intelligence, which refers to the ability of an AI system to perform a wide range of tasks and exhibit a level of intelligence similar to that of a human being. This is in contrast to narrow or specialized AI, which is designed to perform specific tasks with a high degree of efficiency. Therefore, it is urgent to design a general class of models, which we term foundation models, trained on broad data that can be adapted to various downstream tasks. The recently proposed segment anything model (SAM) has made significant progress in breaking the boundaries of segmentation, greatly promoting the development of foundation models for computer vision. To fully comprehend SAM, we conduct a survey study. As the first to comprehensively review the progress of segmenting anything task for vision and beyond based on the foundation model of SAM, this work focuses on its applications to various tasks and data types by discussing its historical development, recent progress, and profound impact on broad applications. We first introduce the background and terminology for foundation models including SAM, as well as state-of-the-art methods contemporaneous with SAM that are significant for segmenting anything task. Then, we analyze and summarize the advantages and limitations of SAM across various image processing applications, including software scenes, real-world scenes, and complex scenes. Importantly, some insights are drawn to guide future research to develop more versatile foundation models and improve the architecture of SAM. We also summarize massive other amazing applications of SAM in vision and beyond.