[[2212.02974] CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain](http://arxiv.org/abs/2212.02974) #security
The field of cybersecurity is evolving fast. Experts need to be informed about past, current and - in the best case - upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available
[[2212.02626] A Generic Methodology for the Modular Verification of Security Protocol Implementations](http://arxiv.org/abs/2212.02626) #security
Security protocols are essential building blocks of modern IT systems. Subtle flaws in their design or implementation may compromise the security of entire systems. It is, thus, important to prove the absence of such flaws through formal verification. Much existing work focuses on the verification of protocol models, which is not sufficient to show that their implementations are actually secure. Verification techniques for protocol implementations (e.g., via code generation or model extraction) typically impose severe restrictions on the used programming language and code design, which may lead to sub-optimal implementations. In this paper, we present a methodology for the modular verification of strong security properties directly on the level of the protocol implementations. Our methodology leverages state-of-the-art verification logics and tools to support a wide range of implementations and programming languages. We demonstrate its effectiveness by verifying memory safety and security of Go implementations of the Needham-Schroeder-Lowe and WireGuard protocols, including forward secrecy and injective agreement for WireGuard. We also show that our methodology is agnostic to a particular language or program verifier with a prototype implementation for C.
[[2212.02740] Stealthy Peers: Understanding Security Risks of WebRTC-Based Peer-Assisted Video Streaming](http://arxiv.org/abs/2212.02740) #security
As an emerging service for in-browser content delivery, peer-assisted delivery network (PDN) is reported to offload up to 95\% of bandwidth consumption for video streaming, significantly reducing the cost incurred by traditional CDN services. With such benefits, PDN services significantly impact today's video streaming and content delivery model. However, their security implications have never been investigated. In this paper, we report the first effort to address this issue, which is made possible by a suite of methodologies, e.g., an automatic pipeline to discover PDN services and their customers, and a PDN analysis framework to test the potential security and privacy risks of these services. Our study has led to the discovery of 3 representative PDN providers, along with 134 websites and 38 mobile apps as their customers. Most of these PDN customers are prominent video streaming services with millions of monthly visits or app downloads (from Google Play). Also found in our study are another 9 top video/live streaming websites with each equipped with a proprietary PDN solution. Most importantly, our analysis on these PDN services has brought to light a series of security risks, which have never been reported before, including free riding of the public PDN services, video segment pollution, exposure of video viewers' IPs to other peers, and resource squatting. All such risks have been studied through controlled experiments and measurements, under the guidance of our institution's IRB. We have responsibly disclosed these security risks to relevant PDN providers, who have acknowledged our findings, and also discussed the avenues to mitigate these risks.
[[2212.02859] Non-interactive Multi-client Searchable Symmetric Encryption with Small Client Storage](http://arxiv.org/abs/2212.02859) #security
Considerable attention has been paid to dynamic searchable symmetric encryption (DSSE) which allows users to search on dynamically updated encrypted databases. To improve the performance of real-world applications, recent non-interactive multi-client DSSE schemes are targeted at avoiding per-query interaction between data owners and data users. However, existing non-interactive multi-client DSSE schemes do not consider forward privacy or backward privacy, making them exposed to leakage abuse attacks. Besides, most existing DSSE schemes with forward and backward privacy rely on keeping a keyword operation counter or an inverted index, resulting in a heavy storage burden on the data owner side. To address these issues, we propose a non-interactive multi-client DSSE scheme with small client storage, and our proposed scheme can provide both forward privacy and backward privacy. Specifically, we first design a lightweight storage chain structure that binds all keywords to a single state to reduce the storage cost. Then, we present a Hidden Key technique, which preserves non-interactive forward privacy through time range queries, ensuring that data with newer timestamps cannot match earlier time ranges. We conduct extensive experiments to validate our methods, which demonstrate computational efficiency. Moreover, security analysis proves the privacy-preserving property of our methods.
[[2212.02507] FEMa-FS: Finite Element Machines for Feature Selection](http://arxiv.org/abs/2212.02507) #security
Identifying anomalies has become one of the primary strategies towards security and protection procedures in computer networks. In this context, machine learning-based methods emerge as an elegant solution to identify such scenarios and learn irrelevant information so that a reduction in the identification time and possible gain in accuracy can be obtained. This paper proposes a novel feature selection approach called Finite Element Machines for Feature Selection (FEMa-FS), which uses the framework of finite elements to identify the most relevant information from a given dataset. Although FEMa-FS can be applied to any application domain, it has been evaluated in the context of anomaly detection in computer networks. The outcomes over two datasets showed promising results.
[[2212.02611] StyleGAN as a Utility-Preserving Face De-identification Method](http://arxiv.org/abs/2212.02611) #privacy
Several face de-identification methods have been proposed to preserve users' privacy by obscuring their faces. These methods, however, can degrade the quality of photos, and they usually do not preserve the utility of faces, e.g., their age, gender, pose, and facial expression. Recently, advanced generative adversarial network models, such as StyleGAN, have been proposed, which generate realistic, high-quality imaginary faces. In this paper, we investigate the use of StyleGAN in generating de-identified faces through style mixing, where the styles or features of the target face and an auxiliary face get mixed to generate a de-identified face that carries the utilities of the target face. We examined this de-identification method with respect to preserving utility and privacy, by implementing several face detection, verification, and identification attacks. Through extensive experiments and also comparing with two state-of-the-art face de-identification methods, we show that StyleGAN preserves the quality and utility of the faces much better than the other approaches and also by choosing the style mixing levels correctly, it can preserve the privacy of the faces much better than other methods.
[[2212.03177] Privacy-Preserving Visual Localization with Event Cameras](http://arxiv.org/abs/2212.03177) #privacy
We present a robust, privacy-preserving visual localization algorithm using event cameras. While event cameras can potentially make robust localization due to high dynamic range and small motion blur, the sensors exhibit large domain gaps making it difficult to directly apply conventional image-based localization algorithms. To mitigate the gap, we propose applying event-to-image conversion prior to localization which leads to stable localization. In the privacy perspective, event cameras capture only a fraction of visual information compared to normal cameras, and thus can naturally hide sensitive visual details. To further enhance the privacy protection in our event-based pipeline, we introduce privacy protection at two levels, namely sensor and network level. Sensor level protection aims at hiding facial details with lightweight filtering while network level protection targets hiding the entire user's view in private scene applications using a novel neural network inference pipeline. Both levels of protection involve light-weight computation and incur only a small performance loss. We thus project our method to serve as a building block for practical location-based services using event cameras. The code and dataset will be made public through the following link: https://github.com/82magnolia/event_localization.
[[2212.03080] Straggler-Resilient Differentially-Private Decentralized Learning](http://arxiv.org/abs/2212.03080) #privacy
We consider the straggler problem in decentralized learning over a logical ring while preserving user data privacy. Especially, we extend the recently proposed framework of differential privacy (DP) amplification by decentralization by Cyffers and Bellet to include overall training latency--comprising both computation and communication latency. Analytical results on both the convergence speed and the DP level are derived for both a skipping scheme (which ignores the stragglers after a timeout) and a baseline scheme that waits for each node to finish before the training continues. A trade-off between overall training latency, accuracy, and privacy, parameterized by the timeout of the skipping scheme, is identified and empirically validated for logistic regression on a real-world dataset.
[[2212.02620] Benchmarking Offline Reinforcement Learning Algorithms for E-Commerce Order Fraud Evaluation](http://arxiv.org/abs/2212.02620) #protect
Amazon and other e-commerce sites must employ mechanisms to protect their millions of customers from fraud, such as unauthorized use of credit cards. One such mechanism is order fraud evaluation, where systems evaluate orders for fraud risk, and either "pass" the order, or take an action to mitigate high risk. Order fraud evaluation systems typically use binary classification models that distinguish fraudulent and legitimate orders, to assess risk and take action. We seek to devise a system that considers both financial losses of fraud and long-term customer satisfaction, which may be impaired when incorrect actions are applied to legitimate customers. We propose that taking actions to optimize long-term impact can be formulated as a Reinforcement Learning (RL) problem. Standard RL methods require online interaction with an environment to learn, but this is not desirable in high-stakes applications like order fraud evaluation. Offline RL algorithms learn from logged data collected from the environment, without the need for online interaction, making them suitable for our use case. We show that offline RL methods outperform traditional binary classification solutions in SimStore, a simplified e-commerce simulation that incorporates order fraud risk. We also propose a novel approach to training offline RL policies that adds a new loss term during training, to better align policy exploration with taking correct actions.
[[2212.03069] Multiple Perturbation Attack: Attack Pixelwise Under Different $\ell_p$-norms For Better Adversarial Performance](http://arxiv.org/abs/2212.03069) #attack
Adversarial machine learning has been both a major concern and a hot topic recently, especially with the ubiquitous use of deep neural networks in the current landscape. Adversarial attacks and defenses are usually likened to a cat-and-mouse game in which defenders and attackers evolve over the time. On one hand, the goal is to develop strong and robust deep networks that are resistant to malicious actors. On the other hand, in order to achieve that, we need to devise even stronger adversarial attacks to challenge these defense models. Most of existing attacks employs a single $\ell_p$ distance (commonly, $p\in{1,2,\infty}$) to define the concept of closeness and performs steepest gradient ascent w.r.t. this $p$-norm to update all pixels in an adversarial example in the same way. These $\ell_p$ attacks each has its own pros and cons; and there is no single attack that can successfully break through defense models that are robust against multiple $\ell_p$ norms simultaneously. Motivated by these observations, we come up with a natural approach: combining various $\ell_p$ gradient projections on a pixel level to achieve a joint adversarial perturbation. Specifically, we learn how to perturb each pixel to maximize the attack performance, while maintaining the overall visual imperceptibility of adversarial examples. Finally, through various experiments with standardized benchmarks, we show that our method outperforms most current strong attacks across state-of-the-art defense mechanisms, while retaining its ability to remain clean visually.
[[2212.02563] A Large-Scale Analysis of Phishing Websites Hosted on Free Web Hosting Domains](http://arxiv.org/abs/2212.02563) #attack
While phishing attacks have evolved to utilize several obfuscation tactics to evade prevalent detection measures, implementing said measures often requires significant technical competence and logistical overhead from the attacker's perspective. In this work, we identify a family of phishing attacks hosted over Free web-Hosting Domains (FHDs), which can be created and maintained at scale with very little effort while also effectively evading prevalent anti-phishing detection and resisting website takedown. We observed over 8.8k such phishing URLs shared on Twitter and Facebook from February to August 2022 using 24 unique FHDs. Our large-scale analysis of these attacks shows that phishing websites hosted on FHDs remain active on Twitter and Facebook for at least 1.5 times longer than regular phishing URLs. In addition, on average, they have 1.7 times lower coverage from anti-phishing blocklists than regular phishing attacks, with a coverage time also being 3.8 times slower while only having half the number of detections from anti-phishing tools. Moreover, only 23.6% of FHD URLs were removed by the hosting domain a week after their first appearance, with a median removal time of 12.2 hours. We also identified several gaps in the prevalent anti-phishing ecosystem in detecting these threats. Based on our findings, we developed FreePhish, an ML-aided framework that acts as an effective countermeasure to detect and mitigate these URLs automatically and more effectively. By regularly reporting phishing URLs found by FreePhish to FHDs and hosting registrars over a period of two weeks, we note a significant decrease in the time taken to remove these websites. Finally, we also provide FreePhish as a free Chromium web extension that can be utilized to prevent end-users from accessing potential FHD-based phishing attacks.
[[2212.02582] Rethinking Backdoor Data Poisoning Attacks in the Context of Semi-Supervised Learning](http://arxiv.org/abs/2212.02582) #attack
Semi-supervised learning methods can train high-accuracy machine learning models with a fraction of the labeled training samples required for traditional supervised learning. Such methods do not typically involve close review of the unlabeled training samples, making them tempting targets for data poisoning attacks. In this paper we investigate the vulnerabilities of semi-supervised learning methods to backdoor data poisoning attacks on the unlabeled samples. We show that simple poisoning attacks that influence the distribution of the poisoned samples' predicted labels are highly effective - achieving an average attack success rate as high as 96.9%. We introduce a generalized attack framework targeting semi-supervised learning methods to better understand and exploit their limitations and to motivate future defense strategies.
[[2212.02701] On the Discredibility of Membership Inference Attacks](http://arxiv.org/abs/2212.02701) #attack
With the wide-spread application of machine learning models, it has become critical to study the potential data leakage of models trained on sensitive data. Recently, various membership inference (MI) attacks are proposed that determines if a sample was part of the training set or not. Although the first generation of MI attacks has been proven to be ineffective in practice, a few recent studies proposed practical MI attacks that achieve reasonable true positive rate at low false positive rate. The question is whether these attacks can be reliably used in practice. We showcase a practical application of membership inference attacks where it is used by an auditor (investigator) to prove to a judge/jury that an auditee unlawfully used sensitive data during training. Then, we show that the auditee can provide a dataset (with potentially unlimited number of samples) to a judge where MI attacks catastrophically fail. Hence, the auditee challenges the credibility of the auditor and can get the case dismissed. More importantly, we show that the auditee does not need to know anything about the MI attack neither a query access to it. In other words, all currently SOTA MI attacks in literature suffer from the same issue. Through comprehensive experimental evaluation, we show that our algorithms can increase the false positive rate from ten to thousands times larger than what auditor claim to the judge. Lastly, we argue that the implication of our algorithms is beyond discredibility: Current membership inference attacks can identify the memorized subpopulations, but they cannot reliably identify which exact sample in the subpopulation was used during training.
[[2212.02648] Spuriosity Rankings: Sorting Data for Spurious Correlation Robustness](http://arxiv.org/abs/2212.02648) #robust
We present a framework for ranking images within their class based on the strength of spurious cues present. By measuring the gap in accuracy on the highest and lowest ranked images (we call this spurious gap), we assess spurious feature reliance for $89$ diverse ImageNet models, finding that even the best models underperform in images with weak spurious presence. However, the effect of spurious cues varies far more dramatically across classes, emphasizing the crucial, often overlooked, class-dependence of the spurious correlation problem. While most spurious features we observe are clarifying (i.e. improving test-time accuracy when present, as is typically expected), we surprisingly find many cases of confusing spurious features, where models perform better when they are absent. We then close the spurious gap by training new classification heads on lowly ranked (i.e. without common spurious cues) images, resulting in improved effective robustness to distribution shifts (ObjectNet, ImageNet-R, ImageNet-Sketch). We also propose a second metric to assess feature reliability, finding that spurious features are generally less reliable than non-spurious (core) ones, though again, spurious features can be more reliable for certain classes. To enable our analysis, we annotated $5,000$ feature-class dependencies over {\it all} of ImageNet as core or spurious using minimal human supervision. Finally, we show the feature discovery and spuriosity ranking framework can be extended to other datasets like CelebA and WaterBirds in a lightweight fashion with only linear layer training, leading to discovering a previously unknown racial bias in the Celeb-A hair classification.
[[2212.02757] Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds](http://arxiv.org/abs/2212.02757) #robust
Visual localization plays an important role for intelligent robots and autonomous driving, especially when the accuracy of GNSS is unreliable. Recently, camera localization in LiDAR maps has attracted more and more attention for its low cost and potential robustness to illumination and weather changes. However, the commonly used pinhole camera has a narrow Field-of-View, thus leading to limited information compared with the omni-directional LiDAR data. To overcome this limitation, we focus on correlating the information of 360 equirectangular images to point clouds, proposing an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space. Inspired by the attention mechanism, we optimize the network to capture the salient feature for comparing images and point clouds. We construct several sequences containing 360 equirectangular images and corresponding point clouds based on the KITTI-360 dataset and conduct extensive experiments. The results demonstrate the effectiveness of our approach.
[[2212.02968] Domain Generalization Strategy to Train Classifiers Robust to Spatial-Temporal Shift](http://arxiv.org/abs/2212.02968) #robust
Deep learning-based weather prediction models have advanced significantly in recent years. However, data-driven models based on deep learning are difficult to apply to real-world applications because they are vulnerable to spatial-temporal shifts. A weather prediction task is especially susceptible to spatial-temporal shifts when the model is overfitted to locality and seasonality. In this paper, we propose a training strategy to make the weather prediction model robust to spatial-temporal shifts. We first analyze the effect of hyperparameters and augmentations of the existing training strategy on the spatial-temporal shift robustness of the model. Next, we propose an optimal combination of hyperparameters and augmentation based on the analysis results and a test-time augmentation. We performed all experiments on the W4C22 Transfer dataset and achieved the 1st performance.
[[2212.02992] Sparse Message Passing Network with Feature Integration for Online Multiple Object Tracking](http://arxiv.org/abs/2212.02992) #robust
Existing Multiple Object Tracking (MOT) methods design complex architectures for better tracking performance. However, without a proper organization of input information, they still fail to perform tracking robustly and suffer from frequent identity switches. In this paper, we propose two novel methods together with a simple online Message Passing Network (MPN) to address these limitations. First, we explore different integration methods for the graph node and edge embeddings and put forward a new IoU (Intersection over Union) guided function, which improves long term tracking and handles identity switches. Second, we introduce a hierarchical sampling strategy to construct sparser graphs which allows to focus the training on more difficult samples. Experimental results demonstrate that a simple online MPN with these two contributions can perform better than many state-of-the-art methods. In addition, our association method generalizes well and can also improve the results of private detection based methods.
[[2212.02997] Weakly-Supervised Gaze Estimation from Synthetic Views](http://arxiv.org/abs/2212.02997) #robust
3D gaze estimation is most often tackled as learning a direct mapping between input images and the gaze vector or its spherical coordinates. Recently, it has been shown that pose estimation of the face, body and hands benefits from revising the learning target from few pose parameters to dense 3D coordinates. In this work, we leverage this observation and propose to tackle 3D gaze estimation as regression of 3D eye meshes. We overcome the absence of compatible ground truth by fitting a rigid 3D eyeball template on existing gaze datasets and propose to improve generalization by making use of widely available in-the-wild face images. To this end, we propose an automatic pipeline to retrieve robust gaze pseudo-labels from arbitrary face images and design a multi-view supervision framework to balance their effect during training. In our experiments, our method achieves improvement of 30% compared to state-of-the-art in cross-dataset gaze estimation, when no ground truth data are available for training, and 7% when they are. We make our project publicly available at https://github.com/Vagver/dense3Deyes.
[[2212.03010] GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds](http://arxiv.org/abs/2212.03010) #robust
Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE in large-scale 3D point clouds remains challenging due to the inherent irregularity. In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm. The core idea is to apply a \textbf{G}enerative \textbf{D}ecoder for MAE (GD-MAE) to automatically merges the surrounding context to restore the masked geometric knowledge in a hierarchical fusion manner. In doing so, our approach is free from introducing the heuristic design of decoders and enjoys the flexibility of exploring various masking strategies. The corresponding part costs less than \textbf{12\%} latency compared with conventional methods, while achieving better performance. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: Waymo, KITTI, and ONCE. Consistent improvement on downstream detection tasks illustrates strong robustness and generalization capability. Not only our method reveals state-of-the-art results, but remarkably, we achieve comparable accuracy even with \textbf{20\%} of the labeled data on the Waymo dataset. The code will be released at \url{https://github.com/Nightmare-n/GD-MAE}.
[[2212.03189] Towards Energy Efficient Mobile Eye Tracking for AR Glasses through Optical Sensor Technology](http://arxiv.org/abs/2212.03189) #robust
After the introduction of smartphones and smartwatches, AR glasses are considered the next breakthrough in the field of wearables. While the transition from smartphones to smartwatches was based mainly on established display technologies, the display technology of AR glasses presents a technological challenge. Many display technologies, such as retina projectors, are based on continuous adaptive control of the display based on the user's pupil position. Furthermore, head-mounted systems require an adaptation and extension of established interaction concepts to provide the user with an immersive experience. Eye-tracking is a crucial technology to help AR glasses achieve a breakthrough through optimized display technology and gaze-based interaction concepts. Available eye-tracking technologies, such as VOG, do not meet the requirements of AR glasses, especially regarding power consumption, robustness, and integrability. To further overcome these limitations and push mobile eye-tracking for AR glasses forward, novel laser-based eye-tracking sensor technologies are researched in this thesis. The thesis contributes to a significant scientific advancement towards energy-efficient mobile eye-tracking for AR glasses.
[[2212.03237] RANA: Relightable Articulated Neural Avatars](http://arxiv.org/abs/2212.03237) #robust
We propose RANA, a relightable and articulated neural avatar for the photorealistic synthesis of humans under arbitrary viewpoints, body poses, and lighting. We only require a short video clip of the person to create the avatar and assume no knowledge about the lighting environment. We present a novel framework to model humans while disentangling their geometry, texture, and also lighting environment from monocular RGB videos. To simplify this otherwise ill-posed task we first estimate the coarse geometry and texture of the person via SMPL+D model fitting and then learn an articulated neural representation for photorealistic image generation. RANA first generates the normal and albedo maps of the person in any given target body pose and then uses spherical harmonics lighting to generate the shaded image in the target lighting environment. We also propose to pretrain RANA using synthetic images and demonstrate that it leads to better disentanglement between geometry and texture while also improving robustness to novel body poses. Finally, we also present a new photorealistic synthetic dataset, Relighting Humans, to quantitatively evaluate the performance of the proposed approach.
[[2212.03239] Perspective Fields for Single Image Camera Calibration](http://arxiv.org/abs/2212.03239) #robust
Geometric camera calibration is often required for applications that understand the perspective of the image. We propose perspective fields as a representation that models the local perspective properties of an image. Perspective Fields contain per-pixel information about the camera view, parameterized as an up vector and a latitude value. This representation has a number of advantages as it makes minimal assumptions about the camera model and is invariant or equivariant to common image editing operations like cropping, warping, and rotation. It is also more interpretable and aligned with human perception. We train a neural network to predict Perspective Fields and the predicted Perspective Fields can be converted to calibration parameters easily. We demonstrate the robustness of our approach under various scenarios compared with camera calibration-based methods and show example applications in image compositing.
[[2212.03242] Robust Point Cloud Segmentation with Noisy Annotations](http://arxiv.org/abs/2212.03242) #robust
Point cloud segmentation is a fundamental task in 3D. Despite recent progress on point cloud segmentation with the power of deep networks, current learning methods based on the clean label assumptions may fail with noisy labels. Yet, class labels are often mislabeled at both instance-level and boundary-level in real-world datasets. In this work, we take the lead in solving the instance-level label noise by proposing a Point Noise-Adaptive Learning (PNAL) framework. Compared to noise-robust methods on image tasks, our framework is noise-rate blind, to cope with the spatially variant noise rate specific to point clouds. Specifically, we propose a point-wise confidence selection to obtain reliable labels from the historical predictions of each point. A cluster-wise label correction is proposed with a voting strategy to generate the best possible label by considering the neighbor correlations. To handle boundary-level label noise, we also propose a variant ``PNAL-boundary " with a progressive boundary label cleaning strategy. Extensive experiments demonstrate its effectiveness on both synthetic and real-world noisy datasets. Even with $60\%$ symmetric noise and high-level boundary noise, our framework significantly outperforms its baselines, and is comparable to the upper bound trained on completely clean data. Moreover, we cleaned the popular real-world dataset ScanNetV2 for rigorous experiment. Our code and data is available at https://github.com/pleaseconnectwifi/PNAL.
[[2212.02745] Sources of Noise in Dialogue and How to Deal with Them](http://arxiv.org/abs/2212.02745) #robust
Training dialogue systems often entails dealing with noisy training examples and unexpected user inputs. Despite their prevalence, there currently lacks an accurate survey of dialogue noise, nor is there a clear sense of the impact of each noise type on task performance. This paper addresses this gap by first constructing a taxonomy of noise encountered by dialogue systems. In addition, we run a series of experiments to show how different models behave when subjected to varying levels of noise and types of noise. Our results reveal that models are quite robust to label errors commonly tackled by existing denoising algorithms, but that performance suffers from dialogue-specific noise. Driven by these observations, we design a data cleaning algorithm specialized for conversational settings and apply it as a proof-of-concept for targeted dialogue denoising.
[[2212.02663] Efficient Malware Analysis Using Metric Embeddings](http://arxiv.org/abs/2212.02663) #robust
In this paper, we explore the use of metric learning to embed Windows PE files in a low-dimensional vector space for downstream use in a variety of applications, including malware detection, family classification, and malware attribute tagging. Specifically, we enrich labeling on malicious and benign PE files using computationally expensive, disassembly-based malicious capabilities. Using these capabilities, we derive several different types of metric embeddings utilizing an embedding neural network trained via contrastive loss, Spearman rank correlation, and combinations thereof. We then examine performance on a variety of transfer tasks performed on the EMBER and SOREL datasets, demonstrating that for several tasks, low-dimensional, computationally efficient metric embeddings maintain performance with little decay, which offers the potential to quickly retrain for a variety of transfer tasks at significantly reduced storage overhead. We conclude with an examination of practical considerations for the use of our proposed embedding approach, such as robustness to adversarial evasion and introduction of task-specific auxiliary objectives to improve performance on mission critical tasks.
[[2212.02895] Loss Adapted Plasticity in Deep Neural Networks to Learn from Data with Unreliable Sources](http://arxiv.org/abs/2212.02895) #robust
When data is streaming from multiple sources, conventional training methods update model weights often assuming the same level of reliability for each source; that is: a model does not consider data quality of each source during training. In many applications, sources can have varied levels of noise or corruption that has negative effects on the learning of a robust deep learning model. A key issue is that the quality of data or labels for individual sources is often not available during training and could vary over time. Our solution to this problem is to consider the mistakes made while training on data originating from sources and utilise this to create a perceived data quality for each source. This paper demonstrates a straight-forward and novel technique that can be applied to any gradient descent optimiser: Update model weights as a function of the perceived reliability of data sources within a wider data set. The algorithm controls the plasticity of a given model to weight updates based on the history of losses from individual data sources. We show that applying this technique can significantly improve model performance when trained on a mixture of reliable and unreliable data sources, and maintain performance when models are trained on data sources that are all considered reliable. All code to reproduce this work's experiments and implement the algorithm in the reader's own models is made available.
[[2212.03130] Deep Learning Methods for Partial Differential Equations and Related Parameter Identification Problems](http://arxiv.org/abs/2212.03130) #robust
Recent years have witnessed a growth in mathematics for deep learning--which seeks a deeper understanding of the concepts of deep learning with mathematics, and explores how to make it more robust--and deep learning for mathematics, where deep learning algorithms are used to solve problems in mathematics. The latter has popularised the field of scientific machine learning where deep learning is applied to problems in scientific computing. Specifically, more and more neural network architectures have been developed to solve specific classes of partial differential equations (PDEs). Such methods exploit properties that are inherent to PDEs and thus solve the PDEs better than classical feed-forward neural networks, recurrent neural networks, and convolutional neural networks. This has had a great impact in the area of mathematical modeling where parametric PDEs are widely used to model most natural and physical processes arising in science and engineering, In this work, we review such methods and extend them for parametric studies as well as for solving the related inverse problems. We equally proceed to show their relevance in some industrial applications.
[[2212.03201] Misspecification in Inverse Reinforcement Learning](http://arxiv.org/abs/2212.03201) #robust
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $\pi$. To do this, we need a model of how $\pi$ relates to $R$. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.
[[2212.03228] ISAACS: Iterative Soft Adversarial Actor-Critic for Safety](http://arxiv.org/abs/2212.03228) #robust
The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable "deep" methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions. This work introduces a novel approach enabling scalable synthesis of robust safety-preserving controllers for robotic systems with general nonlinear dynamics subject to bounded modeling error by combining game-theoretic safety analysis with adversarial reinforcement learning in simulation. Following a soft actor-critic scheme, a safety-seeking fallback policy is co-trained with an adversarial "disturbance" agent that aims to invoke the worst-case realization of model error and training-to-deployment discrepancy allowed by the designer's uncertainty. While the learned control policy does not intrinsically guarantee safety, it is used to construct a real-time safety filter (or shield) with robust safety guarantees based on forward reachability rollouts. This shield can be used in conjunction with a safety-agnostic control policy, precluding any task-driven actions that could result in loss of safety. We evaluate our learning-based safety approach in a 5D race car simulator, compare the learned safety policy to the numerically obtained optimal solution, and empirically validate the robust safety guarantee of our proposed safety shield against worst-case model discrepancy.
[[2212.02755] Objects as Spatio-Temporal 2](http://arxiv.org/abs/2212.02755) #extraction
Determining accurate bird's eye view (BEV) positions of objects and tracks in a scene is vital for various perception tasks including object interactions mapping, scenario extraction etc., however, the level of supervision required to accomplish that is extremely challenging to procure. We propose a light-weight, weakly supervised method to estimate 3D position of objects by jointly learning to regress the 2D object detections and scene's depth prediction in a single feed-forward pass of a network. Our proposed method extends a center-point based single-shot object detector \cite{zhou2019objects}, and introduces a novel object representation where each object is modeled as a BEV point spatio-temporally, without the need of any 3D or BEV annotations for training and LiDAR data at query time. The approach leverages readily available 2D object supervision along with LiDAR point clouds (used only during training) to jointly train a single network, that learns to predict 2D object detection alongside the whole scene's depth, to spatio-temporally model object tracks as points in BEV. The proposed method is computationally over $\sim$10x efficient compared to recent SOTA approaches [1, 38] while achieving comparable accuracies on KITTI tracking benchmark.
[[2212.02794] Hybrid Model using Feature Extraction and Non-linear SVM for Brain Tumor Classification](http://arxiv.org/abs/2212.02794) #extraction
It is essential to classify brain tumors from magnetic resonance imaging (MRI) accurately for better and timely treatment of the patients. In this paper, we propose a hybrid model, using VGG along with Nonlinear-SVM (Soft and Hard) to classify the brain tumors: glioma and pituitary and tumorous and non-tumorous. The VGG-SVM model is trained for two different datasets of two classes; thus, we perform binary classification. The VGG models are trained via the PyTorch python library to obtain the highest testing accuracy of tumor classification. The method is threefold, in the first step, we normalize and resize the images, and the second step consists of feature extraction through variants of the VGG model. The third step classified brain tumors using non-linear SVM (soft and hard). We have obtained 98.18% accuracy for the first dataset and 99.78% for the second dataset using VGG19. The classification accuracies for non-linear SVM are 95.50% and 97.98% with linear and rbf kernel and 97.95% for soft SVM with RBF kernel with D1, and 96.75% and 98.60% with linear and RBF kernel and 98.38% for soft SVM with RBF kernel with D2. Results indicate that the hybrid VGG-SVM model, especially VGG 19 with SVM, is able to outperform existing techniques and achieve high accuracy.
[[2212.02896] Multimodal Tree Decoder for Table of Contents Extraction in Document Images](http://arxiv.org/abs/2212.02896) #extraction
Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents, which can be widely used for document understanding and information retrieval. Existing works often use hand-crafted features and predefined rule-based functions to detect headings and resolve the hierarchical relationship between headings. Both the benchmark and research based on deep learning are still limited. Accordingly, in this paper, we first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels. Then we propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc. The MTD model is mainly composed of three parts, namely encoder, classifier, and decoder. The encoder fuses the multimodality features of vision, text, and layout information for each entity of the document. Then the classifier recognizes and selects the heading entities. Next, to parse the hierarchical relationship between the heading entities, a tree-structured decoder is designed. To evaluate the performance, both the metric of tree-edit-distance similarity (TEDS) and F1-Measure are adopted. Finally, our MTD approach achieves an average TEDS of 87.2% and an average F1-Measure of 88.1% on the test set of HierDoc. The code and dataset will be released at: https://github.com/Pengfei-Hu/MTD.
[[2212.03029] AbHE: All Attention-based Homography Estimation](http://arxiv.org/abs/2212.03029) #extraction
Homography estimation is a basic computer vision task, which aims to obtain the transformation from multi-view images for image alignment. Unsupervised learning homography estimation trains a convolution neural network for feature extraction and transformation matrix regression. While the state-of-the-art homography method is based on convolution neural networks, few work focuses on transformer which shows superiority in high-level vision tasks. In this paper, we propose a strong-baseline model based on the Swin Transformer, which combines convolution neural network for local features and transformer module for global features. Moreover, a cross non-local layer is introduced to search the matched features within the feature maps coarsely.In the homography regression stage, we adopts an attention layer for the channels of correlation volume, which can drop out some weak correlation feature points. The experiment shows that in 8 Degree-of-Freedoms(DOFs) homography estimation our methods overperform the state-of-the-art method.
[[2212.02560] Cross-Domain Few-Shot Relation Extraction via Representation Learning and Domain Adaptation](http://arxiv.org/abs/2212.02560) #extraction
Cross-domain few-shot relation extraction poses a great challenge for the existing few-shot learning methods and domain adaptation methods when the source domain and target domain have large discrepancies. This paper proposes a method by combining the idea of few-shot learning and domain adaptation to deal with this problem. In the proposed method, an encoder, learned by optimizing a representation loss and an adversarial loss, is used to extract the relation of sentences in the source and target domain. The representation loss, including a cross-entropy loss and a contrastive loss, makes the encoder extract the relation of the source domain and keep the geometric structure of the classes in the source domain. And the adversarial loss is used to merge the source domain and target domain. The experimental results on the benchmark FewRel dataset demonstrate that the proposed method can outperform some state-of-the-art methods.
[[2212.03000] SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies](http://arxiv.org/abs/2212.03000) #extraction
Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations.
Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts.
Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.
[[2212.02567] cs-net: structural approach to time-series forecasting for high-dimensional feature space data with limited observations](http://arxiv.org/abs/2212.02567) #extraction
In recent years, deep-learning-based approaches have been introduced to solving time-series forecasting-related problems. These novel methods have demonstrated impressive performance in univariate and low-dimensional multivariate time-series forecasting tasks. However, when these novel methods are used to handle high-dimensional multivariate forecasting problems, their performance is highly restricted by a practical training time and a reasonable GPU memory configuration. In this paper, inspired by a change of basis in the Hilbert space, we propose a flexible data feature extraction technique that excels in high-dimensional multivariate forecasting tasks. Our approach was originally developed for the National Science Foundation (NSF) Algorithms for Threat Detection (ATD) 2022 Challenge. Implemented using the attention mechanism and Convolutional Neural Networks (CNN) architecture, our method demonstrates great performance and compatibility. Our models trained on the GDELT Dataset finished 1st and 2nd places in the ATD sprint series and hold promise for other datasets for time series forecasting.
[[2212.02758] Tackling Data Heterogeneity in Federated Learning with Class Prototypes](http://arxiv.org/abs/2212.02758) #federate
Data heterogeneity across clients in federated learning (FL) settings is a widely acknowledged challenge. In response, personalized federated learning (PFL) emerged as a framework to curate local models for clients' tasks. In PFL, a common strategy is to develop local and global models jointly - the global model (for generalization) informs the local models, and the local models (for personalization) are aggregated to update the global model. A key observation is that if we can improve the generalization ability of local models, then we can improve the generalization of global models, which in turn builds better personalized models. In this work, we consider class imbalance, an overlooked type of data heterogeneity, in the classification setting. We propose FedNH, a novel method that improves the local models' performance for both personalization and generalization by combining the uniformity and semantics of class prototypes. FedNH initially distributes class prototypes uniformly in the latent space and smoothly infuses the class semantics into class prototypes. We show that imposing uniformity helps to combat prototype collapse while infusing class semantics improves local models. Extensive experiments were conducted on popular classification datasets under the cross-device setting. Our results demonstrate the effectiveness and stability of our method over recent works.
[[2212.02985] Multi-Layer Personalized Federated Learning for Mitigating Biases in Student Predictive Analytics](http://arxiv.org/abs/2212.02985) #federate
Traditional learning-based approaches to student modeling (e.g., predicting grades based on measured activities) generalize poorly to underrepresented/minority student groups due to biases in data availability. In this paper, we propose a Multi-Layer Personalized Federated Learning (MLPFL) methodology which optimizes inference accuracy over different layers of student grouping criteria, such as by course and by demographic subgroups within each course. In our approach, personalized models for individual student subgroups are derived from a global model, which is trained in a distributed fashion via meta-gradient updates that account for subgroup heterogeneity while preserving modeling commonalities that exist across the full dataset. To evaluate our methodology, we consider case studies of two popular downstream student modeling tasks, knowledge tracing and outcome prediction, which leverage multiple modalities of student behavior (e.g., visits to lecture videos and participation on forums) in model training. Experiments on three real-world datasets from online courses demonstrate that our approach obtains substantial improvements over existing student modeling baselines in terms of increasing the average and decreasing the variance of prediction quality across different student subgroups. Visual analysis of the resulting students' knowledge state embeddings confirm that our personalization methodology extracts activity patterns which cluster into different student subgroups, consistent with the performance enhancements we obtain over the baselines.
[[2212.02687] Enabling and Accelerating Dynamic Vision Transformer Inference for Real-Time Applications](http://arxiv.org/abs/2212.02687) #fair
Many state-of-the-art deep learning models for computer vision tasks are based on the transformer architecture. Such models can be computationally expensive and are typically statically set to meet the deployment scenario. However, in real-time applications, the resources available for every inference can vary considerably and be smaller than what state-of-the-art models use. We can use dynamic models to adapt the model execution to meet real-time application resource constraints. While prior dynamic work has primarily minimized resource utilization for less complex input images while maintaining accuracy and focused on CNNs and early transformer models such as BERT, we adapt vision transformers to meet system dynamic resource constraints, independent of the input image. We find that unlike early transformer models, recent state-of-the-art vision transformers heavily rely on convolution layers. We show that pretrained models are fairly resilient to skipping computation in the convolution and self-attention layers, enabling us to create a low-overhead system for dynamic real-time inference without additional training. Finally, we create a optimized accelerator for these dynamic vision transformers in a 5nm technology. The PE array occupies 2.26mm$^2$ and is 17 times faster than a NVIDIA TITAN V GPU for state-of-the-art transformer-based models for semantic segmentation.
[[2212.02614] Can Ensembling Pre-processing Algorithms Lead to Better Machine Learning Fairness?](http://arxiv.org/abs/2212.02614) #fair
As machine learning (ML) systems get adopted in more critical areas, it has become increasingly crucial to address the bias that could occur in these systems. Several fairness pre-processing algorithms are available to alleviate implicit biases during model training. These algorithms employ different concepts of fairness, often leading to conflicting strategies with consequential trade-offs between fairness and accuracy. In this work, we evaluate three popular fairness pre-processing algorithms and investigate the potential for combining all algorithms into a more robust pre-processing ensemble. We report on lessons learned that can help practitioners better select fairness algorithms for their models.
[[2212.02924] Controlled Text Generation using T5 based Encoder-Decoder Soft Prompt Tuning and Analysis of the Utility of Generated Text in AI](http://arxiv.org/abs/2212.02924) #interpretability
Controlled text generation is a very important task in the arena of natural language processing due to its promising applications. In order to achieve this task we mainly introduce the novel soft prompt tuning method of using soft prompts at both encoder and decoder levels together in a T5 model and investigate the performance as the behaviour of an additional soft prompt related to the decoder of a T5 model in controlled text generation remained unexplored. Then we also investigate the feasibility of steering the output of this extended soft prompted T5 model at decoder level and finally analyse the utility of generated text to be used in AI related tasks such as training AI models with an interpretability analysis of the classifier trained with synthetic text, as there is a lack of proper analysis of methodologies in generating properly labelled data to be utilized in AI tasks. Through the performed in-depth intrinsic and extrinsic evaluations of this generation model along with the artificially generated data, we found that this model produced better results compared to the T5 model with a single soft prompt at encoder level and the sentiment classifier trained using this artificially generated data can produce comparable classification results to the results of a classifier trained with real labelled data and also the classifier decision is interpretable with respect to the input text content.
[[2212.03044] On the Importance of Clinical Notes in Multi-modal Learning for EHR Data](http://arxiv.org/abs/2212.03044) #interpretability
Understanding deep learning model behavior is critical to accepting machine learning-based decision support systems in the medical community. Previous research has shown that jointly using clinical notes with electronic health record (EHR) data improved predictive performance for patient monitoring in the intensive care unit (ICU). In this work, we explore the underlying reasons for these improvements. While relying on a basic attention-based model to allow for interpretability, we first confirm that performance significantly improves over state-of-the-art EHR data models when combining EHR data and clinical notes. We then provide an analysis showing improvements arise almost exclusively from a subset of notes containing broader context on patient state rather than clinician notes. We believe such findings highlight deep learning models for EHR data to be more limited by partially-descriptive data than by modeling choice, motivating a more data-centric approach in the field.
[[2212.03131] Explainability as statistical inference](http://arxiv.org/abs/2212.03131) #explainability
A wide variety of model explanation approaches have been proposed in recent years, all guided by very different rationales and heuristics. In this paper, we take a new route and cast interpretability as a statistical inference problem. We propose a general deep probabilistic model designed to produce interpretable predictions. The model parameters can be learned via maximum likelihood, and the method can be adapted to any predictor network architecture and any type of prediction problem. Our method is a case of amortized interpretability models, where a neural network is used as a selector to allow for fast interpretation at inference time. Several popular interpretability methods are shown to be particular cases of regularised maximum likelihood for our general model. We propose new datasets with ground truth selection which allow for the evaluation of the features importance map. Using these datasets, we show experimentally that using multiple imputation provides more reasonable interpretations.
[[2212.02814] Mixer: DNN Watermarking using Image Mixup](http://arxiv.org/abs/2212.02814) #watermark
It is crucial to protect the intellectual property rights of DNN models prior to their deployment. The DNN should perform two main tasks: its primary task and watermarking task. This paper proposes a lightweight, reliable, and secure DNN watermarking that attempts to establish strong ties between these two tasks. The samples triggering the watermarking task are generated using image Mixup either from training or testing samples. This means that there is an infinity of triggers not limited to the samples used to embed the watermark in the model at training. The extensive experiments on image classification models for different datasets as well as exposing them to a variety of attacks, show that the proposed watermarking provides protection with an adequate level of security and robustness.
[[2212.02773] DiffusionInst: Diffusion Model for Instance Segmentation](http://arxiv.org/abs/2212.02773) #diffusion
Recently, diffusion frameworks have achieved comparable performance with previous state-of-the-art image generation models. Researchers are curious about its variants in discriminative tasks because of its powerful noise-to-image denoising pipeline. This paper proposes DiffusionInst, a novel framework that represents instances as instance-aware filters and formulates instance segmentation as a noise-to-filter denoising process. The model is trained to reverse the noisy groundtruth without any inductive bias from RPN. During inference, it takes a randomly generated filter as input and outputs mask in one-step or multi-step denoising. Extensive experimental results on COCO and LVIS show that DiffusionInst achieves competitive performance compared to existing instance segmentation models. We hope our work could serve as a simple yet effective baseline, which could inspire designing more efficient diffusion frameworks for challenging discriminative tasks. Our code is available in https://github.com/chenhaoxing/DiffusionInst.
[[2212.02796] DiffuPose: Monocular 3D Human Pose Estimation via Denoising Diffusion Probabilistic Model](http://arxiv.org/abs/2212.02796) #diffusion
Thanks to the development of 2D keypoint detectors, monocular 3D human pose estimation (HPE) via 2D-to-3D uplifting approaches have achieved remarkable improvements. Still, monocular 3D HPE is a challenging problem due to the inherent depth ambiguities and occlusions. To handle this problem, many previous works exploit temporal information to mitigate such difficulties. However, there are many real-world applications where frame sequences are not accessible. This paper focuses on reconstructing a 3D pose from a single 2D keypoint detection. Rather than exploiting temporal information, we alleviate the depth ambiguity by generating multiple 3D pose candidates which can be mapped to an identical 2D keypoint. We build a novel diffusion-based framework to effectively sample diverse 3D poses from an off-the-shelf 2D detector. By considering the correlation between human joints by replacing the conventional denoising U-Net with graph convolutional network, our approach accomplishes further performance improvements. We evaluate our method on the widely adopted Human3.6M and HumanEva-I datasets. Comprehensive experiments are conducted to prove the efficacy of the proposed method, and they confirm that our model outperforms state-of-the-art multi-hypothesis 3D HPE methods.
[[2212.02802] Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding](http://arxiv.org/abs/2212.02802) #diffusion
Inspired by the impressive performance of recent face image editing methods, several studies have been naturally proposed to extend these methods to the face video editing task. One of the main challenges here is temporal consistency among edited frames, which is still unresolved. To this end, we propose a novel face video editing framework based on diffusion autoencoders that can successfully extract the decomposed features - for the first time as a face video editing model - of identity and motion from a given video. This modeling allows us to edit the video by simply manipulating the temporally invariant feature to the desired direction for the consistency. Another unique strength of our model is that, since our model is based on diffusion models, it can satisfy both reconstruction and edit capabilities at the same time, and is robust to corner cases in wild face videos (e.g. occluded faces) unlike the existing GAN-based methods.
[[2212.02837] Pretrained Diffusion Models for Unified Human Motion Synthesis](http://arxiv.org/abs/2212.02837) #diffusion
Generative modeling of human motion has broad applications in computer animation, virtual reality, and robotics. Conventional approaches develop separate models for different motion synthesis tasks, and typically use a model of a small size to avoid overfitting the scarce data available in each setting. It remains an open question whether developing a single unified model is feasible, which may 1) benefit the acquirement of novel skills by combining skills learned from multiple tasks, and 2) help in increasing the model capacity without overfitting by combining multiple data sources. Unification is challenging because 1) it involves diverse control signals as well as targets of varying granularity, and 2) motion datasets may use different skeletons and default poses. In this paper, we present MoFusion, a framework for unified motion synthesis. MoFusion employs a Transformer backbone to ease the inclusion of diverse control signals via cross attention, and pretrains the backbone as a diffusion model to support multi-granularity synthesis ranging from motion completion of a body part to whole-body motion generation. It uses a learnable adapter to accommodate the differences between the default skeletons used by the pretraining and the fine-tuning data. Empirical results show that pretraining is vital for scaling the model size without overfitting, and demonstrate MoFusion's potential in various tasks, e.g., text-to-motion, motion completion, and zero-shot mixing of multiple control signals. Project page: \url{https://ofa-sys.github.io/MoFusion/}.
[[2212.02936] M-VADER: A Model for Diffusion with Multimodal Context](http://arxiv.org/abs/2212.02936) #diffusion
We introduce M-VADER: a diffusion model (DM) for image generation where the output can be specified using arbitrary combinations of images and text. We show how M-VADER enables the generation of images specified using combinations of image and text, and combinations of multiple images. Previously, a number of successful DM image generation algorithms have been introduced that make it possible to specify the output image using a text prompt. Inspired by the success of those models, and led by the notion that language was already developed to describe the elements of visual contexts that humans find most important, we introduce an embedding model closely related to a vision-language model. Specifically, we introduce the embedding model S-MAGMA: a 13 billion parameter multimodal decoder combining components from an autoregressive vision-language model MAGMA and biases finetuned for semantic search.
[[2212.02963] SDM: Spatial Diffusion Model for Large Hole Image Inpainting](http://arxiv.org/abs/2212.02963) #diffusion
Generative adversarial networks (GANs) have made great success in image inpainting yet still have difficulties tackling large missing regions. In contrast, iterative algorithms, such as autoregressive and denoising diffusion models, have to be deployed with massive computing resources for decent effect. To overcome the respective limitations, we present a novel spatial diffusion model (SDM) that uses a few iterations to gradually deliver informative pixels to the entire image, largely enhancing the inference efficiency. Also, thanks to the proposed decoupled probabilistic modeling and spatial diffusion scheme, our method achieves high-quality large-hole completion. On multiple benchmarks, we achieve new state-of-the-art performance. Code is released at https://github.com/fenglinglwb/SDM.
[[2212.03099] Semantic-Conditional Diffusion Networks for Image Captioning](http://arxiv.org/abs/2212.03099) #diffusion
Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}.
[[2212.03185] Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis](http://arxiv.org/abs/2212.03185) #diffusion
Vector-Quantized (VQ-based) generative models usually consist of two basic components, i.e., VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects the generation ability of generative transformers. In this paper, we surprisingly find that improving the reconstruction fidelity of VQ tokenizers does not necessarily improve the generation. Instead, learning to compress semantic features within VQ tokenizers significantly improves generative transformers' ability to capture textures and structures. We thus highlight two competing objectives of VQ tokenizers for image synthesis: semantic compression and details preservation. Different from previous work that only pursues better details preservation, we propose Semantic-Quantized GAN (SeQ-GAN) with two learning phases to balance the two objectives. In the first phase, we propose a semantic-enhanced perceptual loss for better semantic compression. In the second phase, we fix the encoder and codebook, but enhance and finetune the decoder to achieve better details preservation. The proposed SeQ-GAN greatly improves VQ-based generative models and surpasses the GAN and Diffusion Models on both unconditional and conditional image generation. Our SeQ-GAN (364M) achieves Frechet Inception Distance (FID) of 6.25 and Inception Score (IS) of 140.9 on 256x256 ImageNet generation, a remarkable improvement over VIT-VQGAN (714M), which obtains 11.2 FID and 97.2 IS.
[[2212.03221] ADIR: Adaptive Diffusion for Image Reconstruction](http://arxiv.org/abs/2212.03221) #diffusion
In recent years, denoising diffusion models have demonstrated outstanding
image generation performance. The information on natural images captured by
these models is useful for many image reconstruction applications, where the
task is to restore a clean image from its degraded observations. In this work,
we propose a conditional sampling scheme that exploits the prior learned by
diffusion models while retaining agreement with the observations. We then
combine it with a novel approach for adapting pretrained diffusion denoising
networks to their input. We examine two adaption strategies: the first uses
only the degraded image, while the second, which we advocate, is performed
using images that are `nearest neighbors'' of the degraded image, retrieved
from a diverse dataset using an off-the-shelf visual-language model. To
evaluate our method, we test it on two state-of-the-art publicly available
diffusion models, Stable Diffusion and Guided Diffusion. We show that our
proposed
adaptive diffusion for image reconstruction' (ADIR) approach achieves
a significant improvement in the super-resolution, deblurring, and text-based
editing tasks.
[[2212.02977] Denoising diffusion probabilistic models for probabilistic energy forecasting](http://arxiv.org/abs/2212.02977) #diffusion
Scenario-based probabilistic forecasts have become a vital tool to equip decision-makers to address the uncertain nature of renewable energies. This paper presents a recent promising deep learning generative approach: denoising diffusion probabilistic models. It is a class of latent variable models that have recently demonstrated impressive results in the computer vision community. However, to the best of our knowledge, there has yet to be a demonstration that they can generate high-quality samples of load, PV, or wind power time series that are crucial to face the new challenges in power systems applications. Thus, we propose the first implementation of this model for energy forecasting using the open data of the Global Energy Forecasting Competition 2014. The results demonstrate that this approach is competitive with other state-of-the-art deep learning generative models: generative adversarial networks, variational autoencoders, and normalizing flows.