[[2211.16987] Secure Software Development Methodologies: A Multivocal Literature Review](http://arxiv.org/abs/2211.16987) #secure
In recent years, the number of cyber attacks has grown rapidly. An effective way to reduce the attack surface and protect software is adoption of methodologies that apply security at each step of the software development lifecycle. While different methodologies have been proposed to address software security, recent research shows an increase in the number of vulnerabilities in software and data breaches. Therefore, the security practices incorporated in secure software development methodologies require investigation. This paper provides an overview of security practices involved in 28 secure software development methodologies from industry, government, and academia. To achieve this goal, we distributed the security practices among the software development lifecycle stages. We also investigated auxiliary (non-technical) practices, such as organizational, behavioral, legal, policy, and governance aspects that are incorporated in the secure software development methodologies. Furthermore, we explored methods used to provide evidence of the effectiveness of the methodologies. Finally, we present the gaps that require attention in the scientific community. The results of our survey may assist researchers and organizations to better understand the existing security practices integrated into the secure software development methodologies. In addition, our bridge between "technical" and "non-technical" worlds may be useful for non-technical specialists who investigate software security. Moreover, exploring the gaps that we found in current research may help improve security in software development and produce software with fewer number of vulnerabilities.
[[2211.17019] Real time QKD Post Processing based on Reconfigurable Hardware Acceleration](http://arxiv.org/abs/2211.17019) #secure
Key Distillation is an essential component of every Quantum Key Distribution system because it compensates the inherent transmission errors of quantum channel. However, throughput and interoperability aspects of post-processing engine design often neglected, and exiting solutions are not providing any guarantee. In this paper, we propose multiple protocol support high throughput key distillation framework implemented in a Field Programmable Gate Array (FPGA) using High-Level Synthesis (HLS). The proposed design uses a Hadoop framework with a map-reduce programming model to efficiently process large chunks of raw data across the limited computing resources of an FPGA. We present a novel hardware-efficient integrated post-processing architecture that offer dynamic error correction, a side-channel resistant authentication scheme, and an inbuilt high-speed encryption application, which uses the key for secure communication. We develop a semi automated High level synthesis framework capable of handling different QKD protocols with promising speedup. Overall, the experimental results shows that there is a significant improvement in performance and compatible with any discrete variable QKD systems.
[[2211.16938] Evaluating Digital Agriculture Recommendations with Causal Inference](http://arxiv.org/abs/2211.16938) #secure
In contrast to the rapid digitalization of several industries, agriculture suffers from low adoption of smart farming tools. While AI-driven digital agriculture tools can offer high-performing predictive functionalities, they lack tangible quantitative evidence on their benefits to the farmers. Field experiments can derive such evidence, but are often costly, time consuming and hence limited in scope and scale of application. To this end, we propose an observational causal inference framework for the empirical evaluation of the impact of digital tools on target farm performance indicators (e.g., yield in this case). This way, we can increase farmers' trust via enhancing the transparency of the digital agriculture market and accelerate the adoption of technologies that aim to secure farmer income resilience and global agricultural sustainability. As a case study, we designed and implemented a recommendation system for the optimal sowing time of cotton based on numerical weather predictions, which was used by a farmers' cooperative during the growing season of 2021. We then leverage agricultural knowledge, collected yield data, and environmental information to develop a causal graph of the farm system. Using the back-door criterion, we identify the impact of sowing recommendations on the yield and subsequently estimate it using linear regression, matching, inverse propensity score weighting and meta-learners. The results reveal that a field sown according to our recommendations exhibited a statistically significant yield increase that ranged from 12% to 17%, depending on the method. The effect estimates were robust, as indicated by the agreement among the estimation methods and four successful refutation tests. We argue that this approach can be implemented for decision support systems of other fields, extending their evaluation beyond a performance assessment of internal functionalities.
[[2211.16785] SafeSpace MFNet: Precise and Efficient MultiFeature Drone Detection Network](http://arxiv.org/abs/2211.16785) #security
Unmanned air vehicles (UAVs) popularity is on the rise as it enables the services like traffic monitoring, emergency communications, deliveries, and surveillance. However, the unauthorized usage of UAVs (a.k.a drone) may violate security and privacy protocols for security-sensitive national and international institutions. The presented challenges require fast, efficient, and precise detection of UAVs irrespective of harsh weather conditions, the presence of different objects, and their size to enable SafeSpace. Recently, there has been significant progress in using the latest deep learning models, but those models have shortcomings in terms of computational complexity, precision, and non-scalability. To overcome these limitations, we propose a precise and efficient multiscale and multifeature UAV detection network for SafeSpace, i.e., \textit{MultiFeatureNet} (\textit{MFNet}), an improved version of the popular object detection algorithm YOLOv5s. In \textit{MFNet}, we perform multiple changes in the backbone and neck of the YOLOv5s network to focus on the various small and ignored features required for accurate and fast UAV detection. To further improve the accuracy and focus on the specific situation and multiscale UAVs, we classify the \textit{MFNet} into small (S), medium (M), and large (L): these are the combinations of various size filters in the convolution and the bottleneckCSP layers, reside in the backbone and neck of the architecture. This classification helps to overcome the computational cost by training the model on a specific feature map rather than all the features. The dataset and code are available as an open source: github.com/ZeeshanKaleem/MultiFeatureNet.
[[2211.16735] Unsafe at Any Copy: Name Collisions from Mixing Case Sensitivities](http://arxiv.org/abs/2211.16735) #security
File name confusion attacks, such as malicious symbolic links and file squatting, have long been studied as sources of security vulnerabilities. However, a recently emerged type, i.e., case-sensitivity-induced name collisions, has not been scrutinized. These collisions are introduced by differences in name resolution under case-sensitive and case-insensitive file systems or directories. A prominent example is the recent Git vulnerability (CVE-2021-21300) which can lead to code execution on a victim client when it clones a maliciously crafted repository onto a case-insensitive file system. With trends including ext4 adding support for per-directory case-insensitivity and the broad deployment of the Windows Subsystem for Linux, the prerequisites for such vulnerabilities are increasingly likely to exist even in a single system.
In this paper, we make a first effort to investigate how and where the lack of any uniform approach to handling name collisions leads to a diffusion of responsibility and resultant vulnerabilities. Interestingly, we demonstrate the existence of a range of novel security challenges arising from name collisions and their inconsistent handling by low-level utilities and applications. Specifically, our experiments show that utilities handle many name collision scenarios unsafely, leaving the responsibility to applications whose developers are unfortunately not yet aware of the threats. We examine three case studies as a first step towards systematically understanding the emerging type of name collision vulnerability.
[[2211.16942] ALARM: Active LeArning of Rowhammer Mitigations](http://arxiv.org/abs/2211.16942) #security
Rowhammer is a serious security problem of contemporary dynamic random-access memory (DRAM) where reads or writes of bits can flip other bits. DRAM manufacturers add mitigations, but don't disclose details, making it difficult for customers to evaluate their efficacy. We present a tool, based on active learning, that automatically infers parameter of Rowhammer mitigations against synthetic models of modern DRAM.
[[2211.17132] Targets in Reinforcement Learning to solve Stackelberg Security Games](http://arxiv.org/abs/2211.17132) #security
Reinforcement Learning (RL) algorithms have been successfully applied to real world situations like illegal smuggling, poaching, deforestation, climate change, airport security, etc. These scenarios can be framed as Stackelberg security games (SSGs) where defenders and attackers compete to control target resources. The algorithm's competency is assessed by which agent is controlling the targets. This review investigates modeling of SSGs in RL with a focus on possible improvements of target representations in RL algorithms.
[[2211.17230] The Bounded Gaussian Mechanism for Differential Privacy](http://arxiv.org/abs/2211.17230) #privacy
The Gaussian mechanism is one differential privacy mechanism commonly used to protect numerical data. However, it may be ill-suited to some applications because it has unbounded support and thus can produce invalid numerical answers to queries, such as negative ages or human heights in the tens of meters. One can project such private values onto valid ranges of data, though such projections lead to the accumulation of private query responses at the boundaries of such ranges, thereby harming accuracy. Motivated by the need for both privacy and accuracy over bounded domains, we present a bounded Gaussian mechanism for differential privacy, which has support only on a given region. We present both univariate and multivariate versions of this mechanism and illustrate a significant reduction in variance relative to comparable existing work.
[[2211.17067] Fair Ranking with Noisy Protected Attributes](http://arxiv.org/abs/2211.17067) #protect
The fair-ranking problem, which asks to rank a given set of items to maximize utility subject to group fairness constraints, has received attention in the fairness, information retrieval, and machine learning literature. Recent works, however, observe that errors in socially-salient (including protected) attributes of items can significantly undermine fairness guarantees of existing fair-ranking algorithms and raise the problem of mitigating the effect of such errors. We study the fair-ranking problem under a model where socially-salient attributes of items are randomly and independently perturbed. We present a fair-ranking framework that incorporates group fairness requirements along with probabilistic information about perturbations in socially-salient attributes. We provide provable guarantees on the fairness and utility attainable by our framework and show that it is information-theoretically impossible to significantly beat these guarantees. Our framework works for multiple non-disjoint attributes and a general class of fairness constraints that includes proportional and equal representation. Empirically, we observe that, compared to baselines, our algorithm outputs rankings with higher fairness, and has a similar or better fairness-utility trade-off compared to baselines.
[[2211.16786] Two-branch Multi-scale Deep Neural Network for Generalized Document Recapture Attack Detection](http://arxiv.org/abs/2211.16786) #attack
The image recapture attack is an effective image manipulation method to erase certain forensic traces, and when targeting on personal document images, it poses a great threat to the security of e-commerce and other web applications. Considering the current learning-based methods suffer from serious overfitting problem, in this paper, we propose a novel two-branch deep neural network by mining better generalized recapture artifacts with a designed frequency filter bank and multi-scale cross-attention fusion module. In the extensive experiment, we show that our method can achieve better generalization capability compared with state-of-the-art techniques on different scenarios.
[[2211.16791] Adaptive adversarial training method for improving multi-scale GAN based on generalization bound theory](http://arxiv.org/abs/2211.16791) #attack
In recent years, multi-scale generative adversarial networks (GANs) have been proposed to build generalized image processing models based on single sample. Constraining on the sample size, multi-scale GANs have much difficulty converging to the global optimum, which ultimately leads to limitations in their capabilities. In this paper, we pioneered the introduction of PAC-Bayes generalized bound theory into the training analysis of specific models under different adversarial training methods, which can obtain a non-vacuous upper bound on the generalization error for the specified multi-scale GAN structure. Based on the drastic changes we found of the generalization error bound under different adversarial attacks and different training states, we proposed an adaptive training method which can greatly improve the image manipulation ability of multi-scale GANs. The final experimental results show that our adaptive training method in this paper has greatly contributed to the improvement of the quality of the images generated by multi-scale GANs on several image manipulation tasks. In particular, for the image super-resolution restoration task, the multi-scale GAN model trained by the proposed method achieves a 100% reduction in natural image quality evaluator (NIQE) and a 60% reduction in root mean squared error (RMSE), which is better than many models trained on large-scale datasets.
[[2211.16626] Sludge for Good: Slowing and Imposing Costs on Cyber Attackers](http://arxiv.org/abs/2211.16626) #attack
Choice architecture describes the design by which choices are presented to people. Nudges are an aspect intended to make "good" outcomes easy, such as using password meters to encourage strong passwords. Sludge, on the contrary, is friction that raises the transaction cost and is often seen as a negative to users. Turning this concept around, we propose applying sludge for positive cybersecurity outcomes by using it offensively to consume attackers' time and other resources.
To date, most cyber defenses have been designed to be optimally strong and effective and prohibit or eliminate attackers as quickly as possible. Our complimentary approach is to also deploy defenses that seek to maximize the consumption of the attackers' time and other resources while causing as little damage as possible to the victim. This is consistent with zero trust and similar mindsets which assume breach. The Sludge Strategy introduces cost-imposing cyber defense by strategically deploying friction for attackers before, during, and after an attack using deception and authentic design features. We present the characteristics of effective sludge, and show a continuum from light to heavy sludge. We describe the quantitative and qualitative costs to attackers and offer practical considerations for deploying sludge in practice. Finally, we examine real-world examples of U.S. government operations to frustrate and impose cost on cyber adversaries.
[[2211.16891] Quantitative Information Flow for Hardware: Advancing the Attack Landscape](http://arxiv.org/abs/2211.16891) #attack
Security still remains an afterthought in modern Electronic Design Automation (EDA) tools, which solely focus on enhancing performance and reducing the chip size. Typically, the security analysis is conducted by hand, leading to vulnerabilities in the design remaining unnoticed. Security-aware EDA tools assist the designer in the identification and removal of security threats while keeping performance and area in mind. State-of-the-art approaches utilize information flow analysis to spot unintended information leakages in design structures. However, the classification of such threats is binary, resulting in negligible leakages being listed as well. A novel quantitative analysis allows the application of a metric to determine a numeric value for a leakage. Nonetheless, current approximations to quantify the leakage are still prone to overlooking leakages. The mathematical model 2D-QModel introduced in this work aims to overcome this shortcoming. Additionally, as previous work only includes a limited threat model, multiple threat models can be applied using the provided approach. Open-source benchmarks are used to show the capabilities of 2D-QModel to identify hardware Trojans in the design while ignoring insignificant leakages.
[[2211.17244] Overcoming the Convex Relaxation Barrier for Neural Network Verification via Nonconvex Low-Rank Semidefinite Relaxations](http://arxiv.org/abs/2211.17244) #attack
To rigorously certify the robustness of neural networks to adversarial perturbations, most state-of-the-art techniques rely on a triangle-shaped linear programming (LP) relaxation of the ReLU activation. While the LP relaxation is exact for a single neuron, recent results suggest that it faces an inherent "convex relaxation barrier" as additional activations are added, and as the attack budget is increased. In this paper, we propose a nonconvex relaxation for the ReLU relaxation, based on a low-rank restriction of a semidefinite programming (SDP) relaxation. We show that the nonconvex relaxation has a similar complexity to the LP relaxation, but enjoys improved tightness that is comparable to the much more expensive SDP relaxation. Despite nonconvexity, we prove that the verification problem satisfies constraint qualification, and therefore a Riemannian staircase approach is guaranteed to compute a near-globally optimal solution in polynomial time. Our experiments provide evidence that our nonconvex relaxation almost completely overcome the "convex relaxation barrier" faced by the LP relaxation.
[[2211.16564] Testing GLOM's ability to infer wholes from ambiguous parts](http://arxiv.org/abs/2211.16564) #robust
The GLOM architecture proposed by Hinton [2021] is a recurrent neural network for parsing an image into a hierarchy of wholes and parts. When a part is ambiguous, GLOM assumes that the ambiguity can be resolved by allowing the part to make multi-modal predictions for the pose and identity of the whole to which it belongs and then using attention to similar predictions coming from other possibly ambiguous parts to settle on a common mode that is predicted by several different parts. In this study, we describe a highly simplified version of GLOM that allows us to assess the effectiveness of this way of dealing with ambiguity. Our results show that, with supervised training, GLOM is able to successfully form islands of very similar embedding vectors for all of the locations occupied by the same object and it is also robust to strong noise injections in the input and to out-of-distribution input transformations.
[[2211.16841] Linking Sketch Patches by Learning Synonymous Proximity for Graphic Sketch Representation](http://arxiv.org/abs/2211.16841) #robust
Graphic sketch representations are effective for representing sketches. Existing methods take the patches cropped from sketches as the graph nodes, and construct the edges based on sketch's drawing order or Euclidean distances on the canvas. However, the drawing order of a sketch may not be unique, while the patches from semantically related parts of a sketch may be far away from each other on the canvas. In this paper, we propose an order-invariant, semantics-aware method for graphic sketch representations. The cropped sketch patches are linked according to their global semantics or local geometric shapes, namely the synonymous proximity, by computing the cosine similarity between the captured patch embeddings. Such constructed edges are learnable to adapt to the variation of sketch drawings, which enable the message passing among synonymous patches. Aggregating the messages from synonymous patches by graph convolutional networks plays a role of denoising, which is beneficial to produce robust patch embeddings and accurate sketch representations. Furthermore, we enforce a clustering constraint over the embeddings jointly with the network learning. The synonymous patches are self-organized as compact clusters, and their embeddings are guided to move towards their assigned cluster centroids. It raises the accuracy of the computed synonymous proximity. Experimental results show that our method significantly improves the performance on both controllable sketch synthesis and sketch healing.
[[2211.16869] NeAF: Learning Neural Angle Fields for Point Normal Estimation](http://arxiv.org/abs/2211.16869) #robust
Normal estimation for unstructured point clouds is an important task in 3D computer vision. Current methods achieve encouraging results by mapping local patches to normal vectors or learning local surface fitting using neural networks. However, these methods are not generalized well to unseen scenarios and are sensitive to parameter settings. To resolve these issues, we propose an implicit function to learn an angle field around the normal of each point in the spherical coordinate system, which is dubbed as Neural Angle Fields (NeAF). Instead of directly predicting the normal of an input point, we predict the angle offset between the ground truth normal and a randomly sampled query normal. This strategy pushes the network to observe more diverse samples, which leads to higher prediction accuracy in a more robust manner. To predict normals from the learned angle fields at inference time, we randomly sample query vectors in a unit spherical space and take the vectors with minimal angle values as the predicted normals. To further leverage the prior learned by NeAF, we propose to refine the predicted normal vectors by minimizing the angle offsets. The experimental results with synthetic data and real scans show significant improvements over the state-of-the-art under widely used benchmarks.
[[2211.16905] Rethinking Disparity: A Depth Range Free Multi-View Stereo Based on Disparity](http://arxiv.org/abs/2211.16905) #robust
Existing learning-based multi-view stereo (MVS) methods rely on the depth range to build the 3D cost volume and may fail when the range is too large or unreliable. To address this problem, we propose a disparity-based MVS method based on the epipolar disparity flow (E-flow), called DispMVS, which infers the depth information from the pixel movement between two views. The core of DispMVS is to construct a 2D cost volume on the image plane along the epipolar line between each pair (between the reference image and several source images) for pixel matching and fuse uncountable depths triangulated from each pair by multi-view geometry to ensure multi-view consistency. To be robust, DispMVS starts from a randomly initialized depth map and iteratively refines the depth map with the help of the coarse-to-fine strategy. Experiments on DTUMVS and Tanks\&Temple datasets show that DispMVS is not sensitive to the depth range and achieves state-of-the-art results with lower GPU memory.
[[2211.16922] Learning Motion-Robust Remote Photoplethysmography through Arbitrary Resolution Videos](http://arxiv.org/abs/2211.16922) #robust
Remote photoplethysmography (rPPG) enables non-contact heart rate (HR) estimation from facial videos which gives significant convenience compared with traditional contact-based measurements. In the real-world long-term health monitoring scenario, the distance of the participants and their head movements usually vary by time, resulting in the inaccurate rPPG measurement due to the varying face resolution and complex motion artifacts. Different from the previous rPPG models designed for a constant distance between camera and participants, in this paper, we propose two plug-and-play blocks (i.e., physiological signal feature extraction block (PFE) and temporal face alignment block (TFA)) to alleviate the degradation of changing distance and head motion. On one side, guided with representative-area information, PFE adaptively encodes the arbitrary resolution facial frames to the fixed-resolution facial structure features. On the other side, leveraging the estimated optical flow, TFA is able to counteract the rPPG signal confusion caused by the head movement thus benefit the motion-robust rPPG signal recovery. Besides, we also train the model with a cross-resolution constraint using a two-stream dual-resolution framework, which further helps PFE learn resolution-robust facial rPPG features. Extensive experiments on three benchmark datasets (UBFC-rPPG, COHFACE and PURE) demonstrate the superior performance of the proposed method. One highlight is that with PFE and TFA, the off-the-shelf spatio-temporal rPPG models can predict more robust rPPG signals under both varying face resolution and severe head movement scenarios. The codes are available at https://github.com/LJW-GIT/Arbitrary_Resolution_rPPG.
[[2211.16927] 3D GAN Inversion with Facial Symmetry Prior](http://arxiv.org/abs/2211.16927) #robust
Recently, a surge of high-quality 3D-aware GANs have been proposed, which leverage the generative power of neural rendering. It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion. Although with the facial prior preserved in pre-trained 3D GANs, reconstructing a 3D portrait with only one monocular image is still an ill-pose problem. The straightforward application of 2D GAN inversion methods focuses on texture similarity only while ignoring the correctness of 3D geometry shapes. It may raise geometry collapse effects, especially when reconstructing a side face under an extreme pose. Besides, the synthetic results in novel views are prone to be blurry. In this work, we propose a novel method to promote 3D GAN inversion by introducing facial symmetry prior. We design a pipeline and constraints to make full use of the pseudo auxiliary view obtained via image flipping, which helps obtain a robust and reasonable geometry shape during the inversion process. To enhance texture fidelity in unobserved viewpoints, pseudo labels from depth-guided 3D warping can provide extra supervision. We design constraints aimed at filtering out conflict areas for optimization in asymmetric situations. Comprehensive quantitative and qualitative evaluations on image reconstruction and editing demonstrate the superiority of our method.
[[2211.16991] SparsePose: Sparse-View Camera Pose Regression and Refinement](http://arxiv.org/abs/2211.16991) #robust
Camera pose estimation is a key step in standard 3D reconstruction pipelines that operate on a dense set of images of a single object or scene. However, methods for pose estimation often fail when only a few images are available because they rely on the ability to robustly identify and match visual features between image pairs. While these methods can work robustly with dense camera views, capturing a large set of images can be time-consuming or impractical. We propose SparsePose for recovering accurate camera poses given a sparse set of wide-baseline images (fewer than 10). The method learns to regress initial camera poses and then iteratively refine them after training on a large-scale dataset of objects (Co3D: Common Objects in 3D). SparsePose significantly outperforms conventional and learning-based baselines in recovering accurate camera rotations and translations. We also demonstrate our pipeline for high-fidelity 3D reconstruction using only 5-9 images of an object.
[[2211.17170] How to Train an Accurate and Efficient Object Detection Model on Any Dataset](http://arxiv.org/abs/2211.17170) #robust
The rapidly evolving industry demands high accuracy of the models without the need for time-consuming and computationally expensive experiments required for fine-tuning. Moreover, a model and training pipeline, which was once carefully optimized for a specific dataset, rarely generalizes well to training on a different dataset. This makes it unrealistic to have carefully fine-tuned models for each use case. To solve this, we propose an alternative approach that also forms a backbone of Intel Geti platform: a dataset-agnostic template for object detection trainings, consisting of carefully chosen and pre-trained models together with a robust training pipeline for further training. Our solution works out-of-the-box and provides a strong baseline on a wide range of datasets. It can be used on its own or as a starting point for further fine-tuning for specific use cases when needed. We obtained dataset-agnostic templates by performing parallel training on a corpus of datasets and optimizing the choice of architectures and training tricks with respect to the average results on the whole corpora. We examined a number of architectures, taking into account the performance-accuracy trade-off. Consequently, we propose 3 finalists, VFNet, ATSS, and SSD, that can be deployed on CPU using the OpenVINO toolkit. The source code is available as a part of the OpenVINO Training Extensions (https://github.com/openvinotoolkit/training_extensions}
[[2211.16550] Soft Alignment Objectives for Robust Adaptation in Machine Translation](http://arxiv.org/abs/2211.16550) #robust
Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon a semantic similarity of the predicted tokens to the reference.
Our results show that (1) avoiding the common assumption of a single correct prediction by constructing the training target from tokens' semantic similarity can mitigate catastrophic forgetting during domain adaptation, while (2) preserving the quality of the adaptation, (3) with negligible additions to compute costs. In the broader perspective, the objectives grounded in a soft token alignment pioneer the exploration of the middle ground between the efficient but naive exact-match token-level objectives and expressive but computationally- and resource-intensive sequential objectives.
[[2211.16944] AIONER: All-in-one scheme-based biomedical named entity recognition using deep learning](http://arxiv.org/abs/2211.16944) #robust
Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g., gene or disease). We therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to improve generalization. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g., the entire PubMed data).
[[2211.17107] Handling and extracting key entities from customer conversations using Speech recognition and Named Entity recognition](http://arxiv.org/abs/2211.17107) #robust
In this modern era of technology with e-commerce developing at a rapid pace, it is very important to understand customer requirements and details from a business conversation. It is very crucial for customer retention and satisfaction. Extracting key insights from these conversations is very important when it comes to developing their product or solving their issue. Understanding customer feedback, responses, and important details of the product are essential and it would be done using Named entity recognition (NER). For extracting the entities we would be converting the conversations to text using the optimal speech-to-text model. The model would be a two-stage network in which the conversation is converted to text. Then, suitable entities are extracted using robust techniques using a NER BERT transformer model. This will aid in the enrichment of customer experience when there is an issue which is faced by them. If a customer faces a problem he will call and register his complaint. The model will then extract the key features from this conversation which will be necessary to look into the problem. These features would include details like the order number, and the exact problem. All these would be extracted directly from the conversation and this would reduce the effort of going through the conversation again.
[[2211.16808] Efficient Adversarial Input Generation via Neural Net Patching](http://arxiv.org/abs/2211.16808) #robust
The adversarial input generation problem has become central in establishing the robustness and trustworthiness of deep neural nets, especially when they are used in safety-critical application domains such as autonomous vehicles and precision medicine. This is also practically challenging for multiple reasons-scalability is a common issue owing to large-sized networks, and the generated adversarial inputs often lack important qualities such as naturalness and output-impartiality. We relate this problem to the task of patching neural nets, i.e. applying small changes in some of the network$'$s weights so that the modified net satisfies a given property. Intuitively, a patch can be used to produce an adversarial input because the effect of changing the weights can also be brought about by changing the inputs instead. This work presents a novel technique to patch neural networks and an innovative approach of using it to produce perturbations of inputs which are adversarial for the original net. We note that the proposed solution is significantly more effective than the prior state-of-the-art techniques.
[[2211.16784] Robust and Fast Measure of Information via Low-rank Representation](http://arxiv.org/abs/2211.16784) #robust
The matrix-based R\'enyi's entropy allows us to directly quantify information measures from given data, without explicit estimation of the underlying probability distribution. This intriguing property makes it widely applied in statistical inference and machine learning tasks. However, this information theoretical quantity is not robust against noise in the data, and is computationally prohibitive in large-scale applications. To address these issues, we propose a novel measure of information, termed low-rank matrix-based R\'enyi's entropy, based on low-rank representations of infinitely divisible kernel matrices. The proposed entropy functional inherits the specialty of of the original definition to directly quantify information from data, but enjoys additional advantages including robustness and effective calculation. Specifically, our low-rank variant is more sensitive to informative perturbations induced by changes in underlying distributions, while being insensitive to uninformative ones caused by noises. Moreover, low-rank R\'enyi's entropy can be efficiently approximated by random projection and Lanczos iteration techniques, reducing the overall complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2 s)$ or even $\mathcal{O}(ns^2)$, where $n$ is the number of data samples and $s \ll n$. We conduct large-scale experiments to evaluate the effectiveness of this new information measure, demonstrating superior results compared to matrix-based R\'enyi's entropy in terms of both performance and computational efficiency.
[[2211.17142] Learning Label Modular Prompts for Text Classification in the Wild](http://arxiv.org/abs/2211.17142) #robust
Machine learning models usually assume i.i.d data during training and testing, but data and tasks in real world often change over time. To emulate the transient nature of real world, we propose a challenging but practical task: text classification in-the-wild, which introduces different non-stationary training/testing stages. Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment. However, current modular approaches in NLP do not take advantage of recent advances in parameter efficient tuning of pretrained language models. To close this gap, we propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks. In MODULARPROMPT, the input prompt consists of a sequence of soft label prompts, each encoding modular knowledge related to the corresponding class label. In two of most formidable settings, MODULARPROMPT outperforms relevant baselines by a large margin demonstrating strong generalisation ability. We also conduct comprehensive analysis to validate whether the learned prompts satisfy properties of a modular representation.
[[2211.17179] Investigation of Proper Orthogonal Decomposition for Echo State Networks](http://arxiv.org/abs/2211.17179) #robust
Echo State Networks (ESN) are a type of Recurrent Neural Networks that yields promising results in representing time series and nonlinear dynamic systems. Although they are equipped with a very efficient training procedure, Reservoir Computing strategies, such as the ESN, require the use of high order networks, i.e. large number of layers, resulting in number of states that is magnitudes higher than the number of model inputs and outputs. This not only makes the computation of a time step more costly, but also may pose robustness issues when applying ESNs to problems such as Model Predictive Control (MPC) and other optimal control problems. One such way to circumvent this is through Model Order Reduction strategies such as the Proper Orthogonal Decomposition (POD) and its variants (POD-DEIM), whereby we find an equivalent lower order representation to an already trained high dimension ESN. The objective of this work is to investigate and analyze the performance of POD methods in Echo State Networks, evaluating their effectiveness. To this end, we evaluate the Memory Capacity (MC) of the POD-reduced network in comparison to the original (full order) ENS. We also perform experiments on two different numerical case studies: a NARMA10 difference equation and an oil platform containing two wells and one riser. The results show that there is little loss of performance comparing the original ESN to a POD-reduced counterpart, and also that the performance of a POD-reduced ESN tend to be superior to a normal ESN of the same size. Also we attain speedups of around $80\%$ in comparison to the original ESN.
[[2211.17204] Semisoft Task Clustering for Multi-Task Learning](http://arxiv.org/abs/2211.17204) #robust
Multi-task learning (MTL) aims to improve the performance of multiple related prediction tasks by leveraging useful information from them. Due to their flexibility and ability to reduce unknown coefficients substantially, the task-clustering-based MTL approaches have attracted considerable attention. Motivated by the idea of semisoft clustering of data, we propose a semisoft task clustering approach, which can simultaneously reveal the task cluster structure for both pure and mixed tasks as well as select the relevant features. The main assumption behind our approach is that each cluster has some pure tasks, and each mixed task can be represented by a linear combination of pure tasks in different clusters. To solve the resulting non-convex constrained optimization problem, we design an efficient three-step algorithm. The experimental results based on synthetic and real-world datasets validate the effectiveness and efficiency of the proposed approach. Finally, we extend the proposed approach to a robust task clustering problem.
[[2211.16746] ClaRet -- A CNN Architecture for Optical Coherence Tomography](http://arxiv.org/abs/2211.16746) #extraction
Optical Coherence Tomography is a technique used to scan the Retina of the eye and check for tears. In this paper, we develop a Convolutional Neural Network Architecture for OCT scan classification. The model is trained to detect Retinal tears from an OCT scan and classify the type of tear. We designed a block-based approach to accompany a pre-trained VGG-19 using Transfer Learning by writing customised layers in blocks for better feature extraction. The approach achieved substantially better results than the baseline we initially started out with.
[[2211.16506] Where did you tweet from? Inferring the origin locations of tweets based on contextual information](http://arxiv.org/abs/2211.16506) #extraction
Public conversations on Twitter comprise many pertinent topics including disasters, protests, politics, propaganda, sports, climate change, epidemics/pandemic outbreaks, etc., that can have both regional and global aspects. Spatial discourse analysis rely on geographical data. However, today less than 1% of tweets are geotagged; in both cases--point location or bounding place information. A major issue with tweets is that Twitter users can be at location A and exchange conversations specific to location B, which we call the Location A/B problem. The problem is considered solved if location entities can be classified as either origin locations (Location As) or non-origin locations (Location Bs). In this work, we propose a simple yet effective framework--the True Origin Model--to address the problem that uses machine-level natural language understanding to identify tweets that conceivably contain their origin location information. The model achieves promising accuracy at country (80%), state (67%), city (58%), county (56%) and district (64%) levels with support from a Location Extraction Model as basic as the CoNLL-2003-based RoBERTa. We employ a tweet contexualizer (locBERT) which is one of the core components of the proposed model, to investigate multiple tweets' distributions for understanding Twitter users' tweeting behavior in terms of mentioning origin and non-origin locations. We also highlight a major concern with the currently regarded gold standard test set (ground truth) methodology, introduce a new data set, and identify further research avenues for advancing the area.
[[2211.16971] A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering](http://arxiv.org/abs/2211.16971) #extraction
Question Answering (QA) is a growing area of research, often used to facilitate the extraction of information from within documents. State-of-the-art QA models are usually pre-trained on domain-general corpora like Wikipedia and thus tend to struggle on out-of-domain documents without fine-tuning. We demonstrate that synthetic domain-specific datasets can be generated easily using domain-general models, while still providing significant improvements to QA performance. We present two new tools for this task: A flexible pipeline for validating the synthetic QA data and training downstream models on it, and an online interface to facilitate human annotation of this generated data. Using this interface, crowdworkers labelled 1117 synthetic QA pairs, which we then used to fine-tune downstream models and improve domain-specific QA performance by 8.75 F1.
[[2211.16759] General policy mapping: online continual reinforcement learning inspired on the insect brain](http://arxiv.org/abs/2211.16759) #extraction
We have developed a model for online continual or lifelong reinforcement learning (RL) inspired on the insect brain. Our model leverages the offline training of a feature extraction and a common general policy layer to enable the convergence of RL algorithms in online settings. Sharing a common policy layer across tasks leads to positive backward transfer, where the agent continuously improved in older tasks sharing the same underlying general policy. Biologically inspired restrictions to the agent's network are key for the convergence of RL algorithms. This provides a pathway towards efficient online RL in resource-constrained scenarios.
[[2211.16669] FedGPO: Heterogeneity-Aware Global Parameter Optimization for Efficient Federated Learning](http://arxiv.org/abs/2211.16669) #federate
Federated learning (FL) has emerged as a solution to deal with the risk of privacy leaks in machine learning training. This approach allows a variety of mobile devices to collaboratively train a machine learning model without sharing the raw on-device training data with the cloud. However, efficient edge deployment of FL is challenging because of the system/data heterogeneity and runtime variance. This paper optimizes the energy-efficiency of FL use cases while guaranteeing model convergence, by accounting for the aforementioned challenges. We propose FedGPO based on a reinforcement learning, which learns how to identify optimal global parameters (B, E, K) for each FL aggregation round adapting to the system/data heterogeneity and stochastic runtime variance. In our experiments, FedGPO improves the model convergence time by 2.4 times, and achieves 3.6 times higher energy efficiency over the baseline settings, respectively.
[[2211.17073] Risks to Zero Trust in a Federated Mission Partner Environment](http://arxiv.org/abs/2211.17073) #federate
Recent cybersecurity events have prompted the federal government to begin investigating strategies to transition to Zero Trust Architectures (ZTA) for federal information systems. Within federated mission networks, ZTA provides measures to minimize the potential for unauthorized release and disclosure of information outside bilateral and multilateral agreements. When federating with mission partners, there are potential risks that may undermine the benefits of Zero Trust. This paper explores risks associated with integrating multiple identity models and proposes two potential avenues to investigate in order to mitigate these risks.
[[2211.16952] On the Design of Communication-Efficient Federated Learning for Health Monitoring](http://arxiv.org/abs/2211.16952) #federate
With the booming deployment of Internet of Things, health monitoring applications have gradually prospered. Within the recent COVID-19 pandemic situation, interest in permanent remote health monitoring solutions has raised, targeting to reduce contact and preserve the limited medical resources. Among the technological methods to realize efficient remote health monitoring, federated learning (FL) has drawn particular attention due to its robustness in preserving data privacy. However, FL can yield to high communication costs, due to frequent transmissions between the FL server and clients. To tackle this problem, we propose in this paper a communication-efficient federated learning (CEFL) framework that involves clients clustering and transfer learning. First, we propose to group clients through the calculation of similarity factors, based on the neural networks characteristics. Then, a representative client in each cluster is selected to be the leader of the cluster. Differently from the conventional FL, our method performs FL training only among the cluster leaders. Subsequently, transfer learning is adopted by the leader to update its cluster members with the trained FL model. Finally, each member fine-tunes the received model with its own data. To further reduce the communication costs, we opt for a partial-layer FL aggregation approach. This method suggests partially updating the neural network model rather than fully. Through experiments, we show that CEFL can save up to to 98.45% in communication costs while conceding less than 3% in accuracy loss, when compared to the conventional FL. Finally, CEFL demonstrates a high accuracy for clients with small or unbalanced datasets.
[[2211.16965] Federated deep clustering with GAN-based data synthesis](http://arxiv.org/abs/2211.16965) #federate
Clustering has been extensively studied in centralized settings, but relatively unexplored in federated ones that data are distributed among multiple clients and can only be kept local at the clients. The necessity to invest more resources in improving federated clustering methods is twofold: 1) The performance of supervised federated learning models can benefit from clustering. 2) It is non-trivial to extend centralized ones to perform federated clustering tasks. In centralized settings, various deep clustering methods that perform dimensionality reduction and clustering jointly have achieved great success. To obtain high-quality cluster information, it is natural but non-trivial to extend these methods to federated settings. For this purpose, we propose a simple but effective federated deep clustering method. It requires only one communication round between the central server and clients, can run asynchronously, and can handle device failures. Moreover, although most studies have highlighted adverse effects of the non-independent and identically distributed (non-IID) data across clients, experimental results indicate that the proposed method can significantly benefit from this scenario.
[[2211.17196] EURO: ESPnet Unsupervised ASR Open-source Toolkit](http://arxiv.org/abs/2211.17196) #fair
This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training. In addition to wav2vec2, EURO extends the functionality and promotes reproducibility for UASR tasks by integrating S3PRL and k2, resulting in flexible frontends from 27 self-supervised models and various graph-based decoding strategies. EURO is implemented in ESPnet and follows its unified pipeline to provide UASR recipes with a complete setup. This improves the pipeline's efficiency and allows EURO to be easily applied to existing datasets in ESPnet. Extensive experiments on three mainstream self-supervised models demonstrate the toolkit's effectiveness and achieve state-of-the-art UASR performance on TIMIT and LibriSpeech datasets. EURO will be publicly available at https://github.com/espnet/espnet, aiming to promote this exciting and emerging research area based on UASR through open-source activity.
[[2211.16578] ButterflyNet2D: Bridging Classical Methods and Neural Network Methods in Image Processing](http://arxiv.org/abs/2211.16578) #interpretability
Both classical Fourier transform-based methods and neural network methods are widely used in image processing tasks. The former has better interpretability, whereas the latter often achieves better performance in practice. This paper introduces ButterflyNet2D, a regular CNN with sparse cross-channel connections. A Fourier initialization strategy for ButterflyNet2D is proposed to approximate Fourier transforms. Numerical experiments validate the accuracy of ButterflyNet2D approximating both the Fourier and the inverse Fourier transforms. Moreover, through four image processing tasks and image datasets, we show that training ButterflyNet2D from Fourier initialization does achieve better performance than random initialized neural networks.
[[2211.16699] Interpretability and accessibility of machine learning in selected food processing, agriculture and health applications](http://arxiv.org/abs/2211.16699) #interpretability
Artificial Intelligence (AI) and its data-centric branch of machine learning (ML) have greatly evolved over the last few decades. However, as AI is used increasingly in real world use cases, the importance of the interpretability of and accessibility to AI systems have become major research areas. The lack of interpretability of ML based systems is a major hindrance to widespread adoption of these powerful algorithms. This is due to many reasons including ethical and regulatory concerns, which have resulted in poorer adoption of ML in some areas. The recent past has seen a surge in research on interpretable ML. Generally, designing a ML system requires good domain understanding combined with expert knowledge. New techniques are emerging to improve ML accessibility through automated model design. This paper provides a review of the work done to improve interpretability and accessibility of machine learning in the context of global problems while also being relevant to developing countries. We review work under multiple levels of interpretability including scientific and mathematical interpretation, statistical interpretation and partial semantic interpretation. This review includes applications in three areas, namely food processing, agriculture and health.
[[2211.16736] Understanding transit ridership in an equity context through a comparison of statistical and machine learning algorithms](http://arxiv.org/abs/2211.16736) #interpretability
Building an accurate model of travel behaviour based on individuals' characteristics and built environment attributes is of importance for policy-making and transportation planning. Recent experiments with big data and Machine Learning (ML) algorithms toward a better travel behaviour analysis have mainly overlooked socially disadvantaged groups. Accordingly, in this study, we explore the travel behaviour responses of low-income individuals to transit investments in the Greater Toronto and Hamilton Area, Canada, using statistical and ML models. We first investigate how the model choice affects the prediction of transit use by the low-income group. This step includes comparing the predictive performance of traditional and ML algorithms and then evaluating a transit investment policy by contrasting the predicted activities and the spatial distribution of transit trips generated by vulnerable households after improving accessibility. We also empirically investigate the proposed transit investment by each algorithm and compare it with the city of Brampton's future transportation plan. While, unsurprisingly, the ML algorithms outperform classical models, there are still doubts about using them due to interpretability concerns. Hence, we adopt recent local and global model-agnostic interpretation tools to interpret how the model arrives at its predictions. Our findings reveal the great potential of ML algorithms for enhanced travel behaviour predictions for low-income strata without considerably sacrificing interpretability.
[[2211.17014] An Interpretable Hybrid Predictive Model of COVID-19 Cases using Autoregressive Model and LSTM](http://arxiv.org/abs/2211.17014) #interpretability
The Coronavirus Disease 2019 (COVID-19) has posed a severe threat to global human health and economic. It is an urgent task to build reliable data-driven prediction models for Covid 19 cases to improve public policy making. However, COVID-19 data shows special transmission characteristics such as significant fluctuations and non-stationarity, which may be difficult to be captured by a single predictive model and poses grand challenges in effective forecasting. In this paper, we proposed a novel Hybrid data-driven model combining Autoregressive model (AR) and long short-term memory neural networks (LSTM). It can be viewed as a new neural network model and the contribution of AR and LSTM is auto tuned in the training procedure. We conduct extensive numerical experiments on data collected from 8 counties of California that display various trends. The numerical results show the Hybrid model' advantages over AR and LSTM by its predictive powers. We show that the Hybrid model achieved 4.195\% MAPE, outperformed the AR 5.629\% and LSTM 5.070\% on average, and provide a discussion on interpretability.
[[2211.17264] Interpretability with full complexity by constraining feature information](http://arxiv.org/abs/2211.17264) #interpretability
Interpretability is a pressing issue for machine learning. Common approaches to interpretable machine learning constrain interactions between features of the input, rendering the effects of those features on a model's output comprehensible but at the expense of model complexity. We approach interpretability from a new angle: constrain the information about the features without restricting the complexity of the model. Borrowing from information theory, we use the Distributed Information Bottleneck to find optimal compressions of each feature that maximally preserve information about the output. The learned information allocation, by feature and by feature value, provides rich opportunities for interpretation, particularly in problems with many features and complex feature interactions. The central object of analysis is not a single trained model, but rather a spectrum of models serving as approximations that leverage variable amounts of information about the inputs. Information is allocated to features by their relevance to the output, thereby solving the problem of feature selection by constructing a learned continuum of feature inclusion-to-exclusion. The optimal compression of each feature -- at every stage of approximation -- allows fine-grained inspection of the distinctions among feature values that are most impactful for prediction. We develop a framework for extracting insight from the spectrum of approximate models and demonstrate its utility on a range of tabular datasets.
[[2211.17015] Explaining automated gender classification of human gait](http://arxiv.org/abs/2211.17015) #explainability
State-of-the-art machine learning (ML) models are highly effective in classifying gait analysis data, however, they lack in providing explanations for their predictions. This "black-box" characteristic makes it impossible to understand on which input patterns, ML models base their predictions. The present study investigates whether Explainable Artificial Intelligence methods, i.e., Layer-wise Relevance Propagation (LRP), can be useful to enhance the explainability of ML predictions in gait classification. The research question was: Which input patterns are most relevant for an automated gender classification model and do they correspond to characteristics identified in the literature? We utilized a subset of the GAITREC dataset containing five bilateral ground reaction force (GRF) recordings per person during barefoot walking of 62 healthy participants: 34 females and 28 males. Each input signal (right and left side) was min-max normalized before concatenation and fed into a multi-layer Convolutional Neural Network (CNN). The classification accuracy was obtained over a stratified ten-fold cross-validation. To identify gender-specific patterns, the input relevance scores were derived using LRP. The mean classification accuracy of the CNN with 83.3% showed a clear superiority over the zero-rule baseline of 54.8%.
[[2211.16582] SinDDM: A Single Image Denoising Diffusion Model](http://arxiv.org/abs/2211.16582) #diffusion
Denoising diffusion models (DDMs) have led to staggering performance leaps in image generation, editing and restoration. However, existing DDMs use very large datasets for training. Here, we introduce a framework for training a DDM on a single image. Our method, which we coin SinDDM, learns the internal statistics of the training image by using a multi-scale diffusion process. To drive the reverse diffusion process, we use a fully-convolutional light-weight denoiser, which is conditioned on both the noise level and the scale. This architecture allows generating samples of arbitrary dimensions, in a coarse-to-fine manner. As we illustrate, SinDDM generates diverse high-quality samples, and is applicable in a wide array of tasks, including style transfer and harmonization. Furthermore, it can be easily guided by external supervision. Particularly, we demonstrate text-guided generation from a single image using a pre-trained CLIP model.
[[2211.16677] 3D Neural Field Generation using Triplane Diffusion](http://arxiv.org/abs/2211.16677) #diffusion
Diffusion models have emerged as the state-of-the-art for image generation, among other tasks. Here, we present an efficient diffusion-based model for 3D-aware generation of neural fields. Our approach pre-processes training data, such as ShapeNet meshes, by converting them to continuous occupancy fields and factoring them into a set of axis-aligned triplane feature representations. Thus, our 3D training scenes are all represented by 2D feature planes, and we can directly train existing 2D diffusion models on these representations to generate 3D neural fields with high quality and diversity, outperforming alternative approaches to 3D-aware generation. Our approach requires essential modifications to existing triplane factorization pipelines to make the resulting features easy to learn for the diffusion model. We demonstrate state-of-the-art results on 3D generation on several object classes from ShapeNet.
[[2211.16940] DiffPose: Toward More Reliable 3D Pose Estimation](http://arxiv.org/abs/2211.16940) #diffusion
Monocular 3D human pose estimation is quite challenging due to the inherent ambiguity and occlusion, which often lead to high uncertainty and indeterminacy. On the other hand, diffusion models have recently emerged as an effective tool for generating high-quality images from noise. Inspired by their capability, we explore a novel pose estimation framework (DiffPose) that formulates 3D pose estimation as a reverse diffusion process. We incorporate novel designs into our DiffPose that facilitate the diffusion process for 3D pose estimation: a pose-specific initialization of pose uncertainty distributions, a Gaussian Mixture Model-based forward diffusion process, and a context-conditioned reverse diffusion process. Our proposed DiffPose significantly outperforms existing methods on the widely used pose estimation benchmarks Human3.6M and MPI-INF-3DHP.
[[2211.17084] High-Fidelity Guided Image Synthesis with Latent Diffusion Models](http://arxiv.org/abs/2211.17084) #diffusion
Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we note that prior works in this direction suffer from an intrinsic domain shift problem, wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop.
[[2211.17091] Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models](http://arxiv.org/abs/2211.17091) #diffusion
While the success of diffusion models has been witnessed in various domains, only a few works have investigated the variation of the generative process. In this paper, we introduce a new generative process that is closer to the reverse process than the original generative process, given the identical score checkpoint. Specifically, we adjust the generative process with the auxiliary discriminator between the real data and the generated data. Consequently, the adjusted generative process with the discriminator generates more realistic samples than the original process. In experiments, we achieve new SOTA FIDs of 1.74 on CIFAR-10, 1.33 on CelebA, and 1.88 on FFHQ in the unconditional generation.
[[2211.17106] Diffusion Probabilistic Model Made Slim](http://arxiv.org/abs/2211.17106) #diffusion
Despite the recent visually-pleasing results achieved, the massive computational cost has been a long-standing flaw for diffusion probabilistic models (DPMs), which, in turn, greatly limits their applications on resource-limited platforms. Prior methods towards efficient DPM, however, have largely focused on accelerating the testing yet overlooked their huge complexity and sizes. In this paper, we make a dedicated attempt to lighten DPM while striving to preserve its favourable performance. We start by training a small-sized latent diffusion model (LDM) from scratch, but observe a significant fidelity drop in the synthetic images. Through a thorough assessment, we find that DPM is intrinsically biased against high-frequency generation, and learns to recover different frequency components at different time-steps. These properties make compact networks unable to represent frequency dynamics with accurate high-frequency estimation. Towards this end, we introduce a customized design for slim DPM, which we term as Spectral Diffusion (SD), for light-weight image synthesis. SD incorporates wavelet gating in its architecture to enable frequency dynamic feature extraction at every reverse steps, and conducts spectrum-aware distillation to promote high-frequency recovery by inverse weighting the objective based on spectrum magni tudes. Experimental results demonstrate that, SD achieves 8-18x computational complexity reduction as compared to the latent diffusion models on a series of conditional and unconditional image generation tasks while retaining competitive image fidelity.
[[2211.16750] Score-based Continuous-time Discrete Diffusion Models](http://arxiv.org/abs/2211.16750) #diffusion
Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt \textcolor{\cdiff}{the score-based modeling} to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.