[[2212.14230] Exploring Depth Information for Face Manipulation Detection](http://arxiv.org/abs/2212.14230) #security
Face manipulation detection has been receiving a lot of attention for the reliability and security of the face images. Recent studies focus on using auxiliary information or prior knowledge to capture robust manipulation traces, which are shown to be promising. As one of the important face features, the face depth map, which has shown to be effective in other areas such as the face recognition or face detection, is unfortunately paid little attention to in literature for detecting the manipulated face images. In this paper, we explore the possibility of incorporating the face depth map as auxiliary information to tackle the problem of face manipulation detection in real world applications. To this end, we first propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from a RGB face image, which is able to capture the local depth anomaly created due to manipulation. The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features using a Multi-head Depth Attention (MDA) mechanism that is newly designed. Various experiments demonstrate the advantage of our proposed method for face manipulation detection.
[[2212.14629] Hierarchical Forgery Classifier On Multi-modality Face Forgery Clues](http://arxiv.org/abs/2212.14629) #security
Face forgery detection plays an important role in personal privacy and social security. With the development of adversarial generative models, high-quality forgery images become more and more indistinguishable from real to humans. Existing methods always regard as forgery detection task as the common binary or multi-label classification, and ignore exploring diverse multi-modality forgery image types, e.g. visible light spectrum and near-infrared scenarios. In this paper, we propose a novel Hierarchical Forgery Classifier for Multi-modality Face Forgery Detection (HFC-MFFD), which could effectively learn robust patches-based hybrid domain representation to enhance forgery authentication in multiple-modality scenarios. The local spatial hybrid domain feature module is designed to explore strong discriminative forgery clues both in the image and frequency domain in local distinct face regions. Furthermore, the specific hierarchical face forgery classifier is proposed to alleviate the class imbalance problem and further boost detection performance. Experimental results on representative multi-modality face forgery datasets demonstrate the superior performance of the proposed HFC-MFFD compared with state-of-the-art algorithms. The source code and models are publicly available at https://github.com/EdWhites/HFC-MFFD.
[[2212.14296] Towards Comprehensively Understanding the Run-time Security of Programmable Logic Controllers: A 3-year Empirical Study](http://arxiv.org/abs/2212.14296) #security
Programmable Logic Controllers (PLCs) are the core control devices in Industrial Control Systems (ICSs), which control and monitor the underlying physical plants such as power grids. PLCs were initially designed to work in a trusted industrial network, which however can be brittle once deployed in an Internet-facing (or penetrated) network. Yet, there is a lack of systematic empirical analysis of the run-time security of modern real-world PLCs. To close this gap, we present the first large-scale measurement on 23 off-the-shelf PLCs across 13 leading vendors. We find many common security issues and unexplored implications that should be more carefully addressed in the design and implementation. To sum up, the unsupervised logic applications can cause system resource/privilege abuse, which gives adversaries new means to hijack the control flow of a runtime system remotely (without exploiting memory vulnerabilities); 2) the improper access control mechanisms bring many unauthorized access implications; 3) the proprietary or semi-proprietary protocols are fragile regarding confidentiality and integrity protection of run-time data. We empirically evaluated the corresponding attack vectors on multiple PLCs, which demonstrates that the security implications are severe and broad. Our findings were reported to the related parties responsibly, and 20 bugs have been confirmed with 7 assigned CVEs.
[[2212.14422] Security, Privacy and Challenges in Microservices Architecture and Cloud Computing- Survey](http://arxiv.org/abs/2212.14422) #security
Security issues in processor architectures remain really critical since users and devices continue to share computing as well as networking resources. So, preserving data privacy in such an environment is really a critical concern. We know that there is a continuous growth in security and privacy issues that need to be addressed. Here, we have chosen a microservice architecture, which is a small or even an independent microprocess that interacts, acts, and responds to messages via lightweight technologies such as Thrift, HTTP, or REST API.
[[2212.14760] Deep Hierarchy Quantization Compression algorithm based on Dynamic Sampling](http://arxiv.org/abs/2212.14760) #security
Unlike traditional distributed machine learning, federated learning stores data locally for training and then aggregates the models on the server, which solves the data security problem that may arise in traditional distributed machine learning. However, during the training process, the transmission of model parameters can impose a significant load on the network bandwidth. It has been pointed out that the vast majority of model parameters are redundant during model parameter transmission. In this paper, we explore the data distribution law of selected partial model parameters on this basis, and propose a deep hierarchical quantization compression algorithm, which further compresses the model and reduces the network load brought by data transmission through the hierarchical quantization of model parameters. And we adopt a dynamic sampling strategy for the selection of clients to accelerate the convergence of the model. Experimental results on different public datasets demonstrate the effectiveness of our algorithm.
[[2212.14141] $\pi$QLB: A Privacy-preserving with Integrity-assuring Query Language for Blockchain](http://arxiv.org/abs/2212.14141) #privacy
The increase in the adoption of blockchain technology in different application domains e.g., healthcare systems, supplychain management, has raised the demand for a data query mechanism on blockchain. Since current blockchain systems lack the support for querying data with embedded security and privacy guarantees, there exists inherent security and privacy concerns on those systems. In particular, existing systems require users to submit queries to blockchain operators (e.g., a node validator) in plaintext. This directly jeopardizes users' privacy as the submitted queries may contain sensitive information, e.g., location or gender preferences, that the users may not be comfortable sharing. On the other hand, currently, the only way for users to ensure integrity of the query result is to maintain the entire blockchain database and perform the queries locally. Doing so incurs high storage and computational costs on the users, precluding this approach to be practically deployable on common light-weight devices (e.g., smartphones). To this end, this paper proposes $\pi$QLB, a query language for blockchain systems that ensures both confidentiality of query inputs and integrity of query results. Additionally, $\pi$QLB enables SQL-like queries over the blockchain data by introducing relational data semantics into the existing blockchain database. $\pi$QLB has applied the recent cryptography primitive, i.e., function secret sharing (FSS), to achieve confidentiality. To support integrity, we extend the traditional FSS setting in such a way that integrity of FSS results can be efficiently verified. Successful verification indicates absence of malicious behaviors on the servers, allowing the user to establish trust from the result. To the best of our knowledge, $\pi$QLB is the first query model designed for blockchain databases with support for confidentiality, integrity, and SQL-like queries.
[[2212.14527] Estimating Latent Population Flows from Aggregated Data via Inversing Multi-Marginal Optimal Transport](http://arxiv.org/abs/2212.14527) #privacy
We study the problem of estimating latent population flows from aggregated count data. This problem arises when individual trajectories are not available due to privacy issues or measurement fidelity. Instead, the aggregated observations are measured over discrete-time points, for estimating the population flows among states. Most related studies tackle the problems by learning the transition parameters of a time-homogeneous Markov process. Nonetheless, most real-world population flows can be influenced by various uncertainties such as traffic jam and weather conditions. Thus, in many cases, a time-homogeneous Markov model is a poor approximation of the much more complex population flows. To circumvent this difficulty, we resort to a multi-marginal optimal transport (MOT) formulation that can naturally represent aggregated observations with constrained marginals, and encode time-dependent transition matrices by the cost functions. In particular, we propose to estimate the transition flows from aggregated data by learning the cost functions of the MOT framework, which enables us to capture time-varying dynamic patterns. The experiments demonstrate the improved accuracy of the proposed algorithms than the related methods in estimating several real-world transition flows.
[[2212.14720] Learning from Data Streams: An Overview and Update](http://arxiv.org/abs/2212.14720) #privacy
The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
[[2212.14110] Learning Representations for Masked Facial Recovery](http://arxiv.org/abs/2212.14110) #protect
The pandemic of these very recent years has led to a dramatic increase in people wearing protective masks in public venues. This poses obvious challenges to the pervasive use of face recognition technology that now is suffering a decline in performance. One way to address the problem is to revert to face recovery methods as a preprocessing step. Current approaches to face reconstruction and manipulation leverage the ability to model the face manifold, but tend to be generic. We introduce a method that is specific for the recovery of the face image from an image of the same individual wearing a mask. We do so by designing a specialized GAN inversion method, based on an appropriate set of losses for learning an unmasking encoder. With extensive experiments, we show that the approach is effective at unmasking face images. In addition, we also show that the identity information is preserved sufficiently well to improve face verification performance based on several face recognition benchmark datasets.
[[2212.14709] A Learning-Based Optimal Uncertainty Quantification Method and Its Application to Ballistic Impact Problems](http://arxiv.org/abs/2212.14709) #protect
This paper concerns the study of optimal (supremum and infimum) uncertainty bounds for systems where the input (or prior) probability measure is only partially/imperfectly known (e.g., with only statistical moments and/or on a coarse topology) rather than fully specified. Such partial knowledge provides constraints on the input probability measures. The theory of Optimal Uncertainty Quantification allows us to convert the task into a constraint optimization problem where one seeks to compute the least upper/greatest lower bound of the system's output uncertainties by finding the extremal probability measure of the input. Such optimization requires repeated evaluation of the system's performance indicator (input to performance map) and is high-dimensional and non-convex by nature. Therefore, it is difficult to find the optimal uncertainty bounds in practice. In this paper, we examine the use of machine learning, especially deep neural networks, to address the challenge. We achieve this by introducing a neural network classifier to approximate the performance indicator combined with the stochastic gradient descent method to solve the optimization problem. We demonstrate the learning based framework on the uncertainty quantification of the impact of magnesium alloys, which are promising light-weight structural and protective materials. Finally, we show that the approach can be used to construct maps for the performance certificate and safety design in engineering practice.
[[2212.14647] RL and Fingerprinting to Select Moving Target Defense Mechanisms for Zero-day Attacks in IoT](http://arxiv.org/abs/2212.14647) #defense
Cybercriminals are moving towards zero-day attacks affecting resource-constrained devices such as single-board computers (SBC). Assuming that perfect security is unrealistic, Moving Target Defense (MTD) is a promising approach to mitigate attacks by dynamically altering target attack surfaces. Still, selecting suitable MTD techniques for zero-day attacks is an open challenge. Reinforcement Learning (RL) could be an effective approach to optimize the MTD selection through trial and error, but the literature fails when i) evaluating the performance of RL and MTD solutions in real-world scenarios, ii) studying whether behavioral fingerprinting is suitable for representing SBC's states, and iii) calculating the consumption of resources in SBC. To improve these limitations, the work at hand proposes an online RL-based framework to learn the correct MTD mechanisms mitigating heterogeneous zero-day attacks in SBC. The framework considers behavioral fingerprinting to represent SBCs' states and RL to learn MTD techniques that mitigate each malicious state. It has been deployed on a real IoT crowdsensing scenario with a Raspberry Pi acting as a spectrum sensor. More in detail, the Raspberry Pi has been infected with different samples of command and control malware, rootkits, and ransomware to later select between four existing MTD techniques. A set of experiments demonstrated the suitability of the framework to learn proper MTD techniques mitigating all attacks (except a harmfulness rootkit) while consuming <1 MB of storage and utilizing <55% CPU and <80% RAM.
[[2212.14677] Adversarial attacks and defenses on ML- and hardware-based IoT device fingerprinting and identification](http://arxiv.org/abs/2212.14677) #defense
In the last years, the number of IoT devices deployed has suffered an undoubted explosion, reaching the scale of billions. However, some new cybersecurity issues have appeared together with this development. Some of these issues are the deployment of unauthorized devices, malicious code modification, malware deployment, or vulnerability exploitation. This fact has motivated the requirement for new device identification mechanisms based on behavior monitoring. Besides, these solutions have recently leveraged Machine and Deep Learning techniques due to the advances in this field and the increase in processing capabilities. In contrast, attackers do not stay stalled and have developed adversarial attacks focused on context modification and ML/DL evaluation evasion applied to IoT device identification solutions. This work explores the performance of hardware behavior-based individual device identification, how it is affected by possible context- and ML/DL-focused attacks, and how its resilience can be improved using defense techniques. In this sense, it proposes an LSTM-CNN architecture based on hardware performance behavior for individual device identification. Then, previous techniques have been compared with the proposed architecture using a hardware performance dataset collected from 45 Raspberry Pi devices running identical software. The LSTM-CNN improves previous solutions achieving a +0.96 average F1-Score and 0.8 minimum TPR for all devices. Afterward, context- and ML/DL-focused adversarial attacks were applied against the previous model to test its robustness. A temperature-based context attack was not able to disrupt the identification. However, some ML/DL state-of-the-art evasion attacks were successful. Finally, adversarial training and model distillation defense techniques are selected to improve the model resilience to evasion attacks, without degrading its performance.
[[2212.14875] Guidance Through Surrogate: Towards a Generic Diagnostic Attack](http://arxiv.org/abs/2212.14875) #attack
Adversarial training is an effective approach to make deep neural networks
robust against adversarial attacks. Recently, different adversarial training
defenses are proposed that not only maintain a high clean accuracy but also
show significant robustness against popular and well studied adversarial
attacks such as PGD. High adversarial robustness can also arise if an attack
fails to find adversarial gradient directions, a phenomenon known as gradient
masking'. In this work, we analyse the effect of label smoothing on adversarial
training as one of the potential causes of gradient masking. We then develop a
guided mechanism to avoid local minima during attack optimization, leading to a
novel attack dubbed Guided Projected Gradient Attack (G-PGA). Our attack
approach is based on a
match and deceive' loss that finds optimal adversarial
directions through guidance from a surrogate model. Our modified attack does
not require random restarts, large number of attack iterations or search for an
optimal step-size. Furthermore, our proposed G-PGA is generic, thus it can be
combined with an ensemble attack strategy as we demonstrate for the case of
Auto-Attack, leading to efficiency and convergence speed improvements. More
than an effective attack, G-PGA can be used as a diagnostic tool to reveal
elusive robustness due to gradient masking in adversarial defenses.
[[2212.14109] Synthesis of Adversarial DDOS Attacks Using Tabular Generative Adversarial Networks](http://arxiv.org/abs/2212.14109) #attack
Network Intrusion Detection Systems (NIDS) are tools or software that are widely used to maintain the computer networks and information systems keeping them secure and preventing malicious traffics from penetrating into them, as they flag when somebody is trying to break into the system. Best effort has been set up on these systems, and the results achieved so far are quite satisfying, however, new types of attacks stand out as the technology of attacks keep evolving, one of these attacks are the attacks based on Generative Adversarial Networks (GAN) that can evade machine learning IDS leaving them vulnerable. This project investigates the impact of the Adversarial Attacks synthesized using real DDoS attacks generated using GANs on the IDS. The objective is to discover how will these systems react towards synthesized attacks. marking the vulnerability and weakness points of these systems so we could fix them.
[[2212.14315] "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice](http://arxiv.org/abs/2212.14315) #attack
Recent years have seen a proliferation of research on adversarial machine learning. Numerous papers demonstrate powerful algorithmic attacks against a wide variety of machine learning (ML) models, and numerous other papers propose defenses that can withstand most attacks. However, abundant real-world evidence suggests that actual attackers use simple tactics to subvert ML-driven systems, and as a result security practitioners have not prioritized adversarial ML defenses.
Motivated by the apparent gap between researchers and practitioners, this position paper aims to bridge the two domains. We first present three real-world case studies from which we can glean practical insights unknown or neglected in research. Next we analyze all adversarial ML papers recently published in top security conferences, highlighting positive trends and blind spots. Finally, we state positions on precise and cost-driven threat modeling, collaboration between industry and academia, and reproducible research. We believe that our positions, if adopted, will increase the real-world impact of future endeavours in adversarial ML, bringing both researchers and practitioners closer to their shared goal of improving the security of ML systems.
[[2212.14435] Identification and Verification of Attack-Tree Threat Models in Connected Vehicles](http://arxiv.org/abs/2212.14435) #attack
As a result of the ever-increasing application of cyber-physical components in the automotive industry, cybersecurity has become an urgent topic. Adapting technologies and communication protocols like Ethernet and WiFi in connected vehicles yields many attack scenarios. Consequently, ISO/SAE 21434 and UN R155 (2021) define a standard and regulatory framework for automotive cybersecurity. Both documents follow a risk management-based approach and require a threat modeling methodology for risk analysis and identification. Such a threat modeling methodology must conform to the Threat Analysis and Risk Assessment (TARA) framework of ISO/SAE 21434. Conversely, existing threat modeling methods enumerate isolated threats disregarding the vehicle's design and connections. Consequently, they neglect the role of attack paths from a vehicle's interfaces to its assets. In other words, they are missing the TARA work products, e.g., attack paths compromising assets or feasibility and impact ratings. We propose a threat modeling methodology to construct attack paths by identifying, sequencing, and connecting vulnerabilities from a valid attack surface to an asset. Initially, we transform cybersecurity guidelines to attack trees, and then we use their formal interpretations to assess the vehicle's design. This workflow yields compositional construction of attack paths along with the required TARA work products (e.g., attack paths, feasibility, and impact). More importantly, we can apply the workflow iteratively in the context of connected vehicles to ensure design conformity, privacy, and cybersecurity. Finally, to show the complexity and the importance of preemptive threat identification and risk analysis in the automotive industry, we evaluate the presented model-based approach in a connected vehicle testing platform, SPIDER.
[[2212.14115] Certifying Safety in Reinforcement Learning under Adversarial Perturbation Attacks](http://arxiv.org/abs/2212.14115) #attack
Function approximation has enabled remarkable advances in applying reinforcement learning (RL) techniques in environments with high-dimensional inputs, such as images, in an end-to-end fashion, mapping such inputs directly to low-level control. Nevertheless, these have proved vulnerable to small adversarial input perturbations. A number of approaches for improving or certifying robustness of end-to-end RL to adversarial perturbations have emerged as a result, focusing on cumulative reward. However, what is often at stake in adversarial scenarios is the violation of fundamental properties, such as safety, rather than the overall reward that combines safety with efficiency. Moreover, properties such as safety can only be defined with respect to true state, rather than the high-dimensional raw inputs to end-to-end policies. To disentangle nominal efficiency and adversarial safety, we situate RL in deterministic partially-observable Markov decision processes (POMDPs) with the goal of maximizing cumulative reward subject to safety constraints. We then propose a partially-supervised reinforcement learning (PSRL) framework that takes advantage of an additional assumption that the true state of the POMDP is known at training time. We present the first approach for certifying safety of PSRL policies under adversarial input perturbations, and two adversarial training approaches that make direct use of PSRL. Our experiments demonstrate both the efficacy of the proposed approach for certifying safety in adversarial environments, and the value of the PSRL framework coupled with adversarial training in improving certified safety while preserving high nominal reward and high-quality predictions of true state.
[[2212.14131] TAToo: Vision-based Joint Tracking of Anatomy and Tool for Skull-base Surgery](http://arxiv.org/abs/2212.14131) #robust
Purpose: Tracking the 3D motion of the surgical tool and the patient anatomy is a fundamental requirement for computer-assisted skull-base surgery. The estimated motion can be used both for intra-operative guidance and for downstream skill analysis. Recovering such motion solely from surgical videos is desirable, as it is compliant with current clinical workflows and instrumentation. Methods: We present Tracker of Anatomy and Tool (TAToo). TAToo jointly tracks the rigid 3D motion of patient skull and surgical drill from stereo microscopic videos. TAToo estimates motion via an iterative optimization process in an end-to-end differentiable form. For robust tracking performance, TAToo adopts a probabilistic formulation and enforces geometric constraints on the object level. Results: We validate TAToo on both simulation data, where ground truth motion is available, as well as on anthropomorphic phantom data, where optical tracking provides a strong baseline. We report sub-millimeter and millimeter inter-frame tracking accuracy for skull and drill, respectively, with rotation errors below 1{\deg}. We further illustrate how TAToo may be used in a surgical navigation setting. Conclusion: We present TAToo, which simultaneously tracks the surgical tool and the patient anatomy in skull-base surgery. TAToo directly predicts the motion from surgical videos, without the need of any markers. Our results show that the performance of TAToo compares favorably to competing approaches. Future work will include fine-tuning of our depth network to reach a 1 mm clinical accuracy goal desired for surgical applications in the skull base.
[[2212.14245] Practical Exposure Correction: Great Truths Are Always Simple](http://arxiv.org/abs/2212.14245) #robust
Improving the visual quality of the given degraded observation by correcting exposure level is a fundamental task in the computer vision community. Existing works commonly lack adaptability towards unknown scenes because of the data-driven patterns (deep networks) and limited regularization (traditional optimization), and they usually need time-consuming inference. These two points heavily limit their practicability. In this paper, we establish a Practical Exposure Corrector (PEC) that assembles the characteristics of efficiency and performance. To be concrete, we rethink the exposure correction to provide a linear solution with exposure-sensitive compensation. Around generating the compensation, we introduce an exposure adversarial function as the key engine to fully extract valuable information from the observation. By applying the defined function, we construct a segmented shrinkage iterative scheme to generate the desired compensation. Its shrinkage nature supplies powerful support for algorithmic stability and robustness. Extensive experimental evaluations fully reveal the superiority of our proposed PEC. The code is available at https://rsliu.tech/PEC.
[[2212.14532] Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning](http://arxiv.org/abs/2212.14532) #robust
Remote sensing imagery provides comprehensive views of the Earth, where different sensors collect complementary data at different spatial scales. Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models overlook scale-specific information in the data. In this paper, we present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales throughout the pretraining process. Scale-MAE pretrains a network by masking an input image at a known input scale, where the area of the Earth covered by the image determines the scale of the ViT positional encoding, not the image resolution. Scale-MAE encodes the masked image with a standard ViT backbone, and then decodes the masked image through a bandpass filter to reconstruct low/high frequency images at lower/higher scales. We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery. Scale-MAE achieves an average of a $5.0\%$ non-parametric kNN classification improvement across eight remote sensing datasets compared to current state-of-the-art and obtains a $0.9$ mIoU to $3.8$ mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.
[[2212.14615] DRG-Net: Interactive Joint Learning of Multi-lesion Segmentation and Classification for Diabetic Retinopathy Grading](http://arxiv.org/abs/2212.14615) #robust
Diabetic Retinopathy (DR) is a leading cause of vision loss in the world, and early DR detection is necessary to prevent vision loss and support an appropriate treatment. In this work, we leverage interactive machine learning and introduce a joint learning framework, termed DRG-Net, to effectively learn both disease grading and multi-lesion segmentation. Our DRG-Net consists of two modules: (i) DRG-AI-System to classify DR Grading, localize lesion areas, and provide visual explanations; (ii) DRG-Expert-Interaction to receive feedback from user-expert and improve the DRG-AI-System. To deal with sparse data, we utilize transfer learning mechanisms to extract invariant feature representations by using Wasserstein distance and adversarial learning-based entropy minimization. Besides, we propose a novel attention strategy at both low- and high-level features to automatically select the most significant lesion information and provide explainable properties. In terms of human interaction, we further develop DRG-Net as a tool that enables expert users to correct the system's predictions, which may then be used to update the system as a whole. Moreover, thanks to the attention mechanism and loss functions constraint between lesion features and classification features, our approach can be robust given a certain level of noise in the feedback of users. We have benchmarked DRG-Net on the two largest DR datasets, i.e., IDRID and FGADR, and compared it to various state-of-the-art deep learning networks. In addition to outperforming other SOTA approaches, DRG-Net is effectively updated using user feedback, even in a weakly-supervised manner.
[[2212.14742] DGFont++: Robust Deformable Generative Networks for Unsupervised Font Generation](http://arxiv.org/abs/2212.14742) #robust
Automatic font generation without human experts is a practical and significant problem, especially for some languages that consist of a large number of characters. Existing methods for font generation are often in supervised learning. They require a large number of paired data, which are labor-intensive and expensive to collect. In contrast, common unsupervised image-to-image translation methods are not applicable to font generation, as they often define style as the set of textures and colors. In this work, we propose a robust deformable generative network for unsupervised font generation (abbreviated as DGFont++). We introduce a feature deformation skip connection (FDSC) to learn local patterns and geometric transformations between fonts. The FDSC predicts pairs of displacement maps and employs the predicted maps to apply deformable convolution to the low-level content feature maps. The outputs of FDSC are fed into a mixer to generate final results. Moreover, we introduce contrastive self-supervised learning to learn a robust style representation for fonts by understanding the similarity and dissimilarities of fonts. To distinguish different styles, we train our model with a multi-task discriminator, which ensures that each style can be discriminated independently. In addition to adversarial loss, another two reconstruction losses are adopted to constrain the domain-invariant characteristics between generated images and content images. Taking advantage of FDSC and the adopted loss functions, our model is able to maintain spatial information and generates high-quality character images in an unsupervised manner. Experiments demonstrate that our model is able to generate character images of higher quality than state-of-the-art methods.
[[2212.14871] SE(3)-Equivariant Reconstruction from Light Field](http://arxiv.org/abs/2212.14871) #robust
Recent progress in geometric computer vision has shown significant advances in reconstruction and novel view rendering from multiple views by capturing the scene as a neural radiance field. Such approaches have changed the paradigm of reconstruction but need a plethora of views and do not make use of object shape priors. On the other hand, deep learning has shown how to use priors in order to infer shape from single images. Such approaches, though, require that the object is reconstructed in a canonical pose or assume that object pose is known during training. In this paper, we address the problem of how to compute equivariant priors for reconstruction from a few images, given the relative poses of the cameras. Our proposed reconstruction is $SE(3)$-gauge equivariant, meaning that it is equivariant to the choice of world frame. To achieve this, we make two novel contributions to light field processing: we define light field convolution and we show how it can be approximated by intra-view $SE(2)$ convolutions because the original light field convolution is computationally and memory-wise intractable; we design a map from the light field to $\mathbb{R}^3$ that is equivariant to the transformation of the world frame and to the rotation of the views. We demonstrate equivariance by obtaining robust results in roto-translated datasets without performing transformation augmentation.
[[2212.14049] Differentiable Search of Accurate and Robust Architectures](http://arxiv.org/abs/2212.14049) #robust
Deep neural networks (DNNs) are found to be vulnerable to adversarial attacks, and various methods have been proposed for the defense. Among these methods, adversarial training has been drawing increasing attention because of its simplicity and effectiveness. However, the performance of the adversarial training is greatly limited by the architectures of target DNNs, which often makes the resulting DNNs with poor accuracy and unsatisfactory robustness. To address this problem, we propose DSARA to automatically search for the neural architectures that are accurate and robust after adversarial training. In particular, we design a novel cell-based search space specially for adversarial training, which improves the accuracy and the robustness upper bound of the searched architectures by carefully designing the placement of the cells and the proportional relationship of the filter numbers. Then we propose a two-stage search strategy to search for both accurate and robust neural architectures. At the first stage, the architecture parameters are optimized to minimize the adversarial loss, which makes full use of the effectiveness of the adversarial training in enhancing the robustness. At the second stage, the architecture parameters are optimized to minimize both the natural loss and the adversarial loss utilizing the proposed multi-objective adversarial training method, so that the searched neural architectures are both accurate and robust. We evaluate the proposed algorithm under natural data and various adversarial attacks, which reveals the superiority of the proposed method in terms of both accurate and robust architectures. We also conclude that accurate and robust neural architectures tend to deploy very different structures near the input and the output, which has great practical significance on both hand-crafting and automatically designing of accurate and robust neural architectures.
[[2212.14364] Testbed for Functional Safety-Relevant Wireless Communication Based on IO-Link Wireless and 5G](http://arxiv.org/abs/2212.14364) #robust
In the field of industrial production automation, wireless networks support highly flexible manufacturing processes and enable technologies to set-up new production chains and future software businesses. The IO-Link Wireless (IOLW) protocol is an already established energy-efficient and cost-effective communication standard for smart sensor devices on the industrial shop floor, whereas the mobile communication standard 5G will be mainly applied for medium and long-range wireless communication applications promising low latency times and high reliability. Therefore, 5G with the coming enhancement of deterministic ultra-Reliable Low-Latency Communication (uRLLC) is combined with the robustness and low-latency performance characteristics of IO-Link Wireless. Features of both technologies are highly beneficial to realize even highly demanding safety-related applications. The presented testbed shall qualify wireless functional safety communication with respect to its Residual Error Probability (REP) and quantify the Probability of Failure per Hour (PFH).
[[2212.14106] Robust Ranking Explanations](http://arxiv.org/abs/2212.14106) #robust
Gradient-based explanation is the cornerstone of explainable deep networks, but it has been shown to be vulnerable to adversarial attacks. However, existing works measure the explanation robustness based on $\ell_p$-norm, which can be counter-intuitive to humans, who only pay attention to the top few salient features. We propose explanation ranking thickness as a more suitable explanation robustness metric. We then present a new practical adversarial attacking goal for manipulating explanation rankings. To mitigate the ranking-based attacks while maintaining computational feasibility, we derive surrogate bounds of the thickness that involve expensive sampling and integration. We use a multi-objective approach to analyze the convergence of a gradient-based attack to confirm that the explanation robustness can be measured by the thickness metric. We conduct experiments on various network architectures and diverse datasets to prove the superiority of the proposed methods, while the widely accepted Hessian-based curvature smoothing approaches are not as robust as our method.
[[2212.14133] Investigating Sindy As a Tool For Causal Discovery In Time Series Signals](http://arxiv.org/abs/2212.14133) #robust
The SINDy algorithm has been successfully used to identify the governing equations of dynamical systems from time series data. In this paper, we argue that this makes SINDy a potentially useful tool for causal discovery and that existing tools for causal discovery can be used to dramatically improve the performance of SINDy as tool for robust sparse modeling and system identification. We then demonstrate empirically that augmenting the SINDy algorithm with tools from causal discovery can provides engineers with a tool for learning causally robust governing equations.
[[2212.14599] ComplAI: Theory of A Unified Framework for Multi-factor Assessment of Black-Box Supervised Machine Learning Models](http://arxiv.org/abs/2212.14599) #robust
The advances in Artificial Intelligence are creating new opportunities to improve lives of people around the world, from business to healthcare, from lifestyle to education. For example, some systems profile the users using their demographic and behavioral characteristics to make certain domain-specific predictions. Often, such predictions impact the life of the user directly or indirectly (e.g., loan disbursement, determining insurance coverage, shortlisting applications, etc.). As a result, the concerns over such AI-enabled systems are also increasing. To address these concerns, such systems are mandated to be responsible i.e., transparent, fair, and explainable to developers and end-users. In this paper, we present ComplAI, a unique framework to enable, observe, analyze and quantify explainability, robustness, performance, fairness, and model behavior in drift scenarios, and to provide a single Trust Factor that evaluates different supervised Machine Learning models not just from their ability to make correct predictions but from overall responsibility perspective. The framework helps users to (a) connect their models and enable explanations, (b) assess and visualize different aspects of the model, such as robustness, drift susceptibility, and fairness, and (c) compare different models (from different model families or obtained through different hyperparameter settings) from an overall perspective thereby facilitating actionable recourse for improvement of the models. It is model agnostic and works with different supervised machine learning scenarios (i.e., Binary Classification, Multi-class Classification, and Regression) and frameworks. It can be seamlessly integrated with any ML life-cycle framework. Thus, this already deployed framework aims to unify critical aspects of Responsible AI systems for regulating the development process of such real systems.
[[2212.14772] A Combined Approach Toward Consistent Reconstructions of Indoor Spaces Based on 6D RGB-D Odometry and KinectFusion](http://arxiv.org/abs/2212.14772) #extraction
We propose a 6D RGB-D odometry approach that finds the relative camera pose between consecutive RGB-D frames by keypoint extraction and feature matching both on the RGB and depth image planes. Furthermore, we feed the estimated pose to the highly accurate KinectFusion algorithm, which uses a fast ICP (Iterative Closest Point) to fine-tune the frame-to-frame relative pose and fuse the depth data into a global implicit surface. We evaluate our method on a publicly available RGB-D SLAM benchmark dataset by Sturm et al. The experimental results show that our proposed reconstruction method solely based on visual odometry and KinectFusion outperforms the state-of-the-art RGB-D SLAM system accuracy. Moreover, our algorithm outputs a ready-to-use polygon mesh (highly suitable for creating 3D virtual worlds) without any postprocessing steps.
[[2212.14266] Sequence Generation with Label Augmentation for Relation Extraction](http://arxiv.org/abs/2212.14266) #extraction
Sequence generation demonstrates promising performance in recent information extraction efforts, by incorporating large-scale pre-trained Seq2Seq models. This paper investigates the merits of employing sequence generation in relation extraction, finding that with relation names or synonyms as generation targets, their textual semantics and the correlation (in terms of word sequence pattern) among them affect model performance. We then propose Relation Extraction with Label Augmentation (RELA), a Seq2Seq model with automatic label augmentation for RE. By saying label augmentation, we mean prod semantically synonyms for each relation name as the generation target. Besides, we present an in-depth analysis of the Seq2Seq model's behavior when dealing with RE. Experimental results show that RELA achieves competitive results compared with previous methods on four RE datasets.
[[2212.14270] Reviewing Labels: Label Graph Network with Top-k Prediction Set for Relation Extraction](http://arxiv.org/abs/2212.14270) #extraction
The typical way for relation extraction is fine-tuning large pre-trained language models on task-specific datasets, then selecting the label with the highest probability of the output distribution as the final prediction. However, the usage of the Top-k prediction set for a given sample is commonly overlooked. In this paper, we first reveal that the Top-k prediction set of a given sample contains useful information for predicting the correct label. To effectively utilizes the Top-k prediction set, we propose Label Graph Network with Top-k Prediction Set, termed as KLG. Specifically, for a given sample, we build a label graph to review candidate labels in the Top-k prediction set and learn the connections between them. We also design a dynamic $k$-selection mechanism to learn more powerful and discriminative relation representation. Our experiments show that KLG achieves the best performances on three relation extraction datasets. Moreover, we observe that KLG is more effective in dealing with long-tailed classes.
[[2212.14050] Proof of Swarm Based Ensemble Learning for Federated Learning Applications](http://arxiv.org/abs/2212.14050) #federate
Ensemble learning combines results from multiple machine learning models in order to provide a better and optimised predictive model with reduced bias, variance and improved predictions. However, in federated learning it is not feasible to apply centralised ensemble learning directly due to privacy concerns. Hence, a mechanism is required to combine results of local models to produce a global model. Most distributed consensus algorithms, such as Byzantine fault tolerance (BFT), do not normally perform well in such applications. This is because, in such methods predictions of some of the peers are disregarded, so a majority of peers can win without even considering other peers' decisions. Additionally, the confidence score of the result of each peer is not normally taken into account, although it is an important feature to consider for ensemble learning. Moreover, the problem of a tie event is often left un-addressed by methods such as BFT. To fill these research gaps, we propose PoSw (Proof of Swarm), a novel distributed consensus algorithm for ensemble learning in a federated setting, which was inspired by particle swarm based algorithms for solving optimisation problems. The proposed algorithm is theoretically proved to always converge in a relatively small number of steps and has mechanisms to resolve tie events while trying to achieve sub-optimum solutions. We experimentally validated the performance of the proposed algorithm using ECG classification as an example application in healthcare, showing that the ensemble learning model outperformed all local models and even the FL-based global model. To the best of our knowledge, the proposed algorithm is the first attempt to make consensus over the output results of distributed models trained using federated learning.
[[2212.14395] Graph Federated Learning for CIoT Devices in Smart Home Applications](http://arxiv.org/abs/2212.14395) #federate
This paper deals with the problem of statistical and system heterogeneity in a cross-silo Federated Learning (FL) framework where there exist a limited number of Consumer Internet of Things (CIoT) devices in a smart building. We propose a novel Graph Signal Processing (GSP)-inspired aggregation rule based on graph filtering dubbed ``G-Fedfilt''. The proposed aggregator enables a structured flow of information based on the graph's topology. This behavior allows capturing the interconnection of CIoT devices and training domain-specific models. The embedded graph filter is equipped with a tunable parameter which enables a continuous trade-off between domain-agnostic and domain-specific FL. In the case of domain-agnostic, it forces G-Fedfilt to act similar to the conventional Federated Averaging (FedAvg) aggregation rule. The proposed G-Fedfilt also enables an intrinsic smooth clustering based on the graph connectivity without explicitly specified which further boosts the personalization of the models in the framework. In addition, the proposed scheme enjoys a communication-efficient time-scheduling to alleviate the system heterogeneity. This is accomplished by adaptively adjusting the amount of training data samples and sparsity of the models' gradients to reduce communication desynchronization and latency. Simulation results show that the proposed G-Fedfilt achieves up to $3.99\% $ better classification accuracy than the conventional FedAvg when concerning model personalization on the statistically heterogeneous local datasets, while it is capable of yielding up to $2.41\%$ higher accuracy than FedAvg in the case of testing the generalization of the models.
[[2212.14284] Resolving Task Confusion in Dynamic Expansion Architectures for Class Incremental Learning](http://arxiv.org/abs/2212.14284) #fair
The dynamic expansion architecture is becoming popular in class incremental learning, mainly due to its advantages in alleviating catastrophic forgetting. However, task confusion is not well assessed within this framework, e.g., the discrepancy between classes of different tasks is not well learned (i.e., inter-task confusion, ITC), and certain priority is still given to the latest class batch (i.e., old-new confusion, ONC). We empirically validate the side effects of the two types of confusion. Meanwhile, a novel solution called Task Correlated Incremental Learning (TCIL) is proposed to encourage discriminative and fair feature utilization across tasks. TCIL performs a multi-level knowledge distillation to propagate knowledge learned from old tasks to the new one. It establishes information flow paths at both feature and logit levels, enabling the learning to be aware of old classes. Besides, attention mechanism and classifier re-scoring are applied to generate more fair classification scores. We conduct extensive experiments on CIFAR100 and ImageNet100 datasets. The results demonstrate that TCIL consistently achieves state-of-the-art accuracy. It mitigates both ITC and ONC, while showing advantages in battle with catastrophic forgetting even no rehearsal memory is reserved.
[[2212.14111] Are Deep Image Embedding Clustering Methods Effective for Heterogeneous Tabular Data?](http://arxiv.org/abs/2212.14111) #fair
Deep learning methods in the literature are invariably benchmarked on image data sets and then assumed to work on all data problems. Unfortunately, architectures designed for image learning are often not ready or optimal for non-image data without considering data-specific learning requirements. In this paper, we take a data-centric view to argue that deep image embedding clustering methods are not equally effective on heterogeneous tabular data sets. This paper performs one of the first studies on deep embedding clustering of seven tabular data sets using six state-of-the-art baseline methods proposed for image data sets. Our results reveal that the traditional clustering of tabular data ranks second out of eight methods and is superior to most deep embedding clustering baselines. Our observation is in line with the recent literature that traditional machine learning of tabular data is still a competitive approach against deep learning. Although surprising to many deep learning researchers, traditional clustering methods can be competitive baselines for tabular data, and outperforming these baselines remains a challenge for deep embedding clustering. Therefore, deep learning methods for image learning may not be fair or suitable baselines for tabular data without considering data-specific contrasts and learning requirements.
[[2212.14351] Properties of Group Fairness Metrics for Rankings](http://arxiv.org/abs/2212.14351) #fair
In recent years, several metrics have been developed for evaluating group fairness of rankings. Given that these metrics were developed with different application contexts and ranking algorithms in mind, it is not straightforward which metric to choose for a given scenario. In this paper, we perform a comprehensive comparative analysis of existing group fairness metrics developed in the context of fair ranking. By virtue of their diverse application contexts, we argue that such a comparative analysis is not straightforward. Hence, we take an axiomatic approach whereby we design a set of thirteen properties for group fairness metrics that consider different ranking settings. A metric can then be selected depending on whether it satisfies all or a subset of these properties. We apply these properties on eleven existing group fairness metrics, and through both empirical and theoretical results we demonstrate that most of these metrics only satisfy a small subset of the proposed properties. These findings highlight limitations of existing metrics, and provide insights into how to evaluate and interpret different fairness metrics in practical deployment. The proposed properties can also assist practitioners in selecting appropriate metrics for evaluating fairness in a specific application.
[[2212.14467] Cluster-level Group Representativity Fairness in $k$-means Clustering](http://arxiv.org/abs/2212.14467) #fair
There has been much interest recently in developing fair clustering algorithms that seek to do justice to the representation of groups defined along sensitive attributes such as race and gender. We observe that clustering algorithms could generate clusters such that different groups are disadvantaged within different clusters. We develop a clustering algorithm, building upon the centroid clustering paradigm pioneered by classical algorithms such as $k$-means, where we focus on mitigating the unfairness experienced by the most-disadvantaged group within each cluster. Our method uses an iterative optimisation paradigm whereby an initial cluster assignment is modified by reassigning objects to clusters such that the worst-off sensitive group within each cluster is benefitted. We demonstrate the effectiveness of our method through extensive empirical evaluations over a novel evaluation metric on real-world datasets. Specifically, we show that our method is effective in enhancing cluster-level group representativity fairness significantly at low impact on cluster coherence.
[[2212.14128] Joint Engagement Classification using Video Augmentation Techniques for Multi-person Human-robot Interaction](http://arxiv.org/abs/2212.14128) #interpretability
Affect understanding capability is essential for social robots to autonomously interact with a group of users in an intuitive and reciprocal way. However, the challenge of multi-person affect understanding comes from not only the accurate perception of each user's affective state (e.g., engagement) but also the recognition of the affect interplay between the members (e.g., joint engagement) that presents as complex, but subtle, nonverbal exchanges between them. Here we present a novel hybrid framework for identifying a parent-child dyad's joint engagement by combining a deep learning framework with various video augmentation techniques. Using a dataset of parent-child dyads reading storybooks together with a social robot at home, we first train RGB frame- and skeleton-based joint engagement recognition models with four video augmentation techniques (General Aug, DeepFake, CutOut, and Mixed) applied datasets to improve joint engagement classification performance. Second, we demonstrate experimental results on the use of trained models in the robot-parent-child interaction context. Third, we introduce a behavior-based metric for evaluating the learned representation of the models to investigate the model interpretability when recognizing joint engagement. This work serves as the first step toward fully unlocking the potential of end-to-end video understanding models pre-trained on large public datasets and augmented with data augmentation and visualization techniques for affect recognition in the multi-person human-robot interaction in the wild.
[[2212.14206] Maximizing Use-Case Specificity through Precision Model Tuning](http://arxiv.org/abs/2212.14206) #interpretability
Language models have become increasingly popular in recent years for tasks like information retrieval. As use-cases become oriented toward specific domains, fine-tuning becomes default for standard performance. To fine-tune these models for specific tasks and datasets, it is necessary to carefully tune the model's hyperparameters and training techniques. In this paper, we present an in-depth analysis of the performance of four transformer-based language models on the task of biomedical information retrieval. The models we consider are DeepMind's RETRO (7B parameters), GPT-J (6B parameters), GPT-3 (175B parameters), and BLOOM (176B parameters). We compare their performance on the basis of relevance, accuracy, and interpretability, using a large corpus of 480000 research papers on protein structure/function prediction as our dataset. Our findings suggest that smaller models, with <10B parameters and fine-tuned on domain-specific datasets, tend to outperform larger language models on highly specific questions in terms of accuracy, relevancy, and interpretability by a significant margin (+50% on average). However, larger models do provide generally better results on broader prompts.
[[2212.14591] Mixture of von Mises-Fisher distribution with sparse prototypes](http://arxiv.org/abs/2212.14591) #interpretability
Mixtures of von Mises-Fisher distributions can be used to cluster data on the unit hypersphere. This is particularly adapted for high-dimensional directional data such as texts. We propose in this article to estimate a von Mises mixture using a l 1 penalized likelihood. This leads to sparse prototypes that improve clustering interpretability. We introduce an expectation-maximisation (EM) algorithm for this estimation and explore the trade-off between the sparsity term and the likelihood one with a path following algorithm. The model's behaviour is studied on simulated data and, we show the advantages of the approach on real data benchmark. We also introduce a new data set on financial reports and exhibit the benefits of our method for exploratory analysis.
[[2212.14743] Risk-Sensitive Policy with Distributional Reinforcement Learning](http://arxiv.org/abs/2212.14743) #interpretability
Classical reinforcement learning (RL) techniques are generally concerned with the design of decision-making policies driven by the maximisation of the expected outcome. Nevertheless, this approach does not take into consideration the potential risk associated with the actions taken, which may be critical in certain applications. To address that issue, the present research work introduces a novel methodology based on distributional RL to derive sequential decision-making policies that are sensitive to the risk, the latter being modelled by the tail of the return probability distribution. The core idea is to replace the $Q$ function generally standing at the core of learning schemes in RL by another function taking into account both the expected return and the risk. Named the risk-based utility function $U$, it can be extracted from the random return distribution $Z$ naturally learnt by any distributional RL algorithm. This enables to span the complete potential trade-off between risk minimisation and expected return maximisation, in contrast to fully risk-averse methodologies. Fundamentally, this research yields a truly practical and accessible solution for learning risk-sensitive policies with minimal modification to the distributional RL algorithm, and with an emphasis on the interpretability of the resulting decision-making process.
[[2212.14776] On the Interpretability of Attention Networks](http://arxiv.org/abs/2212.14776) #interpretability
Attention mechanisms form a core component of several successful deep
learning architectures, and are based on one key idea: ''The output depends
only on a small (but unknown) segment of the input.'' In several practical
applications like image captioning and language translation, this is mostly
true. In trained models with an attention mechanism, the outputs of an
intermediate module that encodes the segment of input responsible for the
output is often used as a way to peek into the reasoning
of the network. We
make such a notion more precise for a variant of the classification problem
that we term selective dependence classification (SDC) when used with attention
model architectures. Under such a setting, we demonstrate various error modes
where an attention model can be accurate but fail to be interpretable, and show
that such models do occur as a result of training. We illustrate various
situations that can accentuate and mitigate this behaviour. Finally, we use our
objective definition of interpretability for SDC tasks to evaluate a few
attention model learning algorithms designed to encourage sparsity and
demonstrate that these algorithms help improve interpretability.
[[2212.14815] Black-box language model explanation by context length probing](http://arxiv.org/abs/2212.14815) #explainability
The increasingly widespread adoption of large language models has highlighted the need for improving their explainability. We present context length probing, a novel explanation technique for causal language models, based on tracking the predictions of a model as a function of the length of available context, and allowing to assign differential importance scores to different contexts. The technique is model-agnostic and does not rely on access to model internals beyond computing token-level probabilities. We apply context length probing to large pre-trained language models and offer some initial analyses and insights, including the potential for studying long-range dependencies. The source code and a demo of the method are available.
[[2212.14306] Zero-Shot Object Segmentation through Concept Distillation from Generative Image Foundation Models](http://arxiv.org/abs/2212.14306) #diffusion
Curating datasets for object segmentation is a difficult task. With the advent of large-scale pre-trained generative models, conditional image generation has been given a significant boost in result quality and ease of use. In this paper, we present a novel method that enables the generation of general foreground-background segmentation models from simple textual descriptions, without requiring segmentation labels. We leverage and explore pre-trained latent diffusion models, to automatically generate weak segmentation masks for concepts and objects. The masks are then used to fine-tune the diffusion model on an inpainting task, which enables fine-grained removal of the object, while at the same time providing a synthetic foreground and background dataset. We demonstrate that using this method beats previous methods in both discriminative and generative performance and closes the gap with fully supervised training while requiring no pixel-wise object labels. We show results on the task of segmenting four different objects (humans, dogs, cars, birds).
[[2212.14678] Exploring Transformer Backbones for Image Diffusion Models](http://arxiv.org/abs/2212.14678) #diffusion
We present an end-to-end Transformer based Latent Diffusion model for image synthesis. On the ImageNet class conditioned generation task we show that a Transformer based Latent Diffusion model achieves a 14.1FID which is comparable to the 13.1FID score of a UNet based architecture. In addition to showing the application of Transformer models for Diffusion based image synthesis this simplification in architecture allows easy fusion and modeling of text and image data. The multi-head attention mechanism of Transformers enables simplified interaction between the image and text features which removes the requirement for crossattention mechanism in UNet based Diffusion models.
[[2212.14704] Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models](http://arxiv.org/abs/2212.14704) #diffusion
Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.