[[2304.06179] SePEnTra: A secure and privacy-preserving energy trading mechanisms in transactive energy market](http://arxiv.org/abs/2304.06179) #secure
In this paper, we design and present a novel model called SePEnTra to ensure the security and privacy of energy data while sharing with other entities during energy trading to determine optimal price signals. Furthermore, the market operator can use this data to detect malicious activities of users in the later stage without violating privacy (e.g., deviation of actual energy generation/consumption from forecast beyond a threshold). We use two cryptographic primitives, additive secret sharing and Pedersen commitment, in SePEnTra. The performance of our model is evaluated theoretically and numerically. We compare the performance of SePEnTra with the same Transactive energy market (TEM) framework without security mechanisms. The result shows that even though using advanced cryptographic primitives in a large market framework, SePEnTra has very low computational complexity and communication overhead. Moreover, it is storage efficient for all parties.
[[2304.06582] Cryptanalysis of Random Affine Transformations for Encrypted Control](http://arxiv.org/abs/2304.06582) #secure
Cloud-based and distributed computations are of growing interest in modern control systems. However, these technologies require performing computations on not necessarily trustworthy platforms and, thus, put the confidentiality of sensitive control-related data at risk. Encrypted control has dealt with this issue by utilizing modern cryptosystems with homomorphic properties, which allow a secure evaluation at the cost of an increased computation or communication effort (among others). Recently, a cipher based on a random affine transformation gained attention in the encrypted control community. Its appeal stems from the possibility to construct security providing homomorphisms that do not suffer from the restrictions of ``conventional'' approaches.
This paper provides a cryptanalysis of random affine transformations in the context of encrypted control. To this end, a deterministic and probabilistic variant of the cipher over real numbers are analyzed in a generalized setup, where we use cryptographic definitions for security and attacker models. It is shown that the deterministic cipher breaks under a known-plaintext attack, and unavoidably leaks information of the closed-loop, which opens another angle of attack. For the probabilistic variant, statistical indistinguishability of ciphertexts can be achieved, which makes successful attacks unlikely. We complete our analysis by investigating a floating point realization of the probabilistic random affine transformation cipher, which unfortunately suggests the impracticality of the scheme if a security guarantee is needed.
[[2304.06167] CoVE: Towards Confidential Computing on RISC-V Platforms](http://arxiv.org/abs/2304.06167) #security
Multi-tenant computing platforms are typically comprised of several software and hardware components including platform firmware, host operating system kernel, virtualization monitor, and the actual tenant payloads that run on them (typically in a virtual machine, container, or application). This model is well established in large scale commercial deployment, but the downside is that all platform components and operators are in the Trusted Computing Base (TCB) of the tenant. This aspect is ill-suited for privacy-oriented workloads that aim to minimize the TCB footprint. Confidential computing presents a good stepping-stone towards providing a quantifiable TCB for computing. Confidential computing [1] requires the use of a HW-attested Trusted Execution Environments for data-in-use protection. The RISC-V architecture presents a strong foundation for meeting the requirements for Confidential Computing and other security paradigms in a clean slate manner. This paper describes a reference architecture and discusses ISA, non-ISA and system-on-chip (SoC) requirements for confidential computing on RISC-V Platforms. It discusses proposed ISA and non-ISA Extension for Confidential Virtual Machine for RISC-V platforms, referred to as CoVE.
[[2304.06341] EF/CF: High Performance Smart Contract Fuzzing for Exploit Generation](http://arxiv.org/abs/2304.06341) #security
Smart contracts are increasingly being used to manage large numbers of high-value cryptocurrency accounts. There is a strong demand for automated, efficient, and comprehensive methods to detect security vulnerabilities in a given contract. While the literature features a plethora of analysis methods for smart contracts, the existing proposals do not address the increasing complexity of contracts. Existing analysis tools suffer from false alarms and missed bugs in today's smart contracts that are increasingly defined by complexity and interdependencies. To scale accurate analysis to modern smart contracts, we introduce EF/CF, a high-performance fuzzer for Ethereum smart contracts. In contrast to previous work, EF/CF efficiently and accurately models complex smart contract interactions, such as reentrancy and cross-contract interactions, at a very high fuzzing throughput rate. To achieve this, EF/CF transpiles smart contract bytecode into native C++ code, thereby enabling the reuse of existing, optimized fuzzing toolchains. Furthermore, EF/CF increases fuzzing efficiency by employing a structure-aware mutation engine for smart contract transaction sequences and using a contract's ABI to generate valid transaction inputs. In a comprehensive evaluation, we show that EF/CF scales better -- without compromising accuracy -- to complex contracts compared to state-of-the-art approaches, including other fuzzers, symbolic/concolic execution, and hybrid approaches. Moreover, we show that EF/CF can automatically generate transaction sequences that exploit reentrancy bugs to steal Ether.
[[2304.06059] Efficient Deep Learning Models for Privacy-preserving People Counting on Low-resolution Infrared Arrays](http://arxiv.org/abs/2304.06059) #privacy
Ultra-low-resolution Infrared (IR) array sensors offer a low-cost, energy-efficient, and privacy-preserving solution for people counting, with applications such as occupancy monitoring. Previous work has shown that Deep Learning (DL) can yield superior performance on this task. However, the literature was missing an extensive comparative analysis of various efficient DL architectures for IR array-based people counting, that considers not only their accuracy, but also the cost of deploying them on memory- and energy-constrained Internet of Things (IoT) edge nodes. In this work, we address this need by comparing 6 different DL architectures on a novel dataset composed of IR images collected from a commercial 8x8 array, which we made openly available. With a wide architectural exploration of each model type, we obtain a rich set of Pareto-optimal solutions, spanning cross-validated balanced accuracy scores in the 55.70-82.70% range. When deployed on a commercial Microcontroller (MCU) by STMicroelectronics, the STM32L4A6ZG, these models occupy 0.41-9.28kB of memory, and require 1.10-7.74ms per inference, while consuming 17.18-120.43 $\mu$J of energy. Our models are significantly more accurate than a previous deterministic method (up to +39.9%), while being up to 3.53x faster and more energy efficient. Further, our models' accuracy is comparable to state-of-the-art DL solutions on similar resolution sensors, despite a much lower complexity. All our models enable continuous, real-time inference on a MCU-based IoT node, with years of autonomous operation without battery recharging.
[[2304.06373] You are here! Finding position and orientation on a 2D map from a single image: The Flatlandia localization problem and dataset](http://arxiv.org/abs/2304.06373) #privacy
We introduce Flatlandia, a novel problem for visual localization of an image from object detections composed of two specific tasks: i) Coarse Map Localization: localizing a single image observing a set of objects in respect to a 2D map of object landmarks; ii) Fine-grained 3DoF Localization: estimating latitude, longitude, and orientation of the image within a 2D map. Solutions for these new tasks exploit the wide availability of open urban maps annotated with GPS locations of common objects (\eg via surveying or crowd-sourced). Such maps are also more storage-friendly than standard large-scale 3D models often used in visual localization while additionally being privacy-preserving. As existing datasets are unsuited for the proposed problem, we provide the Flatlandia dataset, designed for 3DoF visual localization in multiple urban settings and based on crowd-sourced data from five European cities. We use the Flatlandia dataset to validate the complexity of the proposed tasks.
[[2304.06627] CoSDA: Continual Source-Free Domain Adaptation](http://arxiv.org/abs/2304.06627) #privacy
Without access to the source data, source-free domain adaptation (SFDA) transfers knowledge from a source-domain trained model to target domains. Recently, SFDA has gained popularity due to the need to protect the data privacy of the source domain, but it suffers from catastrophic forgetting on the source domain due to the lack of data. To systematically investigate the mechanism of catastrophic forgetting, we first reimplement previous SFDA approaches within a unified framework and evaluate them on four benchmarks. We observe that there is a trade-off between adaptation gain and forgetting loss, which motivates us to design a consistency regularization to mitigate forgetting. In particular, we propose a continual source-free domain adaptation approach named CoSDA, which employs a dual-speed optimized teacher-student model pair and is equipped with consistency learning capability. Our experiments demonstrate that CoSDA outperforms state-of-the-art approaches in continuous adaptation. Notably, our CoSDA can also be integrated with other SFDA methods to alleviate forgetting.
[[2304.06469] Analysing Fairness of Privacy-Utility Mobility Models](http://arxiv.org/abs/2304.06469) #privacy
Preserving the individuals' privacy in sharing spatial-temporal datasets is critical to prevent re-identification attacks based on unique trajectories. Existing privacy techniques tend to propose ideal privacy-utility tradeoffs, however, largely ignore the fairness implications of mobility models and whether such techniques perform equally for different groups of users. The quantification between fairness and privacy-aware models is still unclear and there barely exists any defined sets of metrics for measuring fairness in the spatial-temporal context. In this work, we define a set of fairness metrics designed explicitly for human mobility, based on structural similarity and entropy of the trajectories. Under these definitions, we examine the fairness of two state-of-the-art privacy-preserving models that rely on GAN and representation learning to reduce the re-identification rate of users for data sharing. Our results show that while both models guarantee group fairness in terms of demographic parity, they violate individual fairness criteria, indicating that users with highly similar trajectories receive disparate privacy gain. We conclude that the tension between the re-identification task and individual fairness needs to be considered for future spatial-temporal data analysis and modelling to achieve a privacy-preserving fairness-aware setting.
[[2304.06607] False Claims against Model Ownership Resolution](http://arxiv.org/abs/2304.06607) #protect
Deep neural network (DNN) models are valuable intellectual property of model owners, constituting a competitive advantage. Therefore, it is crucial to develop techniques to protect against model theft. Model ownership resolution (MOR) is a class of techniques that can deter model theft. A MOR scheme enables an accuser to assert an ownership claim for a suspect model by presenting evidence, such as a watermark or fingerprint, to show that the suspect model was stolen or derived from a source model owned by the accuser. Most of the existing MOR schemes prioritize robustness against malicious suspects, ensuring that the accuser will win if the suspect model is indeed a stolen model.
In this paper, we show that common MOR schemes in the literature are vulnerable to a different, equally important but insufficiently explored, robustness concern: a malicious accuser. We show how malicious accusers can successfully make false claims against independent suspect models that were not stolen. Our core idea is that a malicious accuser can deviate (without detection) from the specified MOR process by finding (transferable) adversarial examples that successfully serve as evidence against independent suspect models. To this end, we first generalize the procedures of common MOR schemes and show that, under this generalization, defending against false claims is as challenging as preventing (transferable) adversarial examples. Via systematic empirical evaluation we demonstrate that our false claim attacks always succeed in all prominent MOR schemes with realistic configurations, including against a real-world model: Amazon's Rekognition API.
[[2304.06430] Certified Zeroth-order Black-Box Defense with Robust UNet Denoiser](http://arxiv.org/abs/2304.06430) #defense
Certified defense methods against adversarial perturbations have been recently investigated in the black-box setting with a zeroth-order (ZO) perspective. However, these methods suffer from high model variance with low performance on high-dimensional datasets due to the ineffective design of the denoiser and are limited in their utilization of ZO techniques. To this end, we propose a certified ZO preprocessing technique for removing adversarial perturbations from the attacked image in the black-box setting using only model queries. We propose a robust UNet denoiser (RDUNet) that ensures the robustness of black-box models trained on high-dimensional datasets. We propose a novel black-box denoised smoothing (DS) defense mechanism, ZO-RUDS, by prepending our RDUNet to the black-box model, ensuring black-box defense. We further propose ZO-AE-RUDS in which RDUNet followed by autoencoder (AE) is prepended to the black-box model. We perform extensive experiments on four classification datasets, CIFAR-10, CIFAR-10, Tiny Imagenet, STL-10, and the MNIST dataset for image reconstruction tasks. Our proposed defense methods ZO-RUDS and ZO-AE-RUDS beat SOTA with a huge margin of $35\%$ and $9\%$, for low dimensional (CIFAR-10) and with a margin of $20.61\%$ and $23.51\%$ for high-dimensional (STL-10) datasets, respectively.
[[2304.06222] A Comprehensive Survey on the Implementations, Attacks, and Countermeasures of the Current NIST Lightweight Cryptography Standard](http://arxiv.org/abs/2304.06222) #attack
This survey is the first work on the current standard for lightweight cryptography, standardized in 2023. Lightweight cryptography plays a vital role in securing resource-constrained embedded systems such as deeply-embedded systems (implantable and wearable medical devices, smart fabrics, smart homes, and the like), radio frequency identification (RFID) tags, sensor networks, and privacy-constrained usage models. National Institute of Standards and Technology (NIST) initiated a standardization process for lightweight cryptography and after a relatively-long multi-year effort, eventually, in Feb. 2023, the competition ended with ASCON as the winner. This lightweight cryptographic standard will be used in deeply-embedded architectures to provide security through confidentiality and integrity/authentication (the dual of the legacy AES-GCM block cipher which is the NIST standard for symmetric key cryptography). ASCON's lightweight design utilizes a 320-bit permutation which is bit-sliced into five 64-bit register words, providing 128-bit level security. This work summarizes the different implementations of ASCON on field-programmable gate array (FPGA) and ASIC hardware platforms on the basis of area, power, throughput, energy, and efficiency overheads. The presented work also reviews various differential and side-channel analysis attacks (SCAs) performed across variants of ASCON cipher suite in terms of algebraic, cube/cube-like, forgery, fault injection, and power analysis attacks as well as the countermeasures for these attacks. We also provide our insights and visions throughout this survey to provide new future directions in different domains. This survey is the first one in its kind and a step forward towards scrutinizing the advantages and future directions of the NIST lightweight cryptography standard introduced in 2023.
[[2304.06313] Majority is not Needed: A Counterstrategy to Selfish Mining](http://arxiv.org/abs/2304.06313) #attack
In the last few years several papers investigated selfish mine attacks, most of which assumed that every miner that is not part of the selfish mine pool will continue to mine honestly. However, in reality, remaining honest is not always incentivized, particularly when another pool is employing selfish mining or other deviant strategies. In this work we explore the scenario in which a large enough pool capitalises on another selfish pool to gain 100\% of the profit and commit double spending attacks. We show that this counterstrategy can effectively counter any deviant strategy, and that even the possibility of it discourages other pools from implementing deviant strategies.
[[2304.06369] An attack resilient policy on the tip pool for DAG-based distributed ledgers](http://arxiv.org/abs/2304.06369) #attack
This paper discusses congestion control and inconsistency problems in DAG-based distributed ledgers and proposes an additional filter to mitigate these issues. Unlike traditional blockchains, DAG-based DLTs use a directed acyclic graph structure to organize transactions, allowing higher scalability and efficiency. However, this also introduces challenges in controlling the rate at which blocks are added to the network and preventing the influence of spam attacks. To address these challenges, we propose a filter to limit the tip pool size and to avoid referencing old blocks. Furthermore, we present experimental results to demonstrate the effectiveness of this filter in reducing the negative impacts of various attacks. Our approach offers a lightweight and efficient solution for managing the flow of blocks in DAG-based DLTs, which can enhance the consistency and reliability of these systems. Index
[[2304.06125] Assessment Framework for Deepfake Detection in Real-world Situations](http://arxiv.org/abs/2304.06125) #robust
Detecting digital face manipulation in images and video has attracted extensive attention due to the potential risk to public trust. To counteract the malicious usage of such techniques, deep learning-based deepfake detection methods have been employed and have exhibited remarkable performance. However, the performance of such detectors is often assessed on related benchmarks that hardly reflect real-world situations. For example, the impact of various image and video processing operations and typical workflow distortions on detection accuracy has not been systematically measured. In this paper, a more reliable assessment framework is proposed to evaluate the performance of learning-based deepfake detectors in more realistic settings. To the best of our acknowledgment, it is the first systematic assessment approach for deepfake detectors that not only reports the general performance under real-world conditions but also quantitatively measures their robustness toward different processing operations. To demonstrate the effectiveness and usage of the framework, extensive experiments and detailed analysis of three popular deepfake detection methods are further presented in this paper. In addition, a stochastic degradation-based data augmentation method driven by realistic processing operations is designed, which significantly improves the robustness of deepfake detectors.
[[2304.06211] Boosting Video Object Segmentation via Space-time Correspondence Learning](http://arxiv.org/abs/2304.06211) #robust
Current top-leading solutions for video object segmentation (VOS) typically follow a matching-based regime: for each query frame, the segmentation mask is inferred according to its correspondence to previously processed and the first annotated frames. They simply exploit the supervisory signals from the groundtruth masks for learning mask prediction only, without posing any constraint on the space-time correspondence matching, which, however, is the fundamental building block of such regime. To alleviate this crucial yet commonly ignored issue, we devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching during network learning. Through comprehensively exploring the intrinsic coherence in videos on pixel and object levels, our algorithm reinforces the standard, fully supervised training of mask segmentation with label-free, contrastive correspondence learning. Without neither requiring extra annotation cost during training, nor causing speed delay during deployment, nor incurring architectural modification, our algorithm provides solid performance gains on four widely used benchmarks, i.e., DAVIS2016&2017, and YouTube-VOS2018&2019, on the top of famous matching-based VOS solutions.
[[2304.06287] NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds](http://arxiv.org/abs/2304.06287) #robust
We present NeRFVS, a novel neural radiance fields (NeRF) based method to enable free navigation in a room. NeRF achieves impressive performance in rendering images for novel views similar to the input views while suffering for novel views that are significantly different from the training views. To address this issue, we utilize the holistic priors, including pseudo depth maps and view coverage information, from neural reconstruction to guide the learning of implicit neural representations of 3D indoor scenes. Concretely, an off-the-shelf neural reconstruction method is leveraged to generate a geometry scaffold. Then, two loss functions based on the holistic priors are proposed to improve the learning of NeRF: 1) A robust depth loss that can tolerate the error of the pseudo depth map to guide the geometry learning of NeRF; 2) A variance loss to regularize the variance of implicit neural representations to reduce the geometry and color ambiguity in the learning procedure. These two loss functions are modulated during NeRF optimization according to the view coverage information to reduce the negative influence brought by the view coverage imbalance. Extensive results demonstrate that our NeRFVS outperforms state-of-the-art view synthesis methods quantitatively and qualitatively on indoor scenes, achieving high-fidelity free navigation results.
[[2304.06345] ASR: Attention-alike Structural Re-parameterization](http://arxiv.org/abs/2304.06345) #robust
The structural re-parameterization (SRP) technique is a novel deep learning technique that achieves interconversion between different network architectures through equivalent parameter transformations. This technique enables the mitigation of the extra costs for performance improvement during training, such as parameter size and inference time, through these transformations during inference, and therefore SRP has great potential for industrial and practical applications. The existing SRP methods have successfully considered many commonly used architectures, such as normalizations, pooling methods, multi-branch convolution. However, the widely used self-attention modules cannot be directly implemented by SRP due to these modules usually act on the backbone network in a multiplicative manner and the modules' output is input-dependent during inference, which limits the application scenarios of SRP. In this paper, we conduct extensive experiments from a statistical perspective and discover an interesting phenomenon Stripe Observation, which reveals that channel attention values quickly approach some constant vectors during training. This observation inspires us to propose a simple-yet-effective attention-alike structural re-parameterization (ASR) that allows us to achieve SRP for a given network while enjoying the effectiveness of the self-attention mechanism. Extensive experiments conducted on several standard benchmarks demonstrate the effectiveness of ASR in generally improving the performance of existing backbone networks, self-attention modules, and SRP methods without any elaborated model crafting. We also analyze the limitations and provide experimental or theoretical evidence for the strong robustness of the proposed ASR.
[[2304.06370] Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention](http://arxiv.org/abs/2304.06370) #robust
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data sources and may overlook their relative importance. To address these limitations, we propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA). We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Conv, SE, and AFF). We also present a novel GPU-friendly supervised contrastive learning framework SuMoCo to learn better representations. Furthermore, We fine-grained the test split of the DAD dataset to enable the multi-class recognition of drivers' activities. Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses. The code and annotations are publicly available.
[[2304.06547] RadarGNN: Transformation Invariant Graph Neural Network for Radar-based Perception](http://arxiv.org/abs/2304.06547) #robust
A reliable perception has to be robust against challenging environmental conditions. Therefore, recent efforts focused on the use of radar sensors in addition to camera and lidar sensors for perception applications. However, the sparsity of radar point clouds and the poor data availability remain challenging for current perception methods. To address these challenges, a novel graph neural network is proposed that does not just use the information of the points themselves but also the relationships between the points. The model is designed to consider both point features and point-pair features, embedded in the edges of the graph. Furthermore, a general approach for achieving transformation invariance is proposed which is robust against unseen scenarios and also counteracts the limited data availability. The transformation invariance is achieved by an invariant data representation rather than an invariant model architecture, making it applicable to other methods. The proposed RadarGNN model outperforms all previous methods on the RadarScenes dataset. In addition, the effects of different invariances on the object detection and semantic segmentation quality are investigated. The code is made available as open-source software under https://github.com/TUMFTM/RadarGNN.
[[2304.06672] LSFSL: Leveraging Shape Information in Few-shot Learning](http://arxiv.org/abs/2304.06672) #robust
Few-shot learning (FSL) techniques seek to learn the underlying patterns in data using fewer samples, analogous to how humans learn from limited experience. In this limited-data scenario, the challenges associated with deep neural networks, such as shortcut learning and texture bias behaviors, are further exacerbated. Moreover, the significance of addressing shortcut learning is not yet fully explored in the few-shot setup. To address these issues, we propose LSFSL, which enforces the model to learn more generalizable features utilizing the implicit prior information present in the data. Through comprehensive analyses, we demonstrate that LSFSL-trained models are less vulnerable to alteration in color schemes, statistical correlations, and adversarial perturbations leveraging the global semantics in the data. Our findings highlight the potential of incorporating relevant priors in few-shot approaches to increase robustness and generalization.
[[2304.06703] Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement](http://arxiv.org/abs/2304.06703) #robust
Burst image processing is becoming increasingly popular in recent years. However, it is a challenging task since individual burst images undergo multiple degradations and often have mutual misalignments resulting in ghosting and zipper artifacts. Existing burst restoration methods usually do not consider the mutual correlation and non-local contextual information among burst frames, which tends to limit these approaches in challenging cases. Another key challenge lies in the robust up-sampling of burst frames. The existing up-sampling methods cannot effectively utilize the advantages of single-stage and progressive up-sampling strategies with conventional and/or recent up-samplers at the same time. To address these challenges, we propose a novel Gated Multi-Resolution Transfer Network (GMTNet) to reconstruct a spatially precise high-quality image from a burst of low-quality raw images. GMTNet consists of three modules optimized for burst processing tasks: Multi-scale Burst Feature Alignment (MBFA) for feature denoising and alignment, Transposed-Attention Feature Merging (TAFM) for multi-frame feature aggregation, and Resolution Transfer Feature Up-sampler (RTFU) to up-scale merged features and construct a high-quality output image. Detailed experimental analysis on five datasets validates our approach and sets a state-of-the-art for burst super-resolution, burst denoising, and low-light burst enhancement.
[[2304.06719] RoboBEV: Towards Robust Bird's Eye View Perception under Corruptions](http://arxiv.org/abs/2304.06719) #robust
The recent advances in camera-based bird's eye view (BEV) representation exhibit great potential for in-vehicle 3D perception. Despite the substantial progress achieved on standard benchmarks, the robustness of BEV algorithms has not been thoroughly examined, which is critical for safe operations. To bridge this gap, we introduce RoboBEV, a comprehensive benchmark suite that encompasses eight distinct corruptions, including Bright, Dark, Fog, Snow, Motion Blur, Color Quant, Camera Crash, and Frame Lost. Based on it, we undertake extensive evaluations across a wide range of BEV-based models to understand their resilience and reliability. Our findings indicate a strong correlation between absolute performance on in-distribution and out-of-distribution datasets. Nonetheless, there are considerable variations in relative performance across different approaches. Our experiments further demonstrate that pre-training and depth-free BEV transformation has the potential to enhance out-of-distribution robustness. Additionally, utilizing long and rich temporal information largely helps with robustness. Our findings provide valuable insights for designing future BEV models that can achieve both accuracy and robustness in real-world deployments.
[[2304.06364] AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models](http://arxiv.org/abs/2304.06364) #robust
Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/microsoft/AGIEval.
[[2304.06048] RELS-DQN: A Robust and Efficient Local Search Framework for Combinatorial Optimization](http://arxiv.org/abs/2304.06048) #robust
Combinatorial optimization (CO) aims to efficiently find the best solution to NP-hard problems ranging from statistical physics to social media marketing. A wide range of CO applications can benefit from local search methods because they allow reversible action over greedy policies. Deep Q-learning (DQN) using message-passing neural networks (MPNN) has shown promise in replicating the local search behavior and obtaining comparable results to the local search algorithms. However, the over-smoothing and the information loss during the iterations of message passing limit its robustness across applications, and the large message vectors result in memory inefficiency. Our paper introduces RELS-DQN, a lightweight DQN framework that exhibits the local search behavior while providing practical scalability. Using the RELS-DQN model trained on one application, it can generalize to various applications by providing solution values higher than or equal to both the local search algorithms and the existing DQN models while remaining efficient in runtime and memory.
[[2304.06054] Landslide Susceptibility Prediction Modeling Based on Self-Screening Deep Learning Model](http://arxiv.org/abs/2304.06054) #robust
Landslide susceptibility prediction has always been an important and challenging content. However, there are some uncertain problems to be solved in susceptibility modeling, such as the error of landslide samples and the complex nonlinear relationship between environmental factors. A self-screening graph convolutional network and long short-term memory network (SGCN-LSTM) is proposed int this paper to overcome the above problems in landslide susceptibility prediction. The SGCN-LSTM model has the advantages of wide width and good learning ability. The landslide samples with large errors outside the set threshold interval are eliminated by self-screening network, and the nonlinear relationship between environmental factors can be extracted from both spatial nodes and time series, so as to better simulate the nonlinear relationship between environmental factors. The SGCN-LSTM model was applied to landslide susceptibility prediction in Anyuan County, Jiangxi Province, China, and compared with Cascade-parallel Long Short-Term Memory and Conditional Random Fields (CPLSTM-CRF), Random Forest (RF), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD) and Logistic Regression (LR) models.The landslide prediction experiment in Anyuan County showed that the total accuracy and AUC of SGCN-LSTM model were the highest among the six models, and the total accuracy reached 92.38 %, which was 5.88%, 12.44%, 19.65%, 19.92% and 20.34% higher than those of CPLSTM-CRF, RF, SVM, SGD and LR models, respectively. The AUC value reached 0.9782, which was 0.0305,0.0532,0.1875,0.1909 and 0.1829 higher than the other five models, respectively. In conclusion, compared with some existing traditional machine learning, the SGCN-LSTM model proposed in this paper has higher landslide prediction accuracy and better robustness, and has a good application prospect in the LSP field.
[[2304.06344] Streamlined Framework for Agile Forecasting Model Development towards Efficient Inventory Management](http://arxiv.org/abs/2304.06344) #robust
This paper proposes a framework for developing forecasting models by streamlining the connections between core components of the developmental process. The proposed framework enables swift and robust integration of new datasets, experimentation on different algorithms, and selection of the best models. We start with the datasets of different issues and apply pre-processing steps to clean and engineer meaningful representations of time-series data. To identify robust training configurations, we introduce a novel mechanism of multiple cross-validation strategies. We apply different evaluation metrics to find the best-suited models for varying applications. One of the referent applications is our participation in the intelligent forecasting competition held by the United States Agency of International Development (USAID). Finally, we leverage the flexibility of the framework by applying different evaluation metrics to assess the performance of the models in inventory management settings.
[[2304.06715] Evaluating the Robustness of Interpretability Methods through Explanation Invariance and Equivariance](http://arxiv.org/abs/2304.06715) #robust
Interpretability methods are valuable only if their explanations faithfully describe the explained model. In this work, we consider neural networks whose predictions are invariant under a specific symmetry group. This includes popular architectures, ranging from convolutional to graph neural networks. Any explanation that faithfully explains this type of model needs to be in agreement with this invariance property. We formalize this intuition through the notion of explanation invariance and equivariance by leveraging the formalism from geometric deep learning. Through this rigorous formalism, we derive (1) two metrics to measure the robustness of any interpretability method with respect to the model symmetry group; (2) theoretical robustness guarantees for some popular interpretability methods and (3) a systematic approach to increase the invariance of any interpretability method with respect to a symmetry group. By empirically measuring our metrics for explanations of models associated with various modalities and symmetry groups, we derive a set of 5 guidelines to allow users and developers of interpretability methods to produce robust explanations.
[[2304.06274] EWT: Efficient Wavelet-Transformer for Single Image Denoising](http://arxiv.org/abs/2304.06274) #extraction
Transformer-based image denoising methods have achieved encouraging results in the past year. However, it must uses linear operations to model long-range dependencies, which greatly increases model inference time and consumes GPU storage space. Compared with convolutional neural network-based methods, current Transformer-based image denoising methods cannot achieve a balance between performance improvement and resource consumption. In this paper, we propose an Efficient Wavelet Transformer (EWT) for image denoising. Specifically, we use Discrete Wavelet Transform (DWT) and Inverse Wavelet Transform (IWT) for downsampling and upsampling, respectively. This method can fully preserve the image features while reducing the image resolution, thereby greatly reducing the device resource consumption of the Transformer model. Furthermore, we propose a novel Dual-stream Feature Extraction Block (DFEB) to extract image features at different levels, which can further reduce model inference time and GPU memory usage. Experiments show that our method speeds up the original Transformer by more than 80%, reduces GPU memory usage by more than 60%, and achieves excellent denoising results. All code will be public.
[[2304.06385] TransHP: Image Classification with Hierarchical Prompting](http://arxiv.org/abs/2304.06385) #extraction
This paper explores a hierarchical prompting mechanism for the hierarchical image classification (HIC) task. Different from prior HIC methods, our hierarchical prompting is the first to explicitly inject ancestor-class information as a tokenized hint that benefits the descendant-class discrimination. We think it well imitates human visual recognition, i.e., humans may use the ancestor class as a prompt to draw focus on the subtle differences among descendant classes. We model this prompting mechanism into a Transformer with Hierarchical Prompting (TransHP). TransHP consists of three steps: 1) learning a set of prompt tokens to represent the coarse (ancestor) classes, 2) on-the-fly predicting the coarse class of the input image at an intermediate block, and 3) injecting the prompt token of the predicted coarse class into the intermediate feature. Though the parameters of TransHP maintain the same for all input images, the injected coarse-class prompt conditions (modifies) the subsequent feature extraction and encourages a dynamic focus on relatively subtle differences among the descendant classes. Extensive experiments show that TransHP improves image classification on accuracy (e.g., improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data efficiency (e.g., +12.69% improvement under 10% ImageNet training data), and model explainability. Moreover, TransHP also performs favorably against prior HIC methods, showing that TransHP well exploits the hierarchical information.
[[2304.06447] PDF-VQA: A New Dataset for Real-World VQA on PDF Documents](http://arxiv.org/abs/2304.06447) #extraction
Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document understanding from various aspects, including document element recognition, document layout structural understanding as well as contextual understanding and key information extraction. Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages. We also propose a new graph-based VQA model that explicitly integrates the spatial and hierarchically structural relationships between different document elements to boost the document structural understanding. The performances are compared with several baselines over different question types and tasks\footnote{The full dataset will be released after paper acceptance.
[[2304.06203] LeafAI: query generator for clinical cohort discovery rivaling a human programmer](http://arxiv.org/abs/2304.06203) #extraction
Objective: Identifying study-eligible patients within clinical databases is a critical step in clinical research. However, accurate query design typically requires extensive technical and biomedical expertise. We sought to create a system capable of generating data model-agnostic queries while also providing novel logical reasoning capabilities for complex clinical trial eligibility criteria.
Materials and Methods: The task of query creation from eligibility criteria requires solving several text-processing problems, including named entity recognition and relation extraction, sequence-to-sequence transformation, normalization, and reasoning. We incorporated hybrid deep learning and rule-based modules for these, as well as a knowledge base of the Unified Medical Language System (UMLS) and linked ontologies. To enable data-model agnostic query creation, we introduce a novel method for tagging database schema elements using UMLS concepts. To evaluate our system, called LeafAI, we compared the capability of LeafAI to a human database programmer to identify patients who had been enrolled in 8 clinical trials conducted at our institution. We measured performance by the number of actual enrolled patients matched by generated queries.
Results: LeafAI matched a mean 43% of enrolled patients with 27,225 eligible across 8 clinical trials, compared to 27% matched and 14,587 eligible in queries by a human database programmer. The human programmer spent 26 total hours crafting queries compared to several minutes by LeafAI.
Conclusions: Our work contributes a state-of-the-art data model-agnostic query generation system capable of conditional reasoning using a knowledge base. We demonstrate that LeafAI can rival a human programmer in finding patients eligible for clinical trials.
[[2304.06248] LasUIE: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model](http://arxiv.org/abs/2304.06248) #extraction
Universally modeling all typical information extraction tasks (UIE) with one generative language model (GLM) has revealed great potential by the latest study, where various IE predictions are unified into a linearized hierarchical expression under a GLM. Syntactic structure information, a type of effective feature which has been extensively utilized in IE community, should also be beneficial to UIE. In this work, we propose a novel structure-aware GLM, fully unleashing the power of syntactic knowledge for UIE. A heterogeneous structure inductor is explored to unsupervisedly induce rich heterogeneous structural representations by post-training an existing GLM. In particular, a structural broadcaster is devised to compact various latent trees into explicit high-order forests, helping to guide a better generation during decoding. We finally introduce a task-oriented structure fine-tuning mechanism, further adjusting the learned structures to most coincide with the end-task's need. Over 12 IE benchmarks across 7 tasks our system shows significant improvements over the baseline UIE system. Further in-depth analyses show that our GLM learns rich task-adaptive structural bias that greatly resolves the UIE crux, the long-range dependence issue and boundary identifying. Source codes are open at https://github.com/ChocoWu/LasUIE.
[[2304.06634] PGTask: Introducing the Task of Profile Generation from Dialogues](http://arxiv.org/abs/2304.06634) #extraction
Recent approaches have attempted to personalize dialogue systems by leveraging profile information into models. However, this knowledge is scarce and difficult to obtain, which makes the extraction/generation of profile information from dialogues a fundamental asset. To surpass this limitation, we introduce the Profile Generation Task (PGTask). We contribute with a new dataset for this problem, comprising profile sentences aligned with related utterances, extracted from a corpus of dialogues. Furthermore, using state-of-the-art methods, we provide a benchmark for profile generation on this novel dataset. Our experiments disclose the challenges of profile generation, and we hope that this introduces a new research direction.
[[2304.06551] Decentralized federated learning methods for reducing communication cost and energy consumption in UAV networks](http://arxiv.org/abs/2304.06551) #federate
Unmanned aerial vehicles (UAV) or drones play many roles in a modern smart city such as the delivery of goods, mapping real-time road traffic and monitoring pollution. The ability of drones to perform these functions often requires the support of machine learning technology. However, traditional machine learning models for drones encounter data privacy problems, communication costs and energy limitations. Federated Learning, an emerging distributed machine learning approach, is an excellent solution to address these issues. Federated learning (FL) allows drones to train local models without transmitting raw data. However, existing FL requires a central server to aggregate the trained model parameters of the UAV. A failure of the central server can significantly impact the overall training. In this paper, we propose two aggregation methods: Commutative FL and Alternate FL, based on the existing architecture of decentralised Federated Learning for UAV Networks (DFL-UN) by adding a unique aggregation method of decentralised FL. Those two methods can effectively control energy consumption and communication cost by controlling the number of local training epochs, local communication, and global communication. The simulation results of the proposed training methods are also presented to verify the feasibility and efficiency of the architecture compared with two benchmark methods (e.g. standard machine learning training and standard single aggregation server training). The simulation results show that the proposed methods outperform the benchmark methods in terms of operational stability, energy consumption and communication cost.
[[2304.06707] Toward Reliable Human Pose Forecasting with Uncertainty](http://arxiv.org/abs/2304.06707) #fair
Recently, there has been an arms race of pose forecasting methods aimed at solving the spatio-temporal task of predicting a sequence of future 3D poses of a person given a sequence of past observed ones. However, the lack of unified benchmarks and limited uncertainty analysis have hindered progress in the field. To address this, we first develop an open-source library for human pose forecasting, featuring multiple models, datasets, and standardized evaluation metrics, with the aim of promoting research and moving toward a unified and fair evaluation. Second, we devise two types of uncertainty in the problem to increase performance and convey better trust: 1) we propose a method for modeling aleatoric uncertainty by using uncertainty priors to inject knowledge about the behavior of uncertainty. This focuses the capacity of the model in the direction of more meaningful supervision while reducing the number of learned parameters and improving stability; 2) we introduce a novel approach for quantifying the epistemic uncertainty of any model through clustering and measuring the entropy of its assignments. Our experiments demonstrate up to $25\%$ improvements in accuracy and better performance in uncertainty estimation.
[[2304.06333] Priors for symbolic regression](http://arxiv.org/abs/2304.06333) #fair
When choosing between competing symbolic models for a data set, a human will naturally prefer the "simpler" expression or the one which more closely resembles equations previously seen in a similar context. This suggests a non-uniform prior on functions, which is, however, rarely considered within a symbolic regression (SR) framework. In this paper we develop methods to incorporate detailed prior information on both functions and their parameters into SR. Our prior on the structure of a function is based on a $n$-gram language model, which is sensitive to the arrangement of operators relative to one another in addition to the frequency of occurrence of each operator. We also develop a formalism based on the Fractional Bayes Factor to treat numerical parameter priors in such a way that models may be fairly compared though the Bayesian evidence, and explicitly compare Bayesian, Minimum Description Length and heuristic methods for model selection. We demonstrate the performance of our priors relative to literature standards on benchmarks and a real-world dataset from the field of cosmology.
[[2304.06596] Beyond Submodularity: A Unified Framework of Randomized Set Selection with Group Fairness Constraints](http://arxiv.org/abs/2304.06596) #fair
Machine learning algorithms play an important role in a variety of important decision-making processes, including targeted advertisement displays, home loan approvals, and criminal behavior predictions. Given the far-reaching impact of these algorithms, it is crucial that they operate fairly, free from bias or prejudice towards certain groups in the population. Ensuring impartiality in these algorithms is essential for promoting equality and avoiding discrimination. To this end we introduce a unified framework for randomized subset selection that incorporates group fairness constraints. Our problem involves a global utility function and a set of group utility functions for each group, here a group refers to a group of individuals (e.g., people) sharing the same attributes (e.g., gender). Our aim is to generate a distribution across feasible subsets, specifying the selection probability of each feasible set, to maximize the global utility function while meeting a predetermined quota for each group utility function in expectation. Note that there may not necessarily be any direct connections between the global utility function and each group utility function. We demonstrate that this framework unifies and generalizes many significant applications in machine learning and operations research. Our algorithmic results either improves the best known result or provide the first approximation algorithms for new applications.
[[2304.06133] Towards Evaluating Explanations of Vision Transformers for Medical Imaging](http://arxiv.org/abs/2304.06133) #interpretability
As deep learning models increasingly find applications in critical domains such as medical imaging, the need for transparent and trustworthy decision-making becomes paramount. Many explainability methods provide insights into how these models make predictions by attributing importance to input features. As Vision Transformer (ViT) becomes a promising alternative to convolutional neural networks for image classification, its interpretability remains an open research question. This paper investigates the performance of various interpretation methods on a ViT applied to classify chest X-ray images. We introduce the notion of evaluating faithfulness, sensitivity, and complexity of ViT explanations. The obtained results indicate that Layerwise relevance propagation for transformers outperforms Local interpretable model-agnostic explanations and Attention visualization, providing a more accurate and reliable representation of what a ViT has actually learned. Our findings provide insights into the applicability of ViT explanations in medical imaging and highlight the importance of using appropriate evaluation criteria for comparing them.
[[2304.06258] MProtoNet: A Case-Based Interpretable Model for Brain Tumor Classification with 3D Multi-parametric Magnetic Resonance Imaging](http://arxiv.org/abs/2304.06258) #interpretability
Recent applications of deep convolutional neural networks in medical imaging raise concerns about their interpretability. While most explainable deep learning applications use post hoc methods (such as GradCAM) to generate feature attribution maps, there is a new type of case-based reasoning models, namely ProtoPNet and its variants, which identify prototypes during training and compare input image patches with those prototypes. We propose the first medical prototype network (MProtoNet) to extend ProtoPNet to brain tumor classification with 3D multi-parametric magnetic resonance imaging (mpMRI) data. To address different requirements between 2D natural images and 3D mpMRIs especially in terms of localizing attention regions, a new attention module with soft masking and online-CAM loss is introduced. Soft masking helps sharpen attention maps, while online-CAM loss directly utilizes image-level labels when training the attention module. MProtoNet achieves statistically significant improvements in interpretability metrics of both correctness and localization coherence (with a best activation precision of $0.713\pm0.058$) without human-annotated labels during training, when compared with GradCAM and several ProtoPNet variants. The source code is available at https://github.com/aywi/mprotonet.
[[2304.06391] VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking](http://arxiv.org/abs/2304.06391) #interpretability
The lack of interpretability of the Vision Transformer may hinder its use in critical real-world applications despite its effectiveness. To overcome this issue, we propose a post-hoc interpretability method called VISION DIFFMASK, which uses the activations of the model's hidden layers to predict the relevant parts of the input that contribute to its final predictions. Our approach uses a gating mechanism to identify the minimal subset of the original input that preserves the predicted distribution over classes. We demonstrate the faithfulness of our method, by introducing a faithfulness task, and comparing it to other state-of-the-art attribution methods on CIFAR-10 and ImageNet-1K, achieving compelling results. To aid reproducibility and further extension of our work, we open source our implementation: https://github.com/AngelosNal/Vision-DiffMask
[[2304.06653] G2T: A simple but versatile framework for topic modeling based on pretrained language model and community detection](http://arxiv.org/abs/2304.06653) #interpretability
It has been reported that clustering-based topic models, which cluster high-quality sentence embeddings with an appropriate word selection method, can generate better topics than generative probabilistic topic models. However, these approaches suffer from the inability to select appropriate parameters and incomplete models that overlook the quantitative relation between words with topics and topics with text. To solve these issues, we propose graph to topic (G2T), a simple but effective framework for topic modelling. The framework is composed of four modules. First, document representation is acquired using pretrained language models. Second, a semantic graph is constructed according to the similarity between document representations. Third, communities in document semantic graphs are identified, and the relationship between topics and documents is quantified accordingly. Fourth, the word--topic distribution is computed based on a variant of TFIDF. Automatic evaluation suggests that G2T achieved state-of-the-art performance on both English and Chinese documents with different lengths. Human judgements demonstrate that G2T can produce topics with better interpretability and coverage than baselines. In addition, G2T can not only determine the topic number automatically but also give the probabilistic distribution of words in topics and topics in documents. Finally, G2T is publicly available, and the distillation experiments provide instruction on how it works.
[[2304.06118] Explanation of Face Recognition via Saliency Maps](http://arxiv.org/abs/2304.06118) #explainability
Despite the significant progress in face recognition in the past years, they are often treated as "black boxes" and have been criticized for lacking explainability. It becomes increasingly important to understand the characteristics and decisions of deep face recognition systems to make them more acceptable to the public. Explainable face recognition (XFR) refers to the problem of interpreting why the recognition model matches a probe face with one identity over others. Recent studies have explored use of visual saliency maps as an explanation, but they often lack a deeper analysis in the context of face recognition. This paper starts by proposing a rigorous definition of explainable face recognition (XFR) which focuses on the decision-making process of the deep recognition model. Following the new definition, a similarity-based RISE algorithm (S-RISE) is then introduced to produce high-quality visual saliency maps. Furthermore, an evaluation approach is proposed to systematically validate the reliability and accuracy of general visual saliency-based XFR methods.
[[2304.06107] PATMAT: Person Aware Tuning of Mask-Aware Transformer for Face Inpainting](http://arxiv.org/abs/2304.06107) #diffusion
Generative models such as StyleGAN2 and Stable Diffusion have achieved state-of-the-art performance in computer vision tasks such as image synthesis, inpainting, and de-noising. However, current generative models for face inpainting often fail to preserve fine facial details and the identity of the person, despite creating aesthetically convincing image structures and textures. In this work, we propose Person Aware Tuning (PAT) of Mask-Aware Transformer (MAT) for face inpainting, which addresses this issue. Our proposed method, PATMAT, effectively preserves identity by incorporating reference images of a subject and fine-tuning a MAT architecture trained on faces. By using ~40 reference images, PATMAT creates anchor points in MAT's style module, and tunes the model using the fixed anchors to adapt the model to a new face identity. Moreover, PATMAT's use of multiple images per anchor during training allows the model to use fewer reference images than competing methods. We demonstrate that PATMAT outperforms state-of-the-art models in terms of image quality, the preservation of person-specific details, and the identity of the subject. Our results suggest that PATMAT can be a promising approach for improving the quality of personalized face inpainting.
[[2304.06140] An Edit Friendly DDPM Noise Space: Inversion and Manipulations](http://arxiv.org/abs/2304.06140) #diffusion
Denoising diffusion probabilistic models (DDPMs) employ a sequence of white Gaussian noise samples to generate an image. In analogy with GANs, those noise maps could be considered as the latent code associated with the generated image. However, this native noise space does not possess a convenient structure, and is thus challenging to work with in editing tasks. Here, we propose an alternative latent noise space for DDPM that enables a wide range of editing operations via simple means, and present an inversion method for extracting these edit-friendly noise maps for any given image (real or synthetically generated). As opposed to the native DDPM noise space, the edit-friendly noise maps do not have a standard normal distribution and are not statistically independent across timesteps. However, they allow perfect reconstruction of any desired image, and simple transformations on them translate into meaningful manipulations of the output image (e.g., shifting, color edits). Moreover, in text-conditional models, fixing those noise maps while changing the text prompt, modifies semantics while retaining structure. We illustrate how this property enables text-based editing of real images via the diverse DDPM sampling scheme (in contrast to the popular non-diverse DDIM inversion). We also show how it can be used within existing diffusion-based editing methods to improve their quality and diversity.
[[2304.06408] Intriguing properties of synthetic images: from generative adversarial networks to diffusion models](http://arxiv.org/abs/2304.06408) #diffusion
Detecting fake images is becoming a major goal of computer vision. This need is becoming more and more pressing with the continuous improvement of synthesis methods based on Generative Adversarial Networks (GAN), and even more with the appearance of powerful methods based on Diffusion Models (DM). Towards this end, it is important to gain insight into which image features better discriminate fake images from real ones. In this paper we report on our systematic study of a large number of image generators of different families, aimed at discovering the most forensically relevant characteristics of real and generated images. Our experiments provide a number of interesting observations and shed light on some intriguing properties of synthetic images: (1) not only the GAN models but also the DM and VQ-GAN (Vector Quantized Generative Adversarial Networks) models give rise to visible artifacts in the Fourier domain and exhibit anomalous regular patterns in the autocorrelation; (2) when the dataset used to train the model lacks sufficient variety, its biases can be transferred to the generated images; (3) synthetic and real images exhibit significant differences in the mid-high frequency signal content, observable in their radial and angular spectral power distributions.
[[2304.06648] DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning](http://arxiv.org/abs/2304.06648) #diffusion
Diffusion models have proven to be highly effective in generating high-quality images. However, adapting large pre-trained diffusion models to new domains remains an open challenge, which is critical for real-world applications. This paper proposes DiffFit, a parameter-efficient strategy to fine-tune large pre-trained diffusion models that enable fast adaptation to new domains. DiffFit is embarrassingly simple that only fine-tunes the bias term and newly-added scaling factors in specific layers, yet resulting in significant training speed-up and reduced model storage costs. Compared with full fine-tuning, DiffFit achieves 2$\times$ training speed-up and only needs to store approximately 0.12\% of the total model parameters. Intuitive theoretical analysis has been provided to justify the efficacy of scaling factors on fast adaptation. On 8 downstream datasets, DiffFit achieves superior or competitive performances compared to the full fine-tuning while being more efficient. Remarkably, we show that DiffFit can adapt a pre-trained low-resolution generative model to a high-resolution one by adding minimal cost. Among diffusion-based methods, DiffFit sets a new state-of-the-art FID of 3.02 on ImageNet 512$\times$512 benchmark by fine-tuning only 25 epochs from a public pre-trained ImageNet 256$\times$256 checkpoint while being 30$\times$ more training efficient than the closest competitor.
[[2304.06700] Learning Controllable 3D Diffusion Models from Single-view Images](http://arxiv.org/abs/2304.06700) #diffusion
Diffusion models have recently become the de-facto approach for generative modeling in the 2D domain. However, extending diffusion models to 3D is challenging due to the difficulties in acquiring 3D ground truth data for training. On the other hand, 3D GANs that integrate implicit 3D representations into GANs have shown remarkable 3D-aware generation when trained only on single-view image datasets. However, 3D GANs do not provide straightforward ways to precisely control image synthesis. To address these challenges, We present Control3Diff, a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for versatile, controllable 3D-aware image synthesis for single-view datasets. Control3Diff explicitly models the underlying latent distribution (optionally conditioned on external inputs), thus enabling direct control during the diffusion process. Moreover, our approach is general and applicable to any type of controlling input, allowing us to train it with the same diffusion objective without any auxiliary supervision. We validate the efficacy of Control3Diff on standard image generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various conditioning inputs such as images, sketches, and text prompts. Please see the project website (\url{https://jiataogu.me/control3diff}) for video comparisons.
[[2304.06711] DiffusionRig: Learning Personalized Priors for Facial Appearance Editing](http://arxiv.org/abs/2304.06711) #diffusion
We address the problem of learning person-specific facial priors from a small number (e.g., 20) of portrait photos of the same person. This enables us to edit this specific person's facial appearance, such as expression and lighting, while preserving their identity and high-frequency facial details. Key to our approach, which we dub DiffusionRig, is a diffusion model conditioned on, or "rigged by," crude 3D face models estimated from single in-the-wild images by an off-the-shelf estimator. On a high level, DiffusionRig learns to map simplistic renderings of 3D face models to realistic photos of a given person. Specifically, DiffusionRig is trained in two stages: It first learns generic facial priors from a large-scale face dataset and then person-specific priors from a small portrait photo collection of the person of interest. By learning the CGI-to-photo mapping with such personalized priors, DiffusionRig can "rig" the lighting, facial expression, head pose, etc. of a portrait photo, conditioned only on coarse 3D models while preserving this person's identity and other high-frequency characteristics. Qualitative and quantitative experiments show that DiffusionRig outperforms existing approaches in both identity preservation and photorealism. Please see the project website: https://diffusionrig.github.io for the supplemental material, video, code, and data.
[[2304.06714] Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction](http://arxiv.org/abs/2304.06714) #diffusion
3D-aware image synthesis encompasses a variety of tasks, such as scene generation and novel view synthesis from images. Despite numerous task-specific methods, developing a comprehensive model remains challenging. In this paper, we present SSDNeRF, a unified approach that employs an expressive diffusion model to learn a generalizable prior of neural radiance fields (NeRF) from multi-view images of diverse objects. Previous studies have used two-stage approaches that rely on pretrained NeRFs as real data to train diffusion models. In contrast, we propose a new single-stage training paradigm with an end-to-end objective that jointly optimizes a NeRF auto-decoder and a latent diffusion model, enabling simultaneous 3D reconstruction and prior learning, even from sparsely available views. At test time, we can directly sample the diffusion prior for unconditional generation, or combine it with arbitrary observations of unseen objects for NeRF reconstruction. SSDNeRF demonstrates robust results comparable to or better than leading task-specific methods in unconditional generation and single/sparse-view 3D reconstruction.
[[2304.06720] Expressive Text-to-Image Generation with Rich Text](http://arxiv.org/abs/2304.06720) #diffusion
Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on cross-attention maps of a vanilla diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.