[[2304.06727] Contingency Analyses with Warm Starter using Probabilistic Graphical Model](http://arxiv.org/abs/2304.06727) #secure
Cyberthreats are an increasingly common risk to the power grid and can thwart secure grid operation. We propose to extend contingency analysis (CA) that is currently used to secure the grid against natural threats to protect against cyberthreats. However, unlike traditional N-1 or N-2 contingencies, cyberthreats (e.g., MadIoT) require CA to solve harder N-k (with k >> 2) contingencies in a practical amount of time. Purely physics-based solvers, while robust, are slow and may not solve N-k contingencies in a timely manner, whereas the emerging data-driven alternatives to power grid analytics are fast but not sufficiently generalizable, interpretable, or scalable. To address these challenges, we propose a novel conditional Gaussian Random Field-based data-driven method that is bothfast and robust. It achieves speedup by improving starting points for the physical solvers. To improve the physical interpretability and generalizability, the proposed method incorporates domain knowledge by considering the graphical nature of the grid topology. To improve scalability, the method applies physics-informed regularization that reduces the model size and complexity. Experiments validate that simulating MadIoT-induced N-k contingencies with our warm starter requires 5x fewer iterations for a realistic 2000-bus system.
[[2304.06983] A Byte Sequence is Worth an Image: CNN for File Fragment Classification Using Bit Shift and n-Gram Embeddings](http://arxiv.org/abs/2304.06983) #security
File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security. Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification, while the bit information within bytes, i.e., intra-byte information, is seldom considered. This is inherently inapt for classifying variable-length coding files whose symbols are represented as the variable number of bits. Conversely, we propose Byte2Image, a novel data augmentation technique, to introduce the neglected intra-byte information into file fragments and re-treat them as 2d gray-scale images, which allows us to capture both inter-byte and intra-byte correlations simultaneously through powerful convolutional neural networks (CNNs). Specifically, to convert file fragments to 2d images, we employ a sliding byte window to expose the neglected intra-byte information and stack their n-gram features row by row. We further propose a byte sequence \& image fusion network as a classifier, which can jointly model the raw 1d byte sequence and the converted 2d image to perform FFC. Experiments on FFT-75 dataset validate that our proposed method can achieve notable accuracy improvements over state-of-the-art methods in nearly all scenarios. The code will be released at https://github.com/wenyang001/Byte2Image.
[[2304.06725] Advanced Security Threat Modelling for Blockchain-Based FinTech Applications](http://arxiv.org/abs/2304.06725) #security
Cybersecurity threats and vulnerabilities continue to grow in number and complexity, presenting an increasing challenge for organizations worldwide. Organizations use threat modelling and bug bounty programs to address these threats, which often operate independently. In this paper, we propose a Metric-Based Feedback Methodology (MBFM) that integrates bug bounty programs with threat modelling to improve the overall security posture of an organization. By analyzing and categorizing vulnerability data, the methodology enables identifying root causes and refining threat models to prioritize security efforts more effectively. The paper outlines the proposed methodology and its assumptions and provides a foundation for future research to develop the methodology into a versatile framework. Further research should focus on automating the process, integrating additional security testing approaches, and leveraging machine learning algorithms for vulnerability prediction and team-specific recommendations.
[[2304.06728] Late Breaking Results: Scalable and Efficient Hyperdimensional Computing for Network Intrusion Detection](http://arxiv.org/abs/2304.06728) #security
Cybersecurity has emerged as a critical challenge for the industry. With the large complexity of the security landscape, sophisticated and costly deep learning models often fail to provide timely detection of cyber threats on edge devices. Brain-inspired hyperdimensional computing (HDC) has been introduced as a promising solution to address this issue. However, existing HDC approaches use static encoders and require very high dimensionality and hundreds of training iterations to achieve reasonable accuracy. This results in a serious loss of learning efficiency and causes huge latency for detecting attacks. In this paper, we propose CyberHD, an innovative HDC learning framework that identifies and regenerates insignificant dimensions to capture complicated patterns of cyber threats with remarkably lower dimensionality. Additionally, the holographic distribution of patterns in high dimensional space provides CyberHD with notably high robustness against hardware errors.
[[2304.07166] Fuzzing the Latest NTFS in Linux with Papora: An Empirical Study](http://arxiv.org/abs/2304.07166) #security
Recently, the first feature-rich NTFS implementation, NTFS3, has been upstreamed to Linux. Although ensuring the security of NTFS3 is essential for the future of Linux, it remains unclear, however, whether the most recent version of NTFS for Linux contains 0-day vulnerabilities. To this end, we implemented Papora, the first effective fuzzer for NTFS3. We have identified and reported 3 CVE-assigned 0-day vulnerabilities and 9 severe bugs in NTFS3. Furthermore, we have investigated the underlying causes as well as types of these vulnerabilities and bugs. We have conducted an empirical study on the identified bugs while the results of our study have offered practical insights regarding the security of NTFS3 in Linux.
[[2304.06929] Challenges towards the Next Frontier in Privacy](http://arxiv.org/abs/2304.06929) #privacy
In July 2022, we organized a workshop (with the title Differential privacy (DP): Challenges towards the next frontier) with experts from industry, academia, and the public sector to seek answers to broad questions pertaining to privacy and its implications in the design of industry-grade systems. This document is the only public summary of the conversations from the workshop.
There are two potential purposes of this document, which we envision: i) it serves as a standing reference for algorithmic/design decisions that are taken in the space of privacy, and ii) it provides guidance on future research directions. The document covers a broad array of topics, from infrastructure needs for designing private systems, to achieving better privacy/utility trade-offs, to conveying privacy guarantees to a broad audience. Finally, the document also looks at attacking and auditing these systems.
[[2304.07134] Pool Inference Attacks on Local Differential Privacy: Quantifying the Privacy Guarantees of Apple's Count Mean Sketch in Practice](http://arxiv.org/abs/2304.07134) #privacy
Behavioral data generated by users' devices, ranging from emoji use to pages visited, are collected at scale to improve apps and services. These data, however, contain fine-grained records and can reveal sensitive information about individual users. Local differential privacy has been used by companies as a solution to collect data from users while preserving privacy. We here first introduce pool inference attacks, where an adversary has access to a user's obfuscated data, defines pools of objects, and exploits the user's polarized behavior in multiple data collections to infer the user's preferred pool. Second, we instantiate this attack against Count Mean Sketch, a local differential privacy mechanism proposed by Apple and deployed in iOS and Mac OS devices, using a Bayesian model. Using Apple's parameters for the privacy loss $\varepsilon$, we then consider two specific attacks: one in the emojis setting -- where an adversary aims at inferring a user's preferred skin tone for emojis -- and one against visited websites -- where an adversary wants to learn the political orientation of a user from the news websites they visit. In both cases, we show the attack to be much more effective than a random guess when the adversary collects enough data. We find that users with high polarization and relevant interest are significantly more vulnerable, and we show that our attack is well-calibrated, allowing the adversary to target such vulnerable users. We finally validate our results for the emojis setting using user data from Twitter. Taken together, our results show that pool inference attacks are a concern for data protected by local differential privacy mechanisms with a large $\varepsilon$, emphasizing the need for additional technical safeguards and the need for more research on how to apply local differential privacy for multiple collections.
[[2304.07234] Sparsity in neural networks can increase their privacy](http://arxiv.org/abs/2304.07234) #privacy
This article measures how sparsity can make neural networks more robust to membership inference attacks. The obtained empirical results show that sparsity improves the privacy of the network, while preserving comparable performances on the task at hand. This empirical study completes and extends existing literature.
[[2304.07239] Separating Key Agreement and Computational Differential Privacy](http://arxiv.org/abs/2304.07239) #privacy
Two party differential privacy allows two parties who do not trust each other, to come together and perform a joint analysis on their data whilst maintaining individual-level privacy. We show that any efficient, computationally differentially private protocol that has black-box access to key agreement (and nothing stronger), is also an efficient, information-theoretically differentially private protocol. In other words, the existence of efficient key agreement protocols is insufficient for efficient, computationally differentially private protocols. In doing so, we make progress in answering an open question posed by Vadhan about the minimal computational assumption needed for computational differential privacy.
Combined with the information-theoretic lower bound due to McGregor, Mironov, Pitassi, Reingold, Talwar, and Vadhan in [FOCS'10], we show that there is no fully black-box reduction from efficient, computationally differentially private protocols for computing the Hamming distance (or equivalently inner product over the integers) on $n$ bits, with additive error lower than $O\left(\frac{\sqrt{n}}{e^{\epsilon}\log(n)}\right)$, to key agreement.
This complements the result by Haitner, Mazor, Silbak, and Tsfadia in [STOC'22], which showed that computing the Hamming distance implies key agreement. We conclude that key agreement is \emph{strictly} weaker than computational differential privacy for computing the inner product, thereby answering their open question on whether key agreement is sufficient.
[[2304.07092] Obfuscation of Discrete Data](http://arxiv.org/abs/2304.07092) #protect
Data obfuscation deals with the problem of masking a data-set in such a way that the utility of the data is maximized while minimizing the risk of the disclosure of sensitive information. To protect data we address some ways that may as well retain its statistical uses to some extent. One such way is to mask a data with additive noise and revert to certain desired parameters of the original distribution from the knowledge of the noise distribution and masked data. In this project, we discuss the estimation of any desired quantile and range of a quantitative data set masked with additive noise.
[[2304.07165] Hybrid DLT as a data layer for real-time, data-intensive applications](http://arxiv.org/abs/2304.07165) #protect
We propose a new approach, termed Hybrid DLT, to address a broad range of industrial use cases where certain properties of both private and public DLTs are valuable, while other properties may be unnecessary or detrimental. The Hybrid DLT approach involves a system where private ledgers, with limited data block dissemination, are collaboratively created by nodes within a private network. The Notary, a publicly auditable authoritative component, maintains a single, official, coherent history for each private ledger without requiring access to data blocks. This is achieved by leveraging a public DLT solution to render the ledger histories tamper-proof, consequently providing tamper-evidence for ledger data disclosed to external actors. We present Traent Hybrid Blockchain, a commercial implementation of the Hybrid DLT approach: a real-time, data-intensive collaboration system for organizations seeking immutable data while also needing to comply with the European General Data Protection Regulation (GDPR).
[[2304.06934] Classification of social media Toxic comments using Machine learning models](http://arxiv.org/abs/2304.06934) #protect
The abstract outlines the problem of toxic comments on social media platforms, where individuals use disrespectful, abusive, and unreasonable language that can drive users away from discussions. This behavior is referred to as anti-social behavior, which occurs during online debates, comments, and fights. The comments containing explicit language can be classified into various categories, such as toxic, severe toxic, obscene, threat, insult, and identity hate. This behavior leads to online harassment and cyberbullying, which forces individuals to stop expressing their opinions and ideas. To protect users from offensive language, companies have started flagging comments and blocking users. The abstract proposes to create a classifier using an Lstm-cnn model that can differentiate between toxic and non-toxic comments with high accuracy. The classifier can help organizations examine the toxicity of the comment section better.
[[2304.06919] Interpretability is a Kind of Safety: An Interpreter-based Ensemble for Adversary Defense](http://arxiv.org/abs/2304.06919) #defense
While having achieved great success in rich real-life applications, deep neural network (DNN) models have long been criticized for their vulnerability to adversarial attacks. Tremendous research efforts have been dedicated to mitigating the threats of adversarial attacks, but the essential trait of adversarial examples is not yet clear, and most existing methods are yet vulnerable to hybrid attacks and suffer from counterattacks. In light of this, in this paper, we first reveal a gradient-based correlation between sensitivity analysis-based DNN interpreters and the generation process of adversarial examples, which indicates the Achilles's heel of adversarial attacks and sheds light on linking together the two long-standing challenges of DNN: fragility and unexplainability. We then propose an interpreter-based ensemble framework called X-Ensemble for robust adversary defense. X-Ensemble adopts a novel detection-rectification process and features in building multiple sub-detectors and a rectifier upon various types of interpretation information toward target classifiers. Moreover, X-Ensemble employs the Random Forests (RF) model to combine sub-detectors into an ensemble detector for adversarial hybrid attacks defense. The non-differentiable property of RF further makes it a precious choice against the counterattack of adversaries. Extensive experiments under various types of state-of-the-art attacks and diverse attack scenarios demonstrate the advantages of X-Ensemble to competitive baseline methods.
[[2304.06723] Introduction to Presentation Attack Detection in Fingerprint Biometrics](http://arxiv.org/abs/2304.06723) #attack
This chapter provides an introduction to Presentation Attack Detection (PAD) in fingerprint biometrics, also coined anti-spoofing, describes early developments in this field, and briefly summarizes recent trends and open issues.
[[2304.06724] GradMDM: Adversarial Attack on Dynamic Networks](http://arxiv.org/abs/2304.06724) #attack
Dynamic neural networks can greatly reduce computation redundancy without compromising accuracy by adapting their structures based on the input. In this paper, we explore the robustness of dynamic neural networks against energy-oriented attacks targeted at reducing their efficiency. Specifically, we attack dynamic models with our novel algorithm GradMDM. GradMDM is a technique that adjusts the direction and the magnitude of the gradients to effectively find a small perturbation for each input, that will activate more computational units of dynamic models during inference. We evaluate GradMDM on multiple datasets and dynamic models, where it outperforms previous energy-oriented attack techniques, significantly increasing computation complexity while reducing the perceptibility of the perturbations.
[[2304.06908] Generating Adversarial Examples with Better Transferability via Masking Unimportant Parameters of Surrogate Model](http://arxiv.org/abs/2304.06908) #attack
Deep neural networks (DNNs) have been shown to be vulnerable to adversarial examples. Moreover, the transferability of the adversarial examples has received broad attention in recent years, which means that adversarial examples crafted by a surrogate model can also attack unknown models. This phenomenon gave birth to the transfer-based adversarial attacks, which aim to improve the transferability of the generated adversarial examples. In this paper, we propose to improve the transferability of adversarial examples in the transfer-based attack via masking unimportant parameters (MUP). The key idea in MUP is to refine the pretrained surrogate models to boost the transfer-based attack. Based on this idea, a Taylor expansion-based metric is used to evaluate the parameter importance score and the unimportant parameters are masked during the generation of adversarial examples. This process is simple, yet can be naturally combined with various existing gradient-based optimizers for generating adversarial examples, thus further improving the transferability of the generated adversarial examples. Extensive experiments are conducted to validate the effectiveness of the proposed MUP-based methods.
[[2304.06963] Delay Impact on Stubborn Mining Attack Severity in Imperfect Bitcoin Network](http://arxiv.org/abs/2304.06963) #attack
Stubborn mining attack greatly downgrades Bitcoin throughput and also benefits malicious miners (attackers). This paper aims to quantify the impact of block receiving delay on stubborn mining attack severity in imperfect Bitcoin networks. We develop an analytic model and derive formulas of both relative revenue and system throughput, which are applied to study attack severity. Experiment results validate our analysis method and show that imperfect networks favor attackers. The quantitative analysis offers useful insight into stubborn mining attack and then helps the development of countermeasures.
[[2304.07210] Measuring Re-identification Risk](http://arxiv.org/abs/2304.07210) #attack
Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications.
[[2304.06767] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment](http://arxiv.org/abs/2304.06767) #robust
Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially significant repercussions. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) as a means of addressing this problem, wherein generative models are fine-tuned using RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment of generative models, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models more effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently assembles a streaming dataset. This dataset serves as the basis for aligning the generative model and can be employed under both offline and online settings. Notably, the sample generation process within RAFT is gradient-free, rendering it compatible with black-box generators. Through extensive experiments, we demonstrate that our proposed algorithm exhibits strong performance in the context of both large language models and diffusion models.
[[2304.06914] SMAE: Few-shot Learning for HDR Deghosting with Saturation-Aware Masked Autoencoders](http://arxiv.org/abs/2304.06914) #robust
Generating a high-quality High Dynamic Range (HDR) image from dynamic scenes has recently been extensively studied by exploiting Deep Neural Networks (DNNs). Most DNNs-based methods require a large amount of training data with ground truth, requiring tedious and time-consuming work. Few-shot HDR imaging aims to generate satisfactory images with limited data. However, it is difficult for modern DNNs to avoid overfitting when trained on only a few images. In this work, we propose a novel semi-supervised approach to realize few-shot HDR imaging via two stages of training, called SSHDR. Unlikely previous methods, directly recovering content and removing ghosts simultaneously, which is hard to achieve optimum, we first generate content of saturated regions with a self-supervised mechanism and then address ghosts via an iterative semi-supervised learning framework. Concretely, considering that saturated regions can be regarded as masking Low Dynamic Range (LDR) input regions, we design a Saturated Mask AutoEncoder (SMAE) to learn a robust feature representation and reconstruct a non-saturated HDR image. We also propose an adaptive pseudo-label selection strategy to pick high-quality HDR pseudo-labels in the second stage to avoid the effect of mislabeled samples. Experiments demonstrate that SSHDR outperforms state-of-the-art methods quantitatively and qualitatively within and across different datasets, achieving appealing HDR visualization with few labeled samples.
[[2304.06917] One-Shot Stylization for Full-Body Human Images](http://arxiv.org/abs/2304.06917) #robust
The goal of human stylization is to transfer full-body human photos to a style specified by a single art character reference image. Although previous work has succeeded in example-based stylization of faces and generic scenes, full-body human stylization is a more complex domain. This work addresses several unique challenges of stylizing full-body human images. We propose a method for one-shot fine-tuning of a pose-guided human generator to preserve the "content" (garments, face, hair, pose) of the input photo and the "style" of the artistic reference. Since body shape deformation is an essential component of an art character's style, we incorporate a novel skeleton deformation module to reshape the pose of the input person and modify the DiOr pose-guided person generator to be more robust to the rescaled poses falling outside the distribution of the realistic poses that the generator is originally trained on. Several human studies verify the effectiveness of our approach.
[[2304.06955] Uncertainty-Aware Null Space Networks for Data-Consistent Image Reconstruction](http://arxiv.org/abs/2304.06955) #robust
Reconstructing an image from noisy and incomplete measurements is a central task in several image processing applications. In recent years, state-of-the-art reconstruction methods have been developed based on recent advances in deep learning. Especially for highly underdetermined problems, maintaining data consistency is a key goal. This can be achieved either by iterative network architectures or by a subsequent projection of the network reconstruction. However, for such approaches to be used in safety-critical domains such as medical imaging, the network reconstruction should not only provide the user with a reconstructed image, but also with some level of confidence in the reconstruction. In order to meet these two key requirements, this paper combines deep null-space networks with uncertainty quantification. Evaluation of the proposed method includes image reconstruction from undersampled Radon measurements on a toy CT dataset and accelerated MRI reconstruction on the fastMRI dataset. This work is the first approach to solving inverse problems that additionally models data-dependent uncertainty by estimating an input-dependent scale map, providing a robust assessment of reconstruction quality.
[[2304.07031] Spectral Transfer Guided Active Domain Adaptation For Thermal Imagery](http://arxiv.org/abs/2304.07031) #robust
The exploitation of visible spectrum datasets has led deep networks to show remarkable success. However, real-world tasks include low-lighting conditions which arise performance bottlenecks for models trained on large-scale RGB image datasets. Thermal IR cameras are more robust against such conditions. Therefore, the usage of thermal imagery in real-world applications can be useful. Unsupervised domain adaptation (UDA) allows transferring information from a source domain to a fully unlabeled target domain. Despite substantial improvements in UDA, the performance gap between UDA and its supervised learning counterpart remains significant. By picking a small number of target samples to annotate and using them in training, active domain adaptation tries to mitigate this gap with minimum annotation expense. We propose an active domain adaptation method in order to examine the efficiency of combining the visible spectrum and thermal imagery modalities. When the domain gap is considerably large as in the visible-to-thermal task, we may conclude that the methods without explicit domain alignment cannot achieve their full potential. To this end, we propose a spectral transfer guided active domain adaptation method to select the most informative unlabeled target samples while aligning source and target domains. We used the large-scale visible spectrum dataset MS-COCO as the source domain and the thermal dataset FLIR ADAS as the target domain to present the results of our method. Extensive experimental evaluation demonstrates that our proposed method outperforms the state-of-the-art active domain adaptation methods. The code and models are publicly available.
[[2304.07140] TUM-FA\c{C}ADE: Reviewing and enriching point cloud benchmarks for fa\c{c}ade segmentation](http://arxiv.org/abs/2304.07140) #robust
Point clouds are widely regarded as one of the best dataset types for urban mapping purposes. Hence, point cloud datasets are commonly investigated as benchmark types for various urban interpretation methods. Yet, few researchers have addressed the use of point cloud benchmarks for fa\c{c}ade segmentation. Robust fa\c{c}ade segmentation is becoming a key factor in various applications ranging from simulating autonomous driving functions to preserving cultural heritage. In this work, we present a method of enriching existing point cloud datasets with fa\c{c}ade-related classes that have been designed to facilitate fa\c{c}ade segmentation testing. We propose how to efficiently extend existing datasets and comprehensively assess their potential for fa\c{c}ade segmentation. We use the method to create the TUM-FA\c{C}ADE dataset, which extends the capabilities of TUM-MLS-2016. Not only can TUM-FA\c{C}ADE facilitate the development of point-cloud-based fa\c{c}ade segmentation tasks, but our procedure can also be applied to enrich further datasets.
[[2304.07193] DINOv2: Learning Robust Visual Features without Supervision](http://arxiv.org/abs/2304.07193) #robust
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
[[2304.07221] Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models](http://arxiv.org/abs/2304.07221) #robust
Recently, pre-trained point cloud models have found extensive applications in downstream tasks like object classification. However, these tasks often require {full fine-tuning} of models and lead to storage-intensive procedures, thus limiting the real applications of pre-trained models. Inspired by the great success of visual prompt tuning (VPT) in vision, we attempt to explore prompt tuning, which serves as an efficient alternative to full fine-tuning for large-scale models, to point cloud pre-trained models to reduce storage costs. However, it is non-trivial to apply the traditional static VPT to point clouds, owing to the distribution diversity of point cloud data. For instance, the scanned point clouds exhibit various types of missing or noisy points. To address this issue, we propose an Instance-aware Dynamic Prompt Tuning (IDPT) for point cloud pre-trained models, which utilizes a prompt module to perceive the semantic prior features of each instance. This semantic prior facilitates the learning of unique prompts for each instance, thus enabling downstream tasks to robustly adapt to pre-trained point cloud models. Notably, extensive experiments conducted on downstream tasks demonstrate that IDPT outperforms full fine-tuning in most tasks with a mere 7\% of the trainable parameters, thus significantly reducing the storage pressure. Code is available at \url{https://github.com/zyh16143998882/IDPT}.
[[2304.07230] PARFormer: Transformer-based Multi-Task Network for Pedestrian Attribute Recognition](http://arxiv.org/abs/2304.07230) #robust
Pedestrian attribute recognition (PAR) has received increasing attention because of its wide application in video surveillance and pedestrian analysis. Extracting robust feature representation is one of the key challenges in this task. The existing methods mainly use the convolutional neural network (CNN) as the backbone network to extract features. However, these methods mainly focus on small discriminative regions while ignoring the global perspective. To overcome these limitations, we propose a pure transformer-based multi-task PAR network named PARFormer, which includes four modules. In the feature extraction module, we build a transformer-based strong baseline for feature extraction, which achieves competitive results on several PAR benchmarks compared with the existing CNN-based baseline methods. In the feature processing module, we propose an effective data augmentation strategy named batch random mask (BRM) block to reinforce the attentive feature learning of random patches. Furthermore, we propose a multi-attribute center loss (MACL) to enhance the inter-attribute discriminability in the feature representations. In the viewpoint perception module, we explore the impact of viewpoints on pedestrian attributes, and propose a multi-view contrastive loss (MCVL) that enables the network to exploit the viewpoint information. In the attribute recognition module, we alleviate the negative-positive imbalance problem to generate the attribute predictions. The above modules interact and jointly learn a highly discriminative feature space, and supervise the generation of the final features. Extensive experimental results show that the proposed PARFormer network performs well compared to the state-of-the-art methods on several public datasets, including PETA, RAP, and PA100K. Code will be released at https://github.com/xwf199/PARFormer.
[[2304.07101] Task-oriented Document-Grounded Dialog Systems by HLTPR@RWTH for DSTC9 and DSTC10](http://arxiv.org/abs/2304.07101) #robust
This paper summarizes our contributions to the document-grounded dialog tasks at the 9th and 10th Dialog System Technology Challenges (DSTC9 and DSTC10). In both iterations the task consists of three subtasks: first detect whether the current turn is knowledge seeking, second select a relevant knowledge document, and third generate a response grounded on the selected document. For DSTC9 we proposed different approaches to make the selection task more efficient. The best method, Hierarchical Selection, actually improves the results compared to the original baseline and gives a speedup of 24x. In the DSTC10 iteration of the task, the challenge was to adapt systems trained on written dialogs to perform well on noisy automatic speech recognition transcripts. Therefore, we proposed data augmentation techniques to increase the robustness of the models as well as methods to adapt the style of generated responses to fit well into the proceeding dialog. Additionally, we proposed a noisy channel model that allows for increasing the factuality of the generated responses. In addition to summarizing our previous contributions, in this work, we also report on a few small improvements and reconsider the automatic evaluation metrics for the generation task which have shown a low correlation to human judgments.
[[2304.07226] BS-GAT Behavior Similarity Based Graph Attention Network for Network Intrusion Detection](http://arxiv.org/abs/2304.07226) #robust
With the development of the Internet of Things (IoT), network intrusion detection is becoming more complex and extensive. It is essential to investigate an intelligent, automated, and robust network intrusion detection method. Graph neural networks based network intrusion detection methods have been proposed. However, it still needs further studies because the graph construction method of the existing methods does not fully adapt to the characteristics of the practical network intrusion datasets. To address the above issue, this paper proposes a graph neural network algorithm based on behavior similarity (BS-GAT) using graph attention network. First, a novel graph construction method is developed using the behavior similarity by analyzing the characteristics of the practical datasets. The data flows are treated as nodes in the graph, and the behavior rules of nodes are used as edges in the graph, constructing a graph with a relatively uniform number of neighbors for each node. Then, the edge behavior relationship weights are incorporated into the graph attention network to utilize the relationship between data flows and the structure information of the graph, which is used to improve the performance of the network intrusion detection. Finally, experiments are conducted based on the latest datasets to evaluate the performance of the proposed behavior similarity based graph attention network for the network intrusion detection. The results show that the proposed method is effective and has superior performance comparing to existing solutions.
[[2304.07288] Cross-Entropy Loss Functions: Theoretical Analysis and Applications](http://arxiv.org/abs/2304.07288) #robust
Cross-entropy is a widely used loss function in applications. It coincides with the logistic loss applied to the outputs of a neural network, when the softmax is used. But, what guarantees can we rely on when using cross-entropy as a surrogate loss? We present a theoretical analysis of a broad family of losses, comp-sum losses, that includes cross-entropy (or logistic loss), generalized cross-entropy, the mean absolute error and other loss cross-entropy-like functions. We give the first $H$-consistency bounds for these loss functions. These are non-asymptotic guarantees that upper bound the zero-one loss estimation error in terms of the estimation error of a surrogate loss, for the specific hypothesis set $H$ used. We further show that our bounds are tight. These bounds depend on quantities called minimizability gaps, which only depend on the loss function and the hypothesis set. To make them more explicit, we give a specific analysis of these gaps for comp-sum losses. We also introduce a new family of loss functions, smooth adversarial comp-sum losses, derived from their comp-sum counterparts by adding in a related smooth term. We show that these loss functions are beneficial in the adversarial setting by proving that they admit $H$-consistency bounds. This leads to new adversarial robustness algorithms that consist of minimizing a regularized smooth adversarial comp-sum loss. While our main purpose is a theoretical analysis, we also present an extensive empirical analysis comparing comp-sum losses. We further report the results of a series of experiments demonstrating that our adversarial robustness algorithms outperform the current state-of-the-art, while also achieving a superior non-adversarial accuracy.
[[2304.07076] BCE-Net: Reliable Building Footprints Change Extraction based on Historical Map and Up-to-Date Images using Contrastive Learning](http://arxiv.org/abs/2304.07076) #extraction
Automatic and periodic recompiling of building databases with up-to-date high-resolution images has become a critical requirement for rapidly developing urban environments. However, the architecture of most existing approaches for change extraction attempts to learn features related to changes but ignores objectives related to buildings. This inevitably leads to the generation of significant pseudo-changes, due to factors such as seasonal changes in images and the inclination of building fa\c{c}ades. To alleviate the above-mentioned problems, we developed a contrastive learning approach by validating historical building footprints against single up-to-date remotely sensed images. This contrastive learning strategy allowed us to inject the semantics of buildings into a pipeline for the detection of changes, which is achieved by increasing the distinguishability of features of buildings from those of non-buildings. In addition, to reduce the effects of inconsistencies between historical building polygons and buildings in up-to-date images, we employed a deformable convolutional neural network to learn offsets intuitively. In summary, we formulated a multi-branch building extraction method that identifies newly constructed and removed buildings, respectively. To validate our method, we conducted comparative experiments using the public Wuhan University building change detection dataset and a more practical dataset named SI-BU that we established. Our method achieved F1 scores of 93.99% and 70.74% on the above datasets, respectively. Moreover, when the data of the public dataset were divided in the same manner as in previous related studies, our method achieved an F1 score of 94.63%, which surpasses that of the state-of-the-art method.
[[2304.06931] Scale Federated Learning for Label Set Mismatch in Medical Image Classification](http://arxiv.org/abs/2304.06931) #federate
Federated learning (FL) has been introduced to the healthcare domain as a decentralized learning paradigm that allows multiple parties to train a model collaboratively without privacy leakage. However, most previous studies have assumed that every client holds an identical label set. In reality, medical specialists tend to annotate only diseases within their knowledge domain or interest. This implies that label sets in each client can be different and even disjoint. In this paper, we propose the framework FedLSM to solve the problem Label Set Mismatch. FedLSM adopts different training strategies on data with different uncertainty levels to efficiently utilize unlabeled or partially labeled data as well as class-wise adaptive aggregation in the classification layer to avoid inaccurate aggregation when clients have missing labels. We evaluate FedLSM on two public real-world medical image datasets, including chest x-ray (CXR) diagnosis with 112,120 CXR images and skin lesion diagnosis with 10,015 dermoscopy images, and show that it significantly outperforms other state-of-the-art FL algorithms. Code will be made available upon acceptance.
[[2304.06947] TimelyFL: Heterogeneity-aware Asynchronous Federated Learning with Adaptive Partial Training](http://arxiv.org/abs/2304.06947) #federate
In cross-device Federated Learning (FL) environments, scaling synchronous FL methods is challenging as stragglers hinder the training process. Moreover, the availability of each client to join the training is highly variable over time due to system heterogeneities and intermittent connectivity. Recent asynchronous FL methods (e.g., FedBuff) have been proposed to overcome these issues by allowing slower users to continue their work on local training based on stale models and to contribute to aggregation when ready. However, we show empirically that this method can lead to a substantial drop in training accuracy as well as a slower convergence rate. The primary reason is that fast-speed devices contribute to many more rounds of aggregation while others join more intermittently or not at all, and with stale model updates. To overcome this barrier, we propose TimelyFL, a heterogeneity-aware asynchronous FL framework with adaptive partial training. During the training, TimelyFL adjusts the local training workload based on the real-time resource capabilities of each client, aiming to allow more available clients to join in the global update without staleness. We demonstrate the performance benefits of TimelyFL by conducting extensive experiments on various datasets (e.g., CIFAR-10, Google Speech, and Reddit) and models (e.g., ResNet20, VGG11, and ALBERT). In comparison with the state-of-the-art (i.e., FedBuff), our evaluations reveal that TimelyFL improves participation rate by 21.13%, harvests 1.28x - 2.89x more efficiency on convergence rate, and provides a 6.25% increment on test accuracy.
[[2304.06901] Systemic Fairness](http://arxiv.org/abs/2304.06901) #fair
Machine learning algorithms are increasingly used to make or support decisions in a wide range of settings. With such expansive use there is also growing concern about the fairness of such methods. Prior literature on algorithmic fairness has extensively addressed risks and in many cases presented approaches to manage some of them. However, most studies have focused on fairness issues that arise from actions taken by a (single) focal decision-maker or agent. In contrast, most real-world systems have many agents that work collectively as part of a larger ecosystem. For example, in a lending scenario, there are multiple lenders who evaluate loans for applicants, along with policymakers and other institutions whose decisions also affect outcomes. Thus, the broader impact of any lending decision of a single decision maker will likely depend on the actions of multiple different agents in the ecosystem. This paper develops formalisms for firm versus systemic fairness, and calls for a greater focus in the algorithmic fairness literature on ecosystem-wide fairness - or more simply systemic fairness - in real-world contexts.
[[2304.06819] Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction](http://arxiv.org/abs/2304.06819) #interpretability
Integrating whole-slide images (WSIs) and bulk transcriptomics for predicting patient survival can improve our understanding of patient prognosis. However, this multimodal task is particularly challenging due to the different nature of these data: WSIs represent a very high-dimensional spatial description of a tumor, while bulk transcriptomics represent a global description of gene expression levels within that tumor. In this context, our work aims to address two key challenges: (1) how can we tokenize transcriptomics in a semantically meaningful and interpretable way?, and (2) how can we capture dense multimodal interactions between these two modalities? Specifically, we propose to learn biological pathway tokens from transcriptomics that can encode specific cellular functions. Together with histology patch tokens that encode the different morphological patterns in the WSI, we argue that they form appropriate reasoning units for downstream interpretability analyses. We propose fusing both modalities using a memory-efficient multimodal Transformer that can model interactions between pathway and histology patch tokens. Our proposed model, SURVPATH, achieves state-of-the-art performance when evaluated against both unimodal and multimodal baselines on five datasets from The Cancer Genome Atlas. Our interpretability framework identifies key multimodal prognostic factors, and, as such, can provide valuable insights into the interaction between genotype and phenotype, enabling a deeper understanding of the underlying biological mechanisms at play. We make our code public at: https://github.com/ajv012/SurvPath.
[[2304.07152] Combining Stochastic Explainers and Subgraph Neural Networks can Increase Expressivity and Interpretability](http://arxiv.org/abs/2304.07152) #interpretability
Subgraph-enhanced graph neural networks (SGNN) can increase the expressive power of the standard message-passing framework. This model family represents each graph as a collection of subgraphs, generally extracted by random sampling or with hand-crafted heuristics. Our key observation is that by selecting "meaningful" subgraphs, besides improving the expressivity of a GNN, it is also possible to obtain interpretable results. For this purpose, we introduce a novel framework that jointly predicts the class of the graph and a set of explanatory sparse subgraphs, which can be analyzed to understand the decision process of the classifier. We compare the performance of our framework against standard subgraph extraction policies, like random node/edge deletion strategies. The subgraphs produced by our framework allow to achieve comparable performance in terms of accuracy, with the additional benefit of providing explanations.
[[2304.07111] Grouping Shapley Value Feature Importances of Random Forests for explainable Yield Prediction](http://arxiv.org/abs/2304.07111) #explainability
Explainability in yield prediction helps us fully explore the potential of machine learning models that are already able to achieve high accuracy for a variety of yield prediction scenarios. The data included for the prediction of yields are intricate and the models are often difficult to understand. However, understanding the models can be simplified by using natural groupings of the input features. Grouping can be achieved, for example, by the time the features are captured or by the sensor used to do so. The state-of-the-art for interpreting machine learning models is currently defined by the game-theoretic approach of Shapley values. To handle groups of features, the calculated Shapley values are typically added together, ignoring the theoretical limitations of this approach. We explain the concept of Shapley values directly computed for predefined groups of features and introduce an algorithm to compute them efficiently on tree structures. We provide a blueprint for designing swarm plots that combine many local explanations for global understanding. Extensive evaluation of two different yield prediction problems shows the worth of our approach and demonstrates how we can enable a better understanding of yield prediction models in the future, ultimately leading to mutual enrichment of research and application.
[[2304.06790] Inpaint Anything: Segment Anything Meets Image Inpainting](http://arxiv.org/abs/2304.06790) #diffusion
Modern image inpainting systems, despite the significant progress, often
struggle with mask selection and holes filling. Based on Segment-Anything Model
(SAM), we make the first attempt to the mask-free image inpainting and propose
a new paradigm of clicking and filling'', which is named as Inpaint Anything
(IA). The core idea behind IA is to combine the strengths of different models
in order to build a very powerful and user-friendly pipeline for solving
inpainting-related problems. IA supports three main features: (i) Remove
Anything: users could click on an object and IA will remove it and smooth the
hole'' with the context; (ii) Fill Anything: after certain objects removal,
users could provide text-based prompts to IA, and then it will fill the hole
with the corresponding generative content via driving AIGC models like Stable
Diffusion; (iii) Replace Anything: with IA, users have another option to retain
the click-selected object and replace the remaining background with the newly
generated scenes. We are also very willing to help everyone share and promote
new projects based on our Inpaint Anything (IA). Our codes are available at
https://github.com/geekyutao/Inpaint-Anything.
[[2304.06818] Soundini: Sound-Guided Diffusion for Natural Video Editing](http://arxiv.org/abs/2304.06818) #diffusion
We propose a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting. Animating the appearance of the visual effect is challenging because each frame of the edited video should have visual changes while maintaining temporal consistency. Moreover, existing video editing solutions focus on temporal consistency across frames, ignoring the visual style variations over time, e.g., thunderstorm, wave, fire crackling. To overcome this limitation, we utilize temporal sound features for the dynamic style. Specifically, we guide denoising diffusion probabilistic models with an audio latent representation in the audio-visual latent space. To the best of our knowledge, our work is the first to explore sound-guided natural video editing from various sound sources with sound-specialized properties, such as intensity, timbre, and volume. Additionally, we design optical flow-based guidance to generate temporally consistent video frames, capturing the pixel-wise relationship between adjacent frames. Experimental results show that our method outperforms existing video editing techniques, producing more realistic visual effects that reflect the properties of sound. Please visit our page: https://kuai-lab.github.io/soundini-gallery/.
[[2304.07060] DCFace: Synthetic Face Generation with Dual Condition Diffusion Model](http://arxiv.org/abs/2304.07060) #diffusion
Generating synthetic datasets for training face recognition models is challenging because dataset generation entails more than creating high fidelity images. It involves generating multiple images of same subjects under different factors (\textit{e.g.}, variations in pose, illumination, expression, aging and occlusion) which follows the real image conditional distribution. Previous works have studied the generation of synthetic datasets using GAN or 3D models. In this work, we approach the problem from the aspect of combining subject appearance (ID) and external factor (style) conditions. These two conditions provide a direct way to control the inter-class and intra-class variations. To this end, we propose a Dual Condition Face Generator (DCFace) based on a diffusion model. Our novel Patch-wise style extractor and Time-step dependent ID loss enables DCFace to consistently produce face images of the same subject under different styles with precise control. Face recognition models trained on synthetic images from the proposed DCFace provide higher verification accuracies compared to previous works by $6.11\%$ on average in $4$ out of $5$ test datasets, LFW, CFP-FP, CPLFW, AgeDB and CALFW. Code is available at https://github.com/mk-minchul/dcface
[[2304.07087] Memory Efficient Diffusion Probabilistic Models via Patch-based Generation](http://arxiv.org/abs/2304.07087) #diffusion
Diffusion probabilistic models have been successful in generating high-quality and diverse images. However, traditional models, whose input and output are high-resolution images, suffer from excessive memory requirements, making them less practical for edge devices. Previous approaches for generative adversarial networks proposed a patch-based method that uses positional encoding and global content information. Nevertheless, designing a patch-based approach for diffusion probabilistic models is non-trivial. In this paper, we resent a diffusion probabilistic model that generates images on a patch-by-patch basis. We propose two conditioning methods for a patch-based generation. First, we propose position-wise conditioning using one-hot representation to ensure patches are in proper positions. Second, we propose Global Content Conditioning (GCC) to ensure patches have coherent content when concatenated together. We evaluate our model qualitatively and quantitatively on CelebA and LSUN bedroom datasets and demonstrate a moderate trade-off between maximum memory consumption and generated image quality. Specifically, when an entire image is divided into 2 x 2 patches, our proposed approach can reduce the maximum memory consumption by half while maintaining comparable image quality.
[[2304.07090] Delta Denoising Score](http://arxiv.org/abs/2304.07090) #diffusion
We introduce Delta Denoising Score (DDS), a novel scoring function for text-based image editing that guides minimal modifications of an input image towards the content described in a target prompt. DDS leverages the rich generative prior of text-to-image diffusion models and can be used as a loss term in an optimization problem to steer an image towards a desired direction dictated by a text. DDS utilizes the Score Distillation Sampling (SDS) mechanism for the purpose of image editing. We show that using only SDS often produces non-detailed and blurry outputs due to noisy gradients. To address this issue, DDS uses a prompt that matches the input image to identify and remove undesired erroneous directions of SDS. Our key premise is that SDS should be zero when calculated on pairs of matched prompts and images, meaning that if the score is non-zero, its gradients can be attributed to the erroneous component of SDS. Our analysis demonstrates the competence of DDS for text based image-to-image translation. We further show that DDS can be used to train an effective zero-shot image translation model. Experimental results indicate that DDS outperforms existing methods in terms of stability and quality, highlighting its potential for real-world applications in text-based image editing.
[[2304.07169] A Comparative Study on Generative Models for High Resolution Solar Observation Imaging](http://arxiv.org/abs/2304.07169) #diffusion
Solar activity is one of the main drivers of variability in our solar system and the key source of space weather phenomena that affect Earth and near Earth space. The extensive record of high resolution extreme ultraviolet (EUV) observations from the Solar Dynamics Observatory (SDO) offers an unprecedented, very large dataset of solar images. In this work, we make use of this comprehensive dataset to investigate capabilities of current state-of-the-art generative models to accurately capture the data distribution behind the observed solar activity states. Starting from StyleGAN-based methods, we uncover severe deficits of this model family in handling fine-scale details of solar images when training on high resolution samples, contrary to training on natural face images. When switching to the diffusion based generative model family, we observe strong improvements of fine-scale detail generation. For the GAN family, we are able to achieve similar improvements in fine-scale generation when turning to ProjectedGANs, which uses multi-scale discriminators with a pre-trained frozen feature extractor. We conduct ablation studies to clarify mechanisms responsible for proper fine-scale handling. Using distributed training on supercomputers, we are able to train generative models for up to 1024x1024 resolution that produce high quality samples indistinguishable to human experts, as suggested by the evaluation we conduct. We make all code, models and workflows used in this study publicly available at \url{https://github.com/SLAMPAI/generative-models-for-highres-solar-images}.
[[2304.07132] Towards Controllable Diffusion Models via Reward-Guided Exploration](http://arxiv.org/abs/2304.07132) #diffusion
By formulating data samples' formation as a Markov denoising process, diffusion models achieve state-of-the-art performances in a collection of tasks. Recently, many variants of diffusion models have been proposed to enable controlled sample generation. Most of these existing methods either formulate the controlling information as an input (i.e.,: conditional representation) for the noise approximator, or introduce a pre-trained classifier in the test-phase to guide the Langevin dynamic towards the conditional goal. However, the former line of methods only work when the controlling information can be formulated as conditional representations, while the latter requires the pre-trained guidance classifier to be differentiable. In this paper, we propose a novel framework named RGDM (Reward-Guided Diffusion Model) that guides the training-phase of diffusion models via reinforcement learning (RL). The proposed training framework bridges the objective of weighted log-likelihood and maximum entropy RL, which enables calculating policy gradients via samples from a pay-off distribution proportional to exponential scaled rewards, rather than from policies themselves. Such a framework alleviates the high gradient variances and enables diffusion models to explore for highly rewarded samples in the reverse process. Experiments on 3D shape and molecule generation tasks show significant improvements over existing conditional diffusion models.