[[2303.15540] Intel TDX Demystified: A Top-Down Approach](http://arxiv.org/abs/2303.15540) #secure
Intel Trust Domain Extensions (TDX) is a new architectural extension in the 4th Generation Intel Xeon Scalable Processor that supports confidential computing. TDX allows the deployment of virtual machines in the Secure-Arbitration Mode (SEAM) with encrypted CPU state and memory, integrity protection, and remote attestation. TDX aims to enforce hardware-assisted isolation for virtual machines and minimize the attack surface exposed to host platforms, which are considered to be untrustworthy or adversarial in the confidential computing's new threat model. TDX can be leveraged by regulated industries or sensitive data holders to outsource their computations and data with end-to-end protection in public cloud infrastructure.
This paper aims to provide a comprehensive understanding of TDX to potential adopters, domain experts, and security researchers looking to leverage the technology for their own purposes. We adopt a top-down approach, starting with high-level security principles and moving to low-level technical details of TDX. Our analysis is based on publicly available documentation and source code, offering insights from security researchers outside of Intel.
[[2303.15735] Improving the Transferability of Adversarial Samples by Path-Augmented Method](http://arxiv.org/abs/2303.15735) #security
Deep neural networks have achieved unprecedented success on diverse vision tasks. However, they are vulnerable to adversarial noise that is imperceptible to humans. This phenomenon negatively affects their deployment in real-world scenarios, especially security-related ones. To evaluate the robustness of a target model in practice, transfer-based attacks craft adversarial samples with a local model and have attracted increasing attention from researchers due to their high efficiency. The state-of-the-art transfer-based attacks are generally based on data augmentation, which typically augments multiple training images from a linear path when learning adversarial samples. However, such methods selected the image augmentation path heuristically and may augment images that are semantics-inconsistent with the target images, which harms the transferability of the generated adversarial samples. To overcome the pitfall, we propose the Path-Augmented Method (PAM). Specifically, PAM first constructs a candidate augmentation path pool. It then settles the employed augmentation paths during adversarial sample generation with greedy search. Furthermore, to avoid augmenting semantics-inconsistent images, we train a Semantics Predictor (SP) to constrain the length of the augmentation path. Extensive experiments confirm that PAM can achieve an improvement of over 4.8% on average compared with the state-of-the-art baselines in terms of the attack success rates.
[[2303.15818] Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition](http://arxiv.org/abs/2303.15818) #security
Face recognition is a prevailing authentication solution in numerous biometric applications. Physical adversarial attacks, as an important surrogate, can identify the weaknesses of face recognition systems and evaluate their robustness before deployed. However, most existing physical attacks are either detectable readily or ineffective against commercial recognition systems. The goal of this work is to develop a more reliable technique that can carry out an end-to-end evaluation of adversarial robustness for commercial systems. It requires that this technique can simultaneously deceive black-box recognition models and evade defensive mechanisms. To fulfill this, we design adversarial textured 3D meshes (AT3D) with an elaborate topology on a human face, which can be 3D-printed and pasted on the attacker's face to evade the defenses. However, the mesh-based optimization regime calculates gradients in high-dimensional mesh space, and can be trapped into local optima with unsatisfactory transferability. To deviate from the mesh-based space, we propose to perturb the low-dimensional coefficient space based on 3D Morphable Model, which significantly improves black-box transferability meanwhile enjoying faster search efficiency and better visual quality. Extensive experiments in digital and physical scenarios show that our method effectively explores the security vulnerabilities of multiple popular commercial services, including three recognition APIs, four anti-spoofing APIs, two prevailing mobile phones and two automated access control systems.
[[2303.15821] Scaling Multi-Objective Security Games Provably via Space Discretization Based Evolutionary Search](http://arxiv.org/abs/2303.15821) #security
In the field of security, multi-objective security games (MOSGs) allow defenders to simultaneously protect targets from multiple heterogeneous attackers. MOSGs aim to simultaneously maximize all the heterogeneous payoffs, e.g., life, money, and crime rate, without merging heterogeneous attackers. In real-world scenarios, the number of heterogeneous attackers and targets to be protected may exceed the capability of most existing state-of-the-art methods, i.e., MOSGs are limited by the issue of scalability. To this end, this paper proposes a general framework called SDES based on many-objective evolutionary search to scale up MOSGs to large-scale targets and heterogeneous attackers. SDES consists of four consecutive key components, i.e., discretization, optimization, restoration and evaluation, and refinement. Specifically, SDES first discretizes the originally high-dimensional continuous solution space to the low-dimensional discrete one by the maximal indifference property in game theory. This property helps evolutionary algorithms (EAs) bypass the high-dimensional step function and ensure a well-convergent Pareto front. Then, a many-objective EA is used for optimization in the low-dimensional discrete solution space to obtain a well-spaced Pareto front. To evaluate solutions, SDES restores solutions back to the original space via bit-wisely optimizing a novel solution divergence. Finally, the refinement in SDES boosts the optimization performance with acceptable cost. Theoretically, we prove the optimization consistency and convergence of SDES. Experiment results show that SDES is the first linear-time MOSG algorithm for both large-scale attackers and targets. SDES is able to solve up to 20 attackers and 100 targets MOSG problems, while the state-of-the-art methods can only solve up to 8 attackers and 25 targets ones. Ablation study verifies the necessity of all components in SDES.
[[2303.15965] SFHarmony: Source Free Domain Adaptation for Distributed Neuroimaging Analysis](http://arxiv.org/abs/2303.15965) #privacy
To represent the biological variability of clinical neuroimaging populations, it is vital to be able to combine data across scanners and studies. However, different MRI scanners produce images with different characteristics, resulting in a domain shift known as the `harmonisation problem'. Additionally, neuroimaging data is inherently personal in nature, leading to data privacy concerns when sharing the data. To overcome these barriers, we propose an Unsupervised Source-Free Domain Adaptation (SFDA) method, SFHarmony. Through modelling the imaging features as a Gaussian Mixture Model and minimising an adapted Bhattacharyya distance between the source and target features, we can create a model that performs well for the target data whilst having a shared feature representation across the data domains, without needing access to the source data for adaptation or target labels. We demonstrate the performance of our method on simulated and real domain shifts, showing that the approach is applicable to classification, segmentation and regression tasks, requiring no changes to the algorithm. Our method outperforms existing SFDA approaches across a range of realistic data scenarios, demonstrating the potential utility of our approach for MRI harmonisation and general SFDA problems. Our code is available at \url{https://github.com/nkdinsdale/SFHarmony}.
[[2303.16028] Synthetically generated text for supervised text analysis](http://arxiv.org/abs/2303.16028) #privacy
Supervised text models are a valuable tool for political scientists but present several obstacles to their use, including the expense of hand-labeling documents, the difficulty of retrieving rare relevant documents for annotation, and copyright and privacy concerns involved in sharing annotated documents. This article proposes a partial solution to these three issues, in the form of controlled generation of synthetic text with large language models. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text. I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.
[[2303.15563] Privacy-preserving machine learning for healthcare: open challenges and future perspectives](http://arxiv.org/abs/2303.15563) #privacy
Machine Learning (ML) has recently shown tremendous success in modeling various healthcare prediction tasks, ranging from disease diagnosis and prognosis to patient treatment. Due to the sensitive nature of medical data, privacy must be considered along the entire ML pipeline, from model training to inference. In this paper, we conduct a review of recent literature concerning Privacy-Preserving Machine Learning (PPML) for healthcare. We primarily focus on privacy-preserving training and inference-as-a-service, and perform a comprehensive review of existing trends, identify challenges, and discuss opportunities for future research directions. The aim of this review is to guide the development of private and efficient ML models in healthcare, with the prospects of translating research efforts into real-world settings.
[[2303.15916] From Private to Public: Benchmarking GANs in the Context of Private Time Series Classification](http://arxiv.org/abs/2303.15916) #privacy
Deep learning has proven to be successful in various domains and for different tasks. However, when it comes to private data several restrictions are making it difficult to use deep learning approaches in these application fields. Recent approaches try to generate data privately instead of applying a privacy-preserving mechanism directly, on top of the classifier. The solution is to create public data from private data in a manner that preserves the privacy of the data. In this work, two very prominent GAN-based architectures were evaluated in the context of private time series classification. In contrast to previous work, mostly limited to the image domain, the scope of this benchmark was the time series domain. The experiments show that especially GSWGAN performs well across a variety of public datasets outperforming the competitor DPWGAN. An analysis of the generated datasets further validates the superiority of GSWGAN in the context of time series generation.
[[2303.15991] Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks](http://arxiv.org/abs/2303.15991) #privacy
The increasingly deeper neural networks hinder the democratization of privacy-enhancing distributed learning, such as federated learning (FL), to resource-constrained devices. To overcome this challenge, in this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL), allowing multiple client devices to offload substantial training workloads to an edge server via layer-wise model split. By observing that existing PSL schemes incur excessive training latency and large volume of data transmissions, we propose an innovative PSL framework, namely, efficient parallel split learning (EPSL), to accelerate model training. To be specific, EPSL parallelizes client-side model training and reduces the dimension of local gradients for back propagation (BP) via last-layer gradient aggregation, leading to a significant reduction in server-side training and communication latency. Moreover, by considering the heterogeneous channel conditions and computing capabilities at client devices, we jointly optimize subchannel allocation, power control, and cut layer selection to minimize the per-round latency. Simulation results show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy compared with the state-of-the-art benchmarks, and the tailored resource management and layer split strategy can considerably reduce latency than the counterpart without optimization.
[[2303.15553] MoViT: Memorizing Vision Transformers for Medical Image Analysis](http://arxiv.org/abs/2303.15553) #protect
The synergy of long-range dependencies from transformers and local
representations of image content from convolutional neural networks (CNNs) has
led to advanced architectures and increased performance for various medical
image analysis tasks due to their complementary benefits. However, compared
with CNNs, transformers require considerably more training data, due to a
larger number of parameters and an absence of inductive bias. The need for
increasingly large datasets continues to be problematic, particularly in the
context of medical imaging, where both annotation efforts and data protection
result in limited data availability. In this work, inspired by the human
decision-making process of correlating new evidence'' with previously
memorized
experience'', we propose a Memorizing Vision Transformer (MoViT) to
alleviate the need for large-scale datasets to successfully train and deploy
transformer-based architectures. MoViT leverages an external memory structure
to cache history attention snapshots during the training stage. To prevent
overfitting, we incorporate an innovative memory update scheme, attention
temporal moving average, to update the stored external memories with the
historical moving average. For inference speedup, we design a prototypical
attention learning method to distill the external memory into smaller
representative subsets. We evaluate our method on a public histology image
dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied
medical image analysis tasks, can outperform vanilla transformer models across
varied data regimes, especially in cases where only a small amount of annotated
data is available. More importantly, MoViT can reach a competitive performance
of ViT with only 3.0% of the training data.
[[2303.15564] Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder](http://arxiv.org/abs/2303.15564) #defense
Deep neural networks are vulnerable to backdoor attacks, where an adversary maliciously manipulates the model behavior through overlaying images with special triggers. Existing backdoor defense methods often require accessing a few validation data and model parameters, which are impractical in many real-world applications, e.g., when the model is provided as a cloud service. In this paper, we address the practical task of blind backdoor defense at test time, in particular for black-box models. The true label of every test image needs to be recovered on the fly from the hard label predictions of a suspicious model. The heuristic trigger search in image space, however, is not scalable to complex triggers or high image resolution. We circumvent such barrier by leveraging generic image generation models, and propose a framework of Blind Defense with Masked AutoEncoder (BDMAE). It uses the image structural similarity and label consistency between the test image and MAE restorations to detect possible triggers. The detection result is refined by considering the topology of triggers. We obtain a purified test image from restorations for making prediction. Our approach is blind to the model architectures, trigger patterns or image benignity. Extensive experiments on multiple datasets with different backdoor attacks validate its effectiveness and generalizability. Code is available at https://github.com/tsun/BDMAE.
[[2303.15571] EMShepherd: Detecting Adversarial Samples via Side-channel Leakage](http://arxiv.org/abs/2303.15571) #defense
Deep Neural Networks (DNN) are vulnerable to adversarial perturbations-small changes crafted deliberately on the input to mislead the model for wrong predictions. Adversarial attacks have disastrous consequences for deep learning-empowered critical applications. Existing defense and detection techniques both require extensive knowledge of the model, testing inputs, and even execution details. They are not viable for general deep learning implementations where the model internal is unknown, a common 'black-box' scenario for model users. Inspired by the fact that electromagnetic (EM) emanations of a model inference are dependent on both operations and data and may contain footprints of different input classes, we propose a framework, EMShepherd, to capture EM traces of model execution, perform processing on traces and exploit them for adversarial detection. Only benign samples and their EM traces are used to train the adversarial detector: a set of EM classifiers and class-specific unsupervised anomaly detectors. When the victim model system is under attack by an adversarial example, the model execution will be different from executions for the known classes, and the EM trace will be different. We demonstrate that our air-gapped EMShepherd can effectively detect different adversarial attacks on a commonly used FPGA deep learning accelerator for both Fashion MNIST and CIFAR-10 datasets. It achieves a 100% detection rate on most types of adversarial samples, which is comparable to the state-of-the-art 'white-box' software-based detectors.
[[2303.15754] Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization](http://arxiv.org/abs/2303.15754) #attack
Vision transformers (ViTs) have been successfully deployed in a variety of computer vision tasks, but they are still vulnerable to adversarial samples. Transfer-based attacks use a local model to generate adversarial samples and directly transfer them to attack a target black-box model. The high efficiency of transfer-based attacks makes it a severe security threat to ViT-based applications. Therefore, it is vital to design effective transfer-based attacks to identify the deficiencies of ViTs beforehand in security-sensitive scenarios. Existing efforts generally focus on regularizing the input gradients to stabilize the updated direction of adversarial samples. However, the variance of the back-propagated gradients in intermediate blocks of ViTs may still be large, which may make the generated adversarial samples focus on some model-specific features and get stuck in poor local optima. To overcome the shortcomings of existing approaches, we propose the Token Gradient Regularization (TGR) method. According to the structural characteristics of ViTs, TGR reduces the variance of the back-propagated gradient in each internal block of ViTs in a token-wise manner and utilizes the regularized gradient to generate adversarial samples. Extensive experiments on attacking both ViTs and CNNs confirm the superiority of our approach. Notably, compared to the state-of-the-art transfer-based attacks, our TGR offers a performance improvement of 8.8% on average.
[[2303.15847] Canary in Twitter Mine: Collecting Phishing Reports from Experts and Non-experts](http://arxiv.org/abs/2303.15847) #attack
The rise in phishing attacks via e-mail and short message service (SMS) has not slowed down at all. The first thing we need to do to combat the ever-increasing number of phishing attacks is to collect and characterize more phishing cases that reach end users. Without understanding these characteristics, anti-phishing countermeasures cannot evolve. In this study, we propose an approach using Twitter as a new observation point to immediately collect and characterize phishing cases via e-mail and SMS that evade countermeasures and reach users. Specifically, we propose CrowdCanary, a system capable of structurally and accurately extracting phishing information (e.g., URLs and domains) from tweets about phishing by users who have actually discovered or encountered it. In our three months of live operation, CrowdCanary identified 35,432 phishing URLs out of 38,935 phishing reports. We confirmed that 31,960 (90.2%) of these phishing URLs were later detected by the anti-virus engine, demonstrating that CrowdCanary is superior to existing systems in both accuracy and volume of threat extraction. We also analyzed users who shared phishing threats by utilizing the extracted phishing URLs and categorized them into two distinct groups - namely, experts and non-experts. As a result, we found that CrowdCanary could collect information that is specifically included in non-expert reports, such as information shared only by the company brand name in the tweet, information about phishing attacks that we find only in the image of the tweet, and information about the landing page before the redirect.
[[2303.16004] A Survey on Malware Detection with Graph Representation Learning](http://arxiv.org/abs/2303.16004) #attack
Malware detection has become a major concern due to the increasing number and complexity of malware. Traditional detection methods based on signatures and heuristics are used for malware detection, but unfortunately, they suffer from poor generalization to unknown attacks and can be easily circumvented using obfuscation techniques. In recent years, Machine Learning (ML) and notably Deep Learning (DL) achieved impressive results in malware detection by learning useful representations from data and have become a solution preferred over traditional methods. More recently, the application of such techniques on graph-structured data has achieved state-of-the-art performance in various domains and demonstrates promising results in learning more robust representations from malware. Yet, no literature review focusing on graph-based deep learning for malware detection exists. In this survey, we provide an in-depth literature review to summarize and unify existing works under the common approaches and architectures. We notably demonstrate that Graph Neural Networks (GNNs) reach competitive results in learning robust embeddings from malware represented as expressive graph structures, leading to an efficient detection by downstream classifiers. This paper also reviews adversarial attacks that are utilized to fool graph-based detection methods. Challenges and future research directions are discussed at the end of the paper.
[[2303.16031] A Universal Identity Backdoor Attack against Speaker Verification based on Siamese Network](http://arxiv.org/abs/2303.16031) #attack
Speaker verification has been widely used in many authentication scenarios. However, training models for speaker verification requires large amounts of data and computing power, so users often use untrustworthy third-party data or deploy third-party models directly, which may create security risks. In this paper, we propose a backdoor attack for the above scenario. Specifically, for the Siamese network in the speaker verification system, we try to implant a universal identity in the model that can simulate any enrolled speaker and pass the verification. So the attacker does not need to know the victim, which makes the attack more flexible and stealthy. In addition, we design and compare three ways of selecting attacker utterances and two ways of poisoned training for the GE2E loss function in different scenarios. The results on the TIMIT and Voxceleb1 datasets show that our approach can achieve a high attack success rate while guaranteeing the normal verification accuracy. Our work reveals the vulnerability of the speaker verification system and provides a new perspective to further improve the robustness of the system.
[[2303.15472] Learning Rotation-Equivariant Features for Visual Correspondence](http://arxiv.org/abs/2303.15472) #robust
Extracting discriminative local features that are invariant to imaging variations is an integral part of establishing correspondences between images. In this work, we introduce a self-supervised learning framework to extract discriminative rotation-invariant descriptors using group-equivariant CNNs. Thanks to employing group-equivariant CNNs, our method effectively learns to obtain rotation-equivariant features and their orientations explicitly, without having to perform sophisticated data augmentations. The resultant features and their orientations are further processed by group aligning, a novel invariant mapping technique that shifts the group-equivariant features by their orientations along the group dimension. Our group aligning technique achieves rotation-invariance without any collapse of the group dimension and thus eschews loss of discriminability. The proposed method is trained end-to-end in a self-supervised manner, where we use an orientation alignment loss for the orientation estimation and a contrastive descriptor loss for robust local descriptors to geometric/photometric variations. Our method demonstrates state-of-the-art matching accuracy among existing rotation-invariant descriptors under varying rotation and also shows competitive results when transferred to the task of keypoint matching and camera pose estimation.
[[2303.15494] Semantic-visual Guided Transformer for Few-shot Class-incremental Learning](http://arxiv.org/abs/2303.15494) #robust
Few-shot class-incremental learning (FSCIL) has recently attracted extensive attention in various areas. Existing FSCIL methods highly depend on the robustness of the feature backbone pre-trained on base classes. In recent years, different Transformer variants have obtained significant processes in the feature representation learning of massive fields. Nevertheless, the progress of the Transformer in FSCIL scenarios has not achieved the potential promised in other fields so far. In this paper, we develop a semantic-visual guided Transformer (SV-T) to enhance the feature extracting capacity of the pre-trained feature backbone on incremental classes. Specifically, we first utilize the visual (image) labels provided by the base classes to supervise the optimization of the Transformer. And then, a text encoder is introduced to automatically generate the corresponding semantic (text) labels for each image from the base classes. Finally, the constructed semantic labels are further applied to the Transformer for guiding its hyperparameters updating. Our SV-T can take full advantage of more supervision information from base classes and further enhance the training robustness of the feature backbone. More importantly, our SV-T is an independent method, which can directly apply to the existing FSCIL architectures for acquiring embeddings of various incremental classes. Extensive experiments on three benchmarks, two FSCIL architectures, and two Transformer variants show that our proposed SV-T obtains a significant improvement in comparison to the existing state-of-the-art FSCIL methods.
[[2303.15623] Real-Time Semantic Segmentation using Hyperspectral Images for Mapping Unstructured and Unknown Environments](http://arxiv.org/abs/2303.15623) #robust
Autonomous navigation in unstructured off-road environments is greatly improved by semantic scene understanding. Conventional image processing algorithms are difficult to implement and lack robustness due to a lack of structure and high variability across off-road environments. The use of neural networks and machine learning can overcome the previous challenges but they require large labeled data sets for training. In our work we propose the use of hyperspectral images for real-time pixel-wise semantic classification and segmentation, without the need of any prior training data. The resulting segmented image is processed to extract, filter, and approximate objects as polygons, using a polygon approximation algorithm. The resulting polygons are then used to generate a semantic map of the environment. Using our framework. we show the capability to add new semantic classes in run-time for classification. The proposed methodology is also shown to operate in real-time and produce outputs at a frequency of 1Hz, using high resolution hyperspectral images.
[[2303.15651] 4D Panoptic Segmentation as Invariant and Equivariant Field Prediction](http://arxiv.org/abs/2303.15651) #robust
In this paper, we develop rotation-equivariant neural networks for 4D panoptic segmentation. 4D panoptic segmentation is a recently established benchmark task for autonomous driving, which requires recognizing semantic classes and object instances on the road based on LiDAR scans, as well as assigning temporally consistent IDs to instances across time. We observe that the driving scenario is symmetric to rotations on the ground plane. Therefore, rotation-equivariance could provide better generalization and more robust feature learning. Specifically, we review the object instance clustering strategies, and restate the centerness-based approach and the offset-based approach as the prediction of invariant scalar fields and equivariant vector fields. Other sub-tasks are also unified from this perspective, and different invariant and equivariant layers are designed to facilitate their predictions. Through evaluation on the standard 4D panoptic segmentation benchmark of SemanticKITTI, we show that our equivariant models achieve higher accuracy with lower computational costs compared to their non-equivariant counterparts. Moreover, our method sets the new state-of-the-art performance and achieves 1st place on the SemanticKITTI 4D Panoptic Segmentation leaderboard.
[[2303.15671] Colo-SCRL: Self-Supervised Contrastive Representation Learning for Colonoscopic Video Retrieval](http://arxiv.org/abs/2303.15671) #robust
Colonoscopic video retrieval, which is a critical part of polyp treatment, has great clinical significance for the prevention and treatment of colorectal cancer. However, retrieval models trained on action recognition datasets usually produce unsatisfactory retrieval results on colonoscopic datasets due to the large domain gap between them. To seek a solution to this problem, we construct a large-scale colonoscopic dataset named Colo-Pair for medical practice. Based on this dataset, a simple yet effective training method called Colo-SCRL is proposed for more robust representation learning. It aims to refine general knowledge from colonoscopies through masked autoencoder-based reconstruction and momentum contrast to improve retrieval performance. To the best of our knowledge, this is the first attempt to employ the contrastive learning paradigm for medical video retrieval. Empirical results show that our method significantly outperforms current state-of-the-art methods in the colonoscopic video retrieval task.
[[2303.15676] Cross-View Visual Geo-Localization for Outdoor Augmented Reality](http://arxiv.org/abs/2303.15676) #robust
Precise estimation of global orientation and location is critical to ensure a compelling outdoor Augmented Reality (AR) experience. We address the problem of geo-pose estimation by cross-view matching of query ground images to a geo-referenced aerial satellite image database. Recently, neural network-based methods have shown state-of-the-art performance in cross-view matching. However, most of the prior works focus only on location estimation, ignoring orientation, which cannot meet the requirements in outdoor AR applications. We propose a new transformer neural network-based model and a modified triplet ranking loss for joint location and orientation estimation. Experiments on several benchmark cross-view geo-localization datasets show that our model achieves state-of-the-art performance. Furthermore, we present an approach to extend the single image query-based geo-localization approach by utilizing temporal information from a navigation pipeline for robust continuous geo-localization. Experimentation on several large-scale real-world video sequences demonstrates that our approach enables high-precision and stable AR insertion.
[[2303.15742] System-status-aware Adaptive Network for Online Streaming Video Understanding](http://arxiv.org/abs/2303.15742) #robust
Recent years have witnessed great progress in deep neural networks for real-time applications. However, most existing works do not explicitly consider the general case where the device's state and the available resources fluctuate over time, and none of them investigate or address the impact of varying computational resources for online video understanding tasks. This paper proposes a System-status-aware Adaptive Network (SAN) that considers the device's real-time state to provide high-quality predictions with low delay. Usage of our agent's policy improves efficiency and robustness to fluctuations of the system status. On two widely used video understanding tasks, SAN obtains state-of-the-art performance while constantly keeping processing delays low. Moreover, training such an agent on various types of hardware configurations is not easy as the labeled training data might not be available, or can be computationally prohibitive. To address this challenging problem, we propose a Meta Self-supervised Adaptation (MSA) method that adapts the agent's policy to new hardware configurations at test-time, allowing for easy deployment of the model onto other unseen hardware platforms.
[[2303.15768] RobustSwap: A Simple yet Robust Face Swapping Model against Attribute Leakage](http://arxiv.org/abs/2303.15768) #robust
Face swapping aims at injecting a source image's identity (i.e., facial features) into a target image, while strictly preserving the target's attributes, which are irrelevant to identity. However, we observed that previous approaches still suffer from source attribute leakage, where the source image's attributes interfere with the target image's. In this paper, we analyze the latent space of StyleGAN and find the adequate combination of the latents geared for face swapping task. Based on the findings, we develop a simple yet robust face swapping model, RobustSwap, which is resistant to the potential source attribute leakage. Moreover, we exploit the coordination of 3DMM's implicit and explicit information as a guidance to incorporate the structure of the source image and the precise pose of the target image. Despite our method solely utilizing an image dataset without identity labels for training, our model has the capability to generate high-fidelity and temporally consistent videos. Through extensive qualitative and quantitative evaluations, we demonstrate that our method shows significant improvements compared with the previous face swapping models in synthesizing both images and videos. Project page is available at https://robustswap.github.io/
[[2303.15833] Complementary Domain Adaptation and Generalization for Unsupervised Continual Domain Shift Learning](http://arxiv.org/abs/2303.15833) #robust
Continual domain shift poses a significant challenge in real-world applications, particularly in situations where labeled data is not available for new domains. The challenge of acquiring knowledge in this problem setting is referred to as unsupervised continual domain shift learning. Existing methods for domain adaptation and generalization have limitations in addressing this issue, as they focus either on adapting to a specific domain or generalizing to unseen domains, but not both. In this paper, we propose Complementary Domain Adaptation and Generalization (CoDAG), a simple yet effective learning framework that combines domain adaptation and generalization in a complementary manner to achieve three major goals of unsupervised continual domain shift learning: adapting to a current domain, generalizing to unseen domains, and preventing forgetting of previously seen domains. Our approach is model-agnostic, meaning that it is compatible with any existing domain adaptation and generalization algorithms. We evaluate CoDAG on several benchmark datasets and demonstrate that our model outperforms state-of-the-art models in all datasets and evaluation metrics, highlighting its effectiveness and robustness in handling unsupervised continual domain shift learning.
[[2303.15999] Thread Counting in Plain Weave for Old Paintings Using Semi-Supervised Regression Deep Learning Models](http://arxiv.org/abs/2303.15999) #robust
In this work the authors develop regression approaches based on deep learning to perform thread density estimation for plain weave canvas analysis. Previous approaches were based on Fourier analysis, that are quite robust for some scenarios but fail in some other, in machine learning tools, that involve pre-labeling of the painting at hand, or the segmentation of thread crossing points, that provides good estimations in all scenarios with no need of pre-labeling. The segmentation approach is time-consuming as estimation of the densities is performed after locating the crossing points. In this novel proposal, we avoid this step by computing the density of threads directly from the image with a regression deep learning model. We also incorporate some improvements in the initial preprocessing of the input image with an impact on the final error. Several models are proposed and analyzed to retain the best one. Furthermore, we further reduce the density estimation error by introducing a semi-supervised approach. The performance of our novel algorithm is analyzed with works by Ribera, Vel\'azquez, and Poussin where we compare our results to the ones of previous approaches. Finally, the method is put into practice to support the change of authorship or a masterpiece at the Museo del Prado.
[[2303.16058] Unmasked Teacher: Towards Training-Efficient Video Foundation Models](http://arxiv.org/abs/2303.16058) #robust
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. The code and models will be released at https://github.com/OpenGVLab/unmasked_teacher.
[[2303.16099] Medical Image Analysis using Deep Relational Learning](http://arxiv.org/abs/2303.16099) #robust
In the past ten years, with the help of deep learning, especially the rapid development of deep neural networks, medical image analysis has made remarkable progress. However, how to effectively use the relational information between various tissues or organs in medical images is still a very challenging problem, and it has not been fully studied. In this thesis, we propose two novel solutions to this problem based on deep relational learning. First, we propose a context-aware fully convolutional network that effectively models implicit relation information between features to perform medical image segmentation. The network achieves the state-of-the-art segmentation results on the Multi Modal Brain Tumor Segmentation 2017 (BraTS2017) and Multi Modal Brain Tumor Segmentation 2018 (BraTS2018) data sets. Subsequently, we propose a new hierarchical homography estimation network to achieve accurate medical image mosaicing by learning the explicit spatial relationship between adjacent frames. We use the UCL Fetoscopy Placenta dataset to conduct experiments and our hierarchical homography estimation network outperforms the other state-of-the-art mosaicing methods while generating robust and meaningful mosaicing result on unseen frames.
[[2303.16191] Hard Nominal Example-aware Template Mutual Matching for Industrial Anomaly Detection](http://arxiv.org/abs/2303.16191) #robust
Anomaly detectors are widely used in industrial production to detect and localize unknown defects in query images. These detectors are trained on nominal images and have shown success in distinguishing anomalies from most normal samples. However, hard-nominal examples are scattered and far apart from most normalities, they are often mistaken for anomalies by existing anomaly detectors. To address this problem, we propose a simple yet efficient method: \textbf{H}ard Nominal \textbf{E}xample-aware \textbf{T}emplate \textbf{M}utual \textbf{M}atching (HETMM). Specifically, \textit{HETMM} aims to construct a robust prototype-based decision boundary, which can precisely distinguish between hard-nominal examples and anomalies, yielding fewer false-positive and missed-detection rates. Moreover, \textit{HETMM} mutually explores the anomalies in two directions between queries and the template set, and thus it is capable to capture the logical anomalies. This is a significant advantage over most anomaly detectors that frequently fail to detect logical anomalies. Additionally, to meet the speed-accuracy demands, we further propose \textbf{P}ixel-level \textbf{T}emplate \textbf{S}election (PTS) to streamline the original template set. \textit{PTS} selects cluster centres and hard-nominal examples to form a tiny set, maintaining the original decision boundaries. Comprehensive experiments on five real-world datasets demonstrate that our methods yield outperformance than existing advances under the real-time inference speed. Furthermore, \textit{HETMM} can be hot-updated by inserting novel samples, which may promptly address some incremental learning issues.
[[2303.15846] Soft-prompt tuning to predict lung cancer using primary care free-text Dutch medical notes](http://arxiv.org/abs/2303.15846) #robust
We investigate different natural language processing (NLP) approaches based on contextualised word representations for the problem of early prediction of lung cancer using free-text patient medical notes of Dutch primary care physicians. Because lung cancer has a low prevalence in primary care, we also address the problem of classification under highly imbalanced classes. Specifically, we use large Transformer-based pretrained language models (PLMs) and investigate: 1) how \textit{soft prompt-tuning} -- an NLP technique used to adapt PLMs using small amounts of training data -- compares to standard model fine-tuning; 2) whether simpler static word embedding models (WEMs) can be more robust compared to PLMs in highly imbalanced settings; and 3) how models fare when trained on notes from a small number of patients. We find that 1) soft-prompt tuning is an efficient alternative to standard model fine-tuning; 2) PLMs show better discrimination but worse calibration compared to simpler static word embedding models as the classification problem becomes more imbalanced; and 3) results when training models on small number of patients are mixed and show no clear differences between PLMs and WEMs. All our code is available open source in \url{https://bitbucket.org/aumc-kik/prompt_tuning_cancer_prediction/}.
[[2303.15901] Denoising Autoencoder-based Defensive Distillation as an Adversarial Robustness Algorithm](http://arxiv.org/abs/2303.15901) #robust
Adversarial attacks significantly threaten the robustness of deep neural networks (DNNs). Despite the multiple defensive methods employed, they are nevertheless vulnerable to poison attacks, where attackers meddle with the initial training data. In order to defend DNNs against such adversarial attacks, this work proposes a novel method that combines the defensive distillation mechanism with a denoising autoencoder (DAE). This technique tries to lower the sensitivity of the distilled model to poison attacks by spotting and reconstructing poisonous adversarial inputs in the training data. We added carefully created adversarial samples to the initial training data to assess the proposed method's performance. Our experimental findings demonstrate that our method successfully identified and reconstructed the poisonous inputs while also considering enhancing the DNN's resilience. The proposed approach provides a potent and robust defense mechanism for DNNs in various applications where data poisoning attacks are a concern. Thus, the defensive distillation technique's limitation posed by poisonous adversarial attacks is overcome.
[[2303.15453] Robustness of Utilizing Feedback in Embodied Visual Navigation](http://arxiv.org/abs/2303.15453) #robust
This paper presents a framework for training an agent to actively request help in object-goal navigation tasks, with feedback indicating the location of the target object in its field of view. To make the agent more robust in scenarios where a teacher may not always be available, the proposed training curriculum includes a mix of episodes with and without feedback. The results show that this approach improves the agent's performance, even in the absence of feedback.
[[2303.15489] Railway Network Delay Evolution: A Heterogeneous Graph Neural Network Approach](http://arxiv.org/abs/2303.15489) #robust
Railway operations involve different types of entities (stations, trains, etc.), making the existing graph/network models with homogenous nodes (i.e., the same kind of nodes) incapable of capturing the interactions between the entities. This paper aims to develop a heterogeneous graph neural network (HetGNN) model, which can address different types of nodes (i.e., heterogeneous nodes), to investigate the train delay evolution on railway networks. To this end, a graph architecture combining the HetGNN model and the GraphSAGE homogeneous GNN (HomoGNN), called SAGE-Het, is proposed. The aim is to capture the interactions between trains, trains and stations, and stations and other stations on delay evolution based on different edges. In contrast to the traditional methods that require the inputs to have constant dimensions (e.g., in rectangular or grid-like arrays) or only allow homogeneous nodes in the graph, SAGE-Het allows for flexible inputs and heterogeneous nodes. The data from two sub-networks of the China railway network are applied to test the performance and robustness of the proposed SAGE-Het model. The experimental results show that SAGE-Het exhibits better performance than the existing delay prediction methods and some advanced HetGNNs used for other prediction tasks; the predictive performances of SAGE-Het under different prediction time horizons (10/20/30 min ahead) all outperform other baseline methods; Specifically, the influences of train interactions on delay propagation are investigated based on the proposed model. The results show that train interactions become subtle when the train headways increase . This finding directly contributes to decision-making in the situation where conflict-resolution or train-canceling actions are needed.
[[2303.15631] Multiphysics discovery with moving boundaries using Ensemble SINDy and Peridynamic Differential Operator](http://arxiv.org/abs/2303.15631) #robust
This study proposes a novel framework for learning the underlying physics of phenomena with moving boundaries. The proposed approach combines Ensemble SINDy and Peridynamic Differential Operator (PDDO) and imposes an inductive bias assuming the moving boundary physics evolve in its own corotational coordinate system. The robustness of the approach is demonstrated by considering various levels of noise in the measured data using the 2D Fisher-Stefan model. The confidence intervals of recovered coefficients are listed, and the uncertainties of the moving boundary positions are depicted by obtaining the solutions with the recovered coefficients. Although the main focus of this study is the Fisher-Stefan model, the proposed approach is applicable to any type of moving boundary problem with a smooth moving boundary front without a mushy region. The code and data for this framework is available at: https://github.com/alicanbekar/MB_PDDO-SINDy.
[[2303.15634] Learning Rate Schedules in the Presence of Distribution Shift](http://arxiv.org/abs/2303.15634) #robust
We design learning rate schedules that minimize regret for SGD-based online learning in the presence of a changing data distribution. We fully characterize the optimal learning rate schedule for online linear regression via a novel analysis with stochastic differential equations. For general convex loss functions, we propose new learning rate schedules that are robust to distribution shift, and we give upper and lower bounds for the regret that only differ by constants. For non-convex loss functions, we define a notion of regret based on the gradient norm of the estimated models and propose a learning schedule that minimizes an upper bound on the total expected regret. Intuitively, one expects changing loss landscapes to require more exploration, and we confirm that optimal learning rate schedules typically increase in the presence of distribution shift. Finally, we provide experiments for high-dimensional regression models and neural networks to illustrate these learning rate schedules and their cumulative regret.
[[2303.15810] Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization](http://arxiv.org/abs/2303.15810) #robust
Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing $Q$-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse $Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.
[[2303.15845] Conditional Generative Models are Provably Robust: Pointwise Guarantees for Bayesian Inverse Problems](http://arxiv.org/abs/2303.15845) #robust
Conditional generative models became a very powerful tool to sample from Bayesian inverse problem posteriors. It is well-known in classical Bayesian literature that posterior measures are quite robust with respect to perturbations of both the prior measure and the negative log-likelihood, which includes perturbations of the observations. However, to the best of our knowledge, the robustness of conditional generative models with respect to perturbations of the observations has not been investigated yet. In this paper, we prove for the first time that appropriately learned conditional generative models provide robust results for single observations.
[[2303.16100] Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures](http://arxiv.org/abs/2303.16100) #robust
Executing machine learning inference tasks on resource-constrained edge devices requires careful hardware-software co-design optimizations. Recent examples have shown how transformer-based deep neural network models such as ALBERT can be used to enable the execution of natural language processing (NLP) inference on mobile systems-on-chip housing custom hardware accelerators. However, while these existing solutions are effective in alleviating the latency, energy, and area costs of running single NLP tasks, achieving multi-task inference requires running computations over multiple variants of the model parameters, which are tailored to each of the targeted tasks. This approach leads to either prohibitive on-chip memory requirements or paying the cost of off-chip memory access. This paper proposes adapter-ALBERT, an efficient model optimization for maximal data reuse across different tasks. The proposed model's performance and robustness to data compression methods are evaluated across several language tasks from the GLUE benchmark. Additionally, we demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator to extrapolate performance, power, and area improvements over the execution of a traditional ALBERT model on the same hardware platform.
[[2303.15710] Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks](http://arxiv.org/abs/2303.15710) #extraction
Recently, RGB-Thermal based perception has shown significant advances. Thermal information provides useful clues when visual cameras suffer from poor lighting conditions, such as low light and fog. However, how to effectively fuse RGB images and thermal data remains an open challenge. Previous works involve naive fusion strategies such as merging them at the input, concatenating multi-modality features inside models, or applying attention to each data modality. These fusion strategies are straightforward yet insufficient. In this paper, we propose a novel fusion method named Explicit Attention-Enhanced Fusion (EAEF) that fully takes advantage of each type of data. Specifically, we consider the following cases: i) both RGB data and thermal data, ii) only one of the types of data, and iii) none of them generate discriminative features. EAEF uses one branch to enhance feature extraction for i) and iii) and the other branch to remedy insufficient representations for ii). The outputs of two branches are fused to form complementary features. As a result, the proposed fusion method outperforms state-of-the-art by 1.6\% in mIoU on semantic segmentation, 3.1\% in MAE on salient object detection, 2.3\% in mAP on object detection, and 8.1\% in MAE on crowd counting. The code is available at https://github.com/FreeformRobotics/EAEFNet.
[[2303.15743] HS-Pose: Hybrid Scope Feature Extraction for Category-level Object Pose Estimation](http://arxiv.org/abs/2303.15743) #extraction
In this paper, we focus on the problem of category-level object pose estimation, which is challenging due to the large intra-category shape variation. 3D graph convolution (3D-GC) based methods have been widely used to extract local geometric features, but they have limitations for complex shaped objects and are sensitive to noise. Moreover, the scale and translation invariant properties of 3D-GC restrict the perception of an object's size and translation information. In this paper, we propose a simple network structure, the HS-layer, which extends 3D-GC to extract hybrid scope latent features from point cloud data for category-level object pose estimation tasks. The proposed HS-layer: 1) is able to perceive local-global geometric structure and global information, 2) is robust to noise, and 3) can encode size and translation information. Our experiments show that the simple replacement of the 3D-GC layer with the proposed HS-layer on the baseline method (GPV-Pose) achieves a significant improvement, with the performance increased by 14.5% on 5d2cm metric and 10.3% on IoU75. Our method outperforms the state-of-the-art methods by a large margin (8.3% on 5d2cm, 6.9% on IoU75) on the REAL275 dataset and runs in real-time (50 FPS).
[[2303.15764] X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance](http://arxiv.org/abs/2303.15764) #extraction
Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV) and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior methods adopt text-independent multilayer perceptrons (MLPs) to predict the attributes of the target mesh with the supervision of CLIP loss. However, such text-independent architecture lacks textual guidance during predicting attributes, thus leading to unsatisfactory stylization and slow convergence. To address these limitations, we present X-Mesh, an innovative text-driven 3D stylization framework that incorporates a novel Text-guided Dynamic Attention Module (TDAM). The TDAM dynamically integrates the guidance of the target text by utilizing text-relevant spatial and channel-wise attentions during vertex feature extraction, resulting in more accurate attribute prediction and faster convergence speed. Furthermore, existing works lack standard benchmarks and automated metrics for evaluation, often relying on subjective and non-reproducible user studies to assess the quality of stylized 3D assets. To overcome this limitation, we introduce a new standard text-mesh benchmark, namely MIT-30, and two automated metrics, which will enable future research to achieve fair and objective comparisons. Our extensive qualitative and quantitative experiments demonstrate that X-Mesh outperforms previous state-of-the-art methods.
[[2303.16066] Neural Collapse Inspired Federated Learning with Non-iid Data](http://arxiv.org/abs/2303.16066) #federate
One of the challenges in federated learning is the non-independent and identically distributed (non-iid) characteristics between heterogeneous devices, which cause significant differences in local updates and affect the performance of the central server. Although many studies have been proposed to address this challenge, they only focus on local training and aggregation processes to smooth the changes and fail to achieve high performance with deep learning models. Inspired by the phenomenon of neural collapse, we force each client to be optimized toward an optimal global structure for classification. Specifically, we initialize it as a random simplex Equiangular Tight Frame (ETF) and fix it as the unit optimization target of all clients during the local updating. After guaranteeing all clients are learning to converge to the global optimum, we propose to add a global memory vector for each category to remedy the parameter fluctuation caused by the bias of the intra-class condition distribution among clients. Our experimental results show that our method can improve the performance with faster convergence speed on different-size datasets.
[[2303.16141] A Comparative Study of Federated Learning Models for COVID-19 Detection](http://arxiv.org/abs/2303.16141) #federate
Deep learning is effective in diagnosing COVID-19 and requires a large amount of data to be effectively trained. Due to data and privacy regulations, hospitals generally have no access to data from other hospitals. Federated learning (FL) has been used to solve this problem, where it utilizes a distributed setting to train models in hospitals in a privacy-preserving manner. Deploying FL is not always feasible as it requires high computation and network communication resources. This paper evaluates five FL algorithms' performance and resource efficiency for Covid-19 detection. A decentralized setting with CNN networks is set up, and the performance of FL algorithms is compared with a centralized environment. We examined the algorithms with varying numbers of participants, federated rounds, and selection algorithms. Our results show that cyclic weight transfer can have better overall performance, and results are better with fewer participating hospitals. Our results demonstrate good performance for detecting COVID-19 patients and might be useful in deploying FL algorithms for covid-19 detection and medical image analysis in general.
[[2303.16181] Learning Federated Visual Prompt in Null Space for MRI Reconstruction](http://arxiv.org/abs/2303.16181) #federate
Federated Magnetic Resonance Imaging (MRI) reconstruction enables multiple hospitals to collaborate distributedly without aggregating local data, thereby protecting patient privacy. However, the data heterogeneity caused by different MRI protocols, insufficient local training data, and limited communication bandwidth inevitably impair global model convergence and updating. In this paper, we propose a new algorithm, FedPR, to learn federated visual prompts in the null space of global prompt for MRI reconstruction. FedPR is a new federated paradigm that adopts a powerful pre-trained model while only learning and communicating the prompts with few learnable parameters, thereby significantly reducing communication costs and achieving competitive performance on limited local data. Moreover, to deal with catastrophic forgetting caused by data heterogeneity, FedPR also updates efficient federated visual prompts that project the local prompts into an approximate null space of the global prompt, thereby suppressing the interference of gradients on the server performance. Extensive experiments on federated MRI show that FedPR significantly outperforms state-of-the-art FL algorithms with <6% of communication costs when given the limited amount of local training data.
[[2303.15986] Clustered Federated Learning Architecture for Network Anomaly Detection in Large Scale Heterogeneous IoT Networks](http://arxiv.org/abs/2303.15986) #federate
There is a growing trend of cyberattacks against Internet of Things (IoT) devices; moreover, the sophistication and motivation of those attacks is increasing. The vast scale of IoT, diverse hardware and software, and being typically placed in uncontrolled environments make traditional IT security mechanisms such as signature-based intrusion detection and prevention systems challenging to integrate. They also struggle to cope with the rapidly evolving IoT threat landscape due to long delays between the analysis and publication of the detection rules. Machine learning methods have shown faster response to emerging threats; however, model training architectures like cloud or edge computing face multiple drawbacks in IoT settings, including network overhead and data isolation arising from the large scale and heterogeneity that characterizes these networks.
This work presents an architecture for training unsupervised models for network intrusion detection in large, distributed IoT and Industrial IoT (IIoT) deployments. We leverage Federated Learning (FL) to collaboratively train between peers and reduce isolation and network overhead problems. We build upon it to include an unsupervised device clustering algorithm fully integrated into the FL pipeline to address the heterogeneity issues that arise in FL settings. The architecture is implemented and evaluated using a testbed that includes various emulated IoT/IIoT devices and attackers interacting in a complex network topology comprising 100 emulated devices, 30 switches and 10 routers. The anomaly detection models are evaluated on real attacks performed by the testbed's threat actors, including the entire Mirai malware lifecycle, an additional botnet based on the Merlin command and control server and other red-teaming tools performing scanning activities and multiple attacks targeting the emulated devices.
[[2303.15486] Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning with Hierarchical Aggregation](http://arxiv.org/abs/2303.15486) #federate
Multimodal learning has seen great success mining data features from multiple modalities with remarkable model performance improvement. Meanwhile, federated learning (FL) addresses the data sharing problem, enabling privacy-preserved collaborative training to provide sufficient precious data. Great potential, therefore, arises with the confluence of them, known as multimodal federated learning. However, limitation lies in the predominant approaches as they often assume that each local dataset records samples from all modalities. In this paper, we aim to bridge this gap by proposing an Unimodal Training - Multimodal Prediction (UTMP) framework under the context of multimodal federated learning. We design HA-Fedformer, a novel transformer-based model that empowers unimodal training with only a unimodal dataset at the client and multimodal testing by aggregating multiple clients' knowledge for better accuracy. The key advantages are twofold. Firstly, to alleviate the impact of data non-IID, we develop an uncertainty-aware aggregation method for the local encoders with layer-wise Markov Chain Monte Carlo sampling. Secondly, to overcome the challenge of unaligned language sequence, we implement a cross-modal decoder aggregation to capture the hidden signal correlation between decoders trained by data from different modalities. Our experiments on popular sentiment analysis benchmarks, CMU-MOSI and CMU-MOSEI, demonstrate that HA-Fedformer significantly outperforms state-of-the-art multimodal models under the UTMP federated learning frameworks, with 15%-20% improvement on most attributes.
[[2303.15799] Fast Convergence Federated Learning with Aggregated Gradients](http://arxiv.org/abs/2303.15799) #federate
Federated Learning (FL) is a novel machine learning framework, which enables multiple distributed devices cooperatively training a shared model scheduled by a central server while protecting private data locally. However, the non-independent-and-identically-distributed (Non-IID) data samples and frequent communication among participants will slow down the convergent rate and increase communication costs. To achieve fast convergence, we ameliorate the local gradient descend approach in conventional local update rule by introducing the aggregated gradients at each local update epoch, and propose an adaptive learning rate algorithm that further takes the deviation of local parameter and global parameter into consideration at each iteration. The above strategy requires all clients' local parameters and gradients at each local iteration, which is challenging as there is no communication during local update epochs. Accordingly, we utilize mean field approach by introducing two mean field terms to estimate the average local parameters and gradients respectively, which does not require clients to exchange their private information with each other at each local update epoch. Numerical results show that our proposed framework is superior to the state-of-art schemes in model accuracy and convergent rate on both IID and Non-IID dataset.
[[2303.16071] Edge Selection and Clustering for Federated Learning in Optical Inter-LEO Satellite Constellation](http://arxiv.org/abs/2303.16071) #federate
Low-Earth orbit (LEO) satellites have been prosperously deployed for various Earth observation missions due to its capability of collecting a large amount of image or sensor data. However, traditionally, the data training process is performed in the terrestrial cloud server, which leads to a high transmission overhead. With the recent development of LEO, it is more imperative to provide ultra-dense LEO constellation with enhanced on-board computation capability. Benefited from it, we have proposed a collaborative federated learning over LEO satellite constellation (FedLEO). We allocate the entire process on LEOs with low payload inter-satellite transmissions, whilst the low-delay terrestrial gateway server (GS) only takes care for initial signal controlling. The GS initially selects an LEO server, whereas its LEO clients are all determined by clustering mechanism and communication capability through the optical inter-satellite links (ISLs). The re-clustering of changing LEO server will be executed once with low communication quality of FedLEO. In the simulations, we have numerically analyzed the proposed FedLEO under practical Walker-based LEO constellation configurations along with MNIST training dataset for classification mission. The proposed FedLEO outperforms the conventional centralized and distributed architectures with higher classification accuracy as well as comparably lower latency of joint communication and computing.
[[2303.15493] Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis](http://arxiv.org/abs/2303.15493) #fair
In this paper, we propose binary sparse convolutional networks called BSC-Net for efficient point cloud analysis. We empirically observe that sparse convolution operation causes larger quantization errors than standard convolution. However, conventional network quantization methods directly binarize the weights and activations in sparse convolution, resulting in performance drop due to the significant quantization loss. On the contrary, we search the optimal subset of convolution operation that activates the sparse convolution at various locations for quantization error alleviation, and the performance gap between real-valued and binary sparse convolutional networks is closed without complexity overhead. Specifically, we first present the shifted sparse convolution that fuses the information in the receptive field for the active sites that match the pre-defined positions. Then we employ the differentiable search strategies to discover the optimal opsitions for active site matching in the shifted sparse convolution, and the quantization errors are significantly alleviated for efficient point cloud analysis. For fair evaluation of the proposed method, we empirically select the recently advances that are beneficial for sparse convolution network binarization to construct a strong baseline. The experimental results on Scan-Net and NYU Depth v2 show that our BSC-Net achieves significant improvement upon our srtong baseline and outperforms the state-of-the-art network binarization methods by a remarkable margin without additional computation overhead for binarizing sparse convolutional networks.
[[2303.15889] Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition](http://arxiv.org/abs/2303.15889) #fair
Demographic biases in source datasets have been shown as one of the causes of unfairness and discrimination in the predictions of Machine Learning models. One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets. In this paper, we study the measurement of these biases by reviewing the existing metrics, including those that can be borrowed from other disciplines. We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics. To illustrate the utility of our framework, and to further understand the practical characteristics of the metrics, we conduct a case study of 20 datasets used in Facial Emotion Recognition (FER), analyzing the biases present in them. Our experimental results show that many metrics are redundant and that a reduced subset of metrics may be sufficient to measure the amount of demographic bias. The paper provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models. The code is available at https://github.com/irisdominguez/dataset_bias_metrics.
[[2303.15697] Model and Evaluation: Towards Fairness in Multilingual Text Classification](http://arxiv.org/abs/2303.15697) #fair
Recently, more and more research has focused on addressing bias in text classification models. However, existing research mainly focuses on the fairness of monolingual text classification models, and research on fairness for multilingual text classification is still very limited. In this paper, we focus on the task of multilingual text classification and propose a debiasing framework for multilingual text classification based on contrastive learning. Our proposed method does not rely on any external language resources and can be extended to any other languages. The model contains four modules: multilingual text representation module, language fusion module, text debiasing module, and text classification module. The multilingual text representation module uses a multilingual pre-trained language model to represent the text, the language fusion module makes the semantic spaces of different languages tend to be consistent through contrastive learning, and the text debiasing module uses contrastive learning to make the model unable to identify sensitive attributes' information. The text classification module completes the basic tasks of multilingual text classification. In addition, the existing research on the fairness of multilingual text classification is relatively simple in the evaluation mode. The evaluation method of fairness is the same as the monolingual equality difference evaluation method, that is, the evaluation is performed on a single language. We propose a multi-dimensional fairness evaluation framework for multilingual text classification, which evaluates the model's monolingual equality difference, multilingual equality difference, multilingual equality performance difference, and destructiveness of the fairness strategy. We hope that our work can provide a more general debiasing method and a more comprehensive evaluation framework for multilingual text fairness tasks.
[[2303.15708] Bias or Diversity? Unraveling Semantic Discrepancy in U](http://arxiv.org/abs/2303.15708) #fair
There is a broad consensus that news media outlets incorporate ideological biases in their news articles. However, prior studies on measuring the discrepancies among media outlets and further dissecting the origins of semantic differences suffer from small sample sizes and limited scope. In this study, we collect a large dataset of 1.8 million news headlines from major U.S. media outlets spanning from 2014 to 2022 to thoroughly track and dissect the semantic discrepancy in U.S. news media. We employ multiple correspondence analysis (MCA) to quantify the semantic discrepancy relating to four prominent topics - domestic politics, economic issues, social issues, and foreign affairs. Additionally, we compare the most frequent n-grams in media headlines to provide further qualitative insights into our analysis. Our findings indicate that on domestic politics and social issues, the discrepancy can be attributed to a certain degree of media bias. Meanwhile, the discrepancy in reporting foreign affairs is largely attributed to the diversity in individual journalistic styles. Finally, U.S. media outlets show consistency and high similarity in their coverage of economic issues.
[[2303.15555] Object Discovery from Motion-Guided Tokens](http://arxiv.org/abs/2303.15555) #interpretability
Object discovery -- separating objects from the background without manual labels -- is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, texture) or learned (e.g., from auto-encoders). In this work, we augment the auto-encoder representation learning framework with two key components: motion-guidance and mid-level feature tokenization. Although both have been separately investigated, we introduce a new transformer decoder showing that their benefits can compound thanks to motion-guided vector quantization. We show that our architecture effectively leverages the synergy between motion and tokenization, improving upon the state of the art on both synthetic and real datasets. Our approach enables the emergence of interpretable object-specific mid-level features, demonstrating the benefits of motion-guidance (no labeling) and quantization (interpretability, memory efficiency).
[[2303.15569] Core-Periphery Principle Guided Redesign of Self-Attention in Transformers](http://arxiv.org/abs/2303.15569) #interpretability
Designing more efficient, reliable, and explainable neural network architectures is critical to studies that are based on artificial intelligence (AI) techniques. Previous studies, by post-hoc analysis, have found that the best-performing ANNs surprisingly resemble biological neural networks (BNN), which indicates that ANNs and BNNs may share some common principles to achieve optimal performance in either machine learning or cognitive/behavior tasks. Inspired by this phenomenon, we proactively instill organizational principles of BNNs to guide the redesign of ANNs. We leverage the Core-Periphery (CP) organization, which is widely found in human brain networks, to guide the information communication mechanism in the self-attention of vision transformer (ViT) and name this novel framework as CP-ViT. In CP-ViT, the attention operation between nodes is defined by a sparse graph with a Core-Periphery structure (CP graph), where the core nodes are redesigned and reorganized to play an integrative role and serve as a center for other periphery nodes to exchange information. We evaluated the proposed CP-ViT on multiple public datasets, including medical image datasets (INbreast) and natural image datasets. Interestingly, by incorporating the BNN-derived principle (CP structure) into the redesign of ViT, our CP-ViT outperforms other state-of-the-art ANNs. In general, our work advances the state of the art in three aspects: 1) This work provides novel insights for brain-inspired AI: we can utilize the principles found in BNNs to guide and improve our ANN architecture design; 2) We show that there exist sweet spots of CP graphs that lead to CP-ViTs with significantly improved performance; and 3) The core nodes in CP-ViT correspond to task-related meaningful and important image patches, which can significantly enhance the interpretability of the trained deep model.
[[2303.15656] Predicting Adverse Neonatal Outcomes for Preterm Neonates with Multi-Task Learning](http://arxiv.org/abs/2303.15656) #interpretability
Diagnosis of adverse neonatal outcomes is crucial for preterm survival since it enables doctors to provide timely treatment. Machine learning (ML) algorithms have been demonstrated to be effective in predicting adverse neonatal outcomes. However, most previous ML-based methods have only focused on predicting a single outcome, ignoring the potential correlations between different outcomes, and potentially leading to suboptimal results and overfitting issues. In this work, we first analyze the correlations between three adverse neonatal outcomes and then formulate the diagnosis of multiple neonatal outcomes as a multi-task learning (MTL) problem. We then propose an MTL framework to jointly predict multiple adverse neonatal outcomes. In particular, the MTL framework contains shared hidden layers and multiple task-specific branches. Extensive experiments have been conducted using Electronic Health Records (EHRs) from 121 preterm neonates. Empirical results demonstrate the effectiveness of the MTL framework. Furthermore, the feature importance is analyzed for each neonatal outcome, providing insights into model interpretability.
[[2303.15827] PDExplain: Contextual Modeling of PDEs in the Wild](http://arxiv.org/abs/2303.15827) #explainability
We propose an explainable method for solving Partial Differential Equations by using a contextual scheme called PDExplain. During the training phase, our method is fed with data collected from an operator-defined family of PDEs accompanied by the general form of this family. In the inference phase, a minimal sample collected from a phenomenon is provided, where the sample is related to the PDE family but not necessarily to the set of specific PDEs seen in the training phase. We show how our algorithm can predict the PDE solution for future timesteps. Moreover, our method provides an explainable form of the PDE, a trait that can assist in modelling phenomena based on data in physical sciences. To verify our method, we conduct extensive experimentation, examining its quality both in terms of prediction error and explainability.
[[2303.15649] StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing](http://arxiv.org/abs/2303.15649) #diffusion
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images. They either finetune the model, or invert the image in the latent space of the pretrained model. However, they suffer from two problems: (1) Unsatisfying results for selected regions, and unexpected changes in nonselected regions. (2) They require careful text prompt editing where the prompt should include all visual objects in the input image. To address this, we propose two improvements: (1) Only optimizing the input of the value linear network in the cross-attention layers, is sufficiently powerful to reconstruct a real image. (2) We propose attention regularization to preserve the object-like attention maps after editing, enabling us to obtain accurate style editing without invoking significant structural changes. We further improve the editing technique which is used for the unconditional branch of classifier-free guidance, as well as the conditional one as used by P2P. Extensive experimental prompt-editing results on a variety of images, demonstrate qualitatively and quantitatively that our method has superior editing capabilities than existing and concurrent works.
[[2303.15780] Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion](http://arxiv.org/abs/2303.15780) #diffusion
We propose a high-quality 3D-to-3D conversion method, Instruct 3D-to-3D. Our method is designed for a novel task, which is to convert a given 3D scene to another scene according to text instructions. Instruct 3D-to-3D applies pretrained Image-to-Image diffusion models for 3D-to-3D conversion. This enables the likelihood maximization of each viewpoint image and high-quality 3D generation. In addition, our proposed method explicitly inputs the source 3D scene as a condition, which enhances 3D consistency and controllability of how much of the source 3D scene structure is reflected. We also propose dynamic scaling, which allows the intensity of the geometry transformation to be adjusted. We performed quantitative and qualitative evaluations and showed that our proposed method achieves higher quality 3D-to-3D conversions than baseline methods.
[[2303.16187] Visual Chain-of-Thought Diffusion Models](http://arxiv.org/abs/2303.16187) #diffusion
Recent progress with conditional image diffusion models has been stunning, and this holds true whether we are speaking about models conditioned on a text description, a scene layout, or a sketch. Unconditional image diffusion models are also improving but lag behind, as do diffusion models which are conditioned on lower-dimensional features like class labels. We propose to close the gap between conditional and unconditional models using a two-stage sampling procedure. In the first stage we sample an embedding describing the semantic content of the image. In the second stage we sample the image conditioned on this embedding and then discard the embedding. Doing so lets us leverage the power of conditional diffusion models on the unconditional generation task, which we show improves FID by 25-50% compared to standard unconditional generation.
[[2303.16203] Your Diffusion Model is Secretly a Zero-Shot Classifier](http://arxiv.org/abs/2303.16203) #diffusion
The recent wave of large-scale text-to-image diffusion models has dramatically increased our text-based image generation abilities. These models can generate realistic images for a staggering variety of prompts and exhibit impressive compositional generalization abilities. Almost all use cases thus far have solely focused on sampling; however, diffusion models can also provide conditional density estimates, which are useful for tasks beyond image generation. In this paper, we show that the density estimates from large-scale text-to-image diffusion models like Stable Diffusion can be leveraged to perform zero-shot classification without any additional training. Our generative approach to classification attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models. We also find that our diffusion-based approach has stronger multimodal relational reasoning abilities than competing contrastive approaches. Finally, we evaluate diffusion models trained on ImageNet and find that they approach the performance of SOTA discriminative classifiers trained on the same dataset, even with weak augmentations and no regularization. Results and visualizations at https://diffusion-classifier.github.io/
[[2303.15772] Ecosystem Graphs: The Social Footprint of Foundation Models](http://arxiv.org/abs/2303.15772) #diffusion
Foundation models (e.g. ChatGPT, StableDiffusion) pervasively influence society, warranting immediate social attention. While the models themselves garner much attention, to accurately characterize their impact, we must consider the broader sociotechnical ecosystem. We propose Ecosystem Graphs as a documentation framework to transparently centralize knowledge of this ecosystem. Ecosystem Graphs is composed of assets (datasets, models, applications) linked together by dependencies that indicate technical (e.g. how Bing relies on GPT-4) and social (e.g. how Microsoft relies on OpenAI) relationships. To supplement the graph structure, each asset is further enriched with fine-grained metadata (e.g. the license or training emissions). We document the ecosystem extensively at https://crfm.stanford.edu/ecosystem-graphs/. As of March 16, 2023, we annotate 262 assets (64 datasets, 128 models, 70 applications) from 63 organizations linked by 356 dependencies. We show Ecosystem Graphs functions as a powerful abstraction and interface for achieving the minimum transparency required to address myriad use cases. Therefore, we envision Ecosystem Graphs will be a community-maintained resource that provides value to stakeholders spanning AI researchers, industry professionals, social scientists, auditors and policymakers.
[[2303.16169] Diffusion Maps for Group-Invariant Manifolds](http://arxiv.org/abs/2303.16169) #diffusion
In this article, we consider the manifold learning problem when the data set is invariant under the action of a compact Lie group $K$. Our approach consists in augmenting the data-induced graph Laplacian by integrating over orbits under the action of $K$ of the existing data points. We prove that this $K$-invariant Laplacian operator $L$ can be diagonalized by using the unitary irreducible representation matrices of $K$, and we provide an explicit formula for computing the eigenvalues and eigenvectors of $L$. Moreover, we show that the normalized Laplacian operator $L_N$ converges to the Laplace-Beltrami operator of the data manifold with an improved convergence rate, where the improvement grows with the dimension of the symmetry group $K$. This work extends the steerable graph Laplacian framework of Landa and Shkolnisky from the case of $\operatorname{SO}(2)$ to arbitrary compact Lie groups.