secure

security

Title: Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing. (arXiv:2303.09746v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09746
Code URL: null
Copy Paste: [[2303.09746] Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing](http://arxiv.org/abs/2303.09746) #security
Summary:
Out-of-distribution (OOD) detection aims at enhancing standard deep neural networks to distinguish anomalous inputs from original training data. Previous progress has introduced various approaches where the in-distribution training data and even several OOD examples are prerequisites. However, due to privacy and security, auxiliary data tends to be impractical in a real-world scenario. In this paper, we propose a data-free method without training on natural data, called Class-Conditional Impressions Reappearing (C2IR), which utilizes image impressions from the fixed model to recover class-conditional feature statistics. Based on that, we introduce Integral Probability Metrics to estimate layer-wise class-conditional deviations and obtain layer weights by Measuring Gradient-based Importance (MGI). The experiments verify the effectiveness of our method and indicate that C2IR outperforms other post-hoc methods and reaches comparable performance to the full access (ID and OOD) detection method, especially in the far-OOD dataset (SVHN).

privacy

Title: MODIFY: Model-driven Face Stylization without Style Images. (arXiv:2303.09831v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09831
Code URL: null
Copy Paste: [[2303.09831] MODIFY: Model-driven Face Stylization without Style Images](http://arxiv.org/abs/2303.09831) #privacy
Summary:
Existing face stylization methods always acquire the presence of the target (style) domain during the translation process, which violates privacy regulations and limits their applicability in real-world systems. To address this issue, we propose a new method called MODel-drIven Face stYlization (MODIFY), which relies on the generative model to bypass the dependence of the target images. Briefly, MODIFY first trains a generative model in the target domain and then translates a source input to the target domain via the provided style model. To preserve the multimodal style information, MODIFY further introduces an additional remapping network, mapping a known continuous distribution into the encoder's embedding space. During translation in the source domain, MODIFY fine-tunes the encoder module within the target style-persevering model to capture the content of the source input as precisely as possible. Our method is extremely simple and satisfies versatile training modes for face stylization. Experimental results on several different datasets validate the effectiveness of MODIFY for unsupervised face stylization.

Title: Rehearsal-Free Domain Continual Face Anti-Spoofing: Generalize More and Forget Less. (arXiv:2303.09914v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09914
Code URL: null
Copy Paste: [[2303.09914] Rehearsal-Free Domain Continual Face Anti-Spoofing: Generalize More and Forget Less](http://arxiv.org/abs/2303.09914) #privacy
Summary:
Face Anti-Spoofing (FAS) is recently studied under the continual learning setting, where the FAS models are expected to evolve after encountering the data from new domains. However, existing methods need extra replay buffers to store previous data for rehearsal, which becomes infeasible when previous data is unavailable because of privacy issues. In this paper, we propose the first rehearsal-free method for Domain Continual Learning (DCL) of FAS, which deals with catastrophic forgetting and unseen domain generalization problems simultaneously. For better generalization to unseen domains, we design the Dynamic Central Difference Convolutional Adapter (DCDCA) to adapt Vision Transformer (ViT) models during the continual learning sessions. To alleviate the forgetting of previous domains without using previous data, we propose the Proxy Prototype Contrastive Regularization (PPCR) to constrain the continual learning with previous domain knowledge from the proxy prototypes. Simulate practical DCL scenarios, we devise two new protocols which evaluate both generalization and anti-forgetting performance. Extensive experimental results show that our proposed method can improve the generalization performance in unseen domains and alleviate the catastrophic forgetting of the previous knowledge. The codes and protocols will be released soon.

Title: Privacy-preserving Pedestrian Tracking using Distributed 3D LiDARs. (arXiv:2303.09915v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09915
Code URL: null
Copy Paste: [[2303.09915] Privacy-preserving Pedestrian Tracking using Distributed 3D LiDARs](http://arxiv.org/abs/2303.09915) #privacy
Summary:
The growing demand for intelligent environments unleashes an extraordinary cycle of privacy-aware applications that makes individuals' life more comfortable and safe. Examples of these applications include pedestrian tracking systems in large areas. Although the ubiquity of camera-based systems, they are not a preferable solution due to the vulnerability of leaking the privacy of pedestrians.In this paper, we introduce a novel privacy-preserving system for pedestrian tracking in smart environments using multiple distributed LiDARs of non-overlapping views. The system is designed to leverage LiDAR devices to track pedestrians in partially covered areas due to practical constraints, e.g., occlusion or cost. Therefore, the system uses the point cloud captured by different LiDARs to extract discriminative features that are used to train a metric learning model for pedestrian matching purposes. To boost the system's robustness, we leverage a probabilistic approach to model and adapt the dynamic mobility patterns of individuals and thus connect their sub-trajectories.We deployed the system in a large-scale testbed with 70 colorless LiDARs and conducted three different experiments. The evaluation result at the entrance hall confirms the system's ability to accurately track the pedestrians with a 0.98 F-measure even with zero-covered areas. This result highlights the promise of the proposed system as the next generation of privacy-preserving tracking means in smart environments.

protect

Title: High Accurate and Explainable Multi-Pill Detection Framework with Graph Neural Network-Assisted Multimodal Data Fusion. (arXiv:2303.09782v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09782
Code URL: null
Copy Paste: [[2303.09782] High Accurate and Explainable Multi-Pill Detection Framework with Graph Neural Network-Assisted Multimodal Data Fusion](http://arxiv.org/abs/2303.09782) #protect
Summary:
Due to the significant resemblance in visual appearance, pill misuse is prevalent and has become a critical issue, responsible for one-third of all deaths worldwide. Pill identification, thus, is a crucial concern needed to be investigated thoroughly. Recently, several attempts have been made to exploit deep learning to tackle the pill identification problem. However, most published works consider only single-pill identification and fail to distinguish hard samples with identical appearances. Also, most existing pill image datasets only feature single pill images captured in carefully controlled environments under ideal lighting conditions and clean backgrounds. In this work, we are the first to tackle the multi-pill detection problem in real-world settings, aiming at localizing and identifying pills captured by users in a pill intake. Moreover, we also introduce a multi-pill image dataset taken in unconstrained conditions. To handle hard samples, we propose a novel method for constructing heterogeneous a priori graphs incorporating three forms of inter-pill relationships, including co-occurrence likelihood, relative size, and visual semantic correlation. We then offer a framework for integrating a priori with pills' visual features to enhance detection accuracy. Our experimental results have proved the robustness, reliability, and explainability of the proposed framework. Experimentally, it outperforms all detection benchmarks in terms of all evaluation metrics. Specifically, our proposed framework improves COCO mAP metrics by 9.4% over Faster R-CNN and 12.0% compared to vanilla YOLOv5. Our study opens up new opportunities for protecting patients from medication errors using an AI-based pill identification solution.

Title: Exorcising ''Wraith'': Protecting LiDAR-based Object Detector in Automated Driving System from Appearing Attacks. (arXiv:2303.09731v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2303.09731
Code URL: null
Copy Paste: [[2303.09731] Exorcising ''Wraith'': Protecting LiDAR-based Object Detector in Automated Driving System from Appearing Attacks](http://arxiv.org/abs/2303.09731) #protect
Summary:
Automated driving systems rely on 3D object detectors to recognize possible obstacles from LiDAR point clouds. However, recent works show the adversary can forge non-existent cars in the prediction results with a few fake points (i.e., appearing attack). By removing statistical outliers, existing defenses are however designed for specific attacks or biased by predefined heuristic rules. Towards more comprehensive mitigation, we first systematically inspect the mechanism of recent appearing attacks: Their common weaknesses are observed in crafting fake obstacles which (i) have obvious differences in the local parts compared with real obstacles and (ii) violate the physical relation between depth and point density. In this paper, we propose a novel plug-and-play defensive module which works by side of a trained LiDAR-based object detector to eliminate forged obstacles where a major proportion of local parts have low objectness, i.e., to what degree it belongs to a real object. At the core of our module is a local objectness predictor, which explicitly incorporates the depth information to model the relation between depth and point density, and predicts each local part of an obstacle with an objectness score. Extensive experiments show, our proposed defense eliminates at least 70% cars forged by three known appearing attacks in most cases, while, for the best previous defense, less than 30% forged cars are eliminated. Meanwhile, under the same circumstance, our defense incurs less overhead for AP/precision on cars compared with existing defenses. Furthermore, We validate the effectiveness of our proposed defense on simulation-based closed-loop control driving tests in the open-source system of Baidu's Apollo.

defense

Title: Moving Target Defense for Service-oriented Mission-critical Networks. (arXiv:2303.09893v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2303.09893
Code URL: null
Copy Paste: [[2303.09893] Moving Target Defense for Service-oriented Mission-critical Networks](http://arxiv.org/abs/2303.09893) #defense
Summary:
Modern mission-critical systems (MCS) are increasingly softwarized and interconnected. As a result, their complexity increased, and so their vulnerability against cyber-attacks. The current adoption of virtualization and service-oriented architectures (SOA) in MCSs provides additional flexibility that can be leveraged to withstand and mitigate attacks, e.g., by moving critical services or data flows. This enables the deployment of strategies for moving target defense (MTD), which allows stripping attackers of their asymmetric advantage from the long reconnaissance of MCSs. However, it is challenging to design MTD strategies, given the diverse threat landscape, resource limitations, and potential degradation in service availability. In this paper, we combine two optimization models to explore feasible service configurations for SOA-based systems and to derive subsequent MTD actions with their time schedule based on an attacker-defender game. Our results indicate that even for challenging and diverse attack scenarios, our models can defend the system by up to 90% of the system operation time with a limited MTD defender budget.

attack

Title: Adversarial Counterfactual Visual Explanations. (arXiv:2303.09962v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09962
Code URL: null
Copy Paste: [[2303.09962] Adversarial Counterfactual Visual Explanations](http://arxiv.org/abs/2303.09962) #attack
Summary:
Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations regardless of their characteristics. Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications. Building on the robust learning literature, this paper proposes an elegant method to turn adversarial attacks into semantically meaningful perturbations, without modifying the classifiers to explain. The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations when generating adversarial attacks. The paper's key idea is to build attacks through a diffusion model to polish them. This allows studying the target model regardless of its robustification level. Extensive experimentation shows the advantages of our counterfactual explanation approach over current State-of-the-Art in multiple testbeds.

Title: Fuzziness-tuned: Improving the Transferability of Adversarial Examples. (arXiv:2303.10078v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.10078
Code URL: null
Copy Paste: [[2303.10078] Fuzziness-tuned: Improving the Transferability of Adversarial Examples](http://arxiv.org/abs/2303.10078) #attack
Summary:
With the development of adversarial attacks, adversairal examples have been widely used to enhance the robustness of the training models on deep neural networks. Although considerable efforts of adversarial attacks on improving the transferability of adversarial examples have been developed, the attack success rate of the transfer-based attacks on the surrogate model is much higher than that on victim model under the low attack strength (e.g., the attack strength $\epsilon=8/255$). In this paper, we first systematically investigated this issue and found that the enormous difference of attack success rates between the surrogate model and victim model is caused by the existence of a special area (known as fuzzy domain in our paper), in which the adversarial examples in the area are classified wrongly by the surrogate model while correctly by the victim model. Then, to eliminate such enormous difference of attack success rates for improving the transferability of generated adversarial examples, a fuzziness-tuned method consisting of confidence scaling mechanism and temperature scaling mechanism is proposed to ensure the generated adversarial examples can effectively skip out of the fuzzy domain. The confidence scaling mechanism and the temperature scaling mechanism can collaboratively tune the fuzziness of the generated adversarial examples through adjusting the gradient descent weight of fuzziness and stabilizing the update direction, respectively. Specifically, the proposed fuzziness-tuned method can be effectively integrated with existing adversarial attacks to further improve the transferability of adverarial examples without changing the time complexity. Extensive experiments demonstrated that fuzziness-tuned method can effectively enhance the transferability of adversarial examples in the latest transfer-based attacks.

robust

Title: Rt-Track: Robust Tricks for Multi-Pedestrian Tracking. (arXiv:2303.09668v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09668
Code URL: null
Copy Paste: [[2303.09668] Rt-Track: Robust Tricks for Multi-Pedestrian Tracking](http://arxiv.org/abs/2303.09668) #robust
Summary:
Object tracking is divided into single-object tracking (SOT) and multi-object tracking (MOT). MOT aims to maintain the identities of multiple objects across a series of continuous video sequences. In recent years, MOT has made rapid progress. However, modeling the motion and appearance models of objects in complex scenes still faces various challenging issues. In this paper, we design a novel direction consistency method for smooth trajectory prediction (STP-DC) to increase the modeling of motion information and overcome the lack of robustness in previous methods in complex scenes. Existing methods use pedestrian re-identification (Re-ID) to model appearance, however, they extract more background information which lacks discriminability in occlusion and crowded scenes. We propose a hyper-grain feature embedding network (HG-FEN) to enhance the modeling of appearance models, thus generating robust appearance descriptors. We also proposed other robustness techniques, including CF-ECM for storing robust appearance information and SK-AS for improving association accuracy. To achieve state-of-the-art performance in MOT, we propose a robust tracker named Rt-track, incorporating various tricks and techniques. It achieves 79.5 MOTA, 76.0 IDF1 and 62.1 HOTA on the test set of MOT17.Rt-track also achieves 77.9 MOTA, 78.4 IDF1 and 63.3 HOTA on MOT20, surpassing all published methods.

Title: Instance-Conditioned GAN Data Augmentation for Representation Learning. (arXiv:2303.09677v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09677
Code URL: null
Copy Paste: [[2303.09677] Instance-Conditioned GAN Data Augmentation for Representation Learning](http://arxiv.org/abs/2303.09677) #robust
Summary:
Data augmentation has become a crucial component to train state-of-the-art visual representation models. However, handcrafting combinations of transformations that lead to improved performances is a laborious task, which can result in visually unrealistic samples. To overcome these limitations, recent works have explored the use of generative models as learnable data augmentation tools, showing promising results in narrow application domains, e.g., few-shot learning and low-data medical imaging. In this paper, we introduce a data augmentation module, called DA_IC-GAN, which leverages instance-conditioned GAN generations and can be used off-the-shelf in conjunction with most state-of-the-art training recipes. We showcase the benefits of DA_IC-GAN by plugging it out-of-the-box into the supervised training of ResNets and DeiT models on the ImageNet dataset, and achieving accuracy boosts up to between 1%p and 2%p with the highest capacity models. Moreover, the learnt representations are shown to be more robust than the baselines when transferred to a handful of out-of-distribution datasets, and exhibit increased invariance to variations of instance and viewpoints. We additionally couple DA_IC-GAN with a self-supervised training recipe and show that we can also achieve an improvement of 1%p in accuracy in some settings. With this work, we strengthen the evidence on the potential of learnable data augmentations to improve visual representation learning, paving the road towards non-handcrafted augmentations in model training.

Title: GOOD: General Optimization-based Fusion for 3D Object Detection via LiDAR-Camera Object Candidates. (arXiv:2303.09800v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09800
Code URL: null
Copy Paste: [[2303.09800] GOOD: General Optimization-based Fusion for 3D Object Detection via LiDAR-Camera Object Candidates](http://arxiv.org/abs/2303.09800) #robust
Summary:
3D object detection serves as the core basis of the perception tasks in autonomous driving. Recent years have seen the rapid progress of multi-modal fusion strategies for more robust and accurate 3D object detection. However, current researches for robust fusion are all learning-based frameworks, which demand a large amount of training data and are inconvenient to implement in new scenes. In this paper, we propose GOOD, a general optimization-based fusion framework that can achieve satisfying detection without training additional models and is available for any combinations of 2D and 3D detectors to improve the accuracy and robustness of 3D detection. First we apply the mutual-sided nearest-neighbor probability model to achieve the 3D-2D data association. Then we design an optimization pipeline that can optimize different kinds of instances separately based on the matching result. Apart from this, the 3D MOT method is also introduced to enhance the performance aided by previous frames. To the best of our knowledge, this is the first optimization-based late fusion framework for multi-modal 3D object detection which can be served as a baseline for subsequent research. Experiments on both nuScenes and KITTI datasets are carried out and the results show that GOOD outperforms by 9.1\% on mAP score compared with PointPillars and achieves competitive results with the learning-based late fusion CLOCs.

Title: Prototype Knowledge Distillation for Medical Segmentation with Missing Modality. (arXiv:2303.09830v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09830
Code URL: null
Copy Paste: [[2303.09830] Prototype Knowledge Distillation for Medical Segmentation with Missing Modality](http://arxiv.org/abs/2303.09830) #robust
Summary:
Multi-modality medical imaging is crucial in clinical treatment as it can provide complementary information for medical image segmentation. However, collecting multi-modal data in clinical is difficult due to the limitation of the scan time and other clinical situations. As such, it is clinically meaningful to develop an image segmentation paradigm to handle this missing modality problem. In this paper, we propose a prototype knowledge distillation (ProtoKD) method to tackle the challenging problem, especially for the toughest scenario when only single modal data can be accessed. Specifically, our ProtoKD can not only distillate the pixel-wise knowledge of multi-modality data to single-modality data but also transfer intra-class and inter-class feature variations, such that the student model could learn more robust feature representation from the teacher model and inference with only one single modality data. Our method achieves state-of-the-art performance on BraTS benchmark.

Title: Robust Semi-Supervised Learning for Histopathology Images through Self-Supervision Guided Out-of-Distribution Scoring. (arXiv:2303.09930v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09930
Code URL: null
Copy Paste: [[2303.09930] Robust Semi-Supervised Learning for Histopathology Images through Self-Supervision Guided Out-of-Distribution Scoring](http://arxiv.org/abs/2303.09930) #robust
Summary:
Semi-supervised learning (semi-SL) is a promising alternative to supervised learning for medical image analysis when obtaining good quality supervision for medical imaging is difficult. However, semi-SL assumes that the underlying distribution of unaudited data matches that of the few labeled samples, which is often violated in practical settings, particularly in medical images. The presence of out-of-distribution (OOD) samples in the unlabeled training pool of semi-SL is inevitable and can reduce the efficiency of the algorithm. Common preprocessing methods to filter out outlier samples may not be suitable for medical images that involve a wide range of anatomical structures and rare morphologies. In this paper, we propose a novel pipeline for addressing open-set supervised learning challenges in digital histology images. Our pipeline efficiently estimates an OOD score for each unlabelled data point based on self-supervised learning to calibrate the knowledge needed for a subsequent semi-SL framework. The outlier score derived from the OOD detector is used to modulate sample selection for the subsequent semi-SL stage, ensuring that samples conforming to the distribution of the few labeled samples are more frequently exposed to the subsequent semi-SL framework. Our framework is compatible with any semi-SL framework, and we base our experiments on the popular Mixmatch semi-SL framework. We conduct extensive studies on two digital pathology datasets, Kather colorectal histology dataset and a dataset derived from TCGA-BRCA whole slide images, and establish the effectiveness of our method by comparing with popular methods and frameworks in semi-SL algorithms through various experiments.

Title: Deep Graph-based Spatial Consistency for Robust Non-rigid Point Cloud Registration. (arXiv:2303.09950v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09950
Code URL: https://github.com/qinzheng93/graphscnet
Copy Paste: [[2303.09950] Deep Graph-based Spatial Consistency for Robust Non-rigid Point Cloud Registration](http://arxiv.org/abs/2303.09950) #robust
Summary:
We study the problem of outlier correspondence pruning for non-rigid point cloud registration. In rigid registration, spatial consistency has been a commonly used criterion to discriminate outliers from inliers. It measures the compatibility of two correspondences by the discrepancy between the respective distances in two point clouds. However, spatial consistency no longer holds in non-rigid cases and outlier rejection for non-rigid registration has not been well studied. In this work, we propose Graph-based Spatial Consistency Network (GraphSCNet) to filter outliers for non-rigid registration. Our method is based on the fact that non-rigid deformations are usually locally rigid, or local shape preserving. We first design a local spatial consistency measure over the deformation graph of the point cloud, which evaluates the spatial compatibility only between the correspondences in the vicinity of a graph node. An attention-based non-rigid correspondence embedding module is then devised to learn a robust representation of non-rigid correspondences from local spatial consistency. Despite its simplicity, GraphSCNet effectively improves the quality of the putative correspondences and attains state-of-the-art performance on three challenging benchmarks. Our code and models are available at https://github.com/qinzheng93/GraphSCNet.

Title: Uncertainty-informed Mutual Learning for Joint Medical Image Classification and Segmentation. (arXiv:2303.10049v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.10049
Code URL: null
Copy Paste: [[2303.10049] Uncertainty-informed Mutual Learning for Joint Medical Image Classification and Segmentation](http://arxiv.org/abs/2303.10049) #robust
Summary:
Classification and segmentation are crucial in medical image analysis as they enable accurate diagnosis and disease monitoring. However, current methods often prioritize the mutual learning features and shared model parameters, while neglecting the reliability of features and performances. In this paper, we propose a novel Uncertainty-informed Mutual Learning (UML) framework for reliable and interpretable medical image analysis. Our UML introduces reliability to joint classification and segmentation tasks, leveraging mutual learning with uncertainty to improve performance. To achieve this, we first use evidential deep learning to provide image-level and pixel-wise confidences. Then, an Uncertainty Navigator Decoder is constructed for better using mutual features and generating segmentation results. Besides, an Uncertainty Instructor is proposed to screen reliable masks for classification. Overall, UML could produce confidence estimation in features and performance for each link (classification and segmentation). The experiments on the public datasets demonstrate that our UML outperforms existing methods in terms of both accuracy and robustness. Our UML has the potential to explore the development of more reliable and explainable medical image analysis models. We will release the codes for reproduction after acceptance.

Title: Refinement for Absolute Pose Regression with Neural Feature Synthesis. (arXiv:2303.10087v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.10087
Code URL: null
Copy Paste: [[2303.10087] Refinement for Absolute Pose Regression with Neural Feature Synthesis](http://arxiv.org/abs/2303.10087) #robust
Summary:
Absolute Pose Regression (APR) methods use deep neural networks to directly regress camera poses from RGB images. Despite their advantages in inference speed and simplicity, these methods still fall short of the accuracy achieved by geometry-based techniques. To address this issue, we propose a new model called the Neural Feature Synthesizer (NeFeS). Our approach encodes 3D geometric features during training and renders dense novel view features at test time to refine estimated camera poses from arbitrary APR methods. Unlike previous APR works that require additional unlabeled training data, our method leverages implicit geometric constraints during test time using a robust feature field. To enhance the robustness of our NeFeS network, we introduce a feature fusion module and a progressive training strategy. Our proposed method improves the state-of-the-art single-image APR accuracy by as much as 54.9% on indoor and outdoor benchmark datasets without additional time-consuming unlabeled data training.

Title: DORIC : Domain Robust Fine-Tuning for Open Intent Clustering through Dependency Parsing. (arXiv:2303.09827v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.09827
Code URL: null
Copy Paste: [[2303.09827] DORIC : Domain Robust Fine-Tuning for Open Intent Clustering through Dependency Parsing](http://arxiv.org/abs/2303.09827) #robust
Summary:
We present our work on Track 2 in the Dialog System Technology Challenges 11 (DSTC11). DSTC11-Track2 aims to provide a benchmark for zero-shot, cross-domain, intent-set induction. In the absence of in-domain training dataset, robust utterance representation that can be used across domains is necessary to induce users' intentions. To achieve this, we leveraged a multi-domain dialogue dataset to fine-tune the language model and proposed extracting Verb-Object pairs to remove the artifacts of unnecessary information. Furthermore, we devised the method that generates each cluster's name for the explainability of clustered results. Our approach achieved 3rd place in the precision score and showed superior accuracy and normalized mutual information (NMI) score than the baseline model on various domain datasets.

Title: More Robust Schema-Guided Dialogue State Tracking via Tree-Based Paraphrase Ranking. (arXiv:2303.09905v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.09905
Code URL: null
Copy Paste: [[2303.09905] More Robust Schema-Guided Dialogue State Tracking via Tree-Based Paraphrase Ranking](http://arxiv.org/abs/2303.09905) #robust
Summary:
The schema-guided paradigm overcomes scalability issues inherent in building task-oriented dialogue (TOD) agents with static ontologies. Instead of operating on dialogue context alone, agents have access to hierarchical schemas containing task-relevant natural language descriptions. Fine-tuned language models excel at schema-guided dialogue state tracking (DST) but are sensitive to the writing style of the schemas. We explore methods for improving the robustness of DST models. We propose a framework for generating synthetic schemas which uses tree-based ranking to jointly optimise lexical diversity and semantic faithfulness. The generalisation of strong baselines is improved when augmenting their training data with prompts generated by our framework, as demonstrated by marked improvements in average joint goal accuracy (JGA) and schema sensitivity (SS) on the SGD-X benchmark.

Title: It Is All About Data: A Survey on the Effects of Data on Adversarial Robustness. (arXiv:2303.09767v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09767
Code URL: null
Copy Paste: [[2303.09767] It Is All About Data: A Survey on the Effects of Data on Adversarial Robustness](http://arxiv.org/abs/2303.09767) #robust
Summary:
Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to confuse the model into making a mistake. Such examples pose a serious threat to the applicability of machine-learning-based systems, especially in life- and safety-critical domains. To address this problem, the area of adversarial robustness investigates mechanisms behind adversarial attacks and defenses against these attacks. This survey reviews literature that focuses on the effects of data used by a model on the model's adversarial robustness. It systematically identifies and summarizes the state-of-the-art research in this area and further discusses gaps of knowledge and promising future research directions.

Title: SE-GSL: A General and Effective Graph Structure Learning Framework through Structural Entropy Optimization. (arXiv:2303.09778v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09778
Code URL: https://github.com/ringbdstack/se-gsl
Copy Paste: [[2303.09778] SE-GSL: A General and Effective Graph Structure Learning Framework through Structural Entropy Optimization](http://arxiv.org/abs/2303.09778) #robust
Summary:
Graph Neural Networks (GNNs) are de facto solutions to structural data learning. However, it is susceptible to low-quality and unreliable structure, which has been a norm rather than an exception in real-world graphs. Existing graph structure learning (GSL) frameworks still lack robustness and interpretability. This paper proposes a general GSL framework, SE-GSL, through structural entropy and the graph hierarchy abstracted in the encoding tree. Particularly, we exploit the one-dimensional structural entropy to maximize embedded information content when auxiliary neighbourhood attributes are fused to enhance the original graph. A new scheme of constructing optimal encoding trees is proposed to minimize the uncertainty and noises in the graph whilst assuring proper community partition in hierarchical abstraction. We present a novel sample-based mechanism for restoring the graph structure via node structural entropy distribution. It increases the connectivity among nodes with larger uncertainty in lower-level communities. SE-GSL is compatible with various GNN models and enhances the robustness towards noisy and heterophily structures. Extensive experiments show significant improvements in the effectiveness and robustness of structure learning and node representation learning.

Title: GADFormer: An Attention-based Model for Group Anomaly Detection on Trajectories. (arXiv:2303.09841v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09841
Code URL: null
Copy Paste: [[2303.09841] GADFormer: An Attention-based Model for Group Anomaly Detection on Trajectories](http://arxiv.org/abs/2303.09841) #robust
Summary:
Group Anomaly Detection (GAD) reveals anomalous behavior among groups consisting of multiple member instances, which are, individually considered, not necessarily anomalous. This task is of major importance across multiple disciplines, in which also sequences like trajectories can be considered as a group. However, with increasing amount and heterogenity of group members, actual abnormal groups get harder to detect, especially in an unsupervised or semi-supervised setting. Recurrent Neural Networks are well established deep sequence models, but recent works have shown that their performance can decrease with increasing sequence lengths. Hence, we introduce with this paper GADFormer, a GAD specific BERT architecture, capable to perform attention-based Group Anomaly Detection on trajectories in an unsupervised and semi-supervised setting. We show formally and experimentally how trajectory outlier detection can be realized as an attention-based Group Anomaly Detection problem. Furthermore, we introduce a Block Attention-anomaly Score (BAS) to improve the interpretability of transformer encoder blocks for GAD. In addition to that, synthetic trajectory generation allows us to optimize the training for domain-specific GAD. In extensive experiments we investigate our approach versus GRU in their robustness for trajectory noise and novelties on synthetic and real world datasets.

Title: An evaluation framework for dimensionality reduction through sectional curvature. (arXiv:2303.09909v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09909
Code URL: null
Copy Paste: [[2303.09909] An evaluation framework for dimensionality reduction through sectional curvature](http://arxiv.org/abs/2303.09909) #robust
Summary:
Unsupervised machine learning lacks ground truth by definition. This poses a major difficulty when designing metrics to evaluate the performance of such algorithms. In sharp contrast with supervised learning, for which plenty of quality metrics have been studied in the literature, in the field of dimensionality reduction only a few over-simplistic metrics has been proposed. In this work, we aim to introduce the first highly non-trivial dimensionality reduction performance metric. This metric is based on the sectional curvature behaviour arising from Riemannian geometry. To test its feasibility, this metric has been used to evaluate the performance of the most commonly used dimension reduction algorithms in the state of the art. Furthermore, to make the evaluation of the algorithms robust and representative, using curvature properties of planar curves, a new parameterized problem instance generator has been constructed in the form of a function generator. Experimental results are consistent with what could be expected based on the design and characteristics of the evaluated algorithms and the features of the data instances used to feed the method.

Title: Finding Competence Regions in Domain Generalization. (arXiv:2303.09989v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09989
Code URL: null
Copy Paste: [[2303.09989] Finding Competence Regions in Domain Generalization](http://arxiv.org/abs/2303.09989) #robust
Summary:
We propose a "learning to reject" framework to address the problem of silent failures in Domain Generalization (DG), where the test distribution differs from the training distribution. Assuming a mild distribution shift, we wish to accept out-of-distribution (OOD) data whenever a model's estimated competence foresees trustworthy responses, instead of rejecting OOD data outright. Trustworthiness is then predicted via a proxy incompetence score that is tightly linked to the performance of a classifier. We present a comprehensive experimental evaluation of incompetence scores for classification and highlight the resulting trade-offs between rejection rate and accuracy gain. For comparability with prior work, we focus on standard DG benchmarks and consider the effect of measuring incompetence via different learned representations in a closed versus an open world setting. Our results suggest that increasing incompetence scores are indeed predictive of reduced accuracy, leading to significant improvements of the average accuracy below a suitable incompetence threshold. However, the scores are not yet good enough to allow for a favorable accuracy/rejection trade-off in all tested domains. Surprisingly, our results also indicate that classifiers optimized for DG robustness do not outperform a naive Empirical Risk Minimization (ERM) baseline in the competence region, that is, where test samples elicit low incompetence scores.

Title: Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting. (arXiv:2303.10144v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.10144
Code URL: https://github.com/nicolinho/dutd
Copy Paste: [[2303.10144] Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting](http://arxiv.org/abs/2303.10144) #robust
Summary:
Early stopping based on the validation set performance is a popular approach to find the right balance between under- and overfitting in the context of supervised learning. However, in reinforcement learning, even for supervised sub-problems such as world model learning, early stopping is not applicable as the dataset is continually evolving. As a solution, we propose a new general method that dynamically adjusts the update to data (UTD) ratio during training based on under- and overfitting detection on a small subset of the continuously collected experience not used for training. We apply our method to DreamerV2, a state-of-the-art model-based reinforcement learning algorithm, and evaluate it on the DeepMind Control Suite and the Atari $100$k benchmark. The results demonstrate that one can better balance under- and overestimation by adjusting the UTD ratio with our approach compared to the default setting in DreamerV2 and that it is competitive with an extensive hyperparameter search which is not feasible for many applications. Our method eliminates the need to set the UTD hyperparameter by hand and even leads to a higher robustness with regard to other learning-related hyperparameters further reducing the amount of necessary tuning.

biometric

steal

extraction

Title: Exploring Sparse Visual Prompt for Cross-domain Semantic Segmentation. (arXiv:2303.09792v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09792
Code URL: https://github.com/iccv2595/svdp
Copy Paste: [[2303.09792] Exploring Sparse Visual Prompt for Cross-domain Semantic Segmentation](http://arxiv.org/abs/2303.09792) #extraction
Summary:
Visual Domain Prompts (VDP) have shown promising potential in addressing visual cross-domain problems. Existing methods adopt VDP in classification domain adaptation (DA), such as tuning image-level or feature-level prompts for target domains. Since the previous dense prompts are opaque and mask out continuous spatial details in the prompt regions, it will suffer from inaccurate contextual information extraction and insufficient domain-specific feature transferring when dealing with the dense prediction (i.e. semantic segmentation) DA problems. Therefore, we propose a novel Sparse Visual Domain Prompts (SVDP) approach tailored for addressing domain shift problems in semantic segmentation, which holds minimal discrete trainable parameters (e.g. 10\%) of the prompt and reserves more spatial information. To better apply SVDP, we propose Domain Prompt Placement (DPP) method to adaptively distribute several SVDP on regions with large data distribution distance based on uncertainty guidance. It aims to extract more local domain-specific knowledge and realizes efficient cross-domain learning. Furthermore, we design a Domain Prompt Updating (DPU) method to optimize prompt parameters differently for each target domain sample with different degrees of domain shift, which helps SVDP to better fit target domain knowledge. Experiments, which are conducted on the widely-used benchmarks (Cityscapes, Foggy-Cityscapes, and ACDC), show that our proposed method achieves state-of-the-art performances on the source-free adaptations, including six Test Time Adaptation and one Continual Test-Time Adaptation in semantic segmentation.

Title: TKN: Transformer-based Keypoint Prediction Network For Real-time Video Prediction. (arXiv:2303.09807v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09807
Code URL: null
Copy Paste: [[2303.09807] TKN: Transformer-based Keypoint Prediction Network For Real-time Video Prediction](http://arxiv.org/abs/2303.09807) #extraction
Summary:
Video prediction is a complex time-series forecasting task with great potential in many use cases. However, conventional methods overemphasize accuracy while ignoring the slow prediction speed caused by complicated model structures that learn too much redundant information with excessive GPU memory consumption. Furthermore, conventional methods mostly predict frames sequentially (frame-by-frame) and thus are hard to accelerate. Consequently, valuable use cases such as real-time danger prediction and warning cannot achieve fast enough inference speed to be applicable in reality. Therefore, we propose a transformer-based keypoint prediction neural network (TKN), an unsupervised learning method that boost the prediction process via constrained information extraction and parallel prediction scheme. TKN is the first real-time video prediction solution to our best knowledge, while significantly reducing computation costs and maintaining other performance. Extensive experiments on KTH and Human3.6 datasets demonstrate that TKN predicts 11 times faster than existing methods while reducing memory consumption by 17.4% and achieving state-of-the-art prediction performance on average.

Title: Vision Transformer for Action Units Detection. (arXiv:2303.09917v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09917
Code URL: null
Copy Paste: [[2303.09917] Vision Transformer for Action Units Detection](http://arxiv.org/abs/2303.09917) #extraction
Summary:
Facial Action Units detection (FAUs) represents a fine-grained classification problem that involves identifying different units on the human face, as defined by the Facial Action Coding System. In this paper, we present a simple yet efficient Vision Transformer-based approach for addressing the task of Action Units (AU) detection in the context of Affective Behavior Analysis in-the-wild (ABAW) competition. We employ the Video Vision Transformer(ViViT) Network to capture the temporal facial change in the video. Besides, to reduce massive size of the Vision Transformers model, we replace the ViViT feature extraction layers with the CNN backbone (Regnet). Our model outperform the baseline model of ABAW 2023 challenge, with a notable 14\% difference in result. Furthermore, the achieved results are comparable to those of the top three teams in the previous ABAW 2022 challenge.

membership infer

federate

Title: No Fear of Classifier Biases: Neural Collapse Inspired Federated Learning with Synthetic and Fixed Classifier. (arXiv:2303.10058v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.10058
Code URL: null
Copy Paste: [[2303.10058] No Fear of Classifier Biases: Neural Collapse Inspired Federated Learning with Synthetic and Fixed Classifier](http://arxiv.org/abs/2303.10058) #federate
Summary:
Data heterogeneity is an inherent challenge that hinders the performance of federated learning (FL). Recent studies have identified the biased classifiers of local models as the key bottleneck. Previous attempts have used classifier calibration after FL training, but this approach falls short in improving the poor feature representations caused by training-time classifier biases. Resolving the classifier bias dilemma in FL requires a full understanding of the mechanisms behind the classifier. Recent advances in neural collapse have shown that the classifiers and feature prototypes under perfect training scenarios collapse into an optimal structure called simplex equiangular tight frame (ETF). Building on this neural collapse insight, we propose a solution to the FL's classifier bias problem by utilizing a synthetic and fixed ETF classifier during training. The optimal classifier structure enables all clients to learn unified and optimal feature representations even under extremely heterogeneous data. We devise several effective modules to better adapt the ETF structure in FL, achieving both high generalization and personalization. Extensive experiments demonstrate that our method achieves state-of-the-art performances on CIFAR-10, CIFAR-100, and Tiny-ImageNet.

fair

Title: Trained on 100 million words and still in shape: BERT meets British National Corpus. (arXiv:2303.09859v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.09859
Code URL: null
Copy Paste: [[2303.09859] Trained on 100 million words and still in shape: BERT meets British National Corpus](http://arxiv.org/abs/2303.09859) #fair
Summary:
While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

interpretability

Title: Explainable GeoAI: Can saliency maps help interpret artificial intelligence's learning process? An empirical study on natural feature detection. (arXiv:2303.09660v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09660
Code URL: null
Copy Paste: [[2303.09660] Explainable GeoAI: Can saliency maps help interpret artificial intelligence's learning process? An empirical study on natural feature detection](http://arxiv.org/abs/2303.09660) #interpretability
Summary:
Improving the interpretability of geospatial artificial intelligence (GeoAI) models has become critically important to open the "black box" of complex AI models, such as deep learning. This paper compares popular saliency map generation techniques and their strengths and weaknesses in interpreting GeoAI and deep learning models' reasoning behaviors, particularly when applied to geospatial analysis and image processing tasks. We surveyed two broad classes of model explanation methods: perturbation-based and gradient-based methods. The former identifies important image areas, which help machines make predictions by modifying a localized area of the input image. The latter evaluates the contribution of every single pixel of the input image to the model's prediction results through gradient backpropagation. In this study, three algorithms-the occlusion method, the integrated gradients method, and the class activation map method-are examined for a natural feature detection task using deep learning. The algorithms' strengths and weaknesses are discussed, and the consistency between model-learned and human-understandable concepts for object recognition is also compared. The experiments used two GeoAI-ready datasets to demonstrate the generalizability of the research findings.

Title: DUDES: Deep Uncertainty Distillation using Ensembles for Semantic Segmentation. (arXiv:2303.09843v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09843
Code URL: null
Copy Paste: [[2303.09843] DUDES: Deep Uncertainty Distillation using Ensembles for Semantic Segmentation](http://arxiv.org/abs/2303.09843) #interpretability
Summary:
Deep neural networks lack interpretability and tend to be overconfident, which poses a serious problem in safety-critical applications like autonomous driving, medical imaging, or machine vision tasks with high demands on reliability. Quantifying the predictive uncertainty is a promising endeavour to open up the use of deep neural networks for such applications. Unfortunately, current available methods are computationally expensive. In this work, we present a novel approach for efficient and reliable uncertainty estimation which we call Deep Uncertainty Distillation using Ensembles for Segmentation (DUDES). DUDES applies student-teacher distillation with a Deep Ensemble to accurately approximate predictive uncertainties with a single forward pass while maintaining simplicity and adaptability. Experimentally, DUDES accurately captures predictive uncertainties without sacrificing performance on the segmentation task and indicates impressive capabilities of identifying wrongly classified pixels and out-of-domain samples on the Cityscapes dataset. With DUDES, we manage to simultaneously simplify and outperform previous work on Deep Ensemble-based Uncertainty Distillation.

explainability

Title: Causal Temporal Graph Convolutional Neural Networks (CTGCN). (arXiv:2303.09634v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09634
Code URL: null
Copy Paste: [[2303.09634] Causal Temporal Graph Convolutional Neural Networks (CTGCN)](http://arxiv.org/abs/2303.09634) #explainability
Summary:
Many large-scale applications can be elegantly represented using graph structures. Their scalability, however, is often limited by the domain knowledge required to apply them. To address this problem, we propose a novel Causal Temporal Graph Convolutional Neural Network (CTGCN). Our CTGCN architecture is based on a causal discovery mechanism, and is capable of discovering the underlying causal processes. The major advantages of our approach stem from its ability to overcome computational scalability problems with a divide and conquer technique, and from the greater explainability of predictions made using a causal model. We evaluate the scalability of our CTGCN on two datasets to demonstrate that our method is applicable to large scale problems, and show that the integration of causality into the TGCN architecture improves prediction performance up to 40% over typical TGCN approach. Our results are obtained without requiring additional domain knowledge, making our approach adaptable to various domains, specifically when little contextual knowledge is available.

watermark

Title: A Recipe for Watermarking Diffusion Models. (arXiv:2303.10137v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.10137
Code URL: https://github.com/yunqing-me/watermarkdm
Copy Paste: [[2303.10137] A Recipe for Watermarking Diffusion Models](http://arxiv.org/abs/2303.10137) #watermark
Summary:
Recently, diffusion models (DMs) have demonstrated their advantageous potential for generative tasks. Widespread interest exists in incorporating DMs into downstream applications, such as producing or editing photorealistic images. However, practical deployment and unprecedented power of DMs raise legal issues, including copyright protection and monitoring of generated content. In this regard, watermarking has been a proven solution for copyright protection and content monitoring, but it is underexplored in the DMs literature. Specifically, DMs generate samples from longer tracks and may have newly designed multimodal structures, necessitating the modification of conventional watermarking pipelines. To this end, we conduct comprehensive analyses and derive a recipe for efficiently watermarking state-of-the-art DMs (e.g., Stable Diffusion), via training from scratch or finetuning. Our recipe is straightforward but involves empirically ablated implementation details, providing a solid foundation for future research on watermarking DMs. Our Code: https://github.com/yunqing-me/WatermarkDM.

Title: Rethinking White-Box Watermarks on Deep Learning Models under Neural Structural Obfuscation. (arXiv:2303.09732v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2303.09732
Code URL: null
Copy Paste: [[2303.09732] Rethinking White-Box Watermarks on Deep Learning Models under Neural Structural Obfuscation](http://arxiv.org/abs/2303.09732) #watermark
Summary:
Copyright protection for deep neural networks (DNNs) is an urgent need for AI corporations. To trace illegally distributed model copies, DNN watermarking is an emerging technique for embedding and verifying secret identity messages in the prediction behaviors or the model internals. Sacrificing less functionality and involving more knowledge about the target DNN, the latter branch called \textit{white-box DNN watermarking} is believed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts in both the academy and the industry.

In this paper, we present the first systematic study on how the mainstream white-box DNN watermarks are commonly vulnerable to neural structural obfuscation with \textit{dummy neurons}, a group of neurons which can be added to a target model but leave the model behavior invariant. Devising a comprehensive framework to automatically generate and inject dummy neurons with high stealthiness, our novel attack intensively modifies the architecture of the target model to inhibit the success of watermark verification. With extensive evaluation, our work for the first time shows that nine published watermarking schemes require amendments to their verification procedures.

diffusion

Title: DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion. (arXiv:2303.09604v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09604
Code URL: null
Copy Paste: [[2303.09604] DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion](http://arxiv.org/abs/2303.09604) #diffusion
Summary:
We introduce a novel method to automatically generate an artistic typography by stylizing one or more letter fonts to visually convey the semantics of an input word, while ensuring that the output remains readable. To address an assortment of challenges with our task at hand including conflicting goals (artistic stylization vs. legibility), lack of ground truth, and immense search space, our approach utilizes large language models to bridge texts and visual images for stylization and build an unsupervised generative model with a diffusion model backbone. Specifically, we employ the denoising generator in Latent Diffusion Model (LDM), with the key addition of a CNN-based discriminator to adapt the input style onto the input text. The discriminator uses rasterized images of a given letter/word font as real samples and output of the denoising generator as fake samples. Our model is coined DS-Fusion for discriminated and stylized diffusion. We showcase the quality and versatility of our method through numerous examples, qualitative and quantitative evaluation, as well as ablation studies. User studies comparing to strong baselines including CLIPDraw and DALL-E 2, as well as artist-crafted typographies, demonstrate strong performance of DS-Fusion.

Title: HIVE: Harnessing Human Feedback for Instructional Visual Editing. (arXiv:2303.09618v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09618
Code URL: null
Copy Paste: [[2303.09618] HIVE: Harnessing Human Feedback for Instructional Visual Editing](http://arxiv.org/abs/2303.09618) #diffusion
Summary:
Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models, where outputs are generated based on an input image and an editing instruction, could similarly benefit from human feedback, as their outputs may not adhere to the correct instructions and preferences of users. In this paper, we present a novel framework to harness human feedback for instructional visual editing (HIVE). Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences. We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward. Besides, to mitigate the bias brought by the limitation of data, we contribute a new 1M training dataset, a 3.6K reward dataset for rewards learning, and a 1K evaluation dataset to boost the performance of instructional image editing. We conduct extensive empirical experiments quantitatively and qualitatively, showing that HIVE is favored over previous state-of-the-art instructional image editing approaches by a large margin.

Title: SUD$^2$: Supervision by Denoising Diffusion Models for Image Reconstruction. (arXiv:2303.09642v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09642
Code URL: null
Copy Paste: [[2303.09642] SUD$^2$: Supervision by Denoising Diffusion Models for Image Reconstruction](http://arxiv.org/abs/2303.09642) #diffusion
Summary:
Many imaging inverse problems$\unicode{x2014}$such as image-dependent in-painting and dehazing$\unicode{x2014}$are challenging because their forward models are unknown or depend on unknown latent parameters. While one can solve such problems by training a neural network with vast quantities of paired training data, such paired training data is often unavailable. In this paper, we propose a generalized framework for training image reconstruction networks when paired training data is scarce. In particular, we demonstrate the ability of image denoising algorithms and, by extension, denoising diffusion models to supervise network training in the absence of paired training data.

Title: Denoising Diffusion Autoencoders are Unified Self-supervised Learners. (arXiv:2303.09769v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09769
Code URL: null
Copy Paste: [[2303.09769] Denoising Diffusion Autoencoders are Unified Self-supervised Learners](http://arxiv.org/abs/2303.09769) #diffusion
Summary:
Inspired by recent advances in diffusion models, which are reminiscent of denoising autoencoders, we investigate whether they can acquire discriminative representations for classification via generative pre-training. This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners: by pre-training on unconditional image generation, DDAE has already learned strongly linear-separable representations at its intermediate layers without auxiliary encoders, thus making diffusion pre-training emerge as a general approach for self-supervised generative and discriminative learning. To verify this, we perform linear probe and fine-tuning evaluations on multi-class datasets. Our diffusion-based approach achieves 95.9% and 50.0% linear probe accuracies on CIFAR-10 and Tiny-ImageNet, respectively, and is comparable to masked autoencoders and contrastive learning for the first time. Additionally, transfer learning from ImageNet confirms DDAE's suitability for latent-space Vision Transformers, suggesting the potential for scaling DDAEs as unified foundation models.

Title: DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery. (arXiv:2303.09813v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09813
Code URL: null
Copy Paste: [[2303.09813] DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery](http://arxiv.org/abs/2303.09813) #diffusion
Summary:
Learning from a large corpus of data, pre-trained models have achieved impressive progress nowadays. As popular generative pre-training, diffusion models capture both low-level visual knowledge and high-level semantic relations. In this paper, we propose to exploit such knowledgeable diffusion models for mainstream discriminative tasks, i.e., unsupervised object discovery: saliency segmentation and object localization. However, the challenges exist as there is one structural difference between generative and discriminative models, which limits the direct use. Besides, the lack of explicitly labeled data significantly limits performance in unsupervised settings. To tackle these issues, we introduce DiffusionSeg, one novel synthesis-exploitation framework containing two-stage strategies. To alleviate data insufficiency, we synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first synthesis stage. In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features. These features can be directly used by downstream architectures. Extensive experiments and ablation studies demonstrate the superiority of adapting diffusion for unsupervised object discovery.

Title: FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model. (arXiv:2303.09833v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09833
Code URL: https://github.com/vvictoryuki/freedom
Copy Paste: [[2303.09833] FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model](http://arxiv.org/abs/2303.09833) #diffusion
Summary:
Recently, conditional diffusion models have gained popularity in numerous applications due to their exceptional generation ability. However, many existing methods are training-required. They need to train a time-dependent classifier or a condition-dependent score estimator, which increases the cost of constructing conditional diffusion models and is inconvenient to transfer across different conditions. Some current works aim to overcome this limitation by proposing training-free solutions, but most can only be applied to a specific category of tasks and not to more general conditions. In this work, we propose a training-Free conditional Diffusion Model (FreeDoM) used for various conditions. Specifically, we leverage off-the-shelf pre-trained networks, such as a face detection model, to construct time-independent energy functions, which guide the generation process without requiring training. Furthermore, because the construction of the energy function is very flexible and adaptable to various conditions, our proposed FreeDoM has a broader range of applications than existing training-free methods. FreeDoM is advantageous in its simplicity, effectiveness, and low cost. Experiments demonstrate that FreeDoM is effective for various conditions and suitable for diffusion models of diverse data domains, including image and latent code domains.

Title: DiffusionRet: Generative Text-Video Retrieval with Diffusion Model. (arXiv:2303.09867v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.09867
Code URL: null
Copy Paste: [[2303.09867] DiffusionRet: Generative Text-Video Retrieval with Diffusion Model](http://arxiv.org/abs/2303.09867) #diffusion
Summary:
Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code will be available at https://github.com/jpthu17/DiffusionRet.

Title: GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation. (arXiv:2303.10056v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.10056
Code URL: null
Copy Paste: [[2303.10056] GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation](http://arxiv.org/abs/2303.10056) #diffusion
Summary:
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation.

Title: DialogPaint: A Dialog-based Image Editing Model. (arXiv:2303.10073v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.10073
Code URL: null
Copy Paste: [[2303.10073] DialogPaint: A Dialog-based Image Editing Model](http://arxiv.org/abs/2303.10073) #diffusion
Summary:
We present DialogPaint, an innovative framework that employs an interactive conversational approach for image editing. The framework comprises a pretrained dialogue model (Blenderbot) and a diffusion model (Stable Diffusion). The dialogue model engages in conversation with users to understand their requirements and generates concise instructions based on the dialogue. Subsequently, the Stable Diffusion model employs these instructions, along with the input image, to produce the desired output. Due to the difficulty of acquiring fine-tuning data for such models, we leverage multiple large-scale models to generate simulated dialogues and corresponding image pairs. After fine-tuning our framework with the synthesized data, we evaluate its performance in real application scenes. The results demonstrate that DialogPaint excels in both objective and subjective evaluation metrics effectively handling ambiguous instructions and performing tasks such as object replacement, style transfer, color modification. Moreover, our framework supports multi-round editing, allowing for the completion of complicated editing tasks.

Title: Diffusing the Optimal Topology: A Generative Optimization Approach. (arXiv:2303.09760v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09760
Code URL: null
Copy Paste: [[2303.09760] Diffusing the Optimal Topology: A Generative Optimization Approach](http://arxiv.org/abs/2303.09760) #diffusion
Summary:
Topology Optimization seeks to find the best design that satisfies a set of constraints while maximizing system performance. Traditional iterative optimization methods like SIMP can be computationally expensive and get stuck in local minima, limiting their applicability to complex or large-scale problems. Learning-based approaches have been developed to accelerate the topology optimization process, but these methods can generate designs with floating material and low performance when challenged with out-of-distribution constraint configurations. Recently, deep generative models, such as Generative Adversarial Networks and Diffusion Models, conditioned on constraints and physics fields have shown promise, but they require extensive pre-processing and surrogate models for improving performance. To address these issues, we propose a Generative Optimization method that integrates classic optimization like SIMP as a refining mechanism for the topology generated by a deep generative model. We also remove the need for conditioning on physical fields using a computationally inexpensive approximation inspired by classic ODE solutions and reduce the number of steps needed to generate a feasible and performant topology. Our method allows us to efficiently generate good topologies and explicitly guide them to regions with high manufacturability and high performance, without the need for external auxiliary models or additional labeled data. We believe that our method can lead to significant advancements in the design and optimization of structures in engineering applications, and can be applied to a broader spectrum of performance-aware engineering design problems.

Title: Discovering mesoscopic descriptions of collective movement with neural stochastic modelling. (arXiv:2303.09906v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.09906
Code URL: https://github.com/utkarshp1161/neu_sde
Copy Paste: [[2303.09906] Discovering mesoscopic descriptions of collective movement with neural stochastic modelling](http://arxiv.org/abs/2303.09906) #diffusion
Summary:
Collective motion is an ubiquitous phenomenon in nature, inspiring engineers, physicists and mathematicians to develop mathematical models and bio-inspired designs. Collective motion at small to medium group sizes ($\sim$10-1000 individuals, also called the `mesoscale'), can show nontrivial features due to stochasticity. Therefore, characterizing both the deterministic and stochastic aspects of the dynamics is crucial in the study of mesoscale collective phenomena. Here, we use a physics-inspired, neural-network based approach to characterize the stochastic group dynamics of interacting individuals, through a stochastic differential equation (SDE) that governs the collective dynamics of the group. We apply this technique on both synthetic and real-world datasets, and identify the deterministic and stochastic aspects of the dynamics using drift and diffusion fields, enabling us to make novel inferences about the nature of order in these systems.

Title: Data-Centric Learning from Unlabeled Graphs with Diffusion Model. (arXiv:2303.10108v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.10108
Code URL: null
Copy Paste: [[2303.10108] Data-Centric Learning from Unlabeled Graphs with Diffusion Model](http://arxiv.org/abs/2303.10108) #diffusion
Summary:
Graph property prediction tasks are important and numerous. While each task offers a small size of labeled examples, unlabeled graphs have been collected from various sources and at a large scale. A conventional approach is training a model with the unlabeled graphs on self-supervised tasks and then fine-tuning the model on the prediction tasks. However, the self-supervised task knowledge could not be aligned or sometimes conflicted with what the predictions needed. In this paper, we propose to extract the knowledge underlying the large set of unlabeled graphs as a specific set of useful data points to augment each property prediction model. We use a diffusion model to fully utilize the unlabeled graphs and design two new objectives to guide the model's denoising process with each task's labeled data to generate task-specific graph examples and their labels. Experiments demonstrate that our data-centric approach performs significantly better than fourteen existing various methods on fifteen tasks. The performance improvement brought by unlabeled data is visible as the generated labeled examples unlike self-supervised learning.