[[2301.05920] An Introduction of System-Scientific Approaches to Cognitive Security](http://arxiv.org/abs/2301.05920) #security
Human cognitive capacities and the needs of human-centric solutions for "Industry 5.0" make humans an indispensable component in Cyber-Physical Systems (CPSs), referred to as Human-Cyber-Physical Systems (HCPSs), where AI-powered technologies are incorporated to assist and augment humans. The close integration between humans and technologies in Section 1.1 and cognitive attacks in Section 1.2.4 poses emerging security challenges, where attacks can exploit vulnerabilities of human cognitive processes, affect their behaviors, and ultimately damage the HCPS. Defending HCPSs against cognitive attacks requires a new security paradigm, which we refer to as "cognitive security" in Section 1.2.5. The vulnerabilities of human cognitive systems and the associated methods of exploitation distinguish cognitive security from "cognitive reliability" and give rise to a distinctive CIA triad, as shown in Sections 1.2.5.1 and 1.2.5.2, respectively. Section 1.2.5.3 introduces cognitive and technical defense methods that deter the kill chain of cognitive attacks and achieve cognitive security. System scientific perspectives in Section 1.3 offer a promising direction to address the new challenges of cognitive security by developing quantitative, modular, multi-scale, and transferable solutions.
[[2301.05941] Improving Confidentiality for NFT Referenced Data Stores](http://arxiv.org/abs/2301.05941) #security
A non-fungible token (NFT) references a data store location, typically, using a URL or another unique identifier. At the minimum, a NFT is expected to guarantee ownership and control over the tokenised asset. However, information stored on a third party data store may be copied and stolen. We propose a solution to give control back to the information owner by storing encrypted content on the data store and providing additional security against hacks and zero day exploits. The content on our data store is never decrypted or returned to its owner for decryption during rekeying. Also, the key size in our protocol does not increase with each rekeying. With this, we reduce the synchronisation steps and maintain a bounded key size.
[[2301.05761] Local Model Explanations and Uncertainty Without Model Access](http://arxiv.org/abs/2301.05761) #security
We present a model-agnostic algorithm for generating post-hoc explanations and uncertainty intervals for a machine learning model when only a sample of inputs and outputs from the model is available, rather than direct access to the model itself. This situation may arise when model evaluations are expensive; when privacy, security and bandwidth constraints are imposed; or when there is a need for real-time, on-device explanations. Our algorithm constructs explanations using local polynomial regression and quantifies the uncertainty of the explanations using a bootstrapping approach. Through a simulation study, we show that the uncertainty intervals generated by our algorithm exhibit a favorable trade-off between interval width and coverage probability compared to the naive confidence intervals from classical regression analysis. We further demonstrate the capabilities of our method by applying it to black-box models trained on two real datasets.
[[2301.05795] Poisoning Attacks and Defenses in Federated Learning: A Survey](http://arxiv.org/abs/2301.05795) #defense
Federated learning (FL) enables the training of models among distributed clients without compromising the privacy of training datasets, while the invisibility of clients datasets and the training process poses a variety of security threats. This survey provides the taxonomy of poisoning attacks and experimental evaluation to discuss the need for robust FL.
[[2301.05819] Deepfake Detection using Biological Features: A Survey](http://arxiv.org/abs/2301.05819) #attack
Deepfake is a deep learning-based technique that makes it easy to change or modify images and videos. In investigations and court, visual evidence is commonly employed, but these pieces of evidence may now be suspect due to technological advancements in deepfake. Deepfakes have been used to blackmail individuals, plan terrorist attacks, disseminate false information, defame individuals, and foment political turmoil. This study describes the history of deepfake, its development and detection, and the challenges based on physiological measurements such as eyebrow recognition, eye blinking detection, eye movement detection, ear and mouth detection, and heartbeat detection. The study also proposes a scope in this field and compares the different biological features and their classifiers. Deepfakes are created using the generative adversarial network (GANs) model, and were once easy to detect by humans due to visible artifacts. However, as technology has advanced, deepfakes have become highly indistinguishable from natural images, making it important to review detection methods.
[[2301.05776] Young Labeled Faces in the Wild (YLFW): A Dataset for Children Faces Recognition](http://arxiv.org/abs/2301.05776) #robust
Face recognition has achieved outstanding performance in the last decade with the development of deep learning techniques.
Nowadays, the challenges in face recognition are related to specific scenarios, for instance, the performance under diverse image quality, the robustness for aging and edge cases of person age (children and elders), distinguishing of related identities.
In this set of problems, recognizing children's faces is one of the most sensitive and important. One of the reasons for this problem is the existing bias towards adults in existing face datasets.
In this work, we present a benchmark dataset for children's face recognition, which is compiled similarly to the famous face recognition benchmarks LFW, CALFW, CPLFW, XQLFW and AgeDB.
We also present a development dataset (separated into train and test parts) for adapting face recognition models for face images of children.
The proposed data is balanced for African, Asian, Caucasian, and Indian races. To the best of our knowledge, this is the first standartized data tool set for benchmarking and the largest collection for development for children's face recognition. Several face recognition experiments are presented to demonstrate the performance of the proposed data tool set.
[[2301.05805] Safe Control Transitions: Machine Vision Based Observable Readiness Index and Data-Driven Takeover Time Prediction](http://arxiv.org/abs/2301.05805) #robust
To make safe transitions from autonomous to manual control, a vehicle must have a representation of the awareness of driver state; two metrics which quantify this state are the Observable Readiness Index and Takeover Time. In this work, we show that machine learning models which predict these two metrics are robust to multiple camera views, expanding from the limited view angles in prior research. Importantly, these models take as input feature vectors corresponding to hand location and activity as well as gaze location, and we explore the tradeoffs of different views in generating these feature vectors. Further, we introduce two metrics to evaluate the quality of control transitions following the takeover event (the maximal lateral deviation and velocity deviation) and compute correlations of these post-takeover metrics to the pre-takeover predictive metrics.
[[2301.05858] Robust Remote Sensing Scene Classification with Multi-View Voting and Entropy Ranking](http://arxiv.org/abs/2301.05858) #robust
Deep convolutional neural networks have been widely used in scene classification of remotely sensed images. In this work, we propose a robust learning method for the task that is secure against partially incorrect categorization of images. Specifically, we remove and correct errors in the labels progressively by iterative multi-view voting and entropy ranking. At each time step, we first divide the training data into disjoint parts for separate training and voting. The unanimity in the voting reveals the correctness of the labels, so that we can train a strong model with only the images with unanimous votes. In addition, we adopt entropy as an effective measure for prediction uncertainty, in order to partially recover labeling errors by ranking and selection. We empirically demonstrate the superiority of the proposed method on the WHU-RS19 dataset and the AID dataset.
[[2301.05957] Towards Spatial Equilibrium Object Detection](http://arxiv.org/abs/2301.05957) #robust
Semantic objects are unevenly distributed over images. In this paper, we study the spatial disequilibrium problem of modern object detectors and propose to quantify this ``spatial bias'' by measuring the detection performance over zones. Our analysis surprisingly shows that the spatial imbalance of objects has a great impact on the detection performance, limiting the robustness of detection applications. This motivates us to design a more generalized measurement, termed Spatial equilibrium Precision (SP), to better characterize the detection performance of object detectors. Furthermore, we also present a spatial equilibrium label assignment (SELA) to alleviate the spatial disequilibrium problem by injecting the prior spatial weight into the optimization process of detectors. Extensive experiments on PASCAL VOC, MS COCO, and 3 application datasets on face mask/fruit/helmet images demonstrate the advantages of our method. Our findings challenge the conventional sense of object detectors and show the indispensability of spatial equilibrium. We hope these discoveries would stimulate the community to rethink how an excellent object detector should be. All the source code, evaluation protocols, and the tutorials are publicly available at https://github.com/Zzh-tju/ZoneEval
[[2301.05815] First Three Years of the International Verification of Neural Networks Competition (VNN-COMP)](http://arxiv.org/abs/2301.05815) #robust
This paper presents a summary and meta-analysis of the first three iterations of the annual International Verification of Neural Networks Competition (VNN-COMP) held in 2020, 2021, and 2022. In the VNN-COMP, participants submit software tools that analyze whether given neural networks satisfy specifications describing their input-output behavior. These neural networks and specifications cover a variety of problem classes and tasks, corresponding to safety and robustness properties in image classification, neural control, reinforcement learning, and autonomous systems. We summarize the key processes, rules, and results, present trends observed over the last three years, and provide an outlook into possible future developments.
[[2301.05931] Drug Synergistic Combinations Predictions via Large-Scale Pre-Training and Graph Structure Learning](http://arxiv.org/abs/2301.05931) #robust
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation. However, identifying novel drug combinations through wet-lab experiments is resource intensive due to the vast combinatorial search space. Recently, computational approaches, specifically deep learning models have emerged as an efficient way to discover synergistic combinations. While previous methods reported fair performance, their models usually do not take advantage of multi-modal data and they are unable to handle new drugs or cell lines. In this study, we collected data from various datasets covering various drug-related aspects. Then, we take advantage of large-scale pre-training models to generate informative representations and features for drugs, proteins, and diseases. Based on that, a message-passing graph is built on top to propagate information together with graph structure learning flexibility. This is first introduced in the biological networks and enables us to generate pseudo-relations in the graph. Our framework achieves state-of-the-art results in comparison with other deep learning-based methods on synergistic prediction benchmark datasets. We are also capable of inferencing new drug combination data in a test on an independent set released by AstraZeneca, where 10% of improvement over previous methods is observed. In addition, we're robust against unseen drugs and surpass almost 15% AU ROC compared to the second-best model. We believe our framework contributes to both the future wet-lab discovery of novel drugs and the building of promising guidance for precise combination medicine.
[[2301.05981] Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures](http://arxiv.org/abs/2301.05981) #robust
Traditional reinforcement learning (RL) aims to maximize the expected total reward, while the risk of uncertain outcomes needs to be controlled to ensure reliable performance in a risk-averse setting. In this paper, we consider the problem of maximizing dynamic risk of a sequence of rewards in infinite-horizon Markov Decision Processes (MDPs). We adapt the Expected Conditional Risk Measures (ECRMs) to the infinite-horizon risk-averse MDP and prove its time consistency. Using a convex combination of expectation and conditional value-at-risk (CVaR) as a special one-step conditional risk measure, we reformulate the risk-averse MDP as a risk-neutral counterpart with augmented action space and manipulation on the immediate rewards. We further prove that the related Bellman operator is a contraction mapping, which guarantees the convergence of any value-based RL algorithms. Accordingly, we develop a risk-averse deep Q-learning framework, and our numerical studies based on two simple MDPs show that the risk-averse setting can reduce the variance and enhance robustness of the results.
[[2301.05796] Learning Trajectory-Conditioned Relations to Predict Pedestrian Crossing Behavior](http://arxiv.org/abs/2301.05796) #extraction
In smart transportation, intelligent systems avoid potential collisions by predicting the intent of traffic agents, especially pedestrians. Pedestrian intent, defined as future action, e.g., start crossing, can be dependent on traffic surroundings. In this paper, we develop a framework to incorporate such dependency given observed pedestrian trajectory and scene frames. Our framework first encodes regional joint information between a pedestrian and surroundings over time into feature-map vectors. The global relation representations are then extracted from pairwise feature-map vectors to estimate intent with past trajectory condition. We evaluate our approach on two public datasets and compare against two state-of-the-art approaches. The experimental results demonstrate that our method helps to inform potential risks during crossing events with 0.04 improvement in F1-score on JAAD dataset and 0.01 improvement in recall on PIE dataset. Furthermore, we conduct ablation experiments to confirm the contribution of the relation extraction in our framework.
[[2301.05839] NCP: Neural Correspondence Prior for Effective Unsupervised Shape Matching](http://arxiv.org/abs/2301.05839) #extraction
We present Neural Correspondence Prior (NCP), a new paradigm for computing correspondences between 3D shapes. Our approach is fully unsupervised and can lead to high-quality correspondences even in challenging cases such as sparse point clouds or non-isometric meshes, where current methods fail. Our first key observation is that, in line with neural priors observed in other domains, recent network architectures on 3D data, even without training, tend to produce pointwise features that induce plausible maps between rigid or non-rigid shapes. Secondly, we show that given a noisy map as input, training a feature extraction network with the input map as supervision tends to remove artifacts from the input and can act as a powerful correspondence denoising mechanism, both between individual pairs and within a collection. With these observations in hand, we propose a two-stage unsupervised paradigm for shape matching by (i) performing unsupervised training by adapting an existing approach to obtain an initial set of noisy matches, and (ii) using these matches to train a network in a supervised manner. We demonstrate that this approach significantly improves the accuracy of the maps, especially when trained within a collection. We show that NCP is data-efficient, fast, and achieves state-of-the-art results on many tasks. Our code can be found online: https://github.com/pvnieo/NCP.
[[2301.06002] ACTIVE: A Deep Model for Sperm and Impurity Detection in Microscopic Videos](http://arxiv.org/abs/2301.06002) #extraction
The accurate detection of sperms and impurities is a very challenging task, facing problems such as the small size of targets, indefinite target morphologies, low contrast and resolution of the video, and similarity of sperms and impurities. So far, the detection of sperms and impurities still largely relies on the traditional image processing and detection techniques which only yield limited performance and often require manual intervention in the detection process, therefore unfavorably escalating the time cost and injecting the subjective bias into the analysis. Encouraged by the successes of deep learning methods in numerous object detection tasks, here we report a deep learning model based on Double Branch Feature Extraction Network (DBFEN) and Cross-conjugate Feature Pyramid Networks (CCFPN).DBFEN is designed to extract visual features from tiny objects with a double branch structure, and CCFPN is further introduced to fuse the features extracted by DBFEN to enhance the description of position and high-level semantic information. Our work is the pioneer of introducing deep learning approaches to the detection of sperms and impurities. Experiments show that the highest AP50 of the sperm and impurity detection is 91.13% and 59.64%, which lead its competitors by a substantial margin and establish new state-of-the-art results in this problem.
[[2301.06020] Delving Deep into Pixel Alignment Feature for Accurate Multi-view Human Mesh Recovery](http://arxiv.org/abs/2301.06020) #extraction
Regression-based methods have shown high efficiency and effectiveness for multi-view human mesh recovery. The key components of a typical regressor lie in the feature extraction of input views and the fusion of multi-view features. In this paper, we present Pixel-aligned Feedback Fusion (PaFF) for accurate yet efficient human mesh recovery from multi-view images. PaFF is an iterative regression framework that performs feature extraction and fusion alternately. At each iteration, PaFF extracts pixel-aligned feedback features from each input view according to the reprojection of the current estimation and fuses them together with respect to each vertex of the downsampled mesh. In this way, our regressor can not only perceive the misalignment status of each view from the feedback features but also correct the mesh parameters more effectively based on the feature fusion on mesh vertices. Additionally, our regressor disentangles the global orientation and translation of the body mesh from the estimation of mesh parameters such that the camera parameters of input views can be better utilized in the regression process. The efficacy of our method is validated in the Human3.6M dataset via comprehensive ablation experiments, where PaFF achieves 33.02 MPJPE and brings significant improvements over the previous best solutions by more than 29%. The project page with code and video results can be found at https://kairobo.github.io/PaFF/.
[[2301.05948] $\texttt{tasksource}$: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation](http://arxiv.org/abs/2301.05948) #extraction
The HuggingFace Datasets Hub hosts thousands of datasets. This provides exciting opportunities for language model training and evaluation. However, the datasets for a given type of task are stored with different schemas, and harmonization is harder than it seems (https://xkcd.com/927/). Multi-task training or evaluation requires manual work to fit data into task templates. Various initiatives independently address this problem by releasing the harmonized datasets or harmonization codes to preprocess datasets to the same format. We identify patterns across previous preprocessings, e.g. mapping of column names, and extraction of a specific sub-field from structured data in a column, and propose a structured annotation framework that makes our annotations fully exposed and not buried in unstructured code. We release a dataset annotation framework and dataset annotations for more than 400 English tasks (https://github.com/sileod/tasksource). These annotations provide metadata, like the name of the columns that should be used as input or labels for all datasets, and can save time for future dataset preprocessings, even if they do not use our framework. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size on an external evaluation https://hf.co/sileod/deberta-v3-base-tasksource-nli.
[[2301.06009] Rationalizing Predictions by Adversarial Information Calibration](http://arxiv.org/abs/2301.06009) #extraction
Explaining the predictions of AI models is paramount in safety-critical
applications, such as in legal or medical domains. One form of explanation for
a prediction is an extractive rationale, i.e., a subset of features of an
instance that lead the model to give its prediction on that instance. For
example, the subphrase he stole the mobile phone'' can be an extractive
rationale for the prediction of
Theft''. Previous works on generating
extractive rationales usually employ a two-phase model: a selector that selects
the most important features (i.e., the rationale) followed by a predictor that
makes the prediction based exclusively on the selected features. One
disadvantage of these works is that the main signal for learning to select
features comes from the comparison of the answers given by the predictor to the
ground-truth answers. In this work, we propose to squeeze more information from
the predictor via an information calibration method. More precisely, we train
two models jointly: one is a typical neural model that solves the task at hand
in an accurate but black-box manner, and the other is a selector-predictor
model that additionally produces a rationale for its prediction. The first
model is used as a guide for the second model. We use an adversarial technique
to calibrate the information extracted by the two models such that the
difference between them is an indicator of the missed or over-selected
features. In addition, for natural language tasks, we propose a
language-model-based regularizer to encourage the extraction of fluent
rationales. Experimental results on a sentiment analysis task, a hate speech
recognition task as well as on three tasks from the legal domain show the
effectiveness of our approach to rationale extraction.
[[2301.05797] FedSSC: Shared Supervised-Contrastive Federated Learning](http://arxiv.org/abs/2301.05797) #federate
Federated learning is widely used to perform decentralized training of a global model on multiple devices while preserving the data privacy of each device. However, it suffers from heterogeneous local data on each training device which increases the difficulty to reach the same level of accuracy as the centralized training. Supervised Contrastive Learning which outperform cross-entropy tries to minimizes the difference between feature space of points belongs to the same class and pushes away points from different classes. We propose Supervised Contrastive Federated Learning in which devices can share the learned class-wise feature spaces with each other and add the supervised-contrastive learning loss as a regularization term to foster the feature space learning. The loss tries to minimize the cosine similarity distance between the feature map and the averaged feature map from another device in the same class and maximizes the distance between the feature map and that in a different class. This new regularization term when added on top of the moon regularization term is found to outperform the other state-of-the-art regularization terms in solving the heterogeneous data distribution problem.
[[2301.05849] Survey of Knowledge Distillation in Federated Edge Learning](http://arxiv.org/abs/2301.05849) #federate
The increasing demand for intelligent services and privacy protection of mobile and Internet of Things (IoT) devices motivates the wide application of Federated Edge Learning (FEL), in which devices collaboratively train on-device Machine Learning (ML) models without sharing their private data. \textcolor{black}{Limited by device hardware, diverse user behaviors and network infrastructure, the algorithm design of FEL faces challenges related to resources, personalization and network environments}, and Knowledge Distillation (KD) has been leveraged as an important technique to tackle the above challenges in FEL. In this paper, we investigate the works that KD applies to FEL, discuss the limitations and open problems of existing KD-based FEL approaches, and provide guidance for their real deployment.
[[2301.05860] A Comprehensive Survey of Graph-level Learning](http://arxiv.org/abs/2301.05860) #interpretability
Graphs have a superior ability to represent relational data, like chemical compounds, proteins, and social networks. Hence, graph-level learning, which takes a set of graphs as input, has been applied to many tasks including comparison, regression, classification, and more. Traditional approaches to learning a set of graphs tend to rely on hand-crafted features, such as substructures. But while these methods benefit from good interpretability, they often suffer from computational bottlenecks as they cannot skirt the graph isomorphism problem. Conversely, deep learning has helped graph-level learning adapt to the growing scale of graphs by extracting features automatically and decoding graphs into low-dimensional representations. As a result, these deep graph learning methods have been responsible for many successes. Yet, there is no comprehensive survey that reviews graph-level learning starting with traditional learning and moving through to the deep learning approaches. This article fills this gap and frames the representative algorithms into a systematic taxonomy covering traditional learning, graph-level deep neural networks, graph-level graph neural networks, and graph pooling. To ensure a thoroughly comprehensive survey, the evolutions, interactions, and communications between methods from four different branches of development are also examined. This is followed by a brief review of the benchmark data sets, evaluation metrics, and common downstream applications. The survey concludes with 13 future directions of necessary research that will help to overcome the challenges facing this booming field.
[[2301.06021] Interpretable and Scalable Graphical Models for Complex Spatio-temporal Processes](http://arxiv.org/abs/2301.06021) #interpretability
This thesis focuses on data that has complex spatio-temporal structure and on probabilistic graphical models that learn the structure in an interpretable and scalable manner. We target two research areas of interest: Gaussian graphical models for tensor-variate data and summarization of complex time-varying texts using topic models. This work advances the state-of-the-art in several directions. First, it introduces a new class of tensor-variate Gaussian graphical models via the Sylvester tensor equation. Second, it develops an optimization technique based on a fast-converging proximal alternating linearized minimization method, which scales tensor-variate Gaussian graphical model estimations to modern big-data settings. Third, it connects Kronecker-structured (inverse) covariance models with spatio-temporal partial differential equations (PDEs) and introduces a new framework for ensemble Kalman filtering that is capable of tracking chaotic physical systems. Fourth, it proposes a modular and interpretable framework for unsupervised and weakly-supervised probabilistic topic modeling of time-varying data that combines generative statistical models with computational geometric methods. Throughout, practical applications of the methodology are considered using real datasets. This includes brain-connectivity analysis using EEG data, space weather forecasting using solar imaging data, longitudinal analysis of public opinions using Twitter data, and mining of mental health related issues using TalkLife data. We show in each case that the graphical modeling framework introduced here leads to improved interpretability, accuracy, and scalability.
[[2301.06015] Diffusion-based Generation, Optimization, and Planning in 3D Scenes](http://arxiv.org/abs/2301.06015) #diffusion
We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation, physics-based optimization, and goal-oriented planning via a diffusion-based denoising process in a fully differentiable fashion. Such a design alleviates the discrepancies among different modules and the posterior collapse of previous scene-conditioned generative models. We evaluate SceneDiffuser with various 3D scene understanding tasks, including human pose and motion generation, dexterous grasp generation, path planning for 3D navigation, and motion planning for robot arms. The results show significant improvements compared with previous models, demonstrating the tremendous potential of SceneDiffuser for the broad community of 3D scene understanding.
[[2301.06052] T2M-GPT: Generating Human Motion from Textual Descriptions with discrete Representations](http://arxiv.org/abs/2301.06052) #diffusion
In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.