[[2212.07387] Vulnerability Analysis of Smart Contracts](http://arxiv.org/abs/2212.07387) #secure
Blockchain Platforms and smart contracts are not secure to vulnerability issues. Security breached of smart contracts have led to huge financial losses in terms of cryptocurrencies and tokens. In this paper, we present a systematic survey of vulnerability analysis of smart contracts. We begin by providing a brief about the major type of attacks and vulnerabilities that are present in smart contracts. Then we discuss existing frameworks, methods and technologies used for vulnerability detection. We summarise our findings in a table which lists each framework and the attacks it protects against.
[[2212.06951] AI Ethics on Blockchain: Topic Analysis on Twitter Data for Blockchain Security](http://arxiv.org/abs/2212.06951) #security
Blockchain has empowered computer systems to be more secure using a distributed network. However, the current blockchain design suffers from fairness issues in transaction ordering. Miners are able to reorder transactions to generate profits, the so-called miner extractable value (MEV). Existing research recognizes MEV as a severe security issue and proposes potential solutions, including prominent Flashbots. However, previous studies have mostly analyzed blockchain data, which might not capture the impacts of MEV in a much broader AI society. Thus, in this research, we applied natural language processing (NLP) methods to comprehensively analyze topics in tweets on MEV. We collected more than 20000 tweets with #MEV and #Flashbots hashtags and analyzed their topics. Our results show that the tweets discussed profound topics of ethical concern, including security, equity, emotional sentiments, and the desire for solutions to MEV. We also identify the co-movements of MEV activities on blockchain and social media platforms. Our study contributes to the literature at the interface of blockchain security, MEV solutions, and AI ethics.
[[2212.06987] A Survey on Privacy of Personal and Non-Personal Data in B5G/6G Networks](http://arxiv.org/abs/2212.06987) #privacy
The upcoming Beyond 5G (B5G) and 6G networks are expected to provide enhanced capabilities such as ultra-high data rates, dense connectivity, and high scalability. It opens many possibilities for a new generation of services driven by Artificial Intelligence (AI) and billions of interconnected smart devices. However, with this expected massive upgrade, the privacy of people, organizations, and states is becoming a rising concern. The recent introduction of privacy laws and regulations for personal and non-personal data signals that global awareness is emerging in the current privacy landscape. Yet, many gaps need to be identified in the case of two data types. If not detected, they can lead to significant privacy leakages and attacks that will affect billions of people and organizations who utilize B5G/6G. This survey is a comprehensive study of personal and non-personal data privacy in B5G/6G to identify the current progress and future directions to ensure data privacy. We provide a detailed comparison of the two data types and a set of related privacy goals for B5G/6G. Next, we bring data privacy issues with possible solutions. This paper also provides future directions to preserve personal and non-personal data privacy in future networks.
[[2212.07326] Mathematical model of printing-imaging channel for blind detection of fake copy detection patterns](http://arxiv.org/abs/2212.07326) #protect
Nowadays, copy detection patterns (CDP) appear as a very promising anti-counterfeiting technology for physical object protection. However, the advent of deep learning as a powerful attacking tool has shown that the general authentication schemes are unable to compete and fail against such attacks. In this paper, we propose a new mathematical model of printing-imaging channel for the authentication of CDP together with a new detection scheme based on it. The results show that even deep learning created copy fakes unknown at the training stage can be reliably authenticated based on the proposed approach and using only digital references of CDP during authentication.
[[2212.06958] A Novel Active Solution for Two-Dimensional Face Presentation Attack Detection](http://arxiv.org/abs/2212.06958) #attack
Identity authentication is the process of verifying one's identity. There are several identity authentication methods, among which biometric authentication is of utmost importance. Facial recognition is a sort of biometric authentication with various applications, such as unlocking mobile phones and accessing bank accounts. However, presentation attacks pose the greatest threat to facial recognition. A presentation attack is an attempt to present a non-live face, such as a photo, video, mask, and makeup, to the camera. Presentation attack detection is a countermeasure that attempts to identify between a genuine user and a presentation attack. Several industries, such as financial services, healthcare, and education, use biometric authentication services on various devices. This illustrates the significance of presentation attack detection as the verification step. In this paper, we study state-of-the-art to cover the challenges and solutions related to presentation attack detection in a single place. We identify and classify different presentation attack types and identify the state-of-the-art methods that could be used to detect each of them. We compare the state-of-the-art literature regarding attack types, evaluation metrics, accuracy, and datasets and discuss research and industry challenges of presentation attack detection. Most presentation attack detection approaches rely on extensive data training and quality, making them difficult to implement. We introduce an efficient active presentation attack detection approach that overcomes weaknesses in the existing literature. The proposed approach does not require training data, is CPU-light, can process low-quality images, has been tested with users of various ages and is shown to be user-friendly and highly robust to 2-dimensional presentation attacks.
[[2212.06836] Towards Efficient and Domain-Agnostic Evasion Attack with High-dimensional Categorical Inputs](http://arxiv.org/abs/2212.06836) #attack
Our work targets at searching feasible adversarial perturbation to attack a classifier with high-dimensional categorical inputs in a domain-agnostic setting. This is intrinsically an NP-hard knapsack problem where the exploration space becomes explosively larger as the feature dimension increases. Without the help of domain knowledge, solving this problem via heuristic method, such as Branch-and-Bound, suffers from exponential complexity, yet can bring arbitrarily bad attack results. We address the challenge via the lens of multi-armed bandit based combinatorial search. Our proposed method, namely FEAT, treats modifying each categorical feature as pulling an arm in multi-armed bandit programming. Our objective is to achieve highly efficient and effective attack using an Orthogonal Matching Pursuit (OMP)-enhanced Upper Confidence Bound (UCB) exploration strategy. Our theoretical analysis bounding the regret gap of FEAT guarantees its practical attack performance. In empirical analysis, we compare FEAT with other state-of-the-art domain-agnostic attack methods over various real-world categorical data sets of different applications. Substantial experimental observations confirm the expected efficiency and attack effectiveness of FEAT applied in different application scenarios. Our work further hints the applicability of FEAT for assessing the adversarial vulnerability of classification systems with high-dimensional categorical inputs.
[[2212.07278] Backdoor Mitigation in Deep Neural Networks via Strategic Retraining](http://arxiv.org/abs/2212.07278) #attack
Deep Neural Networks (DNN) are becoming increasingly more important in assisted and automated driving. Using such entities which are obtained using machine learning is inevitable: tasks such as recognizing traffic signs cannot be developed reasonably using traditional software development methods. DNN however do have the problem that they are mostly black boxes and therefore hard to understand and debug. One particular problem is that they are prone to hidden backdoors. This means that the DNN misclassifies its input, because it considers properties that should not be decisive for the output. Backdoors may either be introduced by malicious attackers or by inappropriate training. In any case, detecting and removing them is important in the automotive area, as they might lead to safety violations with potentially severe consequences. In this paper, we introduce a novel method to remove backdoors. Our method works for both intentional as well as unintentional backdoors. We also do not require prior knowledge about the shape or distribution of backdoors. Experimental evidence shows that our method performs well on several medium-sized examples.
[[2212.06872] Examining the Difference Among Transformers and CNNs with Explanation Methods](http://arxiv.org/abs/2212.06872) #robust
We propose a methodology that systematically applies deep explanation algorithms on a dataset-wide basis, to compare different types of visual recognition backbones, such as convolutional networks (CNNs), global attention networks, and local attention networks. Examination of both qualitative visualizations and quantitative statistics across the dataset helps us to gain intuitions that are not just anecdotal, but are supported by the statistics computed on the entire dataset. Specifically, we propose two methods. The first one, sub-explanation counting, systematically searches for minimally-sufficient explanations of all images and count the amount of sub-explanations for each network. The second one, called cross-testing, computes salient regions using one network and then evaluates the performance by only showing these regions as an image to other networks. Through a combination of qualitative insights and quantitative statistics, we illustrate that 1) there are significant differences between the salient features of CNNs and attention models; 2) the occlusion-robustness in local attention models and global attention models may come from different decision-making mechanisms.
[[2212.06969] Localizing Objects in 3D from Egocentric Videos with Visual Queries](http://arxiv.org/abs/2212.06969) #robust
With the recent advances in video and 3D understanding, novel 4D spatio-temporal challenges fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory Benchmark proposed a task for Visual Queries with 3D Localization (VQ3D). Given an egocentric video clip and an image crop depicting a query object, the goal is to localize the 3D position of the center of that query object with respect to the camera pose of a query frame. Current methods tackle the problem of VQ3D by lifting the 2D localization results of the sister task Visual Queries with 2D Localization (VQ2D) into a 3D reconstruction. Yet, we point out that the low number of Queries with Poses (QwP) from previous VQ3D methods severally hinders their overall success rate and highlights the need for further effort in 3D modeling to tackle the VQ3D task. In this work, we formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos. We estimate more robust camera poses, leading to more successful object queries and substantially improved VQ3D performance. In practice, our method reaches a top-1 overall success rate of 86.36% on the Ego4D Episodic Memory Benchmark VQ3D, a 10x improvement over the previous state-of-the-art. In addition, we provide a complete empirical study highlighting the remaining challenges in VQ3D.
[[2212.07016] Understanding Zero-Shot Adversarial Robustness for Large-Scale Models](http://arxiv.org/abs/2212.07016) #robust
Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of \emph{adapting large-scale models for zero-shot adversarial robustness}. We first identify two key factors during model adaption -- training losses and adaptation methods -- that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of over 31 points over ImageNet and 15 zero-shot datasets. We hope this work can shed light on understanding the zero-shot adversarial robustness of large-scale models.
[[2212.07026] Improving group robustness under noisy labels using predictive uncertainty](http://arxiv.org/abs/2212.07026) #robust
The standard empirical risk minimization (ERM) can underperform on certain minority groups (i.e., waterbirds in lands or landbirds in water) due to the spurious correlation between the input and its label. Several studies have improved the worst-group accuracy by focusing on the high-loss samples. The hypothesis behind this is that such high-loss samples are \textit{spurious-cue-free} (SCF) samples. However, these approaches can be problematic since the high-loss samples may also be samples with noisy labels in the real-world scenarios. To resolve this issue, we utilize the predictive uncertainty of a model to improve the worst-group accuracy under noisy labels. To motivate this, we theoretically show that the high-uncertainty samples are the SCF samples in the binary classification problem. This theoretical result implies that the predictive uncertainty is an adequate indicator to identify SCF samples in a noisy label setting. Motivated from this, we propose a novel ENtropy based Debiasing (END) framework that prevents models from learning the spurious cues while being robust to the noisy labels. In the END framework, we first train the \textit{identification model} to obtain the SCF samples from a training set using its predictive uncertainty. Then, another model is trained on the dataset augmented with an oversampled SCF set. The experimental results show that our END framework outperforms other strong baselines on several real-world benchmarks that consider both the noisy labels and the spurious-cues.
[[2212.07044] 3D Neuron Morphology Analysis](http://arxiv.org/abs/2212.07044) #robust
We consider the problem of finding an accurate representation of neuron shapes, extracting sub-cellular features, and classifying neurons based on neuron shapes. In neuroscience research, the skeleton representation is often used as a compact and abstract representation of neuron shapes. However, existing methods are limited to getting and analyzing "curve" skeletons which can only be applied for tubular shapes. This paper presents a 3D neuron morphology analysis method for more general and complex neuron shapes. First, we introduce the concept of skeleton mesh to represent general neuron shapes and propose a novel method for computing mesh representations from 3D surface point clouds. A skeleton graph is then obtained from skeleton mesh and is used to extract sub-cellular features. Finally, an unsupervised learning method is used to embed the skeleton graph for neuron classification. Extensive experiment results are provided and demonstrate the robustness of our method to analyze neuron morphology.
[[2212.07086] NLIP: Noise-robust Language-Image Pre-training](http://arxiv.org/abs/2212.07086) #robust
Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks.
[[2212.07144] Uncertain Facial Expression Recognition via Multi-task Assisted Correction](http://arxiv.org/abs/2212.07144) #robust
Deep models for facial expression recognition achieve high performance by training on large-scale labeled data. However, publicly available datasets contain uncertain facial expressions caused by ambiguous annotations or confusing emotions, which could severely decline the robustness. Previous studies usually follow the bias elimination method in general tasks without considering the uncertainty problem from the perspective of different corresponding sources. In this paper, we propose a novel method of multi-task assisted correction in addressing uncertain facial expression recognition called MTAC. Specifically, a confidence estimation block and a weighted regularization module are applied to highlight solid samples and suppress uncertain samples in every batch. In addition, two auxiliary tasks, i.e., action unit detection and valence-arousal measurement, are introduced to learn semantic distributions from a data-driven AU graph and mitigate category imbalance based on latent dependencies between discrete and continuous emotions, respectively. Moreover, a re-labeling strategy guided by feature-level similarity constraint further generates new labels for identified uncertain samples to promote model learning. The proposed method can flexibly combine with existing frameworks in a fully-supervised or weakly-supervised manner. Experiments on RAF-DB, AffectNet, and AffWild2 datasets demonstrate that the MTAC obtains substantial improvements over baselines when facing synthetic and real uncertainties and outperforms the state-of-the-art methods.
[[2212.07181] Event-based YOLO Object Detection: Proof of Concept for Forward Perception System](http://arxiv.org/abs/2212.07181) #robust
Neuromorphic vision or event vision is an advanced vision technology, where in contrast to the visible camera that outputs pixels, the event vision generates neuromorphic events every time there is a brightness change which exceeds a specific threshold in the field of view (FOV). This study focuses on leveraging neuromorphic event data for roadside object detection. This is a proof of concept towards building artificial intelligence (AI) based pipelines which can be used for forward perception systems for advanced vehicular applications. The focus is on building efficient state-of-the-art object detection networks with better inference results for fast-moving forward perception using an event camera. In this article, the event-simulated A2D2 dataset is manually annotated and trained on two different YOLOv5 networks (small and large variants). To further assess its robustness, single model testing and ensemble model testing are carried out.
[[2212.07283] Generative Robust Classification](http://arxiv.org/abs/2212.07283) #robust
Training adversarially robust discriminative (i.e., softmax) classifier has been the dominant approach to robust classification. Building on recent work on adversarial training (AT)-based generative models, we investigate using AT to learn unnormalized class-conditional density models and then performing generative robust classification. Our result shows that, under the condition of similar model capacities, the generative robust classifier achieves comparable performance to a baseline softmax robust classifier when the test data is clean or when the test perturbation is of limited size, and much better performance when the test perturbation size exceeds the training perturbation size. The generative classifier is also able to generate samples or counterfactuals that more closely resemble the training data, suggesting that the generative classifier can better capture the class-conditional distributions. In contrast to standard discriminative adversarial training where advanced data augmentation techniques are only effective when combined with weight averaging, we find it straightforward to apply advanced data augmentation to achieve better robustness in our approach. Our result suggests that the generative classifier is a competitive alternative to robust classification, especially for problems with limited number of classes.
[[2212.07422] ECON: Explicit Clothed humans Obtained from Normals](http://arxiv.org/abs/2212.07422) #robust
The combination of artist-curated scans, and deep implicit functions (IF), is enabling the creation of detailed, clothed, 3D humans from images. However, existing methods are far from perfect. IF-based methods recover free-form geometry but produce disembodied limbs or degenerate shapes for unseen poses or clothes. To increase robustness for these cases, existing work uses an explicit parametric body model to constrain surface reconstruction, but this limits the recovery of free-form surfaces such as loose clothing that deviates from the body. What we want is a method that combines the best properties of implicit and explicit methods. To this end, we make two key observations: (1) current networks are better at inferring detailed 2D maps than full-3D surfaces, and (2) a parametric model can be seen as a "canvas" for stitching together detailed surface patches. ECON infers high-fidelity 3D humans even in loose clothes and challenging poses, while having realistic faces and fingers. This goes beyond previous methods. Quantitative, evaluation of the CAPE and Renderpeople datasets shows that ECON is more accurate than the state of the art. Perceptual studies also show that ECON's perceived realism is better by a large margin. Code and models are available for research purposes at https://xiuyuliang.cn/econ
[[2212.07284] MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling](http://arxiv.org/abs/2212.07284) #robust
Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pre-trained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.
[[2212.06864] Task-Adaptive Meta-Learning Framework for Advancing Spatial Generalizability](http://arxiv.org/abs/2212.06864) #robust
Spatio-temporal machine learning is critically needed for a variety of societal applications, such as agricultural monitoring, hydrological forecast, and traffic management. These applications greatly rely on regional features that characterize spatial and temporal differences. However, spatio-temporal data are often complex and pose several unique challenges for machine learning models: 1) multiple models are needed to handle region-based data patterns that have significant spatial heterogeneity across different locations; 2) local models trained on region-specific data have limited ability to adapt to other regions that have large diversity and abnormality; 3) spatial and temporal variations entangle data complexity that requires more robust and adaptive models; 4) limited spatial-temporal data in real scenarios (e.g., crop yield data is collected only once a year) makes the problems intrinsically challenging. To bridge these gaps, we propose task-adaptive formulations and a model-agnostic meta-learning framework that ensembles regionally heterogeneous data into location-sensitive meta tasks. We conduct task adaptation following an easy-to-hard task hierarchy in which different meta models are adapted to tasks of different difficulty levels. One major advantage of our proposed method is that it improves the model adaptation to a large number of heterogeneous tasks. It also enhances the model generalization by automatically adapting the meta model of the corresponding difficulty level to any new tasks. We demonstrate the superiority of our proposed framework over a diverse set of baselines and state-of-the-art meta-learning frameworks. Our extensive experiments on real crop yield data show the effectiveness of the proposed method in handling spatial-related heterogeneous tasks in real societal applications.
[[2212.07414] Hierarchical Over-the-Air FedGradNorm](http://arxiv.org/abs/2212.07414) #robust
Multi-task learning (MTL) is a learning paradigm to learn multiple related tasks simultaneously with a single shared network where each task has a distinct personalized header network for fine-tuning. MTL can be integrated into a federated learning (FL) setting if tasks are distributed across clients and clients have a single shared network, leading to personalized federated learning (PFL). To cope with statistical heterogeneity in the federated setting across clients which can significantly degrade the learning performance, we use a distributed dynamic weighting approach. To perform the communication between the remote parameter server (PS) and the clients efficiently over the noisy channel in a power and bandwidth-limited regime, we utilize over-the-air (OTA) aggregation and hierarchical federated learning (HFL). Thus, we propose hierarchical over-the-air (HOTA) PFL with a dynamic weighting strategy which we call HOTA-FedGradNorm. Our algorithm considers the channel conditions during the dynamic weight selection process. We conduct experiments on a wireless communication system dataset (RadComDynamic). The experimental results demonstrate that the training speed with HOTA-FedGradNorm is faster compared to the algorithms with a naive static equal weighting strategy. In addition, HOTA-FedGradNorm provides robustness against the negative channel effects by compensating for the channel conditions during the dynamic weight selection process.
[[2212.07299] Child PalmID: Contactless Palmprint Recognition](http://arxiv.org/abs/2212.07299) #biometric
Developing and least developed countries face the dire challenge of ensuring that each child in their country receives required doses of vaccination, adequate nutrition and proper medication. International agencies such as UNICEF, WHO and WFP, among other organizations, strive to find innovative solutions to determine which child has received the benefits and which have not. Biometric recognition systems have been sought out to help solve this problem. To that end, this report establishes a baseline accuracy of a commercial contactless palmprint recognition system that may be deployed for recognizing children in the age group of one to five years old. On a database of contactless palmprint images of one thousand unique palms from 500 children, we establish SOTA authentication accuracy of 90.85% @ FAR of 0.01%, rank-1 identification accuracy of 99.0% (closed set), and FPIR=0.01 @ FNIR=0.3 for open-set identification using PalmMobile SDK from Armatura.
[[2212.07047] Shared Coupling-bridge for Weakly Supervised Local Feature Learning](http://arxiv.org/abs/2212.07047) #extraction
Sparse local feature extraction is usually believed to be of important significance in typical vision tasks such as simultaneous localization and mapping, image matching and 3D reconstruction. At present, it still has some deficiencies needing further improvement, mainly including the discrimination power of extracted local descriptors, the localization accuracy of detected keypoints, and the efficiency of local feature learning. This paper focuses on promoting the currently popular sparse local feature learning with camera pose supervision. Therefore, it pertinently proposes a Shared Coupling-bridge scheme with four light-weight yet effective improvements for weakly-supervised local feature (SCFeat) learning. It mainly contains: i) the \emph{Feature-Fusion-ResUNet Backbone} (F2R-Backbone) for local descriptors learning, ii) a shared coupling-bridge normalization to improve the decoupling training of description network and detection network, iii) an improved detection network with peakiness measurement to detect keypoints and iv) the fundamental matrix error as a reward factor to further optimize feature detection training. Extensive experiments prove that our SCFeat improvement is effective. It could often obtain a state-of-the-art performance on classic image matching and visual localization. In terms of 3D reconstruction, it could still achieve competitive results. For sharing and communication, our source codes are available at https://github.com/sunjiayuanro/SCFeat.git.
[[2212.07060] VINet: Lightweight, Scalable, and Heterogeneous Cooperative Perception for 3D Object Detection](http://arxiv.org/abs/2212.07060) #extraction
Utilizing the latest advances in Artificial Intelligence (AI), the computer vision community is now witnessing an unprecedented evolution in all kinds of perception tasks, particularly in object detection. Based on multiple spatially separated perception nodes, Cooperative Perception (CP) has emerged to significantly advance the perception of automated driving. However, current cooperative object detection methods mainly focus on ego-vehicle efficiency without considering the practical issues of system-wide costs. In this paper, we introduce VINet, a unified deep learning-based CP network for scalable, lightweight, and heterogeneous cooperative 3D object detection. VINet is the first CP method designed from the standpoint of large-scale system-level implementation and can be divided into three main phases: 1) Global Pre-Processing and Lightweight Feature Extraction which prepare the data into global style and extract features for cooperation in a lightweight manner; 2) Two-Stream Fusion which fuses the features from scalable and heterogeneous perception nodes; and 3) Central Feature Backbone and 3D Detection Head which further process the fused features and generate cooperative detection results. A cooperative perception platform is designed and developed for CP dataset acquisition and several baselines are compared during the experiments. The experimental analysis shows that VINet can achieve remarkable improvements for pedestrians and cars with 2x less system-wide computational costs and 12x less system-wide communicational costs.
[[2212.07112] DialogQAE: N-to-N Question Answer Pair Extraction from Customer Service Chatlog](http://arxiv.org/abs/2212.07112) #extraction
Harvesting question-answer (QA) pairs from customer service chatlog in the wild is an efficient way to enrich the knowledge base for customer service chatbots in the cold start or continuous integration scenarios. Prior work attempts to obtain 1-to-1 QA pairs from growing customer service chatlog, which fails to integrate the incomplete utterances from the dialog context for composite QA retrieval. In this paper, we propose N-to-N QA extraction task in which the derived questions and corresponding answers might be separated across different utterances. We introduce a suite of generative/discriminative tagging based methods with end-to-end and two-stage variants that perform well on 5 customer service datasets and for the first time setup a benchmark for N-to-N DialogQAE with utterance and session level evaluation metrics. With a deep dive into extracted QA pairs, we find that the relations between and inside the QA pairs can be indicators to analyze the dialogue structure, e.g. information seeking, clarification, barge-in and elaboration. We also show that the proposed models can adapt to different domains and languages, and reduce the labor cost of knowledge accumulation in the real-world product dialogue platform.
[[2212.07156] MIST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text](http://arxiv.org/abs/2212.07156) #extraction
Modal verbs (e.g., "can", "should", or "must") occur highly frequently in scientific articles. Decoding their function is not straightforward: they are often used for hedging, but they may also denote abilities and restrictions. Understanding their meaning is important for various NLP tasks such as writing assistance or accurate information extraction from scientific text.
To foster research on the usage of modals in this genre, we introduce the MIST (Modals In Scientific Text) dataset, which contains 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function. We systematically evaluate a set of competitive neural architectures on MIST. Transfer experiments reveal that leveraging non-scientific data is of limited benefit for modeling the distinctions in MIST. Our corpus analysis provides evidence that scientific communities differ in their usage of modal verbs, yet, classifiers trained on scientific data generalize to some extent to unseen scientific domains.
[[2212.07172] Quotations, Coreference Resolution, and Sentiment Annotations in Croatian News Articles: An Exploratory Study](http://arxiv.org/abs/2212.07172) #extraction
This paper presents a corpus annotated for the task of direct-speech extraction in Croatian. The paper focuses on the annotation of the quotation, co-reference resolution, and sentiment annotation in SETimes news corpus in Croatian and on the analysis of its language-specific differences compared to English. From this, a list of the phenomena that require special attention when performing these annotations is derived. The generated corpus with quotation features annotations can be used for multiple tasks in the field of Natural Language Processing.
[[2212.07179] FLAGS Framework for Comparative Analysis of Federated Learning Algorithms](http://arxiv.org/abs/2212.07179) #federate
Federated Learning (FL) has become a key choice for distributed machine learning. Initially focused on centralized aggregation, recent works in FL have emphasized greater decentralization to adapt to the highly heterogeneous network edge. Among these, Hierarchical, Device-to-Device and Gossip Federated Learning (HFL, D2DFL \& GFL respectively) can be considered as foundational FL algorithms employing fundamental aggregation strategies. A number of FL algorithms were subsequently proposed employing multiple fundamental aggregation schemes jointly. Existing research, however, subjects the FL algorithms to varied conditions and gauges the performance of these algorithms mainly against Federated Averaging (FedAvg) only. This work consolidates the FL landscape and offers an objective analysis of the major FL algorithms through a comprehensive cross-evaluation for a wide range of operating conditions. In addition to the three foundational FL algorithms, this work also analyzes six derived algorithms. To enable a uniform assessment, a multi-FL framework named FLAGS: Federated Learning AlGorithms Simulation has been developed for rapid configuration of multiple FL algorithms. Our experiments indicate that fully decentralized FL algorithms achieve comparable accuracy under multiple operating conditions, including asynchronous aggregation and the presence of stragglers. Furthermore, decentralized FL can also operate in noisy environments and with a comparably higher local update rate. However, the impact of extremely skewed data distributions on decentralized FL is much more adverse than on centralized variants. The results indicate that it may not be necessary to restrict the devices to a single FL algorithm; rather, multi-FL nodes may operate with greater efficiency.
[[2212.07224] FedSkip: Combatting Statistical Heterogeneity with Federated Skip Aggregation](http://arxiv.org/abs/2212.07224) #federate
The statistical heterogeneity of the non-independent and identically distributed (non-IID) data in local clients significantly limits the performance of federated learning. Previous attempts like FedProx, SCAFFOLD, MOON, FedNova and FedDyn resort to an optimization perspective, which requires an auxiliary term or re-weights local updates to calibrate the learning bias or the objective inconsistency. However, in addition to previous explorations for improvement in federated averaging, our analysis shows that another critical bottleneck is the poorer optima of client models in more heterogeneous conditions. We thus introduce a data-driven approach called FedSkip to improve the client optima by periodically skipping federated averaging and scattering local models to the cross devices. We provide theoretical analysis of the possible benefit from FedSkip and conduct extensive experiments on a range of datasets to demonstrate that FedSkip achieves much higher accuracy, better aggregation efficiency and competing communication efficiency. Source code is available at: https://github.com/MediaBrain-SJTU/FedSkip.
[[2212.07356] Scheduling and Aggregation Design for Asynchronous Federated Learning over Wireless Networks](http://arxiv.org/abs/2212.07356) #federate
Federated Learning (FL) is a collaborative machine learning (ML) framework that combines on-device training and server-based aggregation to train a common ML model among distributed agents. In this work, we propose an asynchronous FL design with periodic aggregation to tackle the straggler issue in FL systems. Considering limited wireless communication resources, we investigate the effect of different scheduling policies and aggregation designs on the convergence performance. Driven by the importance of reducing the bias and variance of the aggregated model updates, we propose a scheduling policy that jointly considers the channel quality and training data representation of user devices. The effectiveness of our channel-aware data-importance-based scheduling policy, compared with state-of-the-art methods proposed for synchronous FL, is validated through simulations. Moreover, we show that an "age-aware" aggregation weighting design can significantly improve the learning performance in an asynchronous FL setting.
[[2212.07265] Cross-Channel: Scalable Off-Chain Channels Supporting Fair and Atomic Cross-Chain Operations](http://arxiv.org/abs/2212.07265) #fair
Cross-chain technology facilitates the interoperability among isolated blockchains on which users can freely communicate and transfer values. Existing cross-chain protocols suffer from the scalability problem when processing on-chain transactions. Off-chain channels, as a promising blockchain scaling technique, can enable micro-payment transactions without involving on-chain transaction settlement. However, existing channel schemes can only be applied to operations within a single blockchain, failing to support cross-chain services. Therefore in this paper, we propose Cross-Channel, the first off-chain channel to support cross-chain services. We introduce a novel hierarchical channel structure, a new hierarchical settlement protocol, and a smart general fair exchange protocol, to ensure scalability, fairness, and atomicity of cross-chain interactions. Besides, Cross-Channel provides strong security and practicality by avoiding high latency in asynchronous networks. Through a 50-instance deployment of Cross-Channel on AliCloud, we demonstrate that Cross-Channel is well-suited for processing cross-chain transactions in high-frequency and large-scale, and brings a significantly enhanced throughput with a small amount of gas and delay overhead.
[[2212.06925] On the Relationship Between Explanation and Prediction: A Causal View](http://arxiv.org/abs/2212.06925) #explainability
Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.
[[2212.07056] On the Probability of Necessity and Sufficiency of Explaining Graph Neural Networks: A Lower Bound Optimization Approach](http://arxiv.org/abs/2212.07056) #explainability
Explainability of Graph Neural Networks (GNNs) is critical to various GNN applications but remains an open challenge. A convincing explanation should be both necessary and sufficient simultaneously. However, existing GNN explaining approaches focus on only one of the two aspects, necessity or sufficiency, or a trade-off between the two. To search for the most necessary and sufficient explanation, the Probability of Necessity and Sufficiency (PNS) can be applied since it can mathematically quantify the necessity and sufficiency of an explanation. Nevertheless, the difficulty of obtaining PNS due to non-monotonicity and the challenge of counterfactual estimation limits its wide use. To address the non-identifiability of PNS, we resort to a lower bound of PNS that can be optimized via counterfactual estimation, and propose Necessary and Sufficient Explanation for GNN (NSEG) via optimizing that lower bound. Specifically, we employ nearest neighbor matching to generate counterfactual samples for the features, which is different from the random perturbation. In particular, NSEG combines the edges and node features to generate an explanation, where the common edge explanation is a special case of the combined explanation. Empirical study shows that NSEG achieves excellent performance in generating the most necessary and sufficient explanations among a series of state-of-the-art methods.
[[2212.06858] LidarCLIP or: How I Learned to Talk to Point Clouds](http://arxiv.org/abs/2212.06858) #diffusion
Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also use LidarCLIP as a tool to investigate fundamental lidar capabilities through natural language. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. We hope LidarCLIP can inspire future work to dive deeper into connections between text and point cloud understanding. Code and trained models available at https://github.com/atonderski/lidarclip.
[[2212.06909] Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting](http://arxiv.org/abs/2212.06909) #diffusion
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
[[2212.07352] Bi-Noising Diffusion: Towards Conditional Diffusion Models with Generative Restoration Priors](http://arxiv.org/abs/2212.07352) #diffusion
Conditional diffusion probabilistic models can model the distribution of natural images and can generate diverse and realistic samples based on given conditions. However, oftentimes their results can be unrealistic with observable color shifts and textures. We believe that this issue results from the divergence between the probabilistic distribution learned by the model and the distribution of natural images. The delicate conditions gradually enlarge the divergence during each sampling timestep. To address this issue, we introduce a new method that brings the predicted samples to the training data manifold using a pretrained unconditional diffusion model. The unconditional model acts as a regularizer and reduces the divergence introduced by the conditional model at each sampling step. We perform comprehensive experiments to demonstrate the effectiveness of our approach on super-resolution, colorization, turbulence removal, and image-deraining tasks. The improvements obtained by our method suggest that the priors can be incorporated as a general plugin for improving conditional diffusion models.