[[2209.09388] An Owner-managed Indirect-Permission Social Authentication Method for Private Key Recovery](http://arxiv.org/abs/2209.09388)
In this paper, we propose a very secure and reliable owner-self-managed private key recovery method. In recent years, Public Key Authentication (PKA) method has been identified as the most feasible online security solution. However, losing the private key also implies the risk of losing the ownership of the assets associated with the private key. For key protection, the commonly adopted something-you-x solutions require a new secret to protect the target secret and fall into a circular protection issue as the new secret has to be protected too. To resolve the circular protection issue and provide a truly secure and reliable solution, we propose separating the permission and possession of the private key. Then we create secret shares of the permission using the open public keys of selected trustees while having the owner possess the permission-encrypted private key. Then by applying the social authentication method, one may easily retrieve the permission to recover the private key. Our analysis shows that our proposed indirect-permission method is six orders of magnitude more secure and reliable than
[[2209.09424] PolyMPCNet: Towards ReLU-free Neural Architecture Search in Two-party Computation Based Private Inference](http://arxiv.org/abs/2209.09424)
The rapid growth and deployment of deep learning (DL) has witnessed emerging privacy and security concerns. To mitigate these issues, secure multi-party computation (MPC) has been discussed, to enable the privacy-preserving DL computation. In practice, they often come at very high computation and communication overhead, and potentially prohibit their popularity in large scale systems. Two orthogonal research trends have attracted enormous interests in addressing the energy efficiency in secure deep learning, i.e., overhead reduction of MPC comparison protocol, and hardware acceleration. However, they either achieve a low reduction ratio and suffer from high latency due to limited computation and communication saving, or are power-hungry as existing works mainly focus on general computing platforms such as CPUs and GPUs.
In this work, as the first attempt, we develop a systematic framework, PolyMPCNet, of joint overhead reduction of MPC comparison protocol and hardware acceleration, by integrating hardware latency of the cryptographic building block into the DNN loss function to achieve high energy efficiency, accuracy, and security guarantee. Instead of heuristically checking the model sensitivity after a DNN is well-trained (through deleting or dropping some non-polynomial operators), our key design principle is to em enforce exactly what is assumed in the DNN design -- training a DNN that is both hardware efficient and secure, while escaping the local minima and saddle points and maintaining high accuracy. More specifically, we propose a straight through polynomial activation initialization method for cryptographic hardware friendly trainable polynomial activation function to replace the expensive 2P-ReLU operator. We develop a cryptographic hardware scheduler and the corresponding performance model for Field Programmable Gate Arrays (FPGA) platform.
[[2209.09642] A Secure Healthcare 5](http://arxiv.org/abs/2209.09642)
In recent years, the global Internet of Medical Things (IoMT) industry has evolved at a tremendous speed. Security and privacy are key concerns on the IoMT, owing to the huge scale and deployment of IoMT networks. Machine learning (ML) and blockchain (BC) technologies have significantly enhanced the capabilities and facilities of healthcare 5.0, spawning a new area known as "Smart Healthcare." By identifying concerns early, a smart healthcare system can help avoid long-term damage. This will enhance the quality of life for patients while reducing their stress and healthcare costs. The IoMT enables a range of functionalities in the field of information technology, one of which is smart and interactive health care. However, combining medical data into a single storage location to train a powerful machine learning model raises concerns about privacy, ownership, and compliance with greater concentration. Federated learning (FL) overcomes the preceding difficulties by utilizing a centralized aggregate server to disseminate a global learning model. Simultaneously, the local participant keeps control of patient information, assuring data confidentiality and security. This article conducts a comprehensive analysis of the findings on blockchain technology entangled with federated learning in healthcare. 5.0. The purpose of this study is to construct a secure health monitoring system in healthcare 5.0 by utilizing a blockchain technology and Intrusion Detection System (IDS) to detect any malicious activity in a healthcare network and enables physicians to monitor patients through medical sensors and take necessary measures periodically by predicting diseases.
[[2209.09835] EM-Fault It Yourself: Building a Replicable EMFI Setup for Desktop and Server Hardware](http://arxiv.org/abs/2209.09835)
EMFI has become a popular fault injection (FI) technique due to its ability to inject faults precisely considering timing and location. Recently, ARM, RISC-V, and even x86 processing units in different packages were shown to be vulnerable to electromagnetic fault injection (EMFI) attacks. However, past publications lack a detailed description of the entire attack setup, hindering researchers and companies from easily replicating the presented attacks on their devices. In this work, we first show how to build an automated EMFI setup with high scanning resolution and good repeatability that is large enough to attack modern desktop and server CPUs. We structurally lay out all details on mechanics, hardware, and software along with this paper. Second, we use our setup to attack a deeply embedded security co-processor in modern AMD systems on a chip (SoCs), the AMD Secure Processor (AMD-SP). Using a previously published code execution exploit, we run two custom payloads on the AMD-SP that utilize the SoC to different degrees. We then visualize these fault locations on SoC photographs allowing us to reason about the SoC's components under attack. Finally, we show that the signature verification process of one of the first executed firmware parts is susceptible to EMFI attacks, undermining the security architecture of the entire SoC. To the best of our knowledge, this is the first reported EMFI attack against an AMD desktop CPU.
[[2209.09630] Detection of Malicious Websites Using Machine Learning Techniques](http://arxiv.org/abs/2209.09630)
In detecting malicious websites, a common approach is the use of blacklists which are not exhaustive in themselves and are unable to generalize to new malicious sites. Detecting newly encountered malicious websites automatically will help reduce the vulnerability to this form of attack. In this study, we explored the use of ten machine learning models to classify malicious websites based on lexical features and understand how they generalize across datasets. Specifically, we trained, validated, and tested these models on different sets of datasets and then carried out a cross-datasets analysis. From our analysis, we found that K-Nearest Neighbor is the only model that performs consistently high across datasets. Other models such as Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines also consistently outperform a baseline model of predicting every link as malicious across all metrics and datasets. Also, we found no evidence that any subset of lexical features generalizes across models or datasets. This research should be relevant to cybersecurity professionals and academic researchers as it could form the basis for real-life detection systems or further research work.
[[2209.09653] A Framework for Preserving Privacy and Cybersecurity in Brain-Computer Interfacing Applications](http://arxiv.org/abs/2209.09653)
Brain-Computer Interfaces (BCIs) comprise a rapidly evolving field of technology with the potential of far-reaching impact in domains ranging from medical over industrial to artistic, gaming, and military. Today, these emerging BCI applications are typically still at early technology readiness levels, but because BCIs create novel, technical communication channels for the human brain, they have raised privacy and security concerns. To mitigate such risks, a large body of countermeasures has been proposed in the literature, but a general framework is lacking which would describe how privacy and security of BCI applications can be protected by design, i.e., already as an integral part of the early BCI design process, in a systematic manner, and allowing suitable depth of analysis for different contexts such as commercial BCI product development vs. academic research and lab prototypes. Here we propose the adoption of recent systems-engineering methodologies for privacy threat modeling, risk assessment, and privacy engineering to the BCI field. These methodologies address privacy and security concerns in a more systematic and holistic way than previous approaches, and provide reusable patterns on how to move from principles to actions. We apply these methodologies to BCI and data flows and derive a generic, extensible, and actionable framework for brain-privacy-preserving cybersecurity in BCI applications. This framework is designed for flexible application to the wide range of current and future BCI applications. We also propose a range of novel privacy-by-design features for BCIs, with an emphasis on features promoting BCI transparency as a prerequisite for informational self-determination of BCI users, as well as design features for ensuring BCI user autonomy. We anticipate that our framework will contribute to the development of privacy-respecting, trustworthy BCI technologies.
[[2209.09769] Peer-group Behaviour Analytics of Windows Authentications Events Using Hierarchical Bayesian Modelling](http://arxiv.org/abs/2209.09769)
Cyber-security analysts face an increasingly large number of alerts received on any given day. This is mainly due to the low precision of many existing methods to detect threats, producing a substantial number of false positives. Usually, several signature-based and statistical anomaly detectors are implemented within a computer network to detect threats. Recent efforts in User and Entity Behaviour Analytics modelling shed a light on how to reduce the burden on Security Operations Centre analysts through a better understanding of peer-group behaviour. Statistically, the challenge consists of accurately grouping users with similar behaviour, and then identifying those who deviate from their peers. This work proposes a new approach for peer-group behaviour modelling of Windows authentication events, using principles from hierarchical Bayesian models. This is a two-stage approach where in the first stage, peer-groups are formed based on a data-driven method, given the user's individual authentication pattern. In the second stage, the counts of users authenticating to different entities are aggregated by an hour and modelled by a Poisson distribution, taking into account seasonality components and hierarchical principles. Finally, we compare grouping users based on their human resources records against the data-driven methods and provide empirical evidence about alert reduction on a real-world authentication data set from a large enterprise network.
[[2209.09855] Toward Identification and Characterization of IoT Software Update Practices](http://arxiv.org/abs/2209.09855)
Software update systems are critical for ensuring systems remain free of bugs and vulnerabilities while they are in service. While many Internet of Things (IoT) devices are capable of outlasting desktops and mobile phones, their software update practices are not yet well understood. This paper discusses efforts toward characterizing the IoT software update landscape through network analysis of IoT device traffic. Our results suggest that vendors do not currently follow security best practices, and that software update standards, while available, are not being deployed. We discuss our findings and give a research agenda for improving the overall security and transparency of software updates on IoT.
[[2209.09239] Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive Survey](http://arxiv.org/abs/2209.09239)
Data quality is the key factor for the development of trustworthy AI in healthcare. A large volume of curated datasets with controlled confounding factors can help improve the accuracy, robustness and privacy of downstream AI algorithms. However, access to good quality datasets is limited by the technical difficulty of data acquisition and large-scale sharing of healthcare data is hindered by strict ethical restrictions. Data synthesis algorithms, which generate data with a similar distribution as real clinical data, can serve as a potential solution to address the scarcity of good quality data during the development of trustworthy AI. However, state-of-the-art data synthesis algorithms, especially deep learning algorithms, focus more on imaging data while neglecting the synthesis of non-imaging healthcare data, including clinical measurements, medical signals and waveforms, and electronic healthcare records (EHRs). Thus, in this paper, we will review the synthesis algorithms, particularly for non-imaging medical data, with the aim of providing trustworthy AI in this domain. This tutorial-styled review paper will provide comprehensive descriptions of non-imaging medical data synthesis on aspects including algorithms, evaluations, limitations and future research directions.
[[2209.09584] Non-Disclosing Credential On-chaining for Blockchain-based Decentralized Applications](http://arxiv.org/abs/2209.09584)
Many service systems rely on verifiable identity-related information of their users. Manipulation and unwanted exposure of this privacy-relevant information, however, must at the same time be prevented and avoided. Peer-to-peer blockchain-based decentralization with a smart contract-based execution model and verifiable off-chain computations leveraging zero-knowledge proofs promise to provide the basis for next-generation, non-disclosing credential management solutions. In this paper, we propose a novel credential on-chaining system that ensures blockchain-based transparency while preserving pseudonymity. We present a general model compliant to the W3C verifiable credential recommendation and demonstrate how it can be applied to solve existing problems that require computational identity-related attribute verification. Our zkSNARKs-based reference implementation and evaluation show that, compared to related approaches based on, e.g., CL-signatures, our approach provides significant performance advantages and more flexible proof mechanisms, underpinning our vision of increasingly decentralized, transparent, and trustworthy service systems.
[[2209.09631] De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks](http://arxiv.org/abs/2209.09631)
Unstructured textual data are at the heart of health systems: liaison letters between doctors, operating reports, coding of procedures according to the ICD-10 standard, etc. The details included in these documents make it possible to get to know the patient better, to better manage him or her, to better study the pathologies, to accurately remunerate the associated medical acts\ldots All this seems to be (at least partially) within reach of today by artificial intelligence techniques. However, for obvious reasons of privacy protection, the designers of these AIs do not have the legal right to access these documents as long as they contain identifying data. De-identifying these documents, i.e. detecting and deleting all identifying information present in them, is a legally necessary step for sharing this data between two complementary worlds. Over the last decade, several proposals have been made to de-identify documents, mainly in English. While the detection scores are often high, the substitution methods are often not very robust to attack. In French, very few methods are based on arbitrary detection and/or substitution rules. In this paper, we propose a new comprehensive de-identification method dedicated to French-language medical documents. Both the approach for the detection of identifying elements (based on deep learning) and their substitution (based on differential privacy) are based on the most proven existing approaches. The result is an approach that effectively protects the privacy of the patients at the heart of these medical documents. The whole approach has been evaluated on a French language medical dataset of a French public hospital and the results are very encouraging.
[[2209.09502] GAMA: Generative Adversarial Multi-Object Scene Attacks](http://arxiv.org/abs/2209.09502)
The majority of methods for crafting adversarial attacks have focused on scenes with a single dominant object (e.g., images from ImageNet). On the other hand, natural scenes include multiple dominant objects that are semantically related. Thus, it is crucial to explore designing attack strategies that look beyond learning on single-object scenes or attack single-object victim classifiers. Due to their inherent property of strong transferability of perturbations to unknown models, this paper presents the first approach of using generative models for adversarial attacks on multi-object scenes. In order to represent the relationships between different objects in the input scene, we leverage upon the open-sourced pre-trained vision-language model CLIP (Contrastive Language-Image Pre-training), with the motivation to exploit the encoded semantics in the language space along with the visual space. We call this attack approach Generative Adversarial Multi-object scene Attacks (GAMA). GAMA demonstrates the utility of the CLIP model as an attacker's tool to train formidable perturbation generators for multi-object scenes. Using the joint image-text features to train the generator, we show that GAMA can craft potent transferable perturbations in order to fool victim classifiers in various attack settings. For example, GAMA triggers ~16% more misclassification than state-of-the-art generative approaches in black-box settings where both the classifier architecture and data distribution of the attacker are different from the victim. Our code will be made publicly available soon.
[[2209.09652] Adversarial Color Projection: A Projector-Based Physical Attack to DNNs](http://arxiv.org/abs/2209.09652)
Recent advances have shown that deep neural networks (DNNs) are susceptible to adversarial perturbations. Therefore, it is necessary to evaluate the robustness of advanced DNNs using adversarial attacks. However, traditional physical attacks that use stickers as perturbations are more vulnerable than recent light-based physical attacks. In this work, we propose a projector-based physical attack called adversarial color projection (AdvCP), which performs an adversarial attack by manipulating the physical parameters of the projected light. Experiments show the effectiveness of our method in both digital and physical environments. The experimental results demonstrate that the proposed method has excellent attack transferability, which endows AdvCP with effective blackbox attack. We prospect AdvCP threats to future vision-based systems and applications and propose some ideas for light-based physical attacks.
[[2209.09883] Leveraging Local Patch Differences in Multi-Object Scenes for Generative Adversarial Attacks](http://arxiv.org/abs/2209.09883)
State-of-the-art generative model-based attacks against image classifiers
overwhelmingly focus on single-object (i.e., single dominant object) images.
Different from such settings, we tackle a more practical problem of generating
adversarial perturbations using multi-object (i.e., multiple dominant objects)
images as they are representative of most real-world scenes. Our goal is to
design an attack strategy that can learn from such natural scenes by leveraging
the local patch differences that occur inherently in such images (e.g.
difference between the local patch on the object person' and the object
bike'
in a traffic scene). Our key idea is: to misclassify an adversarial
multi-object image, each local patch in the image should confuse the victim
classifier. Based on this, we propose a novel generative attack (called Local
Patch Difference or LPD-Attack) where a novel contrastive loss function uses
the aforesaid local differences in feature space of multi-object scenes to
optimize the perturbation generator. Through various experiments across diverse
victim convolutional neural networks, we show that our approach outperforms
baseline generative attacks with highly transferable perturbations when
evaluated under different white-box and black-box settings.
[[2209.09557] CANflict: Exploiting Peripheral Conflicts for Data-Link Layer Attacks on Automotive Networks](http://arxiv.org/abs/2209.09557)
Current research in the automotive domain has proven the limitations of the CAN protocol from a security standpoint. Application-layer attacks, which involve the creation of malicious packets, are deemed feasible from remote but can be easily detected by modern IDS. On the other hand, more recent link-layer attacks are stealthier and possibly more disruptive but require physical access to the bus. In this paper, we present CANflict, a software-only approach that allows reliable manipulation of the CAN bus at the data link layer from an unmodified microcontroller, overcoming the limitations of state-of-the-art works. We demonstrate that it is possible to deploy stealthy CAN link-layer attacks from a remotely compromised ECU, targeting another ECU on the same CAN network. To do this, we exploit the presence of pin conflicts between microcontroller peripherals to craft polyglot frames, which allows an attacker to control the CAN traffic at the bit level and bypass the protocol's rules. We experimentally demonstrate the effectiveness of our approach on high-, mid-, and low-end microcontrollers, and we provide the ground for future research by releasing an extensible tool that can be used to implement our approach on different platforms and to build CAN countermeasures at the data link layer.
[[2209.09577] Understanding Real-world Threats to Deep Learning Models in Android Apps](http://arxiv.org/abs/2209.09577)
Famous for its superior performance, deep learning (DL) has been popularly used within many applications, which also at the same time attracts various threats to the models. One primary threat is from adversarial attacks. Researchers have intensively studied this threat for several years and proposed dozens of approaches to create adversarial examples (AEs). But most of the approaches are only evaluated on limited models and datasets (e.g., MNIST, CIFAR-10). Thus, the effectiveness of attacking real-world DL models is not quite clear. In this paper, we perform the first systematic study of adversarial attacks on real-world DNN models and provide a real-world model dataset named RWM. Particularly, we design a suite of approaches to adapt current AE generation algorithms to the diverse real-world DL models, including automatically extracting DL models from Android apps, capturing the inputs and outputs of the DL models in apps, generating AEs and validating them by observing the apps' execution. For black-box DL models, we design a semantic-based approach to build suitable datasets and use them for training substitute models when performing transfer-based attacks. After analyzing 245 DL models collected from 62,583 real-world apps, we have a unique opportunity to understand the gap between real-world DL models and contemporary AE generation algorithms. To our surprise, the current AE generation algorithms can only directly attack 6.53% of the models. Benefiting from our approach, the success rate upgrades to 47.35%.
[[2209.09688] Sparse Vicious Attacks on Graph Neural Networks](http://arxiv.org/abs/2209.09688)
Graph Neural Networks (GNNs) have proven to be successful in several predictive modeling tasks for graph-structured data.
Amongst those tasks, link prediction is one of the fundamental problems for many real-world applications, such as recommender systems.
However, GNNs are not immune to adversarial attacks, i.e., carefully crafted malicious examples that are designed to fool the predictive model.
In this work, we focus on a specific, white-box attack to GNN-based link prediction models, where a malicious node aims to appear in the list of recommended nodes for a given target victim.
To achieve this goal, the attacker node may also count on the cooperation of other existing peers that it directly controls, namely on the ability to inject a number of ``vicious'' nodes in the network.
Specifically, all these malicious nodes can add new edges or remove existing ones, thereby perturbing the original graph.
Thus, we propose SAVAGE, a novel framework and a method to mount this type of link prediction attacks.
SAVAGE formulates the adversary's goal as an optimization task, striking the balance between the effectiveness of the attack and the sparsity of malicious resources required.
Extensive experiments conducted on real-world and synthetic datasets demonstrate that adversarial attacks implemented through SAVAGE indeed achieve high attack success rate yet using a small amount of vicious nodes.
Finally, despite those attacks require full knowledge of the target model, we show that they are successfully transferable to other black-box methods for link prediction.
[[2209.09348] Visible-Infrared Person Re-Identification Using Privileged Intermediate Information](http://arxiv.org/abs/2209.09348)
Visible-infrared person re-identification (ReID) aims to recognize a same person of interest across a network of RGB and IR cameras. Some deep learning (DL) models have directly incorporated both modalities to discriminate persons in a joint representation space. However, this cross-modal ReID problem remains challenging due to the large domain shift in data distributions between RGB and IR modalities. % This paper introduces a novel approach for a creating intermediate virtual domain that acts as bridges between the two main domains (i.e., RGB and IR modalities) during training. This intermediate domain is considered as privileged information (PI) that is unavailable at test time, and allows formulating this cross-modal matching task as a problem in learning under privileged information (LUPI). We devised a new method to generate images between visible and infrared domains that provide additional information to train a deep ReID model through an intermediate domain adaptation. In particular, by employing color-free and multi-step triplet loss objectives during training, our method provides common feature representation spaces that are robust to large visible-infrared domain shifts. % Experimental results on challenging visible-infrared ReID datasets indicate that our proposed approach consistently improves matching accuracy, without any computational overhead at test time. The code is available at: \href{https://github.com/alehdaghi/Cross-Modal-Re-ID-via-LUPI}{https://github.com/alehdaghi/Cross-Modal-Re-ID-via-LUPI}
[[2209.09391] QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars](http://arxiv.org/abs/2209.09391)
Real-time tracking of human body motion is crucial for interactive and immersive experiences in AR/VR. However, very limited sensor data about the body is available from standalone wearable devices such as HMDs (Head Mounted Devices) or AR glasses. In this work, we present a reinforcement learning framework that takes in sparse signals from an HMD and two controllers, and simulates plausible and physically valid full body motions. Using high quality full body motion as dense supervision during training, a simple policy network can learn to output appropriate torques for the character to balance, walk, and jog, while closely following the input signals. Our results demonstrate surprisingly similar leg motions to ground truth without any observations of the lower body, even when the input is only the 6D transformations of the HMD. We also show that a single policy can be robust to diverse locomotion styles, different body sizes, and novel environments.
[[2209.09484] Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos](http://arxiv.org/abs/2209.09484)
Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices. We will open-source code and data to facilitate future research.
[[2209.09554] Towards Robust Referring Image Segmentation](http://arxiv.org/abs/2209.09554)
Referring Image Segmentation (RIS) aims to connect image and language via outputting the corresponding object masks given a text description, which is a fundamental vision-language task. Despite lots of works that have achieved considerable progress for RIS, in this work, we explore an essential question, "what if the description is wrong or misleading of the text description?". We term such a sentence as a negative sentence. However, we find that existing works cannot handle such settings. To this end, we propose a novel formulation of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the negative sentence inputs besides the regularly given text inputs. We present three different datasets via augmenting the input negative sentences and a new metric to unify both input types. Furthermore, we design a new transformer-based model named RefSegformer, where we introduce a token-based vision and language fusion module. Such module can be easily extended to our R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves the new state-of-the-art results on three regular RIS datasets and three R-RIS datasets, which serves as a new solid baseline for further research. The project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.
[[2209.09574] Sampling Agnostic Feature Representation for Long-Term Person Re-identification](http://arxiv.org/abs/2209.09574)
Person re-identification is a problem of identifying individuals across non-overlapping cameras. Although remarkable progress has been made in the re-identification problem, it is still a challenging problem due to appearance variations of the same person as well as other people of similar appearance. Some prior works solved the issues by separating features of positive samples from features of negative ones. However, the performances of existing models considerably depend on the characteristics and statistics of the samples used for training. Thus, we propose a novel framework named sampling independent robust feature representation network~(SirNet) that learns disentangled feature embedding from randomly chosen samples. A carefully designed sampling independent maximum discrepancy loss is introduced to model samples of the same person as a cluster. As a result, the proposed framework can generate additional hard negatives/positives using the learned features, which results in better discriminability from other identities. Extensive experimental results on large-scale benchmark datasets verify that the proposed model is more effective than prior state-of-the-art models.
[[2209.09723] GANet: Goal Area Network for Motion Forecasting](http://arxiv.org/abs/2209.09723)
Predicting the future motion of road participants is crucial for autonomous driving but is extremely challenging due to staggering motion uncertainty. Recently, most motion forecasting methods resort to the goal-based strategy, i.e., predicting endpoints of motion trajectories as conditions to regress the entire trajectories, so that the search space of solution can be reduced. However, accurate goal coordinates are hard to predict and evaluate. In addition, the point representation of the destination limits the utilization of a rich road context, leading to inaccurate prediction results in many cases. Goal area, i.e., the possible destination area, rather than goal coordinate, could provide a more soft constraint for searching potential trajectories by involving more tolerance and guidance. In view of this, we propose a new goal area-based framework, named Goal Area Network (GANet), for motion forecasting, which models goal areas rather than exact goal coordinates as preconditions for trajectory prediction, performing more robustly and accurately. Specifically, we propose a GoICrop (Goal Area of Interest) operator to effectively extract semantic lane features in goal areas and model actors' future interactions, which benefits a lot for future trajectory estimations. GANet ranks the 1st on the leaderboard of Argoverse Challenge among all public literature (till the paper submission), and its source codes will be released.
[[2209.09808] Enhancing vehicle detection accuracy in thermal infrared images using multiple GANs](http://arxiv.org/abs/2209.09808)
Vehicle detection accuracy is fairly accurate in good-illumination conditions but susceptible to poor detection accuracy under low-light conditions. The combined effect of low-light and glare from vehicle headlight or tail-light results in misses in vehicle detection more likely by state-of-the-art object detection models. However, thermal infrared images are robust to illumination changes and are based on thermal radiations. Recently, Generative Adversarial Networks (GANs) have been extensively used in image domain transfer tasks. State-of-the-art GAN models have attempted to improve vehicle detection accuracy in night-time by converting infrared images to day-time RGB images. However, these models have been found to under-perform during night-time conditions compared to day-time conditions. Therefore, this study attempts to alleviate this shortcoming by proposing three different approaches based on combination of GAN models at two different levels that tries to reduce the feature distribution gap between day-time and night-time infrared images. Quantitative analysis to compare the performance of the proposed models with the state-of-the-art models have been done by testing the models using state-of-the-art object detection models. Both the quantitative and qualitative analyses have shown that the proposed models outperform the state-of-the-art GAN models for vehicle detection in night-time conditions, showing the efficacy of the proposed models.
[[2209.09844] Frequency Dropout: Feature-Level Regularization via Randomized Filtering](http://arxiv.org/abs/2209.09844)
Deep convolutional neural networks have shown remarkable performance on various computer vision tasks, and yet, they are susceptible to picking up spurious correlations from the training signal. So called `shortcuts' can occur during learning, for example, when there are specific frequencies present in the image data that correlate with the output predictions. Both high and low frequencies can be characteristic of the underlying noise distribution caused by the image acquisition rather than in relation to the task-relevant information about the image content. Models that learn features related to this characteristic noise will not generalize well to new data.
In this work, we propose a simple yet effective training strategy, Frequency Dropout, to prevent convolutional neural networks from learning frequency-specific imaging features. We employ randomized filtering of feature maps during training which acts as a feature-level regularization. In this study, we consider common image processing filters such as Gaussian smoothing, Laplacian of Gaussian, and Gabor filtering. Our training strategy is model-agnostic and can be used for any computer vision task. We demonstrate the effectiveness of Frequency Dropout on a range of popular architectures and multiple tasks including image classification, domain adaptation, and semantic segmentation using both computer vision and medical imaging datasets. Our results suggest that the proposed approach does not only improve predictive accuracy but also improves robustness against domain shift.
[[2209.09619] Personal Attribute Prediction from Conversations](http://arxiv.org/abs/2209.09619)
Personal knowledge bases (PKBs) are critical to many applications, such as Web-based chatbots and personalized recommendation. Conversations containing rich personal knowledge can be regarded as a main source to populate the PKB. Given a user, a user attribute, and user utterances from a conversational system, we aim to predict the personal attribute value for the user, which is helpful for the enrichment of PKBs. However, there are three issues existing in previous studies: (1) manually labeled utterances are required for model training; (2) personal attribute knowledge embedded in both utterances and external resources is underutilized; (3) the performance on predicting some difficult personal attributes is unsatisfactory. In this paper, we propose a framework DSCGN based on the pre-trained language model with a noise-robust loss function to predict personal attributes from conversations without requiring any labeled utterances. We yield two categories of supervision, i.e., document-level supervision via a distant supervision strategy and contextualized word-level supervision via a label guessing method, by mining the personal attribute knowledge embedded in both unlabeled utterances and external resources to fine-tune the language model. Extensive experiments over two real-world data sets (i.e., a profession data set and a hobby data set) show our framework obtains the best performance compared with all the twelve baselines in terms of nDCG and MRR.
[[2209.09813] Register Variation Remains Stable Across 60 Languages](http://arxiv.org/abs/2209.09813)
This paper measures the stability of cross-linguistic register variation. A register is a variety of a language that is associated with extra-linguistic context. The relationship between a register and its context is functional: the linguistic features that make up a register are motivated by the needs and constraints of the communicative situation. This view hypothesizes that register should be universal, so that we expect a stable relationship between the extra-linguistic context that defines a register and the sets of linguistic features which the register contains. In this paper, the universality and robustness of register variation is tested by comparing variation within vs. between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles. Our findings confirm the prediction that register variation is, in fact, universal.
[[2209.09624] Robust Online and Distributed Mean Estimation Under Adversarial Data Corruption](http://arxiv.org/abs/2209.09624)
We study robust mean estimation in an online and distributed scenario in the presence of adversarial data attacks. At each time step, each agent in a network receives a potentially corrupted data point, where the data points were originally independent and identically distributed samples of a random variable. We propose online and distributed algorithms for all agents to asymptotically estimate the mean. We provide the error-bound and the convergence properties of the estimates to the true mean under our algorithms. Based on the network topology, we further evaluate each agent's trade-off in convergence rate between incorporating data from neighbors and learning with only local observations.
[[2209.09240] Distributed Semi-supervised Fuzzy Regression with Interpolation Consistency Regularization](http://arxiv.org/abs/2209.09240)
Recently, distributed semi-supervised learning (DSSL) algorithms have shown their effectiveness in leveraging unlabeled samples over interconnected networks, where agents cannot share their original data with each other and can only communicate non-sensitive information with their neighbors. However, existing DSSL algorithms cannot cope with data uncertainties and may suffer from high computation and communication overhead problems. To handle these issues, we propose a distributed semi-supervised fuzzy regression (DSFR) model with fuzzy if-then rules and interpolation consistency regularization (ICR). The ICR, which was proposed recently for semi-supervised problem, can force decision boundaries to pass through sparse data areas, thus increasing model robustness. However, its application in distributed scenarios has not been considered yet. In this work, we proposed a distributed Fuzzy C-means (DFCM) method and a distributed interpolation consistency regularization (DICR) built on the well-known alternating direction method of multipliers to respectively locate parameters in antecedent and consequent components of DSFR. Notably, the DSFR model converges very fast since it does not involve back-propagation procedure and is scalable to large-scale datasets benefiting from the utilization of DFCM and DICR. Experiments results on both artificial and real-world datasets show that the proposed DSFR model can achieve much better performance than the state-of-the-art DSSL algorithm in terms of both loss value and computational cost.
[[2209.09441] Locally Constrained Representations in Reinforcement Learning](http://arxiv.org/abs/2209.09441)
The success of Reinforcement Learning (RL) heavily relies on the ability to learn robust representations from the observations of the environment. In most cases, the representations learned purely by the reinforcement learning loss can differ vastly across states depending on how the value functions change. However, the representations learned need not be very specific to the task at hand. Relying only on the RL objective may yield representations that vary greatly across successive time steps. In addition, since the RL loss has a changing target, the representations learned would depend on how good the current values/policies are. Thus, disentangling the representations from the main task would allow them to focus more on capturing transition dynamics which can improve generalization. To this end, we propose locally constrained representations, where an auxiliary loss forces the state representations to be predictable by the representations of the neighbouring states. This encourages the representations to be driven not only by the value/policy learning but also self-supervised learning, which constrains the representations from changing too rapidly. We evaluate the proposed method on several known benchmarks and observe strong performance. Especially in continuous control tasks, our experiments show a significant advantage over a strong baseline.
[[2209.09617] Seq2Seq Surrogates of Epidemic Models to Facilitate Bayesian Inference](http://arxiv.org/abs/2209.09617)
Epidemic models are powerful tools in understanding infectious disease. However, as they increase in size and complexity, they can quickly become computationally intractable. Recent progress in modelling methodology has shown that surrogate models can be used to emulate complex epidemic models with a high-dimensional parameter space. We show that deep sequence-to-sequence (seq2seq) models can serve as accurate surrogates for complex epidemic models with sequence based model parameters, effectively replicating seasonal and long-term transmission dynamics. Once trained, our surrogate can predict scenarios a several thousand times faster than the original model, making them ideal for policy exploration. We demonstrate that replacing a traditional epidemic model with a learned simulator facilitates robust Bayesian inference.
[[2209.09882] Soft Action Priors: Towards Robust Policy Transfer](http://arxiv.org/abs/2209.09882)
Despite success in many challenging problems, reinforcement learning (RL) is still confronted with sample inefficiency, which can be mitigated by introducing prior knowledge to agents. However, many transfer techniques in reinforcement learning make the limiting assumption that the teacher is an expert. In this paper, we use the action prior from the Reinforcement Learning as Inference framework - that is, a distribution over actions at each state which resembles a teacher policy, rather than a Bayesian prior - to recover state-of-the-art policy distillation techniques. Then, we propose a class of adaptive methods that can robustly exploit action priors by combining reward shaping and auxiliary regularization losses. In contrast to prior work, we develop algorithms for leveraging suboptimal action priors that may nevertheless impart valuable knowledge - which we call soft action priors. The proposed algorithms adapt by adjusting the strength of teacher feedback according to an estimate of the teacher's usefulness in each state. We perform tabular experiments, which show that the proposed methods achieve state-of-the-art performance, surpassing it when learning from suboptimal priors. Finally, we demonstrate the robustness of the adaptive algorithms in continuous action deep RL problems, in which adaptive algorithms considerably improved stability when compared to existing policy distillation methods.
[[2209.09389] State-driven Implicit Modeling for Sparsity and Robustness in Neural Networks](http://arxiv.org/abs/2209.09389)
Implicit models are a general class of learning models that forgo the hierarchical layer structure typical in neural networks and instead define the internal states based on an ``equilibrium'' equation, offering competitive performance and reduced memory consumption. However, training such models usually relies on expensive implicit differentiation for backward propagation. In this work, we present a new approach to training implicit models, called State-driven Implicit Modeling (SIM), where we constrain the internal states and outputs to match that of a baseline model, circumventing costly backward computations. The training problem becomes convex by construction and can be solved in a parallel fashion, thanks to its decomposable structure. We demonstrate how the SIM approach can be applied to significantly improve sparsity (parameter reduction) and robustness of baseline models trained on FashionMNIST and CIFAR-100 datasets.
[[2209.09423] Fairness and robustness in anti-causal prediction](http://arxiv.org/abs/2209.09423)
Robustness to distribution shift and fairness have independently emerged as two important desiderata required of modern machine learning models. While these two desiderata seem related, the connection between them is often unclear in practice. Here, we discuss these connections through a causal lens, focusing on anti-causal prediction tasks, where the input to a classifier (e.g., an image) is assumed to be generated as a function of the target label and the protected attribute. By taking this perspective, we draw explicit connections between a common fairness criterion - separation - and a common notion of robustness - risk invariance. These connections provide new motivation for applying the separation criterion in anticausal settings, and inform old discussions regarding fairness-performance tradeoffs. In addition, our findings suggest that robustness-motivated approaches can be used to enforce separation, and that they often work better in practice than methods designed to directly enforce separation. Using a medical dataset, we empirically validate our findings on the task of detecting pneumonia from X-rays, in a setting where differences in prevalence across sex groups motivates a fairness mitigation. Our findings highlight the importance of considering causal structure when choosing and enforcing fairness criteria.
[[2209.09500] Reduction from Complementary-Label Learning to Probability Estimates](http://arxiv.org/abs/2209.09500)
Complementary-Label Learning (CLL) is a weakly-supervised learning problem that aims to learn a multi-class classifier from only complementary labels, which indicate a class to which an instance does not belong. Existing approaches mainly adopt the paradigm of reduction to ordinary classification, which applies specific transformations and surrogate losses to connect CLL back to ordinary classification. Those approaches, however, face several limitations, such as the tendency to overfit or be hooked on deep models. In this paper, we sidestep those limitations with a novel perspective--reduction to probability estimates of complementary classes. We prove that accurate probability estimates of complementary labels lead to good classifiers through a simple decoding step. The proof establishes a reduction framework from CLL to probability estimates. The framework offers explanations of several key CLL approaches as its special cases and allows us to design an improved algorithm that is more robust in noisy environments. The framework also suggests a validation procedure based on the quality of probability estimates, leading to an alternative way to validate models with only complementary labels. The flexible framework opens a wide range of unexplored opportunities in using deep and non-deep models for probability estimates to solve the CLL problem. Empirical experiments further verified the framework's efficacy and robustness in various settings.
[[2209.09489] Perceptual Quality Assessment for Digital Human Heads](http://arxiv.org/abs/2209.09489)
Digital humans are attracting more and more research interest during the last decade, the generation, representation, rendering, and animation of which have been put into large amounts of effort. However, the quality assessment for digital humans has fallen behind. Therefore, to tackle the challenge of digital human quality assessment issues, we propose the first large-scale quality assessment database for scanned digital human heads (DHHs). The constructed database consists of 55 reference DHHs and 1,540 distorted DHHs along with the subjective ratings. Then, a simple yet effective full-reference (FR) projection-based method is proposed. The pretrained Swin Transformer tiny is employed for hierarchical feature extraction and the multi-head attention module is utilized for feature fusion. The experimental results reveal that the proposed method exhibits state-of-the-art performance among the mainstream FR metrics. The database and the method presented in this work will be made publicly available.
[[2209.09657] View-Disentangled Transformer for Brain Lesion Detection](http://arxiv.org/abs/2209.09657)
Deep neural networks (DNNs) have been widely adopted in brain lesion detection and segmentation. However, locating small lesions in 2D MRI slices is challenging, and requires to balance between the granularity of 3D context aggregation and the computational complexity. In this paper, we propose a novel view-disentangled transformer to enhance the extraction of MRI features for more accurate tumour detection. First, the proposed transformer harvests long-range correlation among different positions in a 3D brain scan. Second, the transformer models a stack of slice features as multiple 2D views and enhance these features view-by-view, which approximately achieves the 3D correlation computing in an efficient way. Third, we deploy the proposed transformer module in a transformer backbone, which can effectively detect the 2D regions surrounding brain lesions. The experimental results show that our proposed view-disentangled transformer performs well for brain lesion detection on a challenging brain MRI dataset.
[[2209.09316] Activity report analysis with automatic single or multispan answer extraction](http://arxiv.org/abs/2209.09316)
In the era of loT (Internet of Things) we are surrounded by a plethora of Al enabled devices that can transcribe images, video, audio, and sensors signals into text descriptions. When such transcriptions are captured in activity reports for monitoring, life logging and anomaly detection applications, a user would typically request a summary or ask targeted questions about certain sections of the report they are interested in. Depending on the context and the type of question asked, a question answering (QA) system would need to automatically determine whether the answer covers single-span or multi-span text components. Currently available QA datasets primarily focus on single span responses only (such as SQuAD[4]) or contain a low proportion of examples with multiple span answers (such as DROP[3]). To investigate automatic selection of single/multi-span answers in the use case described, we created a new smart home environment dataset comprised of questions paired with single-span or multi-span answers depending on the question and context queried. In addition, we propose a RoBERTa[6]-based multiple span extraction question answering (MSEQA) model returning the appropriate answer span for a given question. Our experiments show that the proposed model outperforms state-of-the-art QA models on our dataset while providing comparable performance on published individual single/multi-span task datasets.
[[2209.09432] CofeNet: Context and Former-Label Enhanced Net for Complicated Quotation Extraction](http://arxiv.org/abs/2209.09432)
Quotation extraction aims to extract quotations from written text. There are three components in a quotation: source refers to the holder of the quotation, cue is the trigger word(s), and content is the main body. Existing solutions for quotation extraction mainly utilize rule-based approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the Context and Former-Label Enhanced Net (CofeNet) for quotation extraction. CofeNet is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (i.e., PolNeAR and Riqua) and one proprietary dataset (i.e., PoliticsZH), we show that our CofeNet achieves state-of-the-art performance on complicated quotation extraction.
[[2209.09450] A Few-shot Approach to Resume Information Extraction via Prompts](http://arxiv.org/abs/2209.09450)
Prompt learning has been shown to achieve near-Fine-tune performance in most text classification tasks with very few training examples. It is advantageous for NLP tasks where samples are scarce. In this paper, we attempt to apply it to a practical scenario, i.e resume information extraction, and to enhance the existing method to make it more applicable to the resume information extraction task. In particular, we created multiple sets of manual templates and verbalizers based on the textual characteristics of resumes. In addition, we compared the performance of Masked Language Model (MLM) pre-training language models (PLMs) and Seq2Seq PLMs on this task. Furthermore, we improve the design method of verbalizer for Knowledgeable Prompt-tuning in order to provide a example for the design of Prompt templates and verbalizer for other application-based NLP tasks. In this case, we propose the concept of Manual Knowledgeable Verbalizer(MKV). A rule for constructing the Knowledgeable Verbalizer corresponding to the application scenario. Experiments demonstrate that templates and verbalizers designed based on our rules are more effective and robust than existing manual templates and automatically generated prompt methods. It is established that the currently available automatic prompt methods cannot compete with manually designed prompt templates for some realistic task scenarios. The results of the final confusion matrix indicate that our proposed MKV significantly resolved the sample imbalance issue.
[[2209.09485] Generalizing through Forgetting -- Domain Generalization for Symptom Event Extraction in Clinical Notes](http://arxiv.org/abs/2209.09485)
Symptom information is primarily documented in free-text clinical notes and is not directly accessible for downstream applications. To address this challenge, information extraction approaches that can handle clinical language variation across different institutions and specialties are needed. In this paper, we present domain generalization for symptom extraction using pretraining and fine-tuning data that differs from the target domain in terms of institution and/or specialty and patient population. We extract symptom events using a transformer-based joint entity and relation extraction method. To reduce reliance on domain-specific features, we propose a domain generalization method that dynamically masks frequent symptoms words in the source domain. Additionally, we pretrain the transformer language model (LM) on task-related unlabeled texts for better representation. Our experiments indicate that masking and adaptive pretraining methods can significantly improve performance when the source domain is more distant from the target domain.
[[2209.09811] Predictive Scale-Bridging Simulations through Active Learning](http://arxiv.org/abs/2209.09811)
Throughout computational science, there is a growing need to utilize the continual improvements in raw computational horsepower to achieve greater physical fidelity through scale-bridging over brute-force increases in the number of mesh elements. For instance, quantitative predictions of transport in nanoporous media, critical to hydrocarbon extraction from tight shale formations, are impossible without accounting for molecular-level interactions. Similarly, inertial confinement fusion simulations rely on numerical diffusion to simulate molecular effects such as non-local transport and mixing without truly accounting for molecular interactions. With these two disparate applications in mind, we develop a novel capability which uses an active learning approach to optimize the use of local fine-scale simulations for informing coarse-scale hydrodynamics. Our approach addresses three challenges: forecasting continuum coarse-scale trajectory to speculatively execute new fine-scale molecular dynamics calculations, dynamically updating coarse-scale from fine-scale calculations, and quantifying uncertainty in neural network models.
[[2209.09338] Reviewing Embeddings for Graph Neural Networks](http://arxiv.org/abs/2209.09338)
Current graph representation learning techniques use Graph Neural Networks (GNNs) to extract features from dataset embeddings. In this work, we examine the quality of these embeddings and assess how changing them can affect the accuracy of GNNs. We explore different embedding extraction techniques for both images and texts. We find that the choice of embedding biases the performance of different GNN architectures and thus the choice of embedding influences the selection of GNNs regardless of the underlying dataset. In addition, we only see an improvement in accuracy from some GNN models compared to the accuracy of models trained from scratch or fine-tuned on the underlying data without utilizing the graph connections. As an alternative, we propose Graph-connected Network (GraNet) layers which use GNN message passing within large models to allow neighborhood aggregation. This gives a chance for the model to inherit weights from large pre-trained models if possible and we demonstrate that this approach improves the accuracy compared to the previous methods: on Flickr_v2, GraNet beats GAT2 and GraphSAGE by 7.7% and 1.7% respectively.
[[2209.09651] Deep Convolutional Architectures for Extrapolative Forecast in Time-dependent Flow Problems](http://arxiv.org/abs/2209.09651)
Physical systems whose dynamics are governed by partial differential equations (PDEs) find applications in numerous fields, from engineering design to weather forecasting. The process of obtaining the solution from such PDEs may be computationally expensive for large-scale and parameterized problems. In this work, deep learning techniques developed especially for time-series forecasts, such as LSTM and TCN, or for spatial-feature extraction such as CNN, are employed to model the system dynamics for advection dominated problems. These models take as input a sequence of high-fidelity vector solutions for consecutive time-steps obtained from the PDEs and forecast the solutions for the subsequent time-steps using auto-regression; thereby reducing the computation time and power needed to obtain such high-fidelity solutions. The models are tested on numerical benchmarks (1D Burgers' equation and Stoker's dam break problem) to assess the long-term prediction accuracy, even outside the training domain (extrapolation). Non-intrusive reduced-order modelling techniques such as deep auto-encoder networks are utilized to compress the high-fidelity snapshots before feeding them as input to the forecasting models in order to reduce the complexity and the required computations in the online and offline stages. Deep ensembles are employed to perform uncertainty quantification of the forecasting models, which provides information about the variance of the predictions as a result of the epistemic uncertainties.
[[2209.09675] Symbolic Regression with Fast Function Extraction and Nonlinear Least Squares Optimization](http://arxiv.org/abs/2209.09675)
Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite.
[[2209.09775] FedToken: Tokenized Incentives for Data Contribution in Federated Learning](http://arxiv.org/abs/2209.09775)
Incentives that compensate for the involved costs in the decentralized training of a Federated Learning (FL) model act as a key stimulus for clients' long-term participation. However, it is challenging to convince clients for quality participation in FL due to the absence of: (i) full information on the client's data quality and properties; (ii) the value of client's data contributions; and (iii) the trusted mechanism for monetary incentive offers. This often leads to poor efficiency in training and communication. While several works focus on strategic incentive designs and client selection to overcome this problem, there is a major knowledge gap in terms of an overall design tailored to the foreseen digital economy, including Web 3.0, while simultaneously meeting the learning objectives. To address this gap, we propose a contribution-based tokenized incentive scheme, namely \texttt{FedToken}, backed by blockchain technology that ensures fair allocation of tokens amongst the clients that corresponds to the valuation of their data during model training. Leveraging the engineered Shapley-based scheme, we first approximate the contribution of local models during model aggregation, then strategically schedule clients lowering the communication rounds for convergence and anchor ways to allocate \emph{affordable} tokens under a constrained monetary budget. Extensive simulations demonstrate the efficacy of our proposed method.
[[2209.09331] Training an Assassin AI for The Resistance: Avalon](http://arxiv.org/abs/2209.09331)
The Resistance: Avalon is a partially observable social deduction game. This area of AI game playing is fairly undeveloped. Implementing an AI for this game involves multiple components specific to each phase as well as role in the game. In this paper, we plan to iteratively develop the required components for each role/phase by first addressing the Assassination phase which can be modeled as a machine learning problem. Using a publicly available dataset from an online version of the game, we train classifiers that emulate an Assassin. After trying various classification techniques, we are able to achieve above average human performance using a simple linear support vector classifier. The eventual goal of this project is to pursue developing an intelligent and complete Avalon player that can play through each phase of the game as any role.
[[2209.09592] Closing the Gender Wage Gap: Adversarial Fairness in Job Recommendation](http://arxiv.org/abs/2209.09592)
The goal of this work is to help mitigate the already existing gender wage gap by supplying unbiased job recommendations based on resumes from job seekers. We employ a generative adversarial network to remove gender bias from word2vec representations of 12M job vacancy texts and 900k resumes. Our results show that representations created from recruitment texts contain algorithmic bias and that this bias results in real-world consequences for recommendation systems. Without controlling for bias, women are recommended jobs with significantly lower salary in our data. With adversarially fair representations, this wage gap disappears, meaning that our debiased job recommendations reduce wage discrimination. We conclude that adversarial debiasing of word representations can increase real-world fairness of systems and thus may be part of the solution for creating fairness-aware recommendation systems.
[[2209.09483] Interpretable Edge Enhancement and Suppression Learning for 3D Point Cloud Segmentation](http://arxiv.org/abs/2209.09483)
3D point clouds can flexibly represent continuous surfaces and can be used for various applications; however, the lack of structural information makes point cloud recognition challenging. Recent edge-aware methods mainly use edge information as an extra feature that describes local structures to facilitate learning. Although these methods show that incorporating edges into the network design is beneficial, they generally lack interpretability, making users wonder how exactly edges help. To shed light on this issue, in this study, we propose the Diffusion Unit (DU) that handles edges in an interpretable manner while providing decent improvement. Our method is interpretable in three ways. First, we theoretically show that DU learns to perform task-beneficial edge enhancement and suppression. Second, we experimentally observe and verify the edge enhancement and suppression behavior. Third, we empirically demonstrate that this behavior contributes to performance improvement. Extensive experiments performed on challenging benchmarks verify the superiority of DU in terms of both interpretability and performance gain. Specifically, our method achieves state-of-the-art performance in object part segmentation using ShapeNet part and scene segmentation using S3DIS. Our source code will be released at https://github.com/martianxiu/DiffusionUnit.
[[2209.09513] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering](http://arxiv.org/abs/2209.09513)
When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (SQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering SQA questions. SQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
[[2209.09659] Ki-Pode: Keypoint-based Implicit Pose Distribution Estimation of Rigid Objects](http://arxiv.org/abs/2209.09659)
The estimation of 6D poses of rigid objects is a fundamental problem in computer vision. Traditionally pose estimation is concerned with the determination of a single best estimate. However, a single estimate is unable to express visual ambiguity, which in many cases is unavoidable due to object symmetries or occlusion of identifying features. Inability to account for ambiguities in pose can lead to failure in subsequent methods, which is unacceptable when the cost of failure is high. Estimates of full pose distributions are, contrary to single estimates, well suited for expressing uncertainty on pose. Motivated by this, we propose a novel pose distribution estimation method. An implicit formulation of the probability distribution over object pose is derived from an intermediary representation of an object as a set of keypoints. This ensures that the pose distribution estimates have a high level of interpretability. Furthermore, our method is based on conservative approximations, which leads to reliable estimates. The method has been evaluated on the task of rotation distribution estimation on the YCB-V and T-LESS datasets and performs reliably on all objects.
[[2209.09768] An Efficient End-to-End Transformer with Progressive Tri-modal Attention for Multi-modal Emotion Recognition](http://arxiv.org/abs/2209.09768)
Recent works on multi-modal emotion recognition move towards end-to-end models, which can extract the task-specific features supervised by the target task compared with the two-phase pipeline. However, previous methods only model the feature interactions between the textual and either acoustic and visual modalities, ignoring capturing the feature interactions between the acoustic and visual modalities. In this paper, we propose the multi-modal end-to-end transformer (ME2ET), which can effectively model the tri-modal features interaction among the textual, acoustic, and visual modalities at the low-level and high-level. At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy and can further leverage such interactions to significantly reduce the computation and memory complexity through reducing the input token length. At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities. The experimental results on the CMU-MOSEI and IEMOCAP datasets show that ME2ET achieves the state-of-the-art performance. The further in-depth analysis demonstrates the effectiveness, efficiency, and interpretability of the proposed progressive tri-modal attention, which can help our model to achieve better performance while significantly reducing the computation and memory cost. Our code will be publicly available.
[[2209.09814] PainPoints: A Framework for Language-based Detection of Chronic Pain and Expert-Collaborative Text-Summarization](http://arxiv.org/abs/2209.09814)
Chronic pain is a pervasive disorder which is often very disabling and is associated with comorbidities such as depression and anxiety. Neuropathic Pain (NP) is a common sub-type which is often caused due to nerve damage and has a known pathophysiology. Another common sub-type is Fibromyalgia (FM) which is described as musculoskeletal, diffuse pain that is widespread through the body. The pathophysiology of FM is poorly understood, making it very hard to diagnose. Standard medications and treatments for FM and NP differ from one another and if misdiagnosed it can cause an increase in symptom severity. To overcome this difficulty, we propose a novel framework, PainPoints, which accurately detects the sub-type of pain and generates clinical notes via summarizing the patient interviews. Specifically, PainPoints makes use of large language models to perform sentence-level classification of the text obtained from interviews of FM and NP patients with a reliable AUC of 0.83. Using a sufficiency-based interpretability approach, we explain how the fine-tuned model accurately picks up on the nuances that patients use to describe their pain. Finally, we generate summaries of these interviews via expert interventions by introducing a novel facet-based approach. PainPoints thus enables practitioners to add/drop facets and generate a custom summary based on the notion of "facet-coverage" which is also introduced in this work.