[[2301.04402] Secure access system using signature verification over tablet PC](http://arxiv.org/abs/2301.04402) #secure
Low-cost portable devices capable of capturing signature signals are being increasingly used. Additionally, the social and legal acceptance of the written signature for authentication purposes is opening a range of new applications. We describe a highly versatile and scalable prototype for Web-based secure access using signature verification. The proposed architecture can be easily extended to work with different kinds of sensors and large-scale databases. Several remarks are also given on security and privacy of network-based signature verification.
[[2301.04491] Managing the Migration to Post-Quantum-Cryptography](http://arxiv.org/abs/2301.04491) #secure
Cryptographically relevant quantum computers (CRQC) are presumably able to break today's prevalent classic cryptographic algorithms. Protocols and schemes based on these algorithms would become insecure if such CRQCs would become available. Although it is not exactly known, whether this will actually happen, organizations (and the IT society) have to plan on migrating to quantum-resilient cryptographic measures, also known as Post-Quantum Cryptography (PQC). However, migrating IT systems and applications in organizations to support and integrate new software components is a difficult task. There exists to the best of our knowledge no generalized approach to manage such a complex migration for cryptography used in IT systems. We present a process for managing the migration from classic cryptography to PQC. Our solution is based on best practices, challenges, and problems derived from established software migration approaches. Compared to existing approaches, our proposal provides a means to help organizations migrate to PQC in a manageable manner and maintain crypto-agility. Thus, our process does not only serve as a framework for a one-time adaptation but also as a blueprint for organizing crypto-agile IT systems.
[[2301.04230] User-Centered Security in Natural Language Processing](http://arxiv.org/abs/2301.04230) #security
This dissertation proposes a framework of user-centered security in Natural Language Processing (NLP), and demonstrates how it can improve the accessibility of related research. Accordingly, it focuses on two security domains within NLP with great public interest. First, that of author profiling, which can be employed to compromise online privacy through invasive inferences. Without access and detailed insight into these models' predictions, there is no reasonable heuristic by which Internet users might defend themselves from such inferences. Secondly, that of cyberbullying detection, which by default presupposes a centralized implementation; i.e., content moderation across social platforms. As access to appropriate data is restricted, and the nature of the task rapidly evolves (both through lexical variation, and cultural shifts), the effectiveness of its classifiers is greatly diminished and thereby often misrepresented.
Under the proposed framework, we predominantly investigate the use of adversarial attacks on language; i.e., changing a given input (generating adversarial samples) such that a given model does not function as intended. These attacks form a common thread between our user-centered security problems; they are highly relevant for privacy-preserving obfuscation methods against author profiling, and adversarial samples might also prove useful to assess the influence of lexical variation and augmentation on cyberbullying detection.
[[2301.04314] ML-FEED: Machine Learning Framework for Efficient Exploit Detection (Extended version)](http://arxiv.org/abs/2301.04314) #security
Machine learning (ML)-based methods have recently become attractive for detecting security vulnerability exploits. Unfortunately, state-of-the-art ML models like long short-term memories (LSTMs) and transformers incur significant computation overheads. This overhead makes it infeasible to deploy them in real-time environments. We propose a novel ML-based exploit detection model, ML-FEED, that enables highly efficient inference without sacrificing performance. We develop a novel automated technique to extract vulnerability patterns from the Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) databases. This feature enables ML-FEED to be aware of the latest cyber weaknesses. Second, it is not based on the traditional approach of classifying sequences of application programming interface (API) calls into exploit categories. Such traditional methods that process entire sequences incur huge computational overheads. Instead, ML-FEED operates at a finer granularity and predicts the exploits triggered by every API call of the program trace. Then, it uses a state table to update the states of these potential exploits and track the progress of potential exploit chains. ML-FEED also employs a feature engineering approach that uses natural language processing-based word embeddings, frequency vectors, and one-hot encoding to detect semantically-similar instruction calls. Then, it updates the states of the predicted exploit categories and triggers an alarm when a vulnerability fingerprint executes. Our experiments show that ML-FEED is 72.9x and 75,828.9x faster than state-of-the-art lightweight LSTM and transformer models, respectively. We trained and tested ML-FEED on 79 real-world exploit categories. It predicts categories of exploit in real-time with 98.2% precision, 97.4% recall, and 97.8% F1 score. These results also outperform the LSTM and transformer baselines.
[[2301.04370] Order-Preserving Database Encryption with Secret Sharing](http://arxiv.org/abs/2301.04370) #security
The order-preserving encryption (OPE) problem was initially formulated by the database community in 2004 soon after the paradigm database-as-a-service (DaaS) was coined in 2002. Over the past two decades, OPE has drawn tremendous research interest from communities of databases, cryptography, and security; we have witnessed significant advances in OPE schemes both theoretically and systematically. All existing OPE schemes assume that the outsourced database is modeled as a single semi-honest adversary who should learn nothing more than the order information of plaintext messages up to a negligible probability. This paper addresses the OPE problem from a new perspective: instead of modeling the outsourced database as a single semi-honest adversary, we assume the outsourced database \textit{service} compromises a cluster of non-colluding servers, which is a practical assumption as all major cloud vendors support multiple database instances deployed to exclusive sub-networks or even to distinct data centers. This assumption allows us to design a new stateless OPE protocol, namely order-preserving database encryption with secret sharing (ODES), by employing secret-sharing schemes among those presumably non-colluding servers. We will demonstrate that ODES guarantees the latest security level, namely IND-FAOCPA, and outperforms the state-of-the-art scheme by orders of magnitude.
[[2301.04587] Electric Vehicles Security and Privacy: Challenges, Solutions, and Future Needs](http://arxiv.org/abs/2301.04587) #security
Electric Vehicles (EVs) share common technologies with classical fossil-fueled cars, but they also employ novel technologies and components (e.g., Charging System and Battery Management System) that create an unexplored attack surface for malicious users. Although multiple contributions in the literature explored cybersecurity aspects of particular components of the EV ecosystem (e.g., charging infrastructure), there is still no contribution to the holistic cybersecurity of EVs and their related technologies from a cyber-physical system perspective.
In this paper, we provide the first in-depth study of the security and privacy threats associated with the EVs ecosystem. We analyze the threats associated with both the EV and the different charging solutions. Focusing on the Cyber-Physical Systems (CPS) paradigm, we provide a detailed analysis of all the processes that an attacker might exploit to affect the security and privacy of both drivers and the infrastructure. To address the highlighted threats, we present possible solutions that might be implemented. We also provide an overview of possible future directions to guarantee the security and privacy of the EVs ecosystem. Based on our analysis, we stress the need for EV-specific cybersecurity solutions.
[[2301.04283] A Multi-Modal Geographic Pre-Training Method](http://arxiv.org/abs/2301.04283) #privacy
As a core task in location-based services (LBS) (e.g., navigation maps), query and point of interest (POI) matching connects users' intent with real-world geographic information. Recently, pre-trained models (PTMs) have made advancements in many natural language processing (NLP) tasks. Generic text-based PTMs do not have enough geographic knowledge for query-POI matching. To overcome this limitation, related literature attempts to employ domain-adaptive pre-training based on geo-related corpus. However, a query generally contains mentions of multiple geographic objects, such as nearby roads and regions of interest (ROIs). The geographic context (GC), i.e., these diverse geographic objects and their relationships, is therefore pivotal to retrieving the most relevant POI. Single-modal PTMs can barely make use of the important GC and therefore have limited performance. In this work, we propose a novel query-POI matching method Multi-modal Geographic language model (MGeo), which comprises a geographic encoder and a multi-modal interaction module. MGeo represents GC as a new modality and is able to fully extract multi-modal correlations for accurate query-POI matching. Besides, there is no publicly available benchmark for this topic. In order to facilitate further research, we build a new open-source large-scale benchmark Geographic TExtual Similarity (GeoTES). The POIs come from an open-source geographic information system (GIS). The queries are manually generated by annotators to prevent privacy issues. Compared with several strong baselines, the extensive experiment results and detailed ablation analyses on GeoTES demonstrate that our proposed multi-modal pre-training method can significantly improve the query-POI matching capability of generic PTMs, even when the queries' GC is not provided. Our code and dataset are publicly available at https://github.com/PhantomGrapes/MGeo.
[[2301.04214] CageCoach: Sharing-Oriented Redaction-Capable Distributed Cryptographic File System](http://arxiv.org/abs/2301.04214) #protect
The modern data economy is built on sharing data. However, sharing data can be an expensive and risky endeavour. Existing sharing systems like Distributed File Systems provide full read, write, and execute Role-based Access Control (RBAC) for sharing data, but can be expensive and difficult to scale. Likewise such systems operate on a binary access model for their data, either a user can read all the data or read none of the data. This approach is not necessary for a more read-only oriented data landscape, and one where data contains many dimensions that represent a risk if overshared. In order to encourage users to share data and smooth out the process of accessing such data a new approach is needed. This new approach must simplify the RBAC of older DFS approaches to something more read-only and something that integrates redaction for user protections. To accomplish this we present CageCoach, a simple sharing-oriented Distributed Cryptographic File System (DCFS). CageCoach leverages the simplicity and speed of basic HTTP, linked data concepts, and automatic redaction systems to facilitate safe and easy sharing of user data. The implementation of CageCoach is available at https://github.umn.edu/CARPE415/CageCoach.
[[2301.04218] Diffusion Models For Stronger Face Morphing Attacks](http://arxiv.org/abs/2301.04218) #attack
Face morphing attacks seek to deceive a Face Recognition (FR) system by presenting a morphed image consisting of the biometric qualities from two different identities with the aim of triggering a false acceptance with one of the two identities, thereby presenting a significant threat to biometric systems. The success of a morphing attack is dependent on the ability of the morphed image to represent the biometric characteristics of both identities that were used to create the image. We present a novel morphing attack that uses a Diffusion-based architecture to improve the visual fidelity of the image and improve the ability of the morphing attack to represent characteristics from both identities. We demonstrate the high fidelity of the proposed attack by evaluating its visual fidelity via the Frechet Inception Distance. Extensive experiments are conducted to measure the vulnerability of FR systems to the proposed attack. The proposed attack is compared to two state-of-the-art GAN-based morphing attacks along with two Landmark-based attacks. The ability of a morphing attack detector to detect the proposed attack is measured and compared against the other attacks. Additionally, a novel metric to measure the relative strength between morphing attacks is introduced and evaluated.
[[2301.04554] Universal Detection of Backdoor Attacks via Density-based Clustering and Centroids Analysis](http://arxiv.org/abs/2301.04554) #attack
In this paper, we propose a Universal Defence based on Clustering and Centroids Analysis (CCA-UD) against backdoor attacks. The goal of the proposed defence is to reveal whether a Deep Neural Network model is subject to a backdoor attack by inspecting the training dataset. CCA-UD first clusters the samples of the training set by means of density-based clustering. Then, it applies a novel strategy to detect the presence of poisoned clusters. The proposed strategy is based on a general misclassification behaviour obtained when the features of a representative example of the analysed cluster are added to benign samples. The capability of inducing a misclassification error is a general characteristic of poisoned samples, hence the proposed defence is attack-agnostic. This mask a significant difference with respect to existing defences, that, either can defend against only some types of backdoor attacks, e.g., when the attacker corrupts the label of the poisoned samples, or are effective only when some conditions on the poisoning ratios adopted by the attacker or the kind of triggering pattern used by the attacker are satisfied. Experiments carried out on several classification tasks, considering different types of backdoor attacks and triggering patterns, including both local and global triggers, reveal that the proposed method is very effective to defend against backdoor attacks in all the cases, always outperforming the state of the art techniques.
[[2301.04299] SoK: Adversarial Machine Learning Attacks and Defences in Multi-Agent Reinforcement Learning](http://arxiv.org/abs/2301.04299) #attack
Multi-Agent Reinforcement Learning (MARL) is vulnerable to Adversarial Machine Learning (AML) attacks and needs adequate defences before it can be used in real world applications. We have conducted a survey into the use of execution-time AML attacks against MARL and the defences against those attacks. We surveyed related work in the application of AML in Deep Reinforcement Learning (DRL) and Multi-Agent Learning (MAL) to inform our analysis of AML for MARL. We propose a novel perspective to understand the manner of perpetrating an AML attack, by defining Attack Vectors. We develop two new frameworks to address a gap in current modelling frameworks, focusing on the means and tempo of an AML attack against MARL, and identify knowledge gaps and future avenues of research.
[[2301.04400] Resynthesis-based Attacks Against Logic Locking](http://arxiv.org/abs/2301.04400) #attack
Logic locking has been a promising solution to many hardware security threats, such as intellectual property infringement and overproduction. Due to the increased attention that threats have received, many efficient specialized attacks against logic locking have been introduced over the years. However, the ability of an adversary to manipulate a locked netlist prior to mounting an attack has not been investigated thoroughly. This paper introduces a resynthesis-based strategy that utilizes the strength of a commercial electronic design automation (EDA) tool to reveal the vulnerabilities of a locked circuit. To do so, in a pre-attack step, a locked netlist is resynthesized using different synthesis parameters in a systematic way, leading to a large number of functionally equivalent but structurally different locked circuits. Then, under the oracle-less threat model, where it is assumed that the adversary only possesses the locked circuit, not the original circuit to query, a prominent attack is applied to these generated netlists collectively, from which a large number of key bits are deciphered. Nevertheless, this paper also describes how the proposed oracle-less attack can be integrated with an oracle-guided attack. The feasibility of the proposed approach is demonstrated for several benchmarks, including remarkable results for breaking a recently proposed provably secure logic locking method and deciphering values of a large number of key bits of the CSAW'19 circuits with very high accuracy.
[[2301.04591] MVAM: Multi-variant Attacks on Memory for IoT Trust Computing](http://arxiv.org/abs/2301.04591) #attack
With the significant development of the Internet of Things and low-cost cloud services, the sensory and data processing requirements of IoT systems are continually going up. TrustZone is a hardware-protected Trusted Execution Environment (TEE) for ARM processors specifically designed for IoT handheld systems. It provides memory isolation techniques to protect trusted application data from being exploited by malicious entities. In this work, we focus on identifying different vulnerabilities of the TrustZone extension of ARM Cortex-M processors. Then design and implement a threat model to execute those attacks. We have found that TrustZone is vulnerable to buffer overflow-based attacks. We have used this to create an attack called MOFlow and successfully leaked the data of another trusted app. This is done by intentionally overflowing the memory of one app to access the encrypted memory of other apps inside the secure world. We have also found that, by not validating the input parameters in the entry function, TrustZone has exposed a security weakness. We call this Achilles heel and present an attack model showing how to exploit this weakness too. Our proposed novel attacks are implemented and successfully tested on two recent ARM Cortex-M processors available on the market (M23 and M33).
[[2301.04472] Adversarial training with informed data selection](http://arxiv.org/abs/2301.04472) #attack
With the increasing amount of available data and advances in computing capabilities, deep neural networks (DNNs) have been successfully employed to solve challenging tasks in various areas, including healthcare, climate, and finance. Nevertheless, state-of-the-art DNNs are susceptible to quasi-imperceptible perturbed versions of the original images -- adversarial examples. These perturbations of the network input can lead to disastrous implications in critical areas where wrong decisions can directly affect human lives. Adversarial training is the most efficient solution to defend the network against these malicious attacks. However, adversarial trained networks generally come with lower clean accuracy and higher computational complexity. This work proposes a data selection (DS) strategy to be applied in the mini-batch training. Based on the cross-entropy loss, the most relevant samples in the batch are selected to update the model parameters in the backpropagation. The simulation results show that a good compromise can be obtained regarding robustness and standard accuracy, whereas the computational complexity of the backpropagation pass is reduced.
[[2301.04243] Robust Human Identity Anonymization using Pose Estimation](http://arxiv.org/abs/2301.04243) #robust
Many outdoor autonomous mobile platforms require more human identity anonymized data to power their data-driven algorithms. The human identity anonymization should be robust so that less manual intervention is needed, which remains a challenge for current face detection and anonymization systems. In this paper, we propose to use the skeleton generated from the state-of-the-art human pose estimation model to help localize human heads. We develop criteria to evaluate the performance and compare it with the face detection approach. We demonstrate that the proposed algorithm can reduce missed faces and thus better protect the identity information for the pedestrians. We also develop a confidence-based fusion method to further improve the performance.
[[2301.04410] GraVIS: Grouping Augmented Views from Independent Sources for Dermatology Analysis](http://arxiv.org/abs/2301.04410) #robust
Self-supervised representation learning has been extremely successful in medical image analysis, as it requires no human annotations to provide transferable representations for downstream tasks. Recent self-supervised learning methods are dominated by noise-contrastive estimation (NCE, also known as contrastive learning), which aims to learn invariant visual representations by contrasting one homogeneous image pair with a large number of heterogeneous image pairs in each training step. Nonetheless, NCE-based approaches still suffer from one major problem that is one homogeneous pair is not enough to extract robust and invariant semantic information. Inspired by the archetypical triplet loss, we propose GraVIS, which is specifically optimized for learning self-supervised features from dermatology images, to group homogeneous dermatology images while separating heterogeneous ones. In addition, a hardness-aware attention is introduced and incorporated to address the importance of homogeneous image views with similar appearance instead of those dissimilar homogeneous ones. GraVIS significantly outperforms its transfer learning and self-supervised learning counterparts in both lesion segmentation and disease classification tasks, sometimes by 5 percents under extremely limited supervision. More importantly, when equipped with the pre-trained weights provided by GraVIS, a single model could achieve better results than winners that heavily rely on ensemble strategies in the well-known ISIC 2017 challenge.
[[2301.04414] How Does Traffic Environment Quantitatively Affect the Autonomous Driving Prediction?](http://arxiv.org/abs/2301.04414) #robust
An accurate trajectory prediction is crucial for safe and efficient autonomous driving in complex traffic environments. In recent years, artificial intelligence has shown strong capabilities in improving prediction accuracy. However, its characteristics of inexplicability and uncertainty make it challenging to determine the traffic environmental effect on prediction explicitly, posing significant challenges to safety-critical decision-making. To address these challenges, this study proposes a trajectory prediction framework with the epistemic uncertainty estimation ability that outputs high uncertainty when confronting unforeseeable or unknown scenarios. The proposed framework is used to analyze the environmental effect on the prediction algorithm performance. In the analysis, the traffic environment is considered in terms of scenario features and shifts, respectively, where features are divided into kinematic features of a target agent, features of its surrounding traffic participants, and other features. In addition, feature correlation and importance analyses are performed to study the above features' influence on the prediction error and epistemic uncertainty. Further, a cross-dataset case study is conducted using multiple intersection datasets to investigate the impact of unavoidable distributional shifts in the real world on trajectory prediction. The results indicate that the deep ensemble-based method has advantages in improving prediction robustness and estimating epistemic uncertainty. The consistent conclusions are obtained by the feature correlation and importance analyses, including the conclusion that kinematic features of the target agent have relatively strong effects on the prediction error and epistemic uncertainty. Furthermore, the prediction failure caused by distributional shifts and the potential of the deep ensemble-based method are analyzed.
[[2301.04447] VS-Net: Multiscale Spatiotemporal Features for Lightweight Video Salient Document Detection](http://arxiv.org/abs/2301.04447) #robust
Video Salient Document Detection (VSDD) is an essential task of practical computer vision, which aims to highlight visually salient document regions in video frames. Previous techniques for VSDD focus on learning features without considering the cooperation among and across the appearance and motion cues and thus fail to perform in practical scenarios. Moreover, most of the previous techniques demand high computational resources, which limits the usage of such systems in resource-constrained settings. To handle these issues, we propose VS-Net, which captures multi-scale spatiotemporal information with the help of dilated depth-wise separable convolution and Approximation Rank Pooling. VS-Net extracts the key features locally from each frame across embedding sub-spaces and forwards the features between adjacent and parallel nodes, enhancing model performance globally. Our model generates saliency maps considering both the background and foreground simultaneously, making it perform better in challenging scenarios. The immense experiments regulated on the benchmark MIDV-500 dataset show that the VS-Net model outperforms state-of-the-art approaches in both time and robustness measures.
[[2301.04634] Street-View Image Generation from a Bird's-Eye View Layout](http://arxiv.org/abs/2301.04634) #robust
Bird's-Eye View (BEV) Perception has received increasing attention in recent years as it provides a concise and unified spatial representation across views and benefits a diverse set of downstream driving applications. While the focus has been placed on discriminative tasks such as BEV segmentation, the dual generative task of creating street-view images from a BEV layout has rarely been explored. The ability to generate realistic street-view images that align with a given HD map and traffic layout is critical for visualizing complex traffic scenarios and developing robust perception models for autonomous driving. In this paper, we propose BEVGen, a conditional generative model that synthesizes a set of realistic and spatially consistent surrounding images that match the BEV layout of a traffic scenario. BEVGen incorporates a novel cross-view transformation and spatial attention design which learn the relationship between cameras and map views to ensure their consistency. Our model can accurately render road and lane lines, as well as generate traffic scenes under different weather conditions and times of day. The code will be made publicly available.
[[2301.04347] Counteracts: Testing Stereotypical Representation in Pre-trained Language Models](http://arxiv.org/abs/2301.04347) #robust
Language models have demonstrated strong performance on various natural language understanding tasks. Similar to humans, language models could also have their own bias that is learned from the training data. As more and more downstream tasks integrate language models as part of the pipeline, it is necessary to understand the internal stereotypical representation and the methods to mitigate the negative effects. In this paper, we proposed a simple method to test the internal stereotypical representation in pre-trained language models using counterexamples. We mainly focused on gender bias, but the method can be extended to other types of bias. We evaluated models on 9 different cloze-style prompts consisting of knowledge and base prompts. Our results indicate that pre-trained language models show a certain amount of robustness when using unrelated knowledge, and prefer shallow linguistic cues, such as word position and syntactic structure, to alter the internal stereotypical representation. Such findings shed light on how to manipulate language models in a neutral approach for both finetuning and evaluation.
[[2301.04344] Robust Bayesian Target Value Optimization](http://arxiv.org/abs/2301.04344) #robust
We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) maximization/minimization rather than target value optimization or ii) on the expectation, but not the variance of the output, ignoring output variations due to stochasticity in uncontrollable environmental variables. In this work, we fill this gap and derive acquisition functions for common criteria such as the expected improvement, the probability of improvement, and the lower confidence bound, assuming that aleatoric effects are Gaussian with known variance. Our experiments illustrate that this setting is compatible with certain extensions of Gaussian processes, and show that the thus derived acquisition functions can outperform classical Bayesian optimization even if the latter assumptions are violated. An industrial use case in billet forging is presented.
[[2301.04404] A prediction and behavioural analysis of machine learning methods for modelling travel mode choice](http://arxiv.org/abs/2301.04404) #robust
The emergence of a variety of Machine Learning (ML) approaches for travel mode choice prediction poses an interesting question to transport modellers: which models should be used for which applications? The answer to this question goes beyond simple predictive performance, and is instead a balance of many factors, including behavioural interpretability and explainability, computational complexity, and data efficiency. There is a growing body of research which attempts to compare the predictive performance of different ML classifiers with classical random utility models. However, existing studies typically analyse only the disaggregate predictive performance, ignoring other aspects affecting model choice. Furthermore, many studies are affected by technical limitations, such as the use of inappropriate validation schemes, incorrect sampling for hierarchical data, lack of external validation, and the exclusive use of discrete metrics. We address these limitations by conducting a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice (out-of-sample predictive performance, accuracy of predicted market shares, extraction of behavioural indicators, and computational efficiency). We combine several real world datasets with synthetic datasets, where the data generation function is known. The results indicate that the models with the highest disaggregate predictive performance (namely extreme gradient boosting and random forests) provide poorer estimates of behavioural indicators and aggregate mode shares, and are more expensive to estimate, than other models, including deep neural networks and Multinomial Logit (MNL). It is further observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.
[[2301.04470] InstaGraM: Instance-level Graph Modeling for Vectorized HD Map Learning](http://arxiv.org/abs/2301.04470) #extraction
The construction of lightweight High-definition (HD) maps containing geometric and semantic information is of foremost importance for the large-scale deployment of autonomous driving. To automatically generate such type of map from a set of images captured by a vehicle, most works formulate this mapping as a segmentation problem, which implies heavy post-processing to obtain the final vectorized representation. Alternative techniques have the ability to generate an HD map in an end-to-end manner but rely on computationally expensive auto-regressive models. To bring camera-based to an applicable level, we propose InstaGraM, a fast end-to-end network generating a vectorized HD map via instance-level graph modeling of the map elements. Our strategy consists of three main stages: top-view feature extraction, road elements' vertices and edges detection, and conversion to a semantic vector representation. After top-down feature extraction, an encoder-decoder architecture is utilized to predict a set of vertices and edge maps of the road elements. Finally, these vertices along with edge maps are associated through an attentional graph neural network generating a semantic vectorized map. Instead of relying on a common segmentation approach, we propose to regress distance transform maps as they provide strong spatial relations and directional information between vertices. Comprehensive experiments on nuScenes dataset show that our proposed network outperforms HDMapNet by 13.7 mAP and achieves comparable accuracy with VectorMapNet 5x faster inference speed.
[[2301.04581] Elevation Estimation-Driven Building 3D Reconstruction from Single-View Remote Sensing Imagery](http://arxiv.org/abs/2301.04581) #extraction
Building 3D reconstruction from remote sensing images has a wide range of applications in smart cities, photogrammetry and other fields. Methods for automatic 3D urban building modeling typically employ multi-view images as input to algorithms to recover point clouds and 3D models of buildings. However, such models rely heavily on multi-view images of buildings, which are time-intensive and limit the applicability and practicality of the models. To solve these issues, we focus on designing an efficient DSM estimation-driven reconstruction framework (Building3D), which aims to reconstruct 3D building models from the input single-view remote sensing image. First, we propose a Semantic Flow Field-guided DSM Estimation (SFFDE) network, which utilizes the proposed concept of elevation semantic flow to achieve the registration of local and global features. Specifically, in order to make the network semantics globally aware, we propose an Elevation Semantic Globalization (ESG) module to realize the semantic globalization of instances. Further, in order to alleviate the semantic span of global features and original local features, we propose a Local-to-Global Elevation Semantic Registration (L2G-ESR) module based on elevation semantic flow. Our Building3D is rooted in the SFFDE network for building elevation prediction, synchronized with a building extraction network for building masks, and then sequentially performs point cloud reconstruction, surface reconstruction (or CityGML model reconstruction). On this basis, our Building3D can optionally generate CityGML models or surface mesh models of the buildings. Extensive experiments on ISPRS Vaihingen and DFC2019 datasets on the DSM estimation task show that our SFFDE significantly improves upon state-of-the-arts. Furthermore, our Building3D achieves impressive results in the 3D point cloud and 3D model reconstruction process.
[[2301.04626] Deep Axial Hypercomplex Networks](http://arxiv.org/abs/2301.04626) #extraction
Over the past decade, deep hypercomplex-inspired networks have enhanced feature extraction for image classification by enabling weight sharing across input channels. Recent works make it possible to improve representational capabilities by using hypercomplex-inspired networks which consume high computational costs. This paper reduces this cost by factorizing a quaternion 2D convolutional module into two consecutive vectormap 1D convolutional modules. Also, we use 5D parameterized hypercomplex multiplication based fully connected layers. Incorporating both yields our proposed hypercomplex network, a novel architecture that can be assembled to construct deep axial-hypercomplex networks (DANs) for image classifications. We conduct experiments on CIFAR benchmarks, SVHN, and Tiny ImageNet datasets and achieve better performance with fewer trainable parameters and FLOPS. Our proposed model achieves almost 2% higher performance for CIFAR and SVHN datasets, and more than 3% for the ImageNet-Tiny dataset and takes six times fewer parameters than the real-valued ResNets. Also, it shows state-of-the-art performance on CIFAR benchmarks in hypercomplex space.
[[2301.04434] Multilingual Entity and Relation Extraction from Unified to Language-specific Training](http://arxiv.org/abs/2301.04434) #extraction
Entity and relation extraction is a key task in information extraction, where the output can be used for downstream NLP tasks. Existing approaches for entity and relation extraction tasks mainly focus on the English corpora and ignore other languages. Thus, it is critical to improving performance in a multilingual setting. Meanwhile, multilingual training is usually used to boost cross-lingual performance by transferring knowledge from languages (e.g., high-resource) to other (e.g., low-resource) languages. However, language interference usually exists in multilingual tasks as the model parameters are shared among all languages. In this paper, we propose a two-stage multilingual training method and a joint model called Multilingual Entity and Relation Extraction framework (mERE) to mitigate language interference across languages. Specifically, we randomly concatenate sentences in different languages to train a Language-universal Aggregator (LA), which narrows the distance of embedding representations by obtaining the unified language representation. Then, we separate parameters to mitigate interference via tuning a Language-specific Switcher (LS), which includes several independent sub-modules to refine the language-specific feature representation. After that, to enhance the relational triple extraction, the sentence representations concatenated with the relation feature are used to recognize the entities. Extensive experimental results show that our method outperforms both the monolingual and multilingual baseline methods. Besides, we also perform detailed analysis to show that mERE is lightweight but effective on relational triple extraction and mERE{} is easy to transfer to other backbone models of multi-field tasks, which further demonstrates the effectiveness of our method.
[[2301.04571] Improving And Analyzing Neural Speaker Embeddings for ASR](http://arxiv.org/abs/2301.04571) #extraction
Neural speaker embeddings encode the speaker's speech characteristics through a DNN model and are prevalent for speaker verification tasks. However, few studies have investigated the usage of neural speaker embeddings for an ASR system. In this work, we present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved embedding extraction pipeline in combination with the Weighted-Simple-Add integration method results in x-vector and c-vector reaching on par performance with i-vectors. We further compare and analyze different speaker embeddings. We present our acoustic model improvements obtained by switching from newbob learning rate schedule to one cycle learning schedule resulting in a ~3% relative WER reduction on Switchboard, additionally reducing the overall training time by 17%. By further adding neural speaker embeddings, we gain additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
[[2301.04643] tieval: An Evaluation Framework for Temporal Information Extraction Systems](http://arxiv.org/abs/2301.04643) #extraction
Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between competitors across different corpora. On the other hand, the fact that each corpus is commonly disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. This constraint forces researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle that hinders the comparability of the TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and $F_1$, a few others prefer temporal awareness -- a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weights this decision is the necessity to implement the temporal closure algorithm in order to compute temporal awareness, which is not straightforward to implement neither is currently easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of temporal extraction systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and facilitates system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features.
[[2301.04511] Federated Learning and Blockchain-enabled Fog-IoT Platform for Wearables in Predictive Healthcare](http://arxiv.org/abs/2301.04511) #federate
Over the years, the popularity and usage of wearable Internet of Things (IoT) devices in several healthcare services are increased. Among the services that benefit from the usage of such devices is predictive analysis, which can improve early diagnosis in e-health. However, due to the limitations of wearable IoT devices, challenges in data privacy, service integrity, and network structure adaptability arose. To address these concerns, we propose a platform using federated learning and private blockchain technology within a fog-IoT network. These technologies have privacy-preserving features securing data within the network. We utilized the fog-IoT network's distributive structure to create an adaptive network for wearable IoT devices. We designed a testbed to examine the proposed platform's ability to preserve the integrity of a classifier. According to experimental results, the introduced implementation can effectively preserve a patient's privacy and a predictive service's integrity. We further investigated the contributions of other technologies to the security and adaptability of the IoT network. Overall, we proved the feasibility of our platform in addressing significant security and privacy challenges of wearable IoT devices in predictive healthcare through analysis, simulation, and experimentation.
[[2301.04430] Network Adaptive Federated Learning: Congestion and Lossy Compression](http://arxiv.org/abs/2301.04430) #federate
In order to achieve the dual goals of privacy and learning across distributed data, Federated Learning (FL) systems rely on frequent exchanges of large files (model updates) between a set of clients and the server. As such FL systems are exposed to, or indeed the cause of, congestion across a wide set of network resources. Lossy compression can be used to reduce the size of exchanged files and associated delays, at the cost of adding noise to model updates. By judiciously adapting clients' compression to varying network congestion, an FL application can reduce wall clock training time. To that end, we propose a Network Adaptive Compression (NAC-FL) policy, which dynamically varies the client's lossy compression choices to network congestion variations. We prove, under appropriate assumptions, that NAC-FL is asymptotically optimal in terms of directly minimizing the expected wall clock training time. Further, we show via simulation that NAC-FL achieves robust performance improvements with higher gains in settings with positively correlated delays across time.
[[2301.04632] Federated Learning under Heterogeneous and Correlated Client Availability](http://arxiv.org/abs/2301.04632) #federate
The enormous amount of data produced by mobile and IoT devices has motivated the development of federated learning (FL), a framework allowing such devices (or clients) to collaboratively train machine learning models without sharing their local data. FL algorithms (like FedAvg) iteratively aggregate model updates computed by clients on their own datasets. Clients may exhibit different levels of participation, often correlated over time and with other clients. This paper presents the first convergence analysis for a FedAvg-like FL algorithm under heterogeneous and correlated client availability. Our analysis highlights how correlation adversely affects the algorithm's convergence rate and how the aggregation strategy can alleviate this effect at the cost of steering training toward a biased model. Guided by the theoretical analysis, we propose CA-Fed, a new FL algorithm that tries to balance the conflicting goals of maximizing convergence speed and minimizing model bias. To this purpose, CA-Fed dynamically adapts the weight given to each client and may ignore clients with low availability and large correlation. Our experimental results show that CA-Fed achieves higher time-average accuracy and a lower standard deviation than state-of-the-art AdaFed and F3AST, both on synthetic and real datasets.
[[2301.04474] Speech Driven Video Editing via an Audio-Conditioned Diffusion Model](http://arxiv.org/abs/2301.04474) #diffusion
In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing.