[[2209.12987] TrustToken, a Trusted SoC solution for Non-Trusted Intellectual Property (IP)s](http://arxiv.org/abs/2209.12987)
Secure and trustworthy execution in heterogeneous SoCs is a major priority in the modern computing system. Security of SoCs mainly addresses two broad layers of trust issues: 1. Protection against hardware security threats(Side-channel, IP Privacy, Cloning, Fault Injection, and Denial of Service); and 2. Protection against malicious software attacks running on SoC processors. To resist malicious software-level attackers from gaining unauthorized access and compromising security, we propose a root of trust-based trusted execution mechanism \textbf{\textit{(named as \textbf{TrustToken}) }}. TrustToken builds a security block to provide a root of trust-based IP security: secure key generation and truly random source.
\textbf{TrustToken} only allows trusted communication between the non-trusted third-party IP and the rest of the SoC world by providing essential security features, i.e., secure, isolated execution, and trusted user interaction. The proposed design achieves this by interconnecting the third-party IP interface to \textbf{TrustToken} Controller and checking IP authorization(Token) signals \texttt{`correctness'} at run-time. \textbf{TrustToken} architecture shows a very low overhead resource utilization LUT (618, 1.16 \%), FF (44, 0.04 \%), and BUFG (2 , 6.25\%) in implementation. The experiment results show that TrustToken can provide a secure, low-cost, and trusted solution for non-trusted SoC IPs.
[[2209.13431] MTTBA- A Key Contributor for Sustainable Energy Consumption Time and Space Utility for Highly Secured Crypto Transactions in Blockchain Technology](http://arxiv.org/abs/2209.13431)
A Merkle tree is an information construction that is used in Blockchain to verify data or transactions in a large content pool in a safe manner. The role of the Merkle tree is crucial in Bitcoin and other cryptocurrencies in a Blockchain network. In this paper, we propose a bright and enhanced verification method, Merkle Trim Tree-based Blockchain Authentication (MTTBA) for the hash node traversal to reach the Merkle Root in a minimum time. MTTBA is a unique mechanism for verifying the Merkle Tree's accumulated transactions specifically for an odd number of transactions. The future impact of cryptocurrency is going to be massive and MTTBA proves its efficacy in transaction speed and eliminating node duplication. Our method enables any block to validate transactions' full availability without duplicating hash nodes. Performance has been evaluated in different parameters and the results show marked improvement in throughput(1680ms), processing time(29700kbps), memory usage(140MB), and security(99.30%). The energy consumption factor is crucial in the scenario, and MTTBA has achieved the lowest of 240 joules.
[[2209.12985] A Bibliometrics Analysis on 28 years of Authentication and Threat Model Area](http://arxiv.org/abs/2209.12985)
The large volume of publications in any research area can make it difficult for researchers to track their research areas' trends, challenges, and characteristics. Bibliometrics solves this problem by bringing statistical tools to help the analysis of selected publications from an online database. Although there are different works in security, our study aims to fill the bibliometric gap in the authentication and threat model area. As a result, a description of the dataset obtained, an overview of some selected variables, and an analysis of the ten most cited articles in this selected dataset is presented, which brings together publications from the last 28 years in these areas combined.
[[2209.12993] Device Tracking via Linux's New TCP Source Port Selection Algorithm (Extended Version)](http://arxiv.org/abs/2209.12993)
We describe a tracking technique for Linux devices, exploiting a new TCP source port generation mechanism recently introduced to the Linux kernel. This mechanism is based on an algorithm, standardized in RFC 6056, for boosting security by better randomizing port selection. Our technique detects collisions in a hash function used in the said algorithm, based on sampling TCP source ports generated in an attacker-prescribed manner. These hash collisions depend solely on a per-device key, and thus the set of collisions forms a device ID that allows tracking devices across browsers, browser privacy modes, containers, and IPv4/IPv6 networks (including some VPNs). It can distinguish among devices with identical hardware and software, and lasts until the device restarts.
We implemented this technique and then tested it using tracking servers in two different locations and with Linux devices on various networks. We also tested it on an Android device that we patched to introduce the new port selection algorithm. The tracking technique works in real-life conditions, and we report detailed findings about it, including its dwell time, scalability, and success rate in different network types. We worked with the Linux kernel team to mitigate the exploit, resulting in a security patch introduced in May 2022 to the Linux kernel, and we provide recommendations for better securing the port selection algorithm in the paper.
[[2209.13288] A Benchmark Comparison of Python Malware Detection Approaches](http://arxiv.org/abs/2209.13288)
While attackers often distribute malware to victims via open-source, community-driven package repositories, these repositories do not currently run automated malware detection systems. In this work, we explore the security goals of the repository administrators and the requirements for deployments of such malware scanners via a case study of the Python ecosystem and PyPI repository, which includes interviews with administrators and maintainers. Further, we evaluate existing malware detection techniques for deployment in this setting by creating a benchmark dataset and comparing several existing tools, including the malware checks implemented in PyPI, Bandit4Mal, and OSSGadget's OSS Detect Backdoor.
We find that repository administrators have exacting technical demands for such malware detection tools. Specifically, they consider a false positive rate of even 0.01% to be unacceptably high, given the large number of package releases that might trigger false alerts. Measured tools have false positive rates between 15% and 97%; increasing thresholds for detection rules to reduce this rate renders the true positive rate useless. In some cases, these checks emitted alerts more often for benign packages than malicious ones. However, we also find a successful socio-technical malware detection system: external security researchers also perform repository malware scans and report the results to repository administrators. These parties face different incentives and constraints on their time and tooling. We conclude with recommendations for improving detection capabilities and strengthening the collaboration between security researchers and software repository administrators.
[[2209.13454] Artificial Intelligence for Cybersecurity: Threats, Attacks and Mitigation](http://arxiv.org/abs/2209.13454)
With the advent of the digital era, every day-to-day task is automated due to technological advances. However, technology has yet to provide people with enough tools and safeguards. As the internet connects more-and-more devices around the globe, the question of securing the connected devices grows at an even spiral rate. Data thefts, identity thefts, fraudulent transactions, password compromises, and system breaches are becoming regular everyday news. The surging menace of cyber-attacks got a jolt from the recent advancements in Artificial Intelligence. AI is being applied in almost every field of different sciences and engineering. The intervention of AI not only automates a particular task but also improves efficiency by many folds. So it is evident that such a scrumptious spread would be very appetizing to cybercriminals. Thus the conventional cyber threats and attacks are now ``intelligent" threats. This article discusses cybersecurity and cyber threats along with both conventional and intelligent ways of defense against cyber-attacks. Furthermore finally, end the discussion with the potential prospects of the future of AI in cybersecurity.
[[2209.13590] Sauron U-Net: Simple automated redundancy elimination in medical image segmentation via filter pruning](http://arxiv.org/abs/2209.13590)
We present Sauron, a filter pruning method that eliminates redundant feature maps by discarding the corresponding filters with automatically-adjusted layer-specific thresholds. Furthermore, Sauron minimizes a regularization term that, as we show with various metrics, promotes the formation of feature maps clusters. In contrast to most filter pruning methods, Sauron is single-phase, similarly to typical neural network optimization, requiring fewer hyperparameters and design decisions. Additionally, unlike other cluster-based approaches, our method does not require pre-selecting the number of clusters, which is non-trivial to determine and varies across layers. We evaluated Sauron and three state-of-the-art filter pruning methods on three medical image segmentation tasks. This is an area where filter pruning has received little attention and where it can help building efficient models for medical grade computers that cannot use cloud services due to privacy considerations. Sauron achieved models with higher performance and pruning rate than the competing pruning methods. Additionally, since Sauron removes filters during training, its optimization accelerated over time. Finally, we show that the feature maps of a Sauron-pruned model were highly interpretable. The Sauron code is publicly available at https://github.com/jmlipman/SauronUNet.
[[2209.13073] Preprint: Privacy-preserving IoT Data Sharing Scheme](http://arxiv.org/abs/2209.13073)
Data sharing can be granted using different factors one of which is something in a users or an IoT devices environment which is in this paper broadcast signals. Using broadcast signals to measure Received Signal Strength Indicator values and Machine Learning models this paper implements an IoT data sharing scheme based on something that is in an IoT devices environment. The proposed scheme is experimentally tested using different ML models and shows 97.78 percent as its highest accuracy.
[[2209.13032] Totems: Physical Objects for Verifying Visual Integrity](http://arxiv.org/abs/2209.13032)
We introduce a new approach to image forensics: placing physical refractive objects, which we call totems, into a scene so as to protect any photograph taken of that scene. Totems bend and redirect light rays, thus providing multiple, albeit distorted, views of the scene within a single image. A defender can use these distorted totem pixels to detect if an image has been manipulated. Our approach unscrambles the light rays passing through the totems by estimating their positions in the scene and using their known geometric and material properties. To verify a totem-protected image, we detect inconsistencies between the scene reconstructed from totem viewpoints and the scene's appearance from the camera viewpoint. Such an approach makes the adversarial manipulation task more difficult, as the adversary must modify both the totem and image pixels in a geometrically consistent manner without knowing the physical properties of the totem. Unlike prior learning-based approaches, our method does not require training on datasets of specific manipulations, and instead uses physical properties of the scene and camera to solve the forensics problem.
[[2209.13113] FG-UAP: Feature-Gathering Universal Adversarial Perturbation](http://arxiv.org/abs/2209.13113)
Deep Neural Networks (DNNs) are susceptible to elaborately designed perturbations, whether such perturbations are dependent or independent of images. The latter one, called Universal Adversarial Perturbation (UAP), is very attractive for model robustness analysis, since its independence of input reveals the intrinsic characteristics of the model. Relatively, another interesting observation is Neural Collapse (NC), which means the feature variability may collapse during the terminal phase of training. Motivated by this, we propose to generate UAP by attacking the layer where NC phenomenon happens. Because of NC, the proposed attack could gather all the natural images' features to its surrounding, which is hence called Feature-Gathering UAP (FG-UAP).
We evaluate the effectiveness our proposed algorithm on abundant experiments, including untargeted and targeted universal attacks, attacks under limited dataset, and transfer-based black-box attacks among different architectures including Vision Transformers, which are believed to be more robust. Furthermore, we investigate FG-UAP in the view of NC by analyzing the labels and extracted features of adversarial examples, finding that collapse phenomenon becomes stronger after the model is corrupted. The code will be released when the paper is accepted.
[[2209.13353] Suppress with a Patch: Revisiting Universal Adversarial Patch Attacks against Object Detection](http://arxiv.org/abs/2209.13353)
Adversarial patch-based attacks aim to fool a neural network with an intentionally generated noise, which is concentrated in a particular region of an input image. In this work, we perform an in-depth analysis of different patch generation parameters, including initialization, patch size, and especially positioning a patch in an image during training. We focus on the object vanishing attack and run experiments with YOLOv3 as a model under attack in a white-box setting and use images from the COCO dataset. Our experiments have shown, that inserting a patch inside a window of increasing size during training leads to a significant increase in attack strength compared to a fixed position. The best results were obtained when a patch was positioned randomly during training, while patch position additionally varied within a batch.
[[2209.13523] Watch What You Pretrain For: Targeted, Transferable Adversarial Examples on Self-Supervised Speech Recognition models](http://arxiv.org/abs/2209.13523)
Targeted adversarial attacks against Automatic Speech Recognition (ASR) are thought to require white-box access to the targeted model to be effective, which mitigates the threat that they pose. We show that the recent line of Transformer ASR models pretrained with Self-Supervised Learning (SSL) are much more at risk: adversarial examples generated against them are transferable, making these models vulnerable to targeted, zero-knowledge attacks. We release an adversarial dataset that partially fools most publicly released SSL-pretrained ASR models (Wav2Vec2, HuBERT, WavLM, etc). With low-level additive noise achieving a 30dB Signal-Noise Ratio, we can force these models to predict our target sentences with up to 80% accuracy, instead of their original transcription. With an ablation study, we show that Self-Supervised pretraining is the main cause of that vulnerability. We also propose an explanation for that curious phenomenon, which increases the threat posed by adversarial attacks on state-of-the-art ASR models.
[[2209.13204] NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System](http://arxiv.org/abs/2209.13204)
We present a neural network-based system for long-term, multi-action human motion synthesis. The system, dubbed as NEURAL MARIONETTE, can produce high-quality and meaningful motions with smooth transitions from simple user input, including a sequence of action tags with expected action duration, and optionally a hand-drawn moving trajectory if the user specifies. The core of our system is a novel Transformer-based motion generation model, namely MARIONET, which can generate diverse motions given action tags. Different from existing motion generation models, MARIONET utilizes contextual information from the past motion clip and future action tag, dedicated to generating actions that can smoothly blend historical and future actions. Specifically, MARIONET first encodes target action tag and contextual information into an action-level latent code. The code is unfolded into frame-level control signals via a time unrolling module, which could be then combined with other frame-level control signals like the target trajectory. Motion frames are then generated in an auto-regressive way. By sequentially applying MARIONET, the system NEURAL MARIONETTE can robustly generate long-term, multi-action motions with the help of two simple schemes, namely "Shadow Start" and "Action Revision". Along with the novel system, we also present a new dataset dedicated to the multi-action motion synthesis task, which contains both action tags and their contextual information. Extensive experiments are conducted to study the action accuracy, naturalism, and transition smoothness of the motions generated by our system.
[[2209.13284] Frame Interpolation for Dynamic Scenes with Implicit Flow Encoding](http://arxiv.org/abs/2209.13284)
In this paper, we propose an algorithm to interpolate between a pair of images of a dynamic scene. While in the past years significant progress in frame interpolation has been made, current approaches are not able to handle images with brightness and illumination changes, which are common even when the images are captured shortly apart. We propose to address this problem by taking advantage of the existing optical flow methods that are highly robust to the variations in the illumination. Specifically, using the bidirectional flows estimated using an existing pre-trained flow network, we predict the flows from an intermediate frame to the two input images. To do this, we propose to encode the bidirectional flows into a coordinate-based network, powered by a hypernetwork, to obtain a continuous representation of the flow across time. Once we obtain the estimated flows, we use them within an existing blending network to obtain the final intermediate frame. Through extensive experiments, we demonstrate that our approach is able to produce significantly better results than state-of-the-art frame interpolation algorithms.
[[2209.13420] Stacking Ensemble Learning in Deep Domain Adaptation for Ophthalmic Image Classification](http://arxiv.org/abs/2209.13420)
Domain adaptation is an attractive approach given the availability of a large amount of labeled data with similar properties but different domains. It is effective in image classification tasks where obtaining sufficient label data is challenging. We propose a novel method, named SELDA, for stacking ensemble learning via extending three domain adaptation methods for effectively solving real-world problems. The major assumption is that when base domain adaptation models are combined, we can obtain a more accurate and robust model by exploiting the ability of each of the base models. We extend Maximum Mean Discrepancy (MMD), Low-rank coding, and Correlation Alignment (CORAL) to compute the adaptation loss in three base models. Also, we utilize a two-fully connected layer network as a meta-model to stack the output predictions of these three well-performing domain adaptation models to obtain high accuracy in ophthalmic image classification tasks. The experimental results using Age-Related Eye Disease Study (AREDS) benchmark ophthalmic dataset demonstrate the effectiveness of the proposed model.
[[2209.13514] StyleSwap: Style-Based Generator Empowers Robust Face Swapping](http://arxiv.org/abs/2209.13514)
Numerous attempts have been made to the task of person-agnostic face swapping given its wide applications. While existing methods mostly rely on tedious network and loss designs, they still struggle in the information balancing between the source and target faces, and tend to produce visible artifacts. In this work, we introduce a concise and effective framework named StyleSwap. Our core idea is to leverage a style-based generator to empower high-fidelity and robust face swapping, thus the generator's advantage can be adopted for optimizing identity similarity. We identify that with only minimal modifications, a StyleGAN2 architecture can successfully handle the desired information from both source and target. Additionally, inspired by the ToRGB layers, a Swapping-Driven Mask Branch is further devised to improve information blending. Furthermore, the advantage of StyleGAN inversion can be adopted. Particularly, a Swapping-Guided ID Inversion strategy is proposed to optimize identity similarity. Extensive experiments validate that our framework generates high-quality face swapping results that outperform state-of-the-art methods both qualitatively and quantitatively.
[[2209.12944] On the Impact of Speech Recognition Errors in Passage Retrieval for Spoken Question Answering](http://arxiv.org/abs/2209.12944)
Interacting with a speech interface to query a Question Answering (QA) system is becoming increasingly popular. Typically, QA systems rely on passage retrieval to select candidate contexts and reading comprehension to extract the final answer. While there has been some attention to improving the reading comprehension part of QA systems against errors that automatic speech recognition (ASR) models introduce, the passage retrieval part remains unexplored. However, such errors can affect the performance of passage retrieval, leading to inferior end-to-end performance. To address this gap, we augment two existing large-scale passage ranking and open domain QA datasets with synthetic ASR noise and study the robustness of lexical and dense retrievers against questions with ASR noise. Furthermore, we study the generalizability of data augmentation techniques across different domains; with each domain being a different language dialect or accent. Finally, we create a new dataset with questions voiced by human users and use their transcriptions to show that the retrieval performance can further degrade when dealing with natural ASR noise instead of synthetic ASR noise.
[[2209.13187] DAMO-NLP at NLPCC-2022 Task 2: Knowledge Enhanced Robust NER for Speech Entity Linking](http://arxiv.org/abs/2209.13187)
Speech Entity Linking aims to recognize and disambiguate named entities in spoken languages. Conventional methods suffer gravely from the unfettered speech styles and the noisy transcripts generated by ASR systems. In this paper, we propose a novel approach called Knowledge Enhanced Named Entity Recognition (KENER), which focuses on improving robustness through painlessly incorporating proper knowledge in the entity recognition stage and thus improving the overall performance of entity linking. KENER first retrieves candidate entities for a sentence without mentions, and then utilizes the entity descriptions as extra information to help recognize mentions. The candidate entities retrieved by a dense retrieval module are especially useful when the input is short or noisy. Moreover, we investigate various data sampling strategies and design effective loss functions, in order to improve the quality of retrieved entities in both recognition and disambiguation stages. Lastly, a linking with filtering module is applied as the final safeguard, making it possible to filter out wrongly-recognized mentions. Our system achieves 1st place in Track 1 and 2nd place in Track 2 of NLPCC-2022 Shared Task 2.
[[2209.13331] EditEval: An Instruction-Based Benchmark for Text Improvements](http://arxiv.org/abs/2209.13331)
Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in different modular skills such as fixing outdated information or making the style more consistent. Even so, comprehensive evaluation of a model's capacity to perform these skills and the ability to edit remains sparse. This work presents EditEval: An instruction-based, benchmark and evaluation suite that leverages high-quality existing and new datasets for automatic evaluation of editing capabilities such as making text more cohesive and paraphrasing. We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA, particularly when neutralizing and updating information. Our analysis also shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models. Through the release of this benchmark and a publicly available leaderboard challenge, we hope to unlock future research in developing models capable of iterative and more controllable editing.
[[2209.13160] Collaborative Decision Making Using Action Suggestions](http://arxiv.org/abs/2209.13160)
The level of autonomy is increasing in systems spanning multiple domains, but these systems still experience failures. One way to mitigate the risk of failures is to integrate human oversight of the autonomous systems and rely on the human to take control when the autonomy fails. In this work, we formulate a method of collaborative decision making through action suggestions that improves action selection without taking control of the system. Our approach uses each suggestion efficiently by incorporating the implicit information shared through suggestions to modify the agent's belief and achieves better performance with fewer suggestions than naively following the suggested actions. We assume collaborative agents share the same objective and communicate through valid actions. By assuming the suggested action is dependent only on the state, we can incorporate the suggested action as an independent observation of the environment. The assumption of a collaborative environment enables us to use the agent's policy to estimate the distribution over action suggestions. We propose two methods that use suggested actions and demonstrate the approach through simulated experiments. The proposed methodology results in increased performance while also being robust to suboptimal suggestions.
[[2209.13237] Reinforcement Learning for Cognitive Delay/Disruption Tolerant Network Node Management in an LEO-based Satellite Constellation](http://arxiv.org/abs/2209.13237)
In recent years, with the large-scale deployment of space spacecraft entities and the increase of satellite onboard capabilities, delay/disruption tolerant network (DTN) emerged as a more robust communication protocol than TCP/IP in the case of excessive network dynamics. DTN node buffer management is still an active area of research, as the current implementation of the DTN core protocol still relies on the assumption that there is always enough memory available in different network nodes to store and forward bundles. In addition, the classical queuing theory does not apply to the dynamic management of DTN node buffers. Therefore, this paper proposes a centralized approach to automatically manage cognitive DTN nodes in low earth orbit (LEO) satellite constellation scenarios based on the advanced reinforcement learning (RL) strategy advantage actor-critic (A2C). The method aims to explore training a geosynchronous earth orbit intelligent agent to manage all DTN nodes in an LEO satellite constellation scenario. The goal of the A2C agent is to maximize delivery success rate and minimize network resource consumption cost while considering node memory utilization. The intelligent agent can dynamically adjust the radio data rate and perform drop operations based on bundle priority. In order to measure the effectiveness of applying A2C technology to DTN node management issues in LEO satellite constellation scenarios, this paper compares the trained intelligent agent strategy with the other two non-RL policies, including random and standard policies. Experiments show that the A2C strategy balances delivery success rate and cost, and provides the highest reward and the lowest node memory utilization.
[[2209.13254] Identifying and Extracting Football Features from Real-World Media Sources using Only Synthetic Training Data](http://arxiv.org/abs/2209.13254)
Real-world images used for training machine learning algorithms are often unstructured and inconsistent. The process of analysing and tagging these images can be costly and error prone (also availability, gaps and legal conundrums). However, as we demonstrate in this article, the potential to generate accurate graphical images that are indistinguishable from real-world sources has a multitude of benefits in machine learning paradigms. One such example of this is football data from broadcast services (television and other streaming media sources). The football games are usually recorded from multiple sources (cameras and phones) and resolutions, not to mention, occlusion of visual details and other artefacts (like blurring, weathering and lighting conditions) which make it difficult to accurately identify features. We demonstrate an approach which is able to overcome these limitations using generated tagged and structured images. The generated images are able to simulate a variety views and conditions (including noise and blurring) which may only occur sporadically in real-world data and make it difficult for machine learning algorithm to 'cope' with these unforeseen problems in real-data. This approach enables us to rapidly train and prepare a robust solution that accurately extracts features (e.g., spacial locations, markers on the pitch, player positions, ball location and camera FOV) from real-world football match sources for analytical purposes.
[[2209.13511] Phy-Taylor: Physics-Model-Based Deep Neural Networks](http://arxiv.org/abs/2209.13511)
Purely data-driven deep neural networks (DNNs) applied to physical engineering systems can infer relations that violate physics laws, thus leading to unexpected consequences. To address this challenge, we propose a physics-model-based DNN framework, called Phy-Taylor, that accelerates learning compliant representations with physical knowledge. The Phy-Taylor framework makes two key contributions; it introduces a new architectural Physics-compatible neural network (PhN), and features a novel compliance mechanism, we call {\em Physics-guided Neural Network Editing\/}. The PhN aims to directly capture nonlinearities inspired by physical quantities, such as kinetic energy, potential energy, electrical power, and aerodynamic drag force. To do so, the PhN augments neural network layers with two key components: (i) monomials of Taylor series expansion of nonlinear functions capturing physical knowledge, and (ii) a suppressor for mitigating the influence of noise. The neural-network editing mechanism further modifies network links and activation functions consistently with physical knowledge. As an extension, we also propose a self-correcting Phy-Taylor framework that introduces two additional capabilities: (i) physics-model-based safety relationship learning, and (ii) automatic output correction when violations of safety occur. Through experiments, we show that (by expressing hard-to-learn nonlinearities directly and by constraining dependencies) Phy-Taylor features considerably fewer parameters, and a remarkably accelerated training process, while offering enhanced model robustness and accuracy.
[[2209.12962] FaRO 2: an Open Source, Configurable Smart City Framework for Real-Time Distributed Vision and Biometric Systems](http://arxiv.org/abs/2209.12962)
Recent global growth in the interest of smart cities has led to trillions of dollars of investment toward research and development. These connected cities have the potential to create a symbiosis of technology and society and revolutionize the cost of living, safety, ecological sustainability, and quality of life of societies on a world-wide scale. Some key components of the smart city construct are connected smart grids, self-driving cars, federated learning systems, smart utilities, large-scale public transit, and proactive surveillance systems. While exciting in prospect, these technologies and their subsequent integration cannot be attempted without addressing the potential societal impacts of such a high degree of automation and data sharing. Additionally, the feasibility of coordinating so many disparate tasks will require a fast, extensible, unifying framework. To that end, we propose FaRO2, a completely reimagined successor to FaRO1, built from the ground up. FaRO2 affords all of the same functionality as its predecessor, serving as a unified biometric API harness that allows for seamless evaluation, deployment, and simple pipeline creation for heterogeneous biometric software. FaRO2 additionally provides a fully declarative capability for defining and coordinating custom machine learning and sensor pipelines, allowing the distribution of processes across otherwise incompatible hardware and networks. FaRO2 ultimately provides a way to quickly configure, hot-swap, and expand large coordinated or federated systems online without interruptions for maintenance. Because much of the data collected in a smart city contains Personally Identifying Information (PII), FaRO2 also provides built-in tools and layers to ensure secure and encrypted streaming, storage, and access of PII data across distributed systems.
[[2209.13090] EEG-based Image Feature Extraction for Visual Classification using Deep Learning](http://arxiv.org/abs/2209.13090)
While capable of segregating visual data, humans take time to examine a single piece, let alone thousands or millions of samples. The deep learning models efficiently process sizeable information with the help of modern-day computing. However, their questionable decision-making process has raised considerable concerns. Recent studies have identified a new approach to extract image features from EEG signals and combine them with standard image features. These approaches make deep learning models more interpretable and also enables faster converging of models with fewer samples. Inspired by recent studies, we developed an efficient way of encoding EEG signals as images to facilitate a more subtle understanding of brain signals with deep learning models. Using two variations in such encoding methods, we classified the encoded EEG signals corresponding to 39 image classes with a benchmark accuracy of 70% on the layered dataset of six subjects, which is significantly higher than the existing work. Our image classification approach with combined EEG features achieved an accuracy of 82% compared to the slightly better accuracy of a pure deep learning approach; nevertheless, it demonstrates the viability of the theory.
[[2209.13116] Spatio-Temporal Relation Learning for Video Anomaly Detection](http://arxiv.org/abs/2209.13116)
Anomaly identification is highly dependent on the relationship between the object and the scene, as different/same object actions in same/different scenes may lead to various degrees of normality and anomaly. Therefore, object-scene relation actually plays a crucial role in anomaly detection but is inadequately explored in previous works. In this paper, we propose a Spatial-Temporal Relation Learning (STRL) framework to tackle the video anomaly detection task. First, considering dynamic characteristics of the objects as well as scene areas, we construct a Spatio-Temporal Auto-Encoder (STAE) to jointly exploit spatial and temporal evolution patterns for representation learning. For better pattern extraction, two decoding branches are designed in the STAE module, i.e. an appearance branch capturing spatial cues by directly predicting the next frame, and a motion branch focusing on modeling the dynamics via optical flow prediction. Then, to well concretize the object-scene relation, a Relation Learning (RL) module is devised to analyze and summarize the normal relations by introducing the Knowledge Graph Embedding methodology. Specifically in this process, the plausibility of object-scene relation is measured by jointly modeling object/scene features and optimizable object-scene relation maps. Extensive experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
[[2209.13232] A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective](http://arxiv.org/abs/2209.13232)
Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (\emph{e.g.,} social network analysis and recommender systems), computer vision (\emph{e.g.,} object detection and point cloud learning), and natural language processing (\emph{e.g.,} relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, \emph{i.e.,} 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.
[[2209.13459] EgoSpeed-Net: Forecasting Speed-Control in Driver Behavior from Egocentric Video Data](http://arxiv.org/abs/2209.13459)
Speed-control forecasting, a challenging problem in driver behavior analysis, aims to predict the future actions of a driver in controlling vehicle speed such as braking or acceleration. In this paper, we try to address this challenge solely using egocentric video data, in contrast to the majority of works in the literature using either third-person view data or extra vehicle sensor data such as GPS, or both. To this end, we propose a novel graph convolutional network (GCN) based network, namely, EgoSpeed-Net. We are motivated by the fact that the position changes of objects over time can provide us very useful clues for forecasting the speed change in future. We first model the spatial relations among the objects from each class, frame by frame, using fully-connected graphs, on top of which GCNs are applied for feature extraction. Then we utilize a long short-term memory network to fuse such features per class over time into a vector, concatenate such vectors and forecast a speed-control action using a multilayer perceptron classifier. We conduct extensive experiments on the Honda Research Institute Driving Dataset and demonstrate the superior performance of EgoSpeed-Net.
[[2209.13500] Dense-TNT: Efficient Vehicle Type Classification Neural Network Using Satellite Imagery](http://arxiv.org/abs/2209.13500)
Accurate vehicle type classification serves a significant role in the intelligent transportation system. It is critical for ruler to understand the road conditions and usually contributive for the traffic light control system to response correspondingly to alleviate traffic congestion. New technologies and comprehensive data sources, such as aerial photos and remote sensing data, provide richer and high-dimensional information. Also, due to the rapid development of deep neural network technology, image based vehicle classification methods can better extract underlying objective features when processing data. Recently, several deep learning models have been proposed to solve the problem. However, traditional pure convolutional based approaches have constraints on global information extraction, and the complex environment, such as bad weather, seriously limits the recognition capability. To improve the vehicle type classification capability under complex environment, this study proposes a novel Densely Connected Convolutional Transformer in Transformer Neural Network (Dense-TNT) framework for the vehicle type classification by stacking Densely Connected Convolutional Network (DenseNet) and Transformer in Transformer (TNT) layers. Three-region vehicle data and four different weather conditions are deployed for recognition capability evaluation. Experimental findings validate the recognition ability of our proposed vehicle classification model with little decay, even under the heavy foggy weather condition.
[[2209.13136] A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing](http://arxiv.org/abs/2209.13136)
The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.
[[2209.13464] Information Extraction and Human-Robot Dialogue towards Real-life Tasks: A Baseline Study with the MobileCS Dataset](http://arxiv.org/abs/2209.13464)
Recently, there have merged a class of task-oriented dialogue (TOD) datasets collected through Wizard-of-Oz simulated games. However, the Wizard-of-Oz data are in fact simulated data and thus are fundamentally different from real-life conversations, which are more noisy and casual. Recently, the SereTOD challenge is organized and releases the MobileCS dataset, which consists of real-world dialog transcripts between real users and customer-service staffs from China Mobile. Based on the MobileCS dataset, the SereTOD challenge has two tasks, not only evaluating the construction of the dialogue system itself, but also examining information extraction from dialog transcripts, which is crucial for building the knowledge base for TOD. This paper mainly presents a baseline study of the two tasks with the MobileCS dataset. We introduce how the two baselines are constructed, the problems encountered, and the results. We anticipate that the baselines can facilitate exciting future research to build human-robot dialogue systems for real-life tasks.
[[2209.13080] FedStack: Personalized activity monitoring using stacked federated learning](http://arxiv.org/abs/2209.13080)
Recent advances in remote patient monitoring (RPM) systems can recognize various human activities to measure vital signs, including subtle motions from superficial vessels. There is a growing interest in applying artificial intelligence (AI) to this area of healthcare by addressing known limitations and challenges such as predicting and classifying vital signs and physical movements, which are considered crucial tasks. Federated learning is a relatively new AI technique designed to enhance data privacy by decentralizing traditional machine learning modeling. However, traditional federated learning requires identical architectural models to be trained across the local clients and global servers. This limits global model architecture due to the lack of local models heterogeneity. To overcome this, a novel federated learning architecture, FedStack, which supports ensembling heterogeneous architectural client models was proposed in this study. This work offers a protected privacy system for hospitalized in-patients in a decentralized approach and identifies optimum sensor placement. The proposed architecture was applied to a mobile health sensor benchmark dataset from 10 different subjects to classify 12 routine activities. Three AI models, ANN, CNN, and Bi-LSTM were trained on individual subject data. The federated learning architecture was applied to these models to build local and global models capable of state of the art performances. The local CNN model outperformed ANN and Bi-LSTM models on each subject data. Our proposed work has demonstrated better performance for heterogeneous stacking of the local models compared to homogeneous stacking. This work sets the stage to build an enhanced RPM system that incorporates client privacy to assist with clinical observations for patients in an acute mental health facility and ultimately help to prevent unexpected death.
[[2209.13115] Semi-Synchronous Personalized Federated Learning over Mobile Edge Networks](http://arxiv.org/abs/2209.13115)
Personalized Federated Learning (PFL) is a new Federated Learning (FL) approach to address the heterogeneity issue of the datasets generated by distributed user equipments (UEs). However, most existing PFL implementations rely on synchronous training to ensure good convergence performances, which may lead to a serious straggler problem, where the training time is heavily prolonged by the slowest UE. To address this issue, we propose a semi-synchronous PFL algorithm, termed as Semi-Synchronous Personalized FederatedAveraging (PerFedS$^2$), over mobile edge networks. By jointly optimizing the wireless bandwidth allocation and UE scheduling policy, it not only mitigates the straggler problem but also provides convergent training loss guarantees. We derive an upper bound of the convergence rate of PerFedS2 in terms of the number of participants per global round and the number of rounds. On this basis, the bandwidth allocation problem can be solved using analytical solutions and the UE scheduling policy can be obtained by a greedy algorithm. Experimental results verify the effectiveness of PerFedS2 in saving training time as well as guaranteeing the convergence of training loss, in contrast to synchronous and asynchronous PFL algorithms.
[[2209.12995] Habitat classification from satellite observations with sparse annotations](http://arxiv.org/abs/2209.12995)
Remote sensing benefits habitat conservation by making monitoring of large areas easier compared to field surveying especially if the remote sensed data can be automatically analyzed. An important aspect of monitoring is classifying and mapping habitat types present in the monitored area. Automatic classification is a difficult task, as classes have fine-grained differences and their distributions are long-tailed and unbalanced. Usually training data used for automatic land cover classification relies on fully annotated segmentation maps, annotated from remote sensed imagery to a fairly high-level taxonomy, i.e., classes such as forest, farmland, or urban area. A challenge with automatic habitat classification is that reliable data annotation requires field-surveys. Therefore, full segmentation maps are expensive to produce, and training data is often sparse, point-like, and limited to areas accessible by foot. Methods for utilizing these limited data more efficiently are needed.
We address these problems by proposing a method for habitat classification and mapping, and apply this method to classify the entire northern Finnish Lapland area into Natura2000 classes. The method is characterized by using finely-grained, sparse, single-pixel annotations collected from the field, combined with large amounts of unannotated data to produce segmentation maps. Supervised, unsupervised and semi-supervised methods are compared, and the benefits of transfer learning from a larger out-of-domain dataset are demonstrated. We propose a \ac{CNN} biased towards center pixel classification ensembled with a random forest classifier, that produces higher quality classifications than the models themselves alone. We show that cropping augmentations, test-time augmentation and semi-supervised learning can help classification even further.
[[2209.13177] A Survey of Fairness in Medical Image Analysis: Concepts, Algorithms, Evaluations, and Challenges](http://arxiv.org/abs/2209.13177)
Fairness, a criterion focuses on evaluating algorithm performance on different demographic groups, has gained attention in natural language processing, recommendation system and facial recognition. Since there are plenty of demographic attributes in medical image samples, it is important to understand the concepts of fairness, be acquainted with unfairness mitigation techniques, evaluate fairness degree of an algorithm and recognize challenges in fairness issues in medical image analysis (MedIA). In this paper, we first give a comprehensive and precise definition of fairness, following by introducing currently used techniques in fairness issues in MedIA. After that, we list public medical image datasets that contain demographic attributes for facilitating the fairness research and summarize current algorithms concerning fairness in MedIA. To help achieve a better understanding of fairness, and call attention to fairness related issues in MedIA, experiments are conducted comparing the difference between fairness and data imbalance, verifying the existence of unfairness in various MedIA tasks, especially in classification, segmentation and detection, and evaluating the effectiveness of unfairness mitigation algorithms. Finally, we conclude with opportunities and challenges in fairness in MedIA.
[[2209.13369] OBBStacking: An Ensemble Method for Remote Sensing Object Detection](http://arxiv.org/abs/2209.13369)
Ensemble methods are a reliable way to combine several models to achieve superior performance. However, research on the application of ensemble methods in the remote sensing object detection scenario is mostly overlooked. Two problems arise. First, one unique characteristic of remote sensing object detection is the Oriented Bounding Boxes (OBB) of the objects and the fusion of multiple OBBs requires further research attention. Second, the widely used deep learning object detectors provide a score for each detected object as an indicator of confidence, but how to use these indicators effectively in an ensemble method remains a problem. Trying to address these problems, this paper proposes OBBStacking, an ensemble method that is compatible with OBBs and combines the detection results in a learned fashion. This ensemble method helps take 1st place in the Challenge Track \textit{Fine-grained Object Recognition in High-Resolution Optical Images}, which was featured in \textit{2021 Gaofen Challenge on Automated High-Resolution Earth Observation Image Interpretation}. The experiments on DOTA dataset and FAIR1M dataset demonstrate the improved performance of OBBStacking and the features of OBBStacking are analyzed.
[[2209.13578] Learning When to Advise Human Decision Makers](http://arxiv.org/abs/2209.13578)
Artificial intelligence (AI) systems are increasingly used for providing advice to facilitate human decision making. While a large body of work has explored how AI systems can be optimized to produce accurate and fair advice and how algorithmic advice should be presented to human decision makers, in this work we ask a different basic question: When should algorithms provide advice? Motivated by limitations of the current practice of constantly providing algorithmic advice, we propose the design of AI systems that interact with the human user in a two-sided manner and provide advice only when it is likely to be beneficial to the human in making their decision. Our AI systems learn advising policies using past human decisions. Then, for new cases, the learned policies utilize input from the human to identify cases where algorithmic advice would be useful, as well as those where the human is better off deciding alone. We conduct a large-scale experiment to evaluate our approach by using data from the US criminal justice system on pretrial-release decisions. In our experiment, participants were asked to assess the risk of defendants to violate their release terms if released and were advised by different advising approaches. The results show that our interactive-advising approach manages to provide advice at times of need and to significantly improve human decision making compared to fixed, non-interactive advising approaches. Our approach has additional advantages in facilitating human learning, preserving complementary strengths of human decision makers, and leading to more positive responsiveness to the advice.
[[2209.13179] Explainable Global Fairness Verification of Tree-Based Classifiers](http://arxiv.org/abs/2209.13179)
We present a new approach to the global fairness verification of tree-based classifiers. Given a tree-based classifier and a set of sensitive features potentially leading to discrimination, our analysis synthesizes sufficient conditions for fairness, expressed as a set of traditional propositional logic formulas, which are readily understandable by human experts. The verified fairness guarantees are global, in that the formulas predicate over all the possible inputs of the classifier, rather than just a few specific test instances. Our analysis is formally proved both sound and complete. Experimental results on public datasets show that the analysis is precise, explainable to human experts and efficient enough for practical adoption.
[[2209.13123] Explainable Graph Pyramid Autoformer for Long-Term Traffic Forecasting](http://arxiv.org/abs/2209.13123)
Accurate traffic forecasting is vital to an intelligent transportation system. Although many deep learning models have achieved state-of-art performance for short-term traffic forecasting of up to 1 hour, long-term traffic forecasting that spans multiple hours remains a major challenge. Moreover, most of the existing deep learning traffic forecasting models are black box, presenting additional challenges related to explainability and interpretability. We develop Graph Pyramid Autoformer (X-GPA), an explainable attention-based spatial-temporal graph neural network that uses a novel pyramid autocorrelation attention mechanism. It enables learning from long temporal sequences on graphs and improves long-term traffic forecasting accuracy. Our model can achieve up to 35 % better long-term traffic forecast accuracy than that of several state-of-the-art methods. The attention-based scores from the X-GPA model provide spatial and temporal explanations based on the traffic dynamics, which change for normal vs. peak-hour traffic and weekday vs. weekend traffic.