[[2208.00388] Secure Email Transmission Protocols -- A New Architecture Design](http://arxiv.org/abs/2208.00388)
During today's digital age, emails have become a crucial part of communications for both personal and enterprise usage. However, email transmission protocols were not designed with security in mind, and this has always been a challenge while trying to make email transmission more secure. On top of the basic layer of SMTP, POP3, and IMAP protocols to send and retrieve emails, there are several other major security protocols used in current days to secure email transmission such as TLS/SSL, STARTTLS, and PGP/GPG encryption. The most general design used in email transmission architecture is SMTP with PGP/GPG encryption sending through an TLS/SSL secure channel. Regardless, vulnerabilities within these security protocols and encryption methods, there is still work can be done regarding the architecture design. In this paper, we discuss the challenges among current email transmission security protocols and architectures. We explore some new techniques and propose a new email transmission architecture using EEKS structure and Schnorr Signature to eliminate the usage of PGP/GPG for encryption while achieving Perfect Forward Secrecy.
[[2208.00205] BlockScope: Detecting and Investigating Propagated Vulnerabilities in Forked Blockchain Projects](http://arxiv.org/abs/2208.00205)
Due to the open-source nature of the blockchain ecosystem, it is common for new blockchains to fork or partially reuse the code of classic blockchains. For example, the popular Dogecoin, Litecoin, Binance BSC, and Polygon are all variants of Bitcoin/Ethereum. These "forked" blockchains thus could encounter similar vulnerabilities that are propagated from Bitcoin/Ethereum during forking or subsequently commit fetching. In this paper, we conduct a systematic study of detecting and investigating the propagated vulnerabilities in forked blockchain projects. To facilitate this study, we propose BlockScope, a novel tool that can effectively and efficiently detect multiple types of cloned vulnerabilities given an input of existing Bitcoin/Ethereum security patches. Specifically, BlockScope adopts similarity-based code match and designs a new way of calculating code similarity to cover all the syntax-wide variant (i.e., Type-1, Type-2, and Type-3) clones. Moreover, BlockScope automatically extracts and leverages the contexts of patch code to narrow down the search scope and locate only potentially relevant code for comparison. Our evaluation shows that BlockScope achieves good precision and high recall both at 91.8% (1.8 times higher recall than that in ReDeBug). BlockScope allows us to discover 101 previously unknown vulnerabilities in 13 out of the 16 forked projects of Bitcoin and Ethereum, including 16 from Dogecoin, 6 from Litecoin, 1 from Binance, and 4 from Optimism. We have reported all the vulnerabilities to their developers; 40 of them have been patched or accepted, 66 were acknowledged or under pending, and only 4 were rejected. We further investigate the propagation and patching processes of discovered vulnerabilities, and reveal three types of vulnerability propagation from source to forked projects, as well as the long delay (over 200 days) for releasing patches in Bitcoin forks.
[[2208.00235] 'PeriHack': Designing a Serious Game for Cybersecurity Awareness](http://arxiv.org/abs/2208.00235)
This paper describes the design process for the cybersecurity serious game 'PeriHack'. Publicly released under a CC (BY-NC-SA) license, PeriHack is a board and card game for two players or teams that simulates the struggle between a red team (attackers) and a blue team (defenders). The game requires players to explore a sample network looking for vulnerabilities and then chain different attacks to exploit possible weaknesses of different nature, which may include both technical and social engineering exploits. At the same time, it also simulates budget level constraints for the blue team by providing limited resources to evaluate and prioritize different critical vulnerabilities. The game is discussed via the lenses of the AGE and 6-11 Frameworks and was primarily designed as a learning tool for students in the cybersecurity and technology related fields.
[[2208.00258] Developers Struggle with Authentication in Blazor WebAssembly](http://arxiv.org/abs/2208.00258)
WebAssembly is a growing technology to build cross-platform applications. We aim to understand the security issues that developers encounter when adopting WebAssembly. We mined WebAssembly questions on Stack Overflow and identified 359 security-related posts. We classified these posts into 8 themes, reflecting developer intentions, and 19 topics, representing developer issues in this domain. We found that the most prevalent themes are related to bug fix support, requests for how to implement particular features, clarification questions, and setup or configuration issues. We noted that the topmost issues attribute to authentication in Blazor WebAssembly. We discuss six of them and provide our suggestions to clear these issues in practice.
[[2208.00373] Modification tolerant signature schemes: location and correction](http://arxiv.org/abs/2208.00373)
This paper considers malleable digital signatures, for situations where data is modified after it is signed. They can be used in applications where either the data can be modified (collaborative work), or the data must be modified (redactable and content extraction signatures) or we need to know which parts of the data have been modified (data forensics). A \new{classical} digital signature is valid for a message only if the signature is authentic and not even one bit of the message has been modified. We propose a general framework of modification tolerant signature schemes (MTSS), which can provide either location only or both location and correction, for modifications in a signed message divided into $n$ blocks. This general scheme uses a set of allowed modifications that must be specified. We present an instantiation of MTSS with a tolerance level of $d$, indicating modifications can appear in any set of up to $d$ message blocks. This tolerance level $d$ is needed in practice for parametrizing and controlling the growth of the signature size with respect to the number $n$ of blocks; using combinatorial group testing (CGT) the signature has size $O(d^2 \log n)$ which is close to the \new{best known} lower bound \new{of $\Omega(\frac{d^2}{\log d} (\log n))$}. There has been work in this very same direction using CGT by Goodrich et al. (ACNS 2005) and Idalino et al. (IPL 2015). Our work differs from theirs in that in one scheme we extend these ideas to include corrections of modification with provable security, and in another variation of the scheme we go in the opposite direction and guarantee privacy for redactable signatures, in this case preventing any leakage of redacted information.
[[2208.00214] Towards Privacy-Preserving, Real-Time and Lossless Feature Matching](http://arxiv.org/abs/2208.00214)
Most visual retrieval applications store feature vectors for downstream matching tasks. These vectors, from where user information can be spied out, will cause privacy leakage if not carefully protected. To mitigate privacy risks, current works primarily utilize non-invertible transformations or fully cryptographic algorithms. However, transformation-based methods usually fail to achieve satisfying matching performances while cryptosystems suffer from heavy computational overheads. In addition, secure levels of current methods should be improved to confront potential adversary attacks. To address these issues, this paper proposes a plug-in module called SecureVector that protects features by random permutations, 4L-DEC converting and existing homomorphic encryption techniques. For the first time, SecureVector achieves real-time and lossless feature matching among sanitized features, along with much higher security levels than current state-of-the-arts. Extensive experiments on face recognition, person re-identification, image retrieval, and privacy analyses demonstrate the effectiveness of our method. Given limited public projects in this field, codes of our method and implemented baselines are made open-source in https://github.com/IrvingMeng/SecureVector.
[[2208.00283] Recurring Contingent Service Payment](http://arxiv.org/abs/2208.00283)
Fair exchange protocols let two mutually distrustful parties exchange digital data in a way that neither party can cheat. They have various applications such as the exchange of digital items, or the exchange of digital coins and digital services between a buyer and seller. At CCS 2017, two blockchain-based protocols were proposed to support the fair exchange of digital coins and a certain service; namely, "proofs of retrievability" (PoR). In this work, we identify two notable issues of these protocols, (1) waste of the seller's resources, and (2) real-time information leakage. To rectify these issues, we formally define and propose a blockchain-based generic construction called "recurring contingent service payment" (RC-S-P). RC-S-P lets a fair exchange of digital coins and verifiable service occur periodically while ensuring that the buyer cannot waste the seller's resources, and the parties' privacy is preserved. It supports arbitrary verifiable services, such as PoR, or verifiable computation and imposes low on-chain overheads. Also, we present a concrete efficient instantiation of RC-S-P when the verifiable service is PoR. The instantiation is called "recurring contingent PoR payment" (RC-PoR-P). We have implemented RC-PoR-P and analysed its cost. When it deals with a 4-GB outsourced file, a verifier can check a proof in 90 milliseconds, and a dispute between prover and verifier is resolved in 0.1 milliseconds.
[[2208.00293] Global Attention-based Encoder-Decoder LSTM Model for Temperature Prediction of Permanent Magnet Synchronous Motors](http://arxiv.org/abs/2208.00293)
Temperature monitoring is critical for electrical motors to determine if device protection measures should be executed. However, the complexity of the internal structure of Permanent Magnet Synchronous Motors (PMSM) makes the direct temperature measurement of the internal components difficult. This work pragmatically develops three deep learning models to estimate the PMSMs' internal temperature based on readily measurable external quantities. The proposed supervised learning models exploit Long Short-Term Memory (LSTM) modules, bidirectional LSTM, and attention mechanism to form encoder-decoder structures to predict simultaneously the temperatures of the stator winding, tooth, yoke, and permanent magnet. Experiments were conducted in an exhaustive manner on a benchmark dataset to verify the proposed models' performances. The comparative analysis shows that the proposed global attention-based encoder-decoder (EnDec) model provides a competitive overall performance of 1.72 Mean Squared Error (MSE) and 5.34 Mean Absolute Error (MAE).
[[2208.00498] DNNShield: Dynamic Randomized Model Sparsification, A Defense Against Adversarial Machine Learning](http://arxiv.org/abs/2208.00498)
DNNs are known to be vulnerable to so-called adversarial attacks that manipulate inputs to cause incorrect results that can be beneficial to an attacker or damaging to the victim. Recent works have proposed approximate computation as a defense mechanism against machine learning attacks. We show that these approaches, while successful for a range of inputs, are insufficient to address stronger, high-confidence adversarial attacks. To address this, we propose DNNSHIELD, a hardware-accelerated defense that adapts the strength of the response to the confidence of the adversarial input. Our approach relies on dynamic and random sparsification of the DNN model to achieve inference approximation efficiently and with fine-grain control over the approximation error. DNNSHIELD uses the output distribution characteristics of sparsified inference compared to a dense reference to detect adversarial inputs. We show an adversarial detection rate of 86% when applied to VGG16 and 88% when applied to ResNet50, which exceeds the detection rate of the state of the art approaches, with a much lower overhead. We demonstrate a software/hardware-accelerated FPGA prototype, which reduces the performance impact of DNNSHIELD relative to software-only CPU and GPU implementations.
[[2208.00094] Robust Trajectory Prediction against Adversarial Attacks](http://arxiv.org/abs/2208.00094)
Trajectory prediction using deep neural networks (DNNs) is an essential component of autonomous driving (AD) systems. However, these methods are vulnerable to adversarial attacks, leading to serious consequences such as collisions. In this work, we identify two key ingredients to defend trajectory prediction models against adversarial attacks including (1) designing effective adversarial training methods and (2) adding domain-specific data augmentation to mitigate the performance degradation on clean data. We demonstrate that our method is able to improve the performance by 46% on adversarial data and at the cost of only 3% performance degradation on clean data, compared to the model trained with clean data. Additionally, compared to existing robust methods, our method can improve performance by 21% on adversarial examples and 9% on clean data. Our robust model is evaluated with a planner to study its downstream impacts. We demonstrate that our model can significantly reduce the severe accident rates (e.g., collisions and off-road driving).
[[2208.00428] Robust Real-World Image Super-Resolution against Adversarial Attacks](http://arxiv.org/abs/2208.00428)
Recently deep neural networks (DNNs) have achieved significant success in real-world image super-resolution (SR). However, adversarial image samples with quasi-imperceptible noises could threaten deep learning SR models. In this paper, we propose a robust deep learning framework for real-world SR that randomly erases potential adversarial noises in the frequency domain of input images or features. The rationale is that on the SR task clean images or features have a different pattern from the attacked ones in the frequency domain. Observing that existing adversarial attacks usually add high-frequency noises to input images, we introduce a novel random frequency mask module that blocks out high-frequency components possibly containing the harmful perturbations in a stochastic manner. Since the frequency masking may not only destroys the adversarial perturbations but also affects the sharp details in a clean image, we further develop an adversarial sample classifier based on the frequency domain of images to determine if applying the proposed mask module. Based on the above ideas, we devise a novel real-world image SR framework that combines the proposed frequency mask modules and the proposed adversarial classifier with an existing super-resolution backbone network. Experiments show that our proposed method is more insensitive to adversarial attacks and presents more stable SR results than existing models and defenses.
[[2208.00351] Chinese grammatical error correction based on knowledge distillation](http://arxiv.org/abs/2208.00351)
In view of the poor robustness of existing Chinese grammatical error correction models on attack test sets and large model parameters, this paper uses the method of knowledge distillation to compress model parameters and improve the anti-attack ability of the model. In terms of data, the attack test set is constructed by integrating the disturbance into the standard evaluation data set, and the model robustness is evaluated by the attack test set. The experimental results show that the distilled small model can ensure the performance and improve the training speed under the condition of reducing the number of model parameters, and achieve the optimal effect on the attack test set, and the robustness is significantly improved.
[[2208.00081] Sampling Attacks on Meta Reinforcement Learning: A Minimax Formulation and Complexity Analysis](http://arxiv.org/abs/2208.00081)
Meta reinforcement learning (meta RL), as a combination of meta-learning ideas and reinforcement learning (RL), enables the agent to adapt to different tasks using a few samples. However, this sampling-based adaptation also makes meta RL vulnerable to adversarial attacks. By manipulating the reward feedback from sampling processes in meta RL, an attacker can mislead the agent into building wrong knowledge from training experience, which deteriorates the agent's performance when dealing with different tasks after adaptation. This paper provides a game-theoretical underpinning for understanding this type of security risk. In particular, we formally define the sampling attack model as a Stackelberg game between the attacker and the agent, which yields a minimax formulation. It leads to two online attack schemes: Intermittent Attack and Persistent Attack, which enable the attacker to learn an optimal sampling attack, defined by an $\epsilon$-first-order stationary point, within $\mathcal{O}(\epsilon^{-2})$ iterations. These attack schemes freeride the learning progress concurrently without extra interactions with the environment. By corroborating the convergence results with numerical experiments, we observe that a minor effort of the attacker can significantly deteriorate the learning performance, and the minimax approach can also help robustify the meta RL algorithms.
[[2208.00343] Electromagnetic Signal Injection Attacks on Differential Signaling](http://arxiv.org/abs/2208.00343)
Differential signaling is a method of data transmission that uses two complementary electrical signals to encode information. This allows a receiver to reject any noise by looking at the difference between the two signals, assuming the noise affects both signals in the same way. Many protocols such as USB, Ethernet, and HDMI use differential signaling to achieve a robust communication channel in a noisy environment. This generally works well and has led many to believe that it is infeasible to remotely inject attacking signals into such a differential pair. In this paper we challenge this assumption and show that an adversary can in fact inject malicious signals from a distance, purely using common-mode injection, i.e., injecting into both wires at the same time. We show how this allows an attacker to inject bits or even arbitrary messages into a communication line. Such an attack is a significant threat to many applications, from home security and privacy to automotive systems, critical infrastructure, or implantable medical devices; in which incorrect data or unauthorized control could cause significant damage, or even fatal accidents.
We show in detail the principles of how an electromagnetic signal can bypass the noise rejection of differential signaling, and eventually result in incorrect bits in the receiver. We show how an attacker can exploit this to achieve a successful injection of an arbitrary bit, and we analyze the success rate of injecting longer arbitrary messages. We demonstrate the attack on a real system and show that the success rate can reach as high as $90\%$. Finally, we present a case study where we wirelessly inject a message into a Controller Area Network (CAN) bus, which is a differential signaling bus protocol used in many critical applications, including the automotive and aviation sector.
[[2208.00002] HOB-CNN: Hallucination of Occluded Branches with a Convolutional Neural Network for 2D Fruit Trees](http://arxiv.org/abs/2208.00002)
Orchard automation has attracted the attention of researchers recently due to the shortage of global labor force. To automate tasks in orchards such as pruning, thinning, and harvesting, a detailed understanding of the tree structure is required. However, occlusions from foliage and fruits can make it challenging to predict the position of occluded trunks and branches. This work proposes a regression-based deep learning model, Hallucination of Occluded Branch Convolutional Neural Network (HOB-CNN), for tree branch position prediction in varying occluded conditions. We formulate tree branch position prediction as a regression problem towards the horizontal locations of the branch along the vertical direction or vice versa. We present comparative experiments on Y-shaped trees with two state-of-the-art baselines, representing common approaches to the problem. Experiments show that HOB-CNN outperform the baselines at predicting branch position and shows robustness against varying levels of occlusion. We further validated HOB-CNN against two different types of 2D trees, and HOB-CNN shows generalization across different trees and robustness under different occluded conditions.
[[2208.00113] Neural Correspondence Field for Object Pose Estimation](http://arxiv.org/abs/2208.00113)
We propose a method for estimating the 6DoF pose of a rigid object with an available 3D model from a single RGB image. Unlike classical correspondence-based methods which predict 3D object coordinates at pixels of the input image, the proposed method predicts 3D object coordinates at 3D query points sampled in the camera frustum. The move from pixels to 3D points, which is inspired by recent PIFu-style methods for 3D reconstruction, enables reasoning about the whole object, including its (self-)occluded parts. For a 3D query point associated with a pixel-aligned image feature, we train a fully-connected neural network to predict: (i) the corresponding 3D object coordinates, and (ii) the signed distance to the object surface, with the first defined only for query points in the surface vicinity. We call the mapping realized by this network as Neural Correspondence Field. The object pose is then robustly estimated from the predicted 3D-3D correspondences by the Kabsch-RANSAC algorithm. The proposed method achieves state-of-the-art results on three BOP datasets and is shown superior especially in challenging cases with occlusion. The project website is at: linhuang17.github.io/NCF.
[[2208.00147] Few-Shot Class-Incremental Learning from an Open-Set Perspective](http://arxiv.org/abs/2208.00147)
The continual appearance of new objects in the visual world poses considerable challenges for current deep learning methods in real-world deployments. The challenge of new task learning is often exacerbated by the scarcity of data for the new categories due to rarity or cost. Here we explore the important task of Few-Shot Class-Incremental Learning (FSCIL) and its extreme data scarcity condition of one-shot. An ideal FSCIL model needs to perform well on all classes, regardless of their presentation order or paucity of data. It also needs to be robust to open-set real-world conditions and be easily adapted to the new tasks that always arise in the field. In this paper, we first reevaluate the current task setting and propose a more comprehensive and practical setting for the FSCIL task. Then, inspired by the similarity of the goals for FSCIL and modern face recognition systems, we propose our method -- Augmented Angular Loss Incremental Classification or ALICE. In ALICE, instead of the commonly used cross-entropy loss, we propose to use the angular penalty loss to obtain well-clustered features. As the obtained features not only need to be compactly clustered but also diverse enough to maintain generalization for future incremental classes, we further discuss how class augmentation, data augmentation, and data balancing affect classification performance. Experiments on benchmark datasets, including CIFAR100, miniImageNet, and CUB200, demonstrate the improved performance of ALICE over the state-of-the-art FSCIL methods.
[[2208.00150] Learning Shadow Correspondence for Video Shadow Detection](http://arxiv.org/abs/2208.00150)
Video shadow detection aims to generate consistent shadow predictions among video frames. However, the current approaches suffer from inconsistent shadow predictions across frames, especially when the illumination and background textures change in a video. We make an observation that the inconsistent predictions are caused by the shadow feature inconsistency, i.e., the features of the same shadow regions show dissimilar proprieties among the nearby frames.In this paper, we present a novel Shadow-Consistent Correspondence method (SC-Cor) to enhance pixel-wise similarity of the specific shadow regions across frames for video shadow detection. Our proposed SC-Cor has three main advantages. Firstly, without requiring the dense pixel-to-pixel correspondence labels, SC-Cor can learn the pixel-wise correspondence across frames in a weakly-supervised manner. Secondly, SC-Cor considers intra-shadow separability, which is robust to the variant textures and illuminations in videos. Finally, SC-Cor is a plug-and-play module that can be easily integrated into existing shadow detectors with no extra computational cost. We further design a new evaluation metric to evaluate the temporal stability of the video shadow detection results. Experimental results show that SC-Cor outperforms the prior state-of-the-art method, by 6.51% on IoU and 3.35% on the newly introduced temporal stability metric.
[[2208.00219] Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation](http://arxiv.org/abs/2208.00219)
Few-shot object detection has been extensively investigated by incorporating meta-learning into region-based detection frameworks. Despite its success, the said paradigm is still constrained by several factors, such as (i) low-quality region proposals for novel classes and (ii) negligence of the inter-class correlation among different classes. Such limitations hinder the generalization of base-class knowledge for the detection of novel-class objects. In this work, we design Meta-DETR, which (i) is the first image-level few-shot detector, and (ii) introduces a novel inter-class correlational meta-learning strategy to capture and leverage the correlation among different classes for robust and accurate few-shot object detection. Meta-DETR works entirely at image level without any region proposals, which circumvents the constraint of inaccurate proposals in prevalent few-shot detection frameworks. In addition, the introduced correlational meta-learning enables Meta-DETR to simultaneously attend to multiple support classes within a single feedforward, which allows to capture the inter-class correlation among different classes, thus significantly reducing the misclassification over similar classes and enhancing knowledge generalization to novel classes. Experiments over multiple few-shot object detection benchmarks show that the proposed Meta-DETR outperforms state-of-the-art methods by large margins. The implementation codes are available at https://github.com/ZhangGongjie/Meta-DETR.
[[2208.00237] RBP-Pose: Residual Bounding Box Projection for Category-Level Pose Estimation](http://arxiv.org/abs/2208.00237)
Category-level object pose estimation aims to predict the 6D pose as well as the 3D metric size of arbitrary objects from a known set of categories. Recent methods harness shape prior adaptation to map the observed point cloud into the canonical space and apply Umeyama algorithm to recover the pose and size. However, their shape prior integration strategy boosts pose estimation indirectly, which leads to insufficient pose-sensitive feature extraction and slow inference speed. To tackle this problem, in this paper, we propose a novel geometry-guided Residual Object Bounding Box Projection network RBP-Pose that jointly predicts object pose and residual vectors describing the displacements from the shape-prior-indicated object surface projections on the bounding box towards the real surface projections. Such definition of residual vectors is inherently zero-mean and relatively small, and explicitly encapsulates spatial cues of the 3D object for robust and accurate pose regression. We enforce geometry-aware consistency terms to align the predicted pose and residual vectors to further boost performance.
[[2208.00374] Neuro-Symbolic Learning: Principles and Applications in Ophthalmology](http://arxiv.org/abs/2208.00374)
Neural networks have been rapidly expanding in recent years, with novel strategies and applications. However, challenges such as interpretability, explainability, robustness, safety, trust, and sensibility remain unsolved in neural network technologies, despite the fact that they will unavoidably be addressed for critical applications. Attempts have been made to overcome the challenges in neural network computing by representing and embedding domain knowledge in terms of symbolic representations. Thus, the neuro-symbolic learning (NeSyL) notion emerged, which incorporates aspects of symbolic representation and bringing common sense into neural networks (NeSyL). In domains where interpretability, reasoning, and explainability are crucial, such as video and image captioning, question-answering and reasoning, health informatics, and genomics, NeSyL has shown promising outcomes. This review presents a comprehensive survey on the state-of-the-art NeSyL approaches, their principles, advances in machine and deep learning algorithms, applications such as opthalmology, and most importantly, future perspectives of this emerging field.
[[2208.00385] Evaluating Table Structure Recognition: A New Perspective](http://arxiv.org/abs/2208.00385)
Existing metrics used to evaluate table structure recognition algorithms have shortcomings with regard to capturing text and empty cells alignment. In this paper, we build on prior work and propose a new metric - TEDS based IOU similarity (TEDS (IOU)) for table structure recognition which uses bounding boxes instead of text while simultaneously being robust against the above disadvantages. We demonstrate the effectiveness of our metric against previous metrics through various examples.
[[2208.00438] Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition](http://arxiv.org/abs/2208.00438)
Artistic text recognition is an extremely challenging task with a wide range of applications. However, current scene text recognition methods mainly focus on irregular text while have not explored artistic text specifically. The challenges of artistic text recognition include the various appearance with special-designed fonts and effects, the complex connections and overlaps between characters, and the severe interference from background patterns. To alleviate these problems, we propose to recognize the artistic text at three levels. Firstly, corner points are applied to guide the extraction of local features inside characters, considering the robustness of corner structures to appearance and shape. In this way, the discreteness of the corner points cuts off the connection between characters, and the sparsity of them improves the robustness for background interference. Secondly, we design a character contrastive loss to model the character-level feature, improving the feature representation for character classification. Thirdly, we utilize Transformer to learn the global feature on image-level and model the global relationship of the corner points, with the assistance of a corner-query cross-attention mechanism. Besides, we provide an artistic text dataset to benchmark the performance. Experimental results verify the significant superiority of our proposed method on artistic text recognition and also achieve state-of-the-art performance on several blurred and perspective datasets.
[[2208.00444] BYOLMed3D: Self-Supervised Representation Learning of Medical Videos using Gradient Accumulation Assisted 3D BYOL Framework](http://arxiv.org/abs/2208.00444)
Applications on Medical Image Analysis suffer from acute shortage of large volume of data properly annotated by medical experts. Supervised Learning algorithms require a large volumes of balanced data to learn robust representations. Often supervised learning algorithms require various techniques to deal with imbalanced data. Self-supervised learning algorithms on the other hand are robust to imbalance in the data and are capable of learning robust representations. In this work, we train a 3D BYOL self-supervised model using gradient accumulation technique to deal with the large number of samples in a batch generally required in a self-supervised algorithm. To the best of our knowledge, this work is one of the first of its kind in this domain. We compare the results obtained through our experiments in the downstream task of ACL Tear Injury detection with the contemporary self-supervised pre-training methods and also with ResNet3D-18 initialized with the Kinetics-400 pre-trained weights. From the downstream task experiments, it is evident that the proposed framework outperforms the existing baselines.
[[2208.00453] One-Shot Medical Landmark Localization by Edge-Guided Transform and Noisy Landmark Refinement](http://arxiv.org/abs/2208.00453)
As an important upstream task for many medical applications, supervised landmark localization still requires non-negligible annotation costs to achieve desirable performance. Besides, due to cumbersome collection procedures, the limited size of medical landmark datasets impacts the effectiveness of large-scale self-supervised pre-training methods. To address these challenges, we propose a two-stage framework for one-shot medical landmark localization, which first infers landmarks by unsupervised registration from the labeled exemplar to unlabeled targets, and then utilizes these noisy pseudo labels to train robust detectors. To handle the significant structure variations, we learn an end-to-end cascade of global alignment and local deformations, under the guidance of novel loss functions which incorporate edge information. In stage II, we explore self-consistency for selecting reliable pseudo labels and cross-consistency for semi-supervised learning. Our method achieves state-of-the-art performances on public datasets of different body parts, which demonstrates its general applicability.
[[2208.00539] Is current research on adversarial robustness addressing the right problem?](http://arxiv.org/abs/2208.00539)
Short answer: Yes, Long answer: No! Indeed, research on adversarial robustness has led to invaluable insights helping us understand and explore different aspects of the problem. Many attacks and defenses have been proposed over the last couple of years. The problem, however, remains largely unsolved and poorly understood. Here, I argue that the current formulation of the problem serves short term goals, and needs to be revised for us to achieve bigger gains. Specifically, the bound on perturbation has created a somewhat contrived setting and needs to be relaxed. This has misled us to focus on model classes that are not expressive enough to begin with. Instead, inspired by human vision and the fact that we rely more on robust features such as shape, vertices, and foreground objects than non-robust features such as texture, efforts should be steered towards looking for significantly different classes of models. Maybe instead of narrowing down on imperceptible adversarial perturbations, we should attack a more general problem which is finding architectures that are simultaneously robust to perceptible perturbations, geometric transformations (e.g. rotation, scaling), image distortions (lighting, blur), and more (e.g. occlusion, shadow). Only then we may be able to solve the problem of adversarial vulnerability.
[[2208.00632] Multi-spectral Vehicle Re-identification with Cross-directional Consistency Network and a High-quality Benchmark](http://arxiv.org/abs/2208.00632)
To tackle the challenge of vehicle re-identification (Re-ID) in complex lighting environments and diverse scenes, multi-spectral sources like visible and infrared information are taken into consideration due to their excellent complementary advantages.
However, multi-spectral vehicle Re-ID suffers cross-modality discrepancy caused by heterogeneous properties of different modalities as well as a big challenge of the diverse appearance with different views in each identity.
Meanwhile, diverse environmental interference leads to heavy sample distributional discrepancy in each modality.
In this work, we propose a novel cross-directional consistency network to simultaneously overcome the discrepancies from both modality and sample aspects.
In particular, we design a new cross-directional center loss to pull the modality centers of each identity close to mitigate cross-modality discrepancy, while the sample centers of each identity close to alleviate the sample discrepancy. Such strategy can generate discriminative multi-spectral feature representations for vehicle Re-ID.
In addition, we design an adaptive layer normalization unit to dynamically adjust individual feature distribution to handle distributional discrepancy of intra-modality features for robust learning.
To provide a comprehensive evaluation platform, we create a high-quality RGB-NIR-TIR multi-spectral vehicle Re-ID benchmark (MSVR310), including 310 different vehicles from a broad range of viewpoints, time spans and environmental complexities.
Comprehensive experiments on both created and public datasets demonstrate the effectiveness of the proposed approach comparing to the state-of-the-art methods.
[[2208.00662] Local Perception-Aware Transformer for Aerial Tracking](http://arxiv.org/abs/2208.00662)
Transformer-based visual object tracking has been utilized extensively. However, the Transformer structure is lack of enough inductive bias. In addition, only focusing on encoding the global feature does harm to modeling local details, which restricts the capability of tracking in aerial robots. Specifically, with local-modeling to global-search mechanism, the proposed tracker replaces the global encoder by a novel local-recognition encoder. In the employed encoder, a local-recognition attention and a local element correction network are carefully designed for reducing the global redundant information interference and increasing local inductive bias. Meanwhile, the latter can model local object details precisely under aerial view through detail-inquiry net. The proposed method achieves competitive accuracy and robustness in several authoritative aerial benchmarks with 316 sequences in total. The proposed tracker's practicability and efficiency have been validated by the real-world tests.
[[2208.00690] Generative Bias for Visual Question Answering](http://arxiv.org/abs/2208.00690)
The task of Visual Question Answering (VQA) is known to be plagued by the issue of VQA models exploiting biases within the dataset to make its final prediction. Many previous ensemble based debiasing methods have been proposed where an additional model is purposefully trained to be biased in order to aid in training a robust target model. However, these methods compute the bias for a model from the label statistics of the training data or directly from single modal branches. In contrast, in this work, in order to better learn the bias a target VQA model suffers from, we propose a generative method to train the bias model \emph{directly from the target model}, called GenB. In particular, GenB employs a generative network to learn the bias through a combination of the adversarial objective and knowledge distillation. We then debias our target model with GenB as a bias model, and show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE.
[[2208.00338] Symmetry Regularization and Saturating Nonlinearity for Robust Quantization](http://arxiv.org/abs/2208.00338)
Robust quantization improves the tolerance of networks for various implementations, allowing reliable output in different bit-widths or fragmented low-precision arithmetic. In this work, we perform extensive analyses to identify the sources of quantization error and present three insights to robustify a network against quantization: reduction of error propagation, range clamping for error minimization, and inherited robustness against quantization. Based on these insights, we propose two novel methods called symmetry regularization (SymReg) and saturating nonlinearity (SatNL). Applying the proposed methods during training can enhance the robustness of arbitrary neural networks against quantization on existing post-training quantization (PTQ) and quantization-aware training (QAT) algorithms and enables us to obtain a single weight flexible enough to maintain the output quality under various conditions. We conduct extensive studies on CIFAR and ImageNet datasets and validate the effectiveness of the proposed methods.
[[2208.00461] Adaptive Temperature Scaling for Robust Calibration of Deep Neural Networks](http://arxiv.org/abs/2208.00461)
In this paper, we study the post-hoc calibration of modern neural networks, a problem that has drawn a lot of attention in recent years. Many calibration methods of varying complexity have been proposed for the task, but there is no consensus about how expressive these should be. We focus on the task of confidence scaling, specifically on post-hoc methods that generalize Temperature Scaling, we call these the Adaptive Temperature Scaling family. We analyse expressive functions that improve calibration and propose interpretable methods. We show that when there is plenty of data complex models like neural networks yield better performance, but are prone to fail when the amount of data is limited, a common situation in certain post-hoc calibration applications like medical diagnosis. We study the functions that expressive methods learn under ideal conditions and design simpler methods but with a strong inductive bias towards these well-performing functions. Concretely, we propose Entropy-based Temperature Scaling, a simple method that scales the confidence of a prediction according to its entropy. Results show that our method obtains state-of-the-art performance when compared to others and, unlike complex models, it is robust against data scarcity. Moreover, our proposed model enables a deeper interpretation of the calibration process.
[[2208.00465] Vector-Based Data Improves Left-Right Eye-Tracking Classifier Performance After a Covariate Distributional Shift](http://arxiv.org/abs/2208.00465)
The main challenges of using electroencephalogram (EEG) signals to make eye-tracking (ET) predictions are the differences in distributional patterns between benchmark data and real-world data and the noise resulting from the unintended interference of brain signals from multiple sources. Increasing the robustness of machine learning models in predicting eye-tracking position from EEG data is therefore integral for both research and consumer use. In medical research, the usage of more complicated data collection methods to test for simpler tasks has been explored to address this very issue. In this study, we propose a fine-grain data approach for EEG-ET data collection in order to create more robust benchmarking. We train machine learning models utilizing both coarse-grain and fine-grain data and compare their accuracies when tested on data of similar/different distributional patterns in order to determine how susceptible EEG-ET benchmarks are to differences in distributional data. We apply a covariate distributional shift to test for this susceptibility. Results showed that models trained on fine-grain, vector-based data were less susceptible to distributional shifts than models trained on coarse-grain, binary-classified data.
[[2208.00368] Skeleton-Parted Graph Scattering Networks for 3D Human Motion Prediction](http://arxiv.org/abs/2208.00368)
Graph convolutional network based methods that model the body-joints' relations, have recently shown great promise in 3D skeleton-based human motion prediction. However, these methods have two critical issues: first, deep graph convolutions filter features within only limited graph spectrums, losing sufficient information in the full band; second, using a single graph to model the whole body underestimates the diverse patterns on various body-parts. To address the first issue, we propose adaptive graph scattering, which leverages multiple trainable band-pass graph filters to decompose pose features into richer graph spectrum bands. To address the second issue, body-parts are modeled separately to learn diverse dynamics, which enables finer feature extraction along the spatial dimensions. Integrating the above two designs, we propose a novel skeleton-parted graph scattering network (SPGSN). The cores of the model are cascaded multi-part graph scattering blocks (MPGSBs), building adaptive graph scattering on diverse body-parts, as well as fusing the decomposed features based on the inferred spectrum importance and body-part interactions. Extensive experiments have shown that SPGSN outperforms state-of-the-art methods by remarkable margins of 13.8%, 9.3% and 2.7% in terms of 3D mean per joint position error (MPJPE) on Human3.6M, CMU Mocap and 3DPW datasets, respectively.
[[2208.00627] A Rotation Meanout Network with Invariance for Dermoscopy Image Classification and Retrieval](http://arxiv.org/abs/2208.00627)
The computer-aided diagnosis (CAD) system can provide a reference basis for the clinical diagnosis of skin diseases. Convolutional neural networks (CNNs) can not only extract visual elements such as colors and shapes but also semantic features. As such they have made great improvements in many tasks of dermoscopy images. The imaging of dermoscopy has no main direction, indicating that there are a large number of skin lesion target rotations in the datasets. However, CNNs lack anti-rotation ability, which is bound to affect the feature extraction ability of CNNs. We propose a rotation meanout (RM) network to extract rotation invariance features from dermoscopy images. In RM, each set of rotated feature maps corresponds to a set of weight-sharing convolution outputs and they are fused using meanout operation to obtain the final feature maps. Through theoretical derivation, the proposed RM network is rotation-equivariant and can extract rotation-invariant features when being followed by the global average pooling (GAP) operation. The extracted rotation-invariant features can better represent the original data in classification and retrieval tasks for dermoscopy images. The proposed RM is a general operation, which does not change the network structure or increase any parameter, and can be flexibly embedded in any part of CNNs. Extensive experiments are conducted on a dermoscopy image dataset. The results show our method outperforms other anti-rotation methods and achieves great improvements in dermoscopy image classification and retrieval tasks, indicating the potential of rotation invariance in the field of dermoscopy images.
[[2208.00773] Real Time Object Detection System with YOLO and CNN Models: A Review](http://arxiv.org/abs/2208.00773)
The field of artificial intelligence is built on object detection techniques. YOU ONLY LOOK ONCE (YOLO) algorithm and it's more evolved versions are briefly described in this research survey. This survey is all about YOLO and convolution neural networks (CNN)in the direction of real time object detection.YOLO does generalized object representation more effectively without precision losses than other object detection models.CNN architecture models have the ability to eliminate highlights and identify objects in any given image. When implemented appropriately, CNN models can address issues like deformity diagnosis, creating educational or instructive application, etc. This article reached atnumber of observations and perspective findings through the analysis.Also it provides support for the focused visual information and feature extraction in the financial and other industries, highlights the method of target detection and feature selection, and briefly describe the development process of YOLO algorithm.
[[2208.00346] Improving Distantly Supervised Relation Extraction by Natural Language Inference](http://arxiv.org/abs/2208.00346)
To reduce human annotations for relation extraction (RE) tasks, distantly supervised approaches have been proposed, while struggling with low performance. In this work, we propose a novel DSRE-NLI framework, which considers both distant supervision from existing knowledge bases and indirect supervision from pretrained language models for other tasks. DSRE-NLI energizes an off-the-shelf natural language inference (NLI) engine with a semi-automatic relation verbalization (SARV) mechanism to provide indirect supervision and further consolidates the distant annotations to benefit multi-classification RE models. The NLI-based indirect supervision acquires only one relation verbalization template from humans as a semantically general template for each relationship, and then the template set is enriched by high-quality textual patterns automatically mined from the distantly annotated corpus. With two simple and effective data consolidation strategies, the quality of training data is substantially improved. Extensive experiments demonstrate that the proposed framework significantly improves the SOTA performance (up to 7.73\% of F1) on distantly supervised RE benchmark datasets.
[[2208.00635] DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning](http://arxiv.org/abs/2208.00635)
Although pre-trained language models (PLMs) have achieved state-of-the-art performance on various natural language processing (NLP) tasks, they are shown to be lacking in knowledge when dealing with knowledge driven tasks. Despite the many efforts made for injecting knowledge into PLMs, this problem remains open. To address the challenge, we propose \textbf{DictBERT}, a novel approach that enhances PLMs with dictionary knowledge which is easier to acquire than knowledge graph (KG). During pre-training, we present two novel pre-training tasks to inject dictionary knowledge into PLMs via contrastive learning: \textit{dictionary entry prediction} and \textit{entry description discrimination}. In fine-tuning, we use the pre-trained DictBERT as a plugin knowledge base (KB) to retrieve implicit knowledge for identified entries in an input sequence, and infuse the retrieved knowledge into the input to enhance its representation via a novel extra-hop attention mechanism. We evaluate our approach on a variety of knowledge driven and language understanding tasks, including NER, relation extraction, CommonsenseQA, OpenBookQA and GLUE. Experimental results demonstrate that our model can significantly improve typical PLMs: it gains a substantial improvement of 0.5\%, 2.9\%, 9.0\%, 7.1\% and 3.3\% on BERT-large respectively, and is also effective on RoBERTa-large.
[[2208.00042] GoodFATR: A Platform for Automated Threat Report Collection and IOC Extraction](http://arxiv.org/abs/2208.00042)
To adapt to a constantly evolving landscape of cyber threats, organizations actively need to collect Indicators of Compromise (IOCs), i.e., forensic artifacts that signal that a host or network might have been compromised. IOCs can be collected through open-source and commercial structured IOC feeds. But, they can also be extracted from a myriad of unstructured threat reports written in natural language and distributed using a wide array of sources such as blogs and social media. This work presents GoodFATR an automated platform for collecting threat reports from a wealth of sources and extracting IOCs from them. GoodFATR supports 6 sources: RSS, Twitter, Telegram, Malpedia, APTnotes, and ChainSmith. GoodFATR continuously monitors the sources, downloads new threat reports, extracts 41 indicator types from the collected reports, and filters generic indicators to output the IOCs. We propose a novel majority-vote methodology for evaluating the accuracy of indicator extraction tools, and apply it to compare 7 popular tools with GoodFATR's indicator extraction module. We run GoodFATR over 15 months to collect 472,891 reports from the 6 sources; extract 1,043,932 indicators from the reports; and identify 655,971 IOCs. We analyze the collected data to identify the top IOC contributors and the IOC class distribution. Finally, we present a case study on how GoodFATR can assist in tracking cybercrime relations on the Bitcoin blockchain.
[[2208.00335] Functional Rule Extraction Method for Artificial Neural Networks](http://arxiv.org/abs/2208.00335)
The idea I propose in this paper is a method that is based on comprehensive functions for directed and undirected rule extraction from artificial neural network operations.
[[2208.00275] Revisiting the Critical Factors of Augmentation-Invariant Representation Learning](http://arxiv.org/abs/2208.00275)
We focus on better understanding the critical factors of augmentation-invariant representation learning. We revisit MoCo v2 and BYOL and try to prove the authenticity of the following assumption: different frameworks bring about representations of different characteristics even with the same pretext task. We establish the first benchmark for fair comparisons between MoCo v2 and BYOL, and observe: (i) sophisticated model configurations enable better adaptation to pre-training dataset; (ii) mismatched optimization strategies of pre-training and fine-tuning hinder model from achieving competitive transfer performances. Given the fair benchmark, we make further investigation and find asymmetry of network structure endows contrastive frameworks to work well under the linear evaluation protocol, while may hurt the transfer performances on long-tailed classification tasks. Moreover, negative samples do not make models more sensible to the choice of data augmentations, nor does the asymmetric network structure. We believe our findings provide useful information for future work.
[[2208.00651] De-biased Representation Learning for Fairness with Unreliable Labels](http://arxiv.org/abs/2208.00651)
Removing bias while keeping all task-relevant information is challenging for fair representation learning methods since they would yield random or degenerate representations w.r.t. labels when the sensitive attributes correlate with labels. Existing works proposed to inject the label information into the learning procedure to overcome such issues. However, the assumption that the observed labels are clean is not always met. In fact, label bias is acknowledged as the primary source inducing discrimination. In other words, the fair pre-processing methods ignore the discrimination encoded in the labels either during the learning procedure or the evaluation stage. This contradiction puts a question mark on the fairness of the learned representations. To circumvent this issue, we explore the following question: \emph{Can we learn fair representations predictable to latent ideal fair labels given only access to unreliable labels?} In this work, we propose a \textbf{D}e-\textbf{B}iased \textbf{R}epresentation Learning for \textbf{F}airness (DBRF) framework which disentangles the sensitive information from non-sensitive attributes whilst keeping the learned representations predictable to ideal fair labels rather than observed biased ones. We formulate the de-biased learning framework through information-theoretic concepts such as mutual information and information bottleneck. The core concept is that DBRF advocates not to use unreliable labels for supervision when sensitive information benefits the prediction of unreliable labels. Experiment results over both synthetic and real-world data demonstrate that DBRF effectively learns de-biased representations towards ideal labels.
[[2208.00726] Proportional Fair Division of Multi-layered Cakes](http://arxiv.org/abs/2208.00726)
We study the multi-layered cake cutting problem, where the multi-layered cake is divided among agents proportionally. This problem was initiated by Hosseini et al.(2020) under two constraints, one is contiguity and the other is feasibility. Basically we will show the existence of proportional multi-allocation for any number of agents with any number of preferences that satisfies contiguity and feasibility constraints using the idea of switching point for individual agent and majority agents. First we show that exact feasible multi-allocation is guaranteed to exist for two agents with two types of preferences. Second we see that we always get an envy-free multi-allocation that satisfies the feasibility and contiguity constraints for three agent with two types of preferences such that each agent has a share to each layer even without the knowledge of the unique preference of the third agent.
[[2208.00457] INSightR-Net: Interpretable Neural Network for Regression using Similarity-based Comparisons to Prototypical Examples](http://arxiv.org/abs/2208.00457)
Convolutional neural networks (CNNs) have shown exceptional performance for a range of medical imaging tasks. However, conventional CNNs are not able to explain their reasoning process, therefore limiting their adoption in clinical practice. In this work, we propose an inherently interpretable CNN for regression using similarity-based comparisons (INSightR-Net) and demonstrate our methods on the task of diabetic retinopathy grading. A prototype layer incorporated into the architecture enables visualization of the areas in the image that are most similar to learned prototypes. The final prediction is then intuitively modeled as a mean of prototype labels, weighted by the similarities. We achieved competitive prediction performance with our INSightR-Net compared to a ResNet baseline, showing that it is not necessary to compromise performance for interpretability. Furthermore, we quantified the quality of our explanations using sparsity and diversity, two concepts considered important for a good explanation, and demonstrated the effect of several parameters on the latent space embeddings.
[[2208.00064] Thutmose Tagger: Single-pass neural model for Inverse Text Normalization](http://arxiv.org/abs/2208.00064)
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition (ASR). It converts numbers, dates, abbreviations, and other semiotic classes from the spoken form generated by ASR to their written forms. One can consider ITN as a Machine Translation task and use neural sequence-to-sequence models to solve it. Unfortunately, such neural models are prone to hallucinations that could lead to unacceptable errors. To mitigate this issue, we propose a single-pass token classifier model that regards ITN as a tagging task. The model assigns a replacement fragment to every input token or marks it for deletion or copying without changes. We present a dataset preparation method based on the granular alignment of ITN examples. The proposed model is less prone to hallucination errors. The model is trained on the Google Text Normalization dataset and achieves state-of-the-art sentence accuracy on both English and Russian test sets. One-to-one correspondence between tags and input words improves the interpretability of the model's predictions, simplifies debugging, and allows for post-processing corrections. The model is simpler than sequence-to-sequence models and easier to optimize in production settings. The model and the code to prepare the dataset is published as part of NeMo project.
[[2208.00563] Backdoor Watermarking Deep Learning Classification Models With Deep Fidelity](http://arxiv.org/abs/2208.00563)
Backdoor Watermarking is a promising paradigm to protect the copyright of deep neural network (DNN) models for classification tasks. In the existing works on this subject, researchers have intensively focused on watermarking robustness, while fidelity, which is concerned with the original functionality, has received less attention. In this paper, we show that the existing shared notion of the sole measurement of learning accuracy is insufficient to characterize backdoor fidelity. Meanwhile, we show that the analogous concept of embedding distortion in multimedia watermarking, interpreted as the total weight loss (TWL) in DNN backdoor watermarking, is also unsuitable to measure the fidelity. To solve this problem, we propose the concept of deep fidelity, which states that the backdoor watermarked DNN model should preserve both the feature representation and decision boundary of the unwatermarked host model. Accordingly, to realize deep fidelity, we propose two loss functions termed as penultimate feature loss (PFL) and softmax probability-distribution loss (SPL) to preserve feature representation, while the decision boundary is preserved by the proposed fix last layer (FixLL) treatment, inspired by the recent discovery that deep learning with a fixed classifier causes no loss of learning accuracy. With the above designs, both embedding from scratch and fine-tuning strategies are implemented to evaluate deep fidelity of backdoor embedding, whose advantages over the existing methods are verified via experiments using ResNet18 for MNIST and CIFAR-10 classifications, and wide residual network (i.e., WRN28_10) for CIFAR-100 task.