[[2304.07676] Privacy-Enhanced Living: A Local Differential Privacy Approach to Secure Smart Home Data](http://arxiv.org/abs/2304.07676) #secure
The rapid expansion of Internet of Things (IoT) devices in smart homes has significantly improved the quality of life, offering enhanced convenience, automation, and energy efficiency. However, this proliferation of connected devices raises critical concerns regarding security and privacy of the user data. In this paper, we propose a differential privacy-based system to ensure comprehensive security for data generated by smart homes. We employ the randomized response technique for the data and utilize Local Differential Privacy (LDP) to achieve data privacy. The data is then transmitted to an aggregator, where an obfuscation method is applied to ensure individual anonymity. Furthermore, we implement the Hidden Markov Model (HMM) technique at the aggregator level and apply differential privacy to the private data received from smart homes. Consequently, our approach achieves a dual layer of privacy protection, addressing the security concerns associated with IoT devices in smart cities.
[[2304.07721] A Novel end-to-end Framework for Occluded Pixel Reconstruction with Spatio-temporal Features for Improved Person Re-identification](http://arxiv.org/abs/2304.07721) #security
Person re-identification is vital for monitoring and tracking crowd movement to enhance public security. However, re-identification in the presence of occlusion substantially reduces the performance of existing systems and is a challenging area. In this work, we propose a plausible solution to this problem by developing effective occlusion detection and reconstruction framework for RGB images/videos consisting of Deep Neural Networks. Specifically, a CNN-based occlusion detection model classifies individual input frames, followed by a Conv-LSTM and Autoencoder to reconstruct the occluded pixels corresponding to the occluded frames for sequential (video) and non-sequential (image) data, respectively. The quality of the reconstructed RGB frames is further refined and fine-tuned using a Conditional Generative Adversarial Network (cGAN). Our method is evaluated on four well-known public data sets of the domain, and the qualitative reconstruction results are indeed appealing. Quantitative evaluation in terms of re-identification accuracy of the Siamese network showed an exceptional Rank-1 accuracy after occluded pixel reconstruction on various datasets. A comparative analysis with state-of-the-art approaches also demonstrates the robustness of our work for use in real-life surveillance systems.
[[2304.07411] SoK: The MITRE ATT&CK Framework in Research and Practice](http://arxiv.org/abs/2304.07411) #security
The MITRE ATT&CK framework, a comprehensive knowledge base of adversary tactics and techniques, has been widely adopted by the cybersecurity industry as well as by academic researchers. Its broad range of industry applications include threat intelligence, threat detection, and incident response, some of which go beyond what it was originally designed for. Despite its popularity, there is a lack of a systematic review of the applications and the research on ATT&CK. This systematization of work aims to fill this gap. To this end, it introduces the first taxonomic systematization of the research literature on ATT&CK, studies its degree of usefulness in different applications, and identifies important gaps and discrepancies in the literature to identify key directions for future work. The results of this work provide valuable insights for academics and practitioners alike, highlighting the need for more research on the practical implementation and evaluation of ATT&CK.
[[2304.07470] Few-shot Weakly-supervised Cybersecurity Anomaly Detection](http://arxiv.org/abs/2304.07470) #security
With increased reliance on Internet based technologies, cyberattacks compromising users' sensitive data are becoming more prevalent. The scale and frequency of these attacks are escalating rapidly, affecting systems and devices connected to the Internet. The traditional defense mechanisms may not be sufficiently equipped to handle the complex and ever-changing new threats. The significant breakthroughs in the machine learning methods including deep learning, had attracted interests from the cybersecurity research community for further enhancements in the existing anomaly detection methods. Unfortunately, collecting labelled anomaly data for all new evolving and sophisticated attacks is not practical. Training and tuning the machine learning model for anomaly detection using only a handful of labelled data samples is a pragmatic approach. Therefore, few-shot weakly supervised anomaly detection is an encouraging research direction. In this paper, we propose an enhancement to an existing few-shot weakly-supervised deep learning anomaly detection framework. This framework incorporates data augmentation, representation learning and ordinal regression. We then evaluated and showed the performance of our implemented framework on three benchmark datasets: NSL-KDD, CIC-IDS2018, and TON_IoT.
[[2304.07594] Preventing Malicious Use of Keyloggers Using Anti-Keyloggers](http://arxiv.org/abs/2304.07594) #security
Multinational corporations routinely track how its employees use their computers, the internet, or email. There are roughly a thousand devices on the market that enable businesses to monitor their workforce. Observe what their users do online, in their emails, and on their so-called personal laptops while at work. Our team will develop a key-logger for this Journal project that will allow us to capture every keystroke. Additionally, it will enable us to capture the mouse movements and clicks. This will enable us to keep an eye on how a person uses the internet and every other application on his or her "own" computer. Keyloggers, on the other hand, can also be used to steal data in the form of malware or something comparable. An anti-key logger will also be created to address this issue, allowing us to determine whether a key logger is already monitoring the system. We will be able to remain watchful and maintain the security of the data on our system thanks to the anti-keylogger.
[[2304.07648] Certifying Zero-Knowledge Circuits with Refinement Types](http://arxiv.org/abs/2304.07648) #security
Zero-knowledge (ZK) proof systems have emerged as a promising solution for building security-sensitive applications. However, bugs in ZK applications are extremely difficult to detect and can allow a malicious party to silently exploit the system without leaving any observable trace. This paper presents Coda, a novel statically-typed language for building zero-knowledge applications. Critically, Coda makes it possible to formally specify and statically check properties of a ZK application through a rich refinement type system. One of the key challenges in formally verifying ZK applications is that they require reasoning about polynomial equations over large prime fields that go beyond the capabilities of automated theorem provers. Coda mitigates this challenge by generating a set of Coq lemmas that can be proven in an interactive manner with the help of a tactic library. We have used Coda to re-implement 79 arithmetic circuits from widely-used Circom libraries and applications. Our evaluation shows that Coda makes it possible to specify important and formally verify correctness properties of these circuits. Our evaluation also revealed 6 previously-unknown vulnerabilities in the original Circom projects.
[[2304.07668] FedBlockHealth: A Synergistic Approach to Privacy and Security in IoT-Enabled Healthcare through Federated Learning and Blockchain](http://arxiv.org/abs/2304.07668) #security
The rapid adoption of Internet of Things (IoT) devices in healthcare has introduced new challenges in preserving data privacy, security and patient safety. Traditional approaches need to ensure security and privacy while maintaining computational efficiency, particularly for resource-constrained IoT devices. This paper proposes a novel hybrid approach combining federated learning and blockchain technology to provide a secure and privacy-preserved solution for IoT-enabled healthcare applications. Our approach leverages a public-key cryptosystem that provides semantic security for local model updates, while blockchain technology ensures the integrity of these updates and enforces access control and accountability. The federated learning process enables a secure model aggregation without sharing sensitive patient data. We implement and evaluate our proposed framework using EMNIST datasets, demonstrating its effectiveness in preserving data privacy and security while maintaining computational efficiency. The results suggest that our hybrid approach can significantly enhance the development of secure and privacy-preserved IoT-enabled healthcare applications, offering a promising direction for future research in this field.
[[2304.07533] ALiSNet: Accurate and Lightweight Human Segmentation Network for Fashion E-Commerce](http://arxiv.org/abs/2304.07533) #privacy
Accurately estimating human body shape from photos can enable innovative applications in fashion, from mass customization, to size and fit recommendations and virtual try-on. Body silhouettes calculated from user pictures are effective representations of the body shape for downstream tasks. Smartphones provide a convenient way for users to capture images of their body, and on-device image processing allows predicting body segmentation while protecting users privacy. Existing off-the-shelf methods for human segmentation are closed source and cannot be specialized for our application of body shape and measurement estimation. Therefore, we create a new segmentation model by simplifying Semantic FPN with PointRend, an existing accurate model. We finetune this model on a high-quality dataset of humans in a restricted set of poses relevant for our application. We obtain our final model, ALiSNet, with a size of 4MB and 97.6$\pm$1.0$\%$ mIoU, compared to Apple Person Segmentation, which has an accuracy of 94.4$\pm$5.7$\%$ mIoU on our dataset.
[[2304.07460] Communication and Energy Efficient Wireless Federated Learning with Intrinsic Privacy](http://arxiv.org/abs/2304.07460) #privacy
Federated Learning (FL) is a collaborative learning framework that enables edge devices to collaboratively learn a global model while keeping raw data locally. Although FL avoids leaking direct information from local datasets, sensitive information can still be inferred from the shared models. To address the privacy issue in FL, differential privacy (DP) mechanisms are leveraged to provide formal privacy guarantee. However, when deploying FL at the wireless edge with over-the-air computation, ensuring client-level DP faces significant challenges. In this paper, we propose a novel wireless FL scheme called private federated edge learning with sparsification (PFELS) to provide client-level DP guarantee with intrinsic channel noise while reducing communication and energy overhead and improving model accuracy. The key idea of PFELS is for each device to first compress its model update and then adaptively design the transmit power of the compressed model update according to the wireless channel status without any artificial noise addition. We provide a privacy analysis for PFELS and prove the convergence of PFELS under general non-convex and non-IID settings. Experimental results show that compared with prior work, PFELS can improve the accuracy with the same DP guarantee and save communication and energy costs simultaneously.
[[2304.07735] Shuffled Transformer for Privacy-Preserving Split Learning](http://arxiv.org/abs/2304.07735) #privacy
In conventional split learning, training and testing data often face severe privacy leakage threats. Existing solutions often have to trade learning accuracy for data privacy, or the other way around. We propose a lossless privacy-preserving split learning framework, on the basis of the permutation equivalence properties which are inherent to many neural network modules. We adopt Transformer as the example building block to the framework. It is proved that the Transformer encoder block is permutation equivalent, and thus training/testing could be done equivalently on permuted data. We further introduce shuffling-based privacy guarantee and enhance it by mix-up training. All properties are verified by conducted experiments, which also show strong defence against privacy attacks compared to the state-of-the-art, yet without any accuracy decline.
[[2304.07822] A Random-patch based Defense Strategy Against Physical Attacks for Face Recognition Systems](http://arxiv.org/abs/2304.07822) #defense
The physical attack has been regarded as a kind of threat against real-world computer vision systems. Still, many existing defense methods are only useful for small perturbations attacks and can't detect physical attacks effectively. In this paper, we propose a random-patch based defense strategy to robustly detect physical attacks for Face Recognition System (FRS). Different from mainstream defense methods which focus on building complex deep neural networks (DNN) to achieve high recognition rate on attacks, we introduce a patch based defense strategy to a standard DNN aiming to obtain robust detection models. Extensive experimental results on the employed datasets show the superiority of the proposed defense method on detecting white-box attacks and adaptive attacks which attack both FRS and the defense method. Additionally, due to the simpleness yet robustness of our method, it can be easily applied to the real world face recognition system and extended to other defense methods to boost the detection performance.
[[2304.07360] Combining Generators of Adversarial Malware Examples to Increase Evasion Rate](http://arxiv.org/abs/2304.07360) #defense
Antivirus developers are increasingly embracing machine learning as a key component of malware defense. While machine learning achieves cutting-edge outcomes in many fields, it also has weaknesses that are exploited by several adversarial attack techniques. Many authors have presented both white-box and black-box generators of adversarial malware examples capable of bypassing malware detectors with varying success. We propose to combine contemporary generators in order to increase their potential. Combining different generators can create more sophisticated adversarial examples that are more likely to evade anti-malware tools. We demonstrated this technique on five well-known generators and recorded promising results. The best-performing combination of AMG-random and MAB-Malware generators achieved an average evasion rate of 15.9% against top-tier antivirus products. This represents an average improvement of more than 36% and 627% over using only the AMG-random and MAB-Malware generators, respectively. The generator that benefited the most from having another generator follow its procedure was the FGSM injection attack, which improved the evasion rate on average between 91.97% and 1,304.73%, depending on the second generator used. These results demonstrate that combining different generators can significantly improve their effectiveness against leading antivirus programs.
[[2304.07296] MLOps Spanning Whole Machine Learning Life Cycle: A Survey](http://arxiv.org/abs/2304.07296) #defense
Google AlphaGos win has significantly motivated and sped up machine learning (ML) research and development, which led to tremendous ML technical advances and wider adoptions in various domains (e.g., Finance, Health, Defense, and Education). These advances have resulted in numerous new concepts and technologies, which are too many for people to catch up to and even make them confused, especially for newcomers to the ML area. This paper is aimed to present a clear picture of the state-of-the-art of the existing ML technologies with a comprehensive survey. We lay out this survey by viewing ML as a MLOps (ML Operations) process, where the key concepts and activities are collected and elaborated with representative works and surveys. We hope that this paper can serve as a quick reference manual (a survey of surveys) for newcomers (e.g., researchers, practitioners) of ML to get an overview of the MLOps process, as well as a good understanding of the key technologies used in each step of the ML process, and know where to find more details.
[[2304.07549] MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing](http://arxiv.org/abs/2304.07549) #attack
The existing multi-modal face anti-spoofing (FAS) frameworks are designed based on two strategies: halfway and late fusion. However, the former requires test modalities consistent with the training input, which seriously limits its deployment scenarios. And the latter is built on multiple branches to process different modalities independently, which limits their use in applications with low memory or fast execution requirements. In this work, we present a single branch based Transformer framework, namely Modality-Agnostic Vision Transformer (MA-ViT), which aims to improve the performance of arbitrary modal attacks with the help of multi-modal data. Specifically, MA-ViT adopts the early fusion to aggregate all the available training modalities data and enables flexible testing of any given modal samples. Further, we develop the Modality-Agnostic Transformer Block (MATB) in MA-ViT, which consists of two stacked attentions named Modal-Disentangle Attention (MDA) and Cross-Modal Attention (CMA), to eliminate modality-related information for each modal sequences and supplement modality-agnostic liveness features from another modal sequences, respectively. Experiments demonstrate that the single model trained based on MA-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin, and approaches the multi-modal frameworks introduced with smaller FLOPs and model parameters.
[[2304.07580] Surveillance Face Presentation Attack Detection Challenge](http://arxiv.org/abs/2304.07580) #attack
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, most of the studies lacked consideration of long-distance scenarios. Specifically, compared with FAS in traditional scenes such as phone unlocking, face payment, and self-service security inspection, FAS in long-distance such as station squares, parks, and self-service supermarkets are equally important, but it has not been sufficiently explored yet. In order to fill this gap in the FAS community, we collect a large-scale Surveillance High-Fidelity Mask (SuHiFiMask). SuHiFiMask contains $10,195$ videos from $101$ subjects of different age groups, which are collected by $7$ mainstream surveillance cameras. Based on this dataset and protocol-$3$ for evaluating the robustness of the algorithm under quality changes, we organized a face presentation attack detection challenge in surveillance scenarios. It attracted 180 teams for the development phase with a total of 37 teams qualifying for the final round. The organization team re-verified and re-ran the submitted code and used the results as the final ranking. In this paper, we present an overview of the challenge, including an introduction to the dataset used, the definition of the protocol, the evaluation metrics, and the announcement of the competition results. Finally, we present the top-ranked algorithms and the research ideas provided by the competition for attack detection in long-range surveillance scenarios.
[[2304.07395] Investigation of ensemble methods for the detection of deepfake face manipulations](http://arxiv.org/abs/2304.07395) #robust
The recent wave of AI research has enabled a new brand of synthetic media, called deepfakes. Deepfakes have impressive photorealism, which has generated exciting new use cases but also raised serious threats to our increasingly digital world. To mitigate these threats, researchers have tried to come up with new methods for deepfake detection that are more effective than traditional forensics and heavily rely on deep AI technology. In this paper, following up on encouraging prior work for deepfake detection with attribution and ensemble techniques, we explore and compare multiple designs for ensemble detectors. The goal is to achieve robustness and good generalization ability by leveraging ensembles of models that specialize in different manipulation categories. Our results corroborate that ensembles can achieve higher accuracy than individual models when properly tuned, while the generalization ability relies on access to a large number of training data for a diverse set of known manipulations.
[[2304.07461] Beta-Rank: A Robust Convolutional Filter Pruning Method For Imbalanced Medical Image Analysis](http://arxiv.org/abs/2304.07461) #robust
As deep neural networks include a high number of parameters and operations, it can be a challenge to implement these models on devices with limited computational resources. Despite the development of novel pruning methods toward resource-efficient models, it has become evident that these models are not capable of handling "imbalanced" and "limited number of data points". With input and output information, along with the values of the filters, a novel filter pruning method is proposed. Our pruning method considers the fact that all information about the importance of a filter may not be reflected in the value of the filter. Instead, it is reflected in the changes made to the data after the filter is applied to it. In this work, three methods are compared with the same training conditions except for the ranking of each method. We demonstrated that our model performed significantly better than other methods for medical datasets which are inherently imbalanced. When we removed up to 58% of FLOPs for the IDRID dataset and up to 45% for the ISIC dataset, our model was able to yield an equivalent (or even superior) result to the baseline model while other models were unable to achieve similar results. To evaluate FLOP and parameter reduction using our model in real-world settings, we built a smartphone app, where we demonstrated a reduction of up to 79% in memory usage and 72% in prediction time. All codes and parameters for training different models are available at https://github.com/mohofar/Beta-Rank
[[2304.07515] S3M: Scalable Statistical Shape Modeling through Unsupervised Correspondences](http://arxiv.org/abs/2304.07515) #robust
Statistical shape models (SSMs) are an established way to geometrically represent the anatomy of a population with various clinically relevant applications. However, they typically require domain expertise and labor-intensive manual segmentations or landmark annotations to generate. Methods to estimate correspondences for SSMs typically learn with such labels as supervision signals. We address these shortcomings by proposing an unsupervised method that leverages deep geometric features and functional correspondences to learn local and global shape structures across complex anatomies simultaneously. Our pipeline significantly improves unsupervised correspondence estimation for SSMs compared to baseline methods, even on highly irregular surface topologies. We demonstrate this for two different anatomical structures: the thyroid and a multi-chamber heart dataset. Furthermore, our method is robust enough to learn from noisy neural network predictions, enabling scaling SSMs to larger patient populations without manual annotation.
[[2304.07775] Robust Cross-Modal Knowledge Distillation for Unconstrained Videos](http://arxiv.org/abs/2304.07775) #robust
Cross-modal distillation has been widely used to transfer knowledge across different modalities, enriching the representation of the target unimodal one. Recent studies highly relate the temporal synchronization between vision and sound to the semantic consistency for cross-modal distillation. However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation. To this end, we first propose a \textit{Modality Noise Filter} (MNF) module to erase the irrelevant noise in teacher modality with cross-modal context. After this purification, we then design a \textit{Contrastive Semantic Calibration} (CSC) module to adaptively distill useful knowledge for target modality, by referring to the differentiated sample-wise semantic correlation in a contrastive fashion. Extensive experiments show that our method could bring a performance boost compared with other distillation methods in both visual action recognition and video retrieval task. We also extend to the audio tagging task to prove the generalization of our method. The source code is available at \href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}.
[[2304.07499] Robust Educational Dialogue Act Classifiers with Low-Resource and Imbalanced Datasets](http://arxiv.org/abs/2304.07499) #robust
Dialogue acts (DAs) can represent conversational actions of tutors or students that take place during tutoring dialogues. Automating the identification of DAs in tutoring dialogues is significant to the design of dialogue-based intelligent tutoring systems. Many prior studies employ machine learning models to classify DAs in tutoring dialogues and invest much effort to optimize the classification accuracy by using limited amounts of training data (i.e., low-resource data scenario). However, beyond the classification accuracy, the robustness of the classifier is also important, which can reflect the capability of the classifier on learning the patterns from different class distributions. We note that many prior studies on classifying educational DAs employ cross entropy (CE) loss to optimize DA classifiers on low-resource data with imbalanced DA distribution. The DA classifiers in these studies tend to prioritize accuracy on the majority class at the expense of the minority class which might not be robust to the data with imbalanced ratios of different DA classes. To optimize the robustness of classifiers on imbalanced class distributions, we propose to optimize the performance of the DA classifier by maximizing the area under the ROC curve (AUC) score (i.e., AUC maximization). Through extensive experiments, our study provides evidence that (i) by maximizing AUC in the training process, the DA classifier achieves significant performance improvement compared to the CE approach under low-resource data, and (ii) AUC maximization approaches can improve the robustness of the DA classifier under different class imbalance ratios.
[[2304.07699] USNID: A Framework for Unsupervised and Semi-supervised New Intent Discovery](http://arxiv.org/abs/2304.07699) #robust
New intent discovery is of great value to natural language processing, allowing for a better understanding of user needs and providing friendly services. However, most existing methods struggle to capture the complicated semantics of discrete text representations when limited or no prior knowledge of labeled data is available. To tackle this problem, we propose a novel framework called USNID for unsupervised and semi-supervised new intent discovery, which has three key technologies. First, it takes full use of unsupervised or semi-supervised data to mine shallow semantic similarity relations and provide well-initialized representations for clustering. Second, it designs a centroid-guided clustering mechanism to address the issue of cluster allocation inconsistency and provide high-quality self-supervised targets for representation learning. Third, it captures high-level semantics in unsupervised or semi-supervised data to discover fine-grained intent-wise clusters by optimizing both cluster-level and instance-level objectives. We also propose an effective method for estimating the cluster number in open-world scenarios without knowing the number of new intents beforehand. USNID performs exceptionally well on several intent benchmark datasets, achieving new state-of-the-art results in unsupervised and semi-supervised new intent discovery and demonstrating robust performance with different cluster numbers.
[[2304.07830] How does ChatGPT rate sound semantics?](http://arxiv.org/abs/2304.07830) #robust
Semantic dimensions of sound have been playing a central role in understanding the nature of auditory sensory experience as well as the broader relation between perception, language, and meaning. Accordingly, and given the recent proliferation of large language models (LLMs), here we asked whether such models exhibit an organisation of perceptual semantics similar to those observed in humans. Specifically, we prompted ChatGPT, a chatbot based on a state-of-the-art LLM, to rate musical instrument sounds on a set of 20 semantic scales. We elicited multiple responses in separate chats, analogous to having multiple human raters. ChatGPT generated semantic profiles that only partially correlated with human ratings, yet showed robust agreement along well-known psychophysical dimensions of musical sounds such as brightness (bright-dark) and pitch height (deep-high). Exploratory factor analysis suggested the same dimensionality but different spatial configuration of a latent factor space between the chatbot and human ratings. Unexpectedly, the chatbot showed degrees of internal variability that were comparable in magnitude to that of human ratings. Our work highlights the potential of LLMs to capture salient dimensions of human sensory experience.
[[2304.07304] Explaining, Analyzing, and Probing Representations of Self-Supervised Learning Models for Sensor-based Human Activity Recognition](http://arxiv.org/abs/2304.07304) #robust
In recent years, self-supervised learning (SSL) frameworks have been extensively applied to sensor-based Human Activity Recognition (HAR) in order to learn deep representations without data annotations. While SSL frameworks reach performance almost comparable to supervised models, studies on interpreting representations learnt by SSL models are limited. Nevertheless, modern explainability methods could help to unravel the differences between SSL and supervised representations: how they are being learnt, what properties of input data they preserve, and when SSL can be chosen over supervised training. In this paper, we aim to analyze deep representations of two recent SSL frameworks, namely SimCLR and VICReg. Specifically, the emphasis is made on (i) comparing the robustness of supervised and SSL models to corruptions in input data; (ii) explaining predictions of deep learning models using saliency maps and highlighting what input channels are mostly used for predicting various activities; (iii) exploring properties encoded in SSL and supervised representations using probing. Extensive experiments on two single-device datasets (MobiAct and UCI-HAR) have shown that self-supervised learning representations are significantly more robust to noise in unseen data compared to supervised models. In contrast, features learnt by the supervised approaches are more homogeneous across subjects and better encode the nature of activities.
[[2304.07391] Revenue Management without Demand Forecasting: A Data-Driven Approach for Bid Price Generation](http://arxiv.org/abs/2304.07391) #robust
Traditional revenue management relies on long and stable historical data and predictable demand patterns. However, meeting those requirements is not always possible. Many industries face demand volatility on an ongoing basis, an example would be air cargo which has much shorter booking horizon with highly variable batch arrivals. Even for passenger airlines where revenue management (RM) is well-established, reacting to external shocks is a well-known challenge that requires user monitoring and manual intervention. Moreover, traditional RM comes with strict data requirements including historical bookings and pricing even in the absence of any bookings, spanning multiple years. For companies that have not established a practice in RM, that type of extensive data is usually not available. We present a data-driven approach to RM which eliminates the need for demand forecasting and optimization techniques. We develop a methodology to generate bid prices using historical booking data only. Our approach is an ex-post greedy heuristic to estimate proxies for marginal opportunity costs as a function of remaining capacity and time-to-departure solely based on historical booking data. We utilize a neural network algorithm to project bid price estimations into the future. We conduct an extensive simulation study where we measure performance of our methodology compared to that of an optimally generated bid price using dynamic programming (DP). We also extend our simulations to measure performance of both data-driven and DP generated bid prices under the presence of demand misspecification. Our results show that our data-driven methodology stays near a theoretical optimum (<1% revenue gap) for a wide-range of settings, whereas DP deviates more significantly from the optimal as the magnitude of misspecification is increased. This highlights the robustness of our data-driven approach.
[[2304.07485] Critical Sampling for Robust Evolution Operator Learning of Unknown Dynamical Systems](http://arxiv.org/abs/2304.07485) #robust
Given an unknown dynamical system, what is the minimum number of samples needed for effective learning of its governing laws and accurate prediction of its future evolution behavior, and how to select these critical samples? In this work, we propose to explore this problem based on a design approach. Starting from a small initial set of samples, we adaptively discover critical samples to achieve increasingly accurate learning of the system evolution. One central challenge here is that we do not know the network modeling error since the ground-truth system state is unknown, which is however needed for critical sampling. To address this challenge, we introduce a multi-step reciprocal prediction network where forward and backward evolution networks are designed to learn the temporal evolution behavior in the forward and backward time directions, respectively. Very interestingly, we find that the desired network modeling error is highly correlated with the multi-step reciprocal prediction error, which can be directly computed from the current system state. This allows us to perform a dynamic selection of critical samples from regions with high network modeling errors for dynamical systems. Additionally, a joint spatial-temporal evolution network is introduced which incorporates spatial dynamics modeling into the temporal evolution prediction for robust learning of the system evolution operator with few samples. Our extensive experimental results demonstrate that our proposed method is able to dramatically reduce the number of samples needed for effective learning and accurate prediction of evolution behaviors of unknown dynamical systems by up to hundreds of times.
[[2304.07832] Characterizing the load profile in power grids by Koopman mode decomposition of interconnected dynamics](http://arxiv.org/abs/2304.07832) #robust
Electricity load forecasting is crucial for effectively managing and optimizing power grids. Over the past few decades, various statistical and deep learning approaches have been used to develop load forecasting models. This paper presents an interpretable machine learning approach that identifies load dynamics using data-driven methods within an operator-theoretic framework. We represent the load data using the Koopman operator, which is inherent to the underlying dynamics. By computing the corresponding eigenfunctions, we decompose the load dynamics into coherent spatiotemporal patterns that are the most robust features of the dynamics. Each pattern evolves independently according to its single frequency, making its predictability based on linear dynamics. We emphasize that the load dynamics are constructed based on coherent spatiotemporal patterns that are intrinsic to the dynamics and are capable of encoding rich dynamical features at multiple time scales. These features are related to complex interactions over interconnected power grids and different exogenous effects. To implement the Koopman operator approach more efficiently, we cluster the load data using a modern kernel-based clustering approach and identify power stations with similar load patterns, particularly those with synchronized dynamics. We evaluate our approach using a large-scale dataset from a renewable electric power system within the continental European electricity system and show that the Koopman-based approach outperforms a deep learning (LSTM) architecture in terms of accuracy and computational efficiency. The code for this paper has been deposited in a GitHub repository, which can be accessed at the following address github.com/Shakeri-Lab/Power-Grids.
[[2304.07598] Understanding Rug Pulls: An In-Depth Behavioral Analysis of Fraudulent NFT Creators](http://arxiv.org/abs/2304.07598) #steal
The explosive growth of non-fungible tokens (NFTs) on Web3 has created a new frontier for digital art and collectibles, but also an emerging space for fraudulent activities. This study provides an in-depth analysis of NFT rug pulls, which are fraudulent schemes aimed at stealing investors' funds. Using data from 758 rug pulls across 10 NFT marketplaces, we examine the structural and behavioral properties of these schemes, identify the characteristics and motivations of rug-pullers, and classify NFT projects into groups based on creators' association with their accounts. Our findings reveal that repeated rug pulls account for a significant proportion of the rise in NFT-related cryptocurrency crimes, with one NFT collection attempting 37 rug pulls within three months. Additionally, we identify the largest group of creators influencing the majority of rug pulls, and demonstrate the connection between rug-pullers of different NFT projects through the use of the same wallets to store and move money. Our study contributes to the understanding of NFT market risks and provides insights for designing preventative strategies to mitigate future losses.
[[2304.07314] Uncovering the Inner Workings of STEGO for Safe Unsupervised Semantic Segmentation](http://arxiv.org/abs/2304.07314) #extraction
Self-supervised pre-training strategies have recently shown impressive results for training general-purpose feature extraction backbones in computer vision. In combination with the Vision Transformer architecture, the DINO self-distillation technique has interesting emerging properties, such as unsupervised clustering in the latent space and semantic correspondences of the produced features without using explicit human-annotated labels. The STEGO method for unsupervised semantic segmentation contrastively distills feature correspondences of a DINO-pre-trained Vision Transformer and recently set a new state of the art. However, the detailed workings of STEGO have yet to be disentangled, preventing its usage in safety-critical applications. This paper provides a deeper understanding of the STEGO architecture and training strategy by conducting studies that uncover the working mechanisms behind STEGO, reproduce and extend its experimental validation, and investigate the ability of STEGO to transfer to different datasets. Results demonstrate that the STEGO architecture can be interpreted as a semantics-preserving dimensionality reduction technique.
[[2304.07486] Region-Enhanced Feature Learning for Scene Semantic Segmentation](http://arxiv.org/abs/2304.07486) #extraction
Semantic segmentation in complex scenes not only relies on local object appearance but also on object locations and the surrounding environment. Nonetheless, it is difficult to model long-range context in the format of pairwise point correlations due to its huge computational cost for large-scale point clouds.In this paper, we propose to use regions as the intermediate representation of point clouds instead of fine-grained points or voxels to reduce the computational burden. We introduce a novel Region-Enhanced Feature Learning network (REFL-Net) that leverages region correlations to enhance the features of ambiguous points. We design a Region-based Feature Enhancement module (RFE) which consists of a Semantic-Spatial Region Extraction (SSRE) stage and a Region Dependency Modeling (RDM) stage. In the SSRE stage, we group the input points into a set of regions according to the point distances in both semantic and spatial space.In the RDM part, we explore region-wise semantic and spatial relationships via a self-attention block on region features and fuse point features with the region features to obtain more discriminative representations. Our proposed RFE module is a plug-and-play module that can be integrated with common semantic segmentation backbones. We conduct extensive experiments on ScanNetv2 and S3DIS datasets, and evaluate our RFE module with different segmentation backbones. Our REFL-Net achieves 1.8% mIoU gain on ScanNetv2 and 1.0% mIoU gain on S3DIS respectively with negligible computational cost compared to the backbone networks. Both quantitative and qualitative results show the powerful long-range context modeling ability and strong generalization ability of our REFL-Net.
[[2304.07584] FSDNet-An efficient fire detection network for complex scenarios based on YOLOv3 and DenseNet](http://arxiv.org/abs/2304.07584) #extraction
Fire is one of the common disasters in daily life. To achieve fast and accurate detection of fires, this paper proposes a detection network called FSDNet (Fire Smoke Detection Network), which consists of a feature extraction module, a fire classification module, and a fire detection module. Firstly, a dense connection structure is introduced in the basic feature extraction module to enhance the feature extraction ability of the backbone network and alleviate the gradient disappearance problem. Secondly, a spatial pyramid pooling structure is introduced in the fire detection module, and the Mosaic data augmentation method and CIoU loss function are used in the training process to comprehensively improve the flame feature extraction ability. Finally, in view of the shortcomings of public fire datasets, a fire dataset called MS-FS (Multi-scene Fire And Smoke) containing 11938 fire images was created through data collection, screening, and object annotation. To prove the effectiveness of the proposed method, the accuracy of the method was evaluated on two benchmark fire datasets and MS-FS. The experimental results show that the accuracy of FSDNet on the two benchmark datasets is 99.82% and 91.15%, respectively, and the average precision on MS-FS is 86.80%, which is better than the mainstream fire detection methods.
[[2304.07803] EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation](http://arxiv.org/abs/2304.07803) #extraction
Estimating the depths of equirectangular (360) images (EIs) is challenging given the distorted 180 x 360 field-of-view, which is hard to be addressed via convolutional neural network (CNN). Although a transformer with global attention achieves significant improvements over CNN for EI depth estimation task, it is computationally inefficient, which raises the need for transformer with local attention. However, to apply local attention successfully for EIs, a specific strategy, which addresses distorted equirectangular geometry and limited receptive field simultaneously, is required. Prior works have only cared either of them, resulting in unsatisfactory depths occasionally. In this paper, we propose an equirectangular geometry-biased transformer termed EGformer, which enables local attention extraction in a global manner considering the equirectangular geometry. To achieve this, we actively utilize the equirectangular geometry as the bias for the local attention instead of struggling to reduce the distortion of EIs. As compared to the most recent transformer based EI depth estimation studies, the proposed approach yields the best depth outcomes overall with the lowest computational cost and the fewest parameters, demonstrating the effectiveness of the proposed methods.
[[2304.07396] Improving Patient Pre-screening for Clinical Trials: Assisting Physicians with Large Language Models](http://arxiv.org/abs/2304.07396) #extraction
Physicians considering clinical trials for their patients are met with the laborious process of checking many text based eligibility criteria. Large Language Models (LLMs) have shown to perform well for clinical information extraction and clinical reasoning, including medical tests, but not yet in real-world scenarios. This paper investigates the use of InstructGPT to assist physicians in determining eligibility for clinical trials based on a patient's summarised medical profile. Using a prompting strategy combining one-shot, selection-inference and chain-of-thought techniques, we investigate the performance of LLMs on 10 synthetically created patient profiles. Performance is evaluated at four levels: ability to identify screenable eligibility criteria from a trial given a medical profile; ability to classify for each individual criterion whether the patient qualifies; the overall classification whether a patient is eligible for a clinical trial and the percentage of criteria to be screened by physician. We evaluated against 146 clinical trials and a total of 4,135 eligibility criteria. The LLM was able to correctly identify the screenability of 72% (2,994/4,135) of the criteria. Additionally, 72% (341/471) of the screenable criteria were evaluated correctly. The resulting trial level classification as eligible or ineligible resulted in a recall of 0.5. By leveraging LLMs with a physician-in-the-loop, a recall of 1.0 and precision of 0.71 on clinical trial level can be achieved while reducing the amount of criteria to be checked by an estimated 90%. LLMs can be used to assist physicians with pre-screening of patients for clinical trials. By forcing instruction-tuned LLMs to produce chain-of-thought responses, the reasoning can be made transparent to and the decision process becomes amenable by physicians, thereby making such a system feasible for use in real-world scenarios.
[[2304.07625] Neural Approaches to Entity-Centric Information Extraction](http://arxiv.org/abs/2304.07625) #extraction
Artificial Intelligence (AI) has huge impact on our daily lives with applications such as voice assistants, facial recognition, chatbots, autonomously driving cars, etc. Natural Language Processing (NLP) is a cross-discipline of AI and Linguistics, dedicated to study the understanding of the text. This is a very challenging area due to unstructured nature of the language, with many ambiguous and corner cases. In this thesis we address a very specific area of NLP that involves the understanding of entities (e.g., names of people, organizations, locations) in text. First, we introduce a radically different, entity-centric view of the information in text. We argue that instead of using individual mentions in text to understand their meaning, we should build applications that would work in terms of entity concepts. Next, we present a more detailed model on how the entity-centric approach can be used for the entity linking task. In our work, we show that this task can be improved by considering performing entity linking at the coreference cluster level rather than each of the mentions individually. In our next work, we further study how information from Knowledge Base entities can be integrated into text. Finally, we analyze the evolution of the entities from the evolving temporal perspective.
[[2304.07774] Syntactic Complexity Identification, Measurement, and Reduction Through Controlled Syntactic Simplification](http://arxiv.org/abs/2304.07774) #extraction
Text simplification is one of the domains in Natural Language Processing (NLP) that offers an opportunity to understand the text in a simplified manner for exploration. However, it is always hard to understand and retrieve knowledge from unstructured text, which is usually in the form of compound and complex sentences. There are state-of-the-art neural network-based methods to simplify the sentences for improved readability while replacing words with plain English substitutes and summarising the sentences and paragraphs. In the Knowledge Graph (KG) creation process from unstructured text, summarising long sentences and substituting words is undesirable since this may lead to information loss. However, KG creation from text requires the extraction of all possible facts (triples) with the same mentions as in the text. In this work, we propose a controlled simplification based on the factual information in a sentence, i.e., triple. We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences. This simplification process will retain the original wording with a simple structure of possible domain facts in each sentence, i.e., triples. The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity (SC), followed by reduction through a controlled syntactic simplification process. Last, an experiment for a dataset re-annotation is also conducted through GPT3; we aim to publish this refined corpus as a resource. This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference. The code and data is available at www.github.com/sallmanm/SynSim.
[[2304.07421] Peer-to-Peer Federated Continual Learning for Naturalistic Driving Action Recognition](http://arxiv.org/abs/2304.07421) #federate
Naturalistic driving action recognition (NDAR) has proven to be an effective method for detecting driver distraction and reducing the risk of traffic accidents. However, the intrusive design of in-cabin cameras raises concerns about driver privacy. To address this issue, we propose a novel peer-to-peer (P2P) federated learning (FL) framework with continual learning, namely FedPC, which ensures privacy and enhances learning efficiency while reducing communication, computational, and storage overheads. Our framework focuses on addressing the clients' objectives within a serverless FL framework, with the goal of delivering personalized and accurate NDAR models. We demonstrate and evaluate the performance of FedPC on two real-world NDAR datasets, including the State Farm Distracted Driver Detection and Track 3 NDAR dataset in the 2023 AICity Challenge. The results of our experiments highlight the strong competitiveness of FedPC compared to the conventional client-to-server (C2S) FLs in terms of performance, knowledge dissemination rate, and compatibility with new clients.
[[2304.07310] Federated and distributed learning applications for electronic health records and structured medical data: A scoping review](http://arxiv.org/abs/2304.07310) #federate
Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medical data, identifies contemporary limitations and discusses potential innovations. We searched five databases, SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL, to identify articles that applied FL to structured medical data and reported results following the PRISMA guidelines. Each selected publication was evaluated from three primary perspectives, including data quality, modeling strategies, and FL frameworks. Out of the 1160 papers screened, 34 met the inclusion criteria, with each article consisting of one or more studies that used FL to handle structured clinical/medical data. Of these, 24 utilized data acquired from electronic health records, with clinical predictions and association studies being the most common clinical research tasks that FL was applied to. Only one article exclusively explored the vertical FL setting, while the remaining 33 explored the horizontal FL setting, with only 14 discussing comparisons between single-site (local) and FL (global) analysis. The existing FL applications on structured medical data lack sufficient evaluations of clinically meaningful benefits, particularly when compared to single-site analyses. Therefore, it is crucial for future FL applications to prioritize clinical motivations and develop designs and methodologies that can effectively support and aid clinical practice and research.
[[2304.07488] SalientGrads: Sparse Models for Communication Efficient and Data Aware Distributed Federated Training](http://arxiv.org/abs/2304.07488) #federate
Federated learning (FL) enables the training of a model leveraging decentralized data in client sites while preserving privacy by not collecting data. However, one of the significant challenges of FL is limited computation and low communication bandwidth in resource limited edge client nodes. To address this, several solutions have been proposed in recent times including transmitting sparse models and learning dynamic masks iteratively, among others. However, many of these methods rely on transmitting the model weights throughout the entire training process as they are based on ad-hoc or random pruning criteria. In this work, we propose Salient Grads, which simplifies the process of sparse training by choosing a data aware subnetwork before training, based on the model-parameter's saliency scores, which is calculated from the local client data. Moreover only highly sparse gradients are transmitted between the server and client models during the training process unlike most methods that rely on sharing the entire dense model in each round. We also demonstrate the efficacy of our method in a real world federated learning application and report improvement in wall-clock communication time.
[[2304.07514] PI-FL: Personalized and Incentivized Federated Learning](http://arxiv.org/abs/2304.07514) #federate
Personalized FL has been widely used to cater to heterogeneity challenges with non-IID data. A primary obstacle is considering the personalization process from the client's perspective to preserve their autonomy. Allowing the clients to participate in personalized FL decisions becomes significant due to privacy and security concerns, where the clients may not be at liberty to share private information necessary for producing good quality personalized models. Moreover, clients with high-quality data and resources are reluctant to participate in the FL process without reasonable incentive. In this paper, we propose PI-FL, a one-shot personalization solution complemented by a token-based incentive mechanism that rewards personalized training. PI-FL outperforms other state-of-the-art approaches and can generate good-quality personalized models while respecting clients' privacy.
[[2304.07537] Gradient-less Federated Gradient Boosting Trees with Learnable Learning Rates](http://arxiv.org/abs/2304.07537) #federate
The privacy-sensitive nature of decentralized datasets and the robustness of eXtreme Gradient Boosting (XGBoost) on tabular data raise the needs to train XGBoost in the context of federated learning (FL). Existing works on federated XGBoost in the horizontal setting rely on the sharing of gradients, which induce per-node level communication frequency and serious privacy concerns. To alleviate these problems, we develop an innovative framework for horizontal federated XGBoost which does not depend on the sharing of gradients and simultaneously boosts privacy and communication efficiency by making the learning rates of the aggregated tree ensembles learnable. We conduct extensive evaluations on various classification and regression datasets, showing our approach achieves performance comparable to the state-of-the-art method and effectively improves communication efficiency by lowering both communication rounds and communication overhead by factors ranging from 25x to 700x.
[[2304.07408] Fairness in Visual Clustering: A Novel Transformer Clustering Approach](http://arxiv.org/abs/2304.07408) #fair
Promoting fairness for deep clustering models in unsupervised clustering settings to reduce demographic bias is a challenging goal. This is because of the limitation of large-scale balanced data with well-annotated labels for sensitive or protected attributes. In this paper, we first evaluate demographic bias in deep clustering models from the perspective of cluster purity, which is measured by the ratio of positive samples within a cluster to their correlation degree. This measurement is adopted as an indication of demographic bias. Then, a novel loss function is introduced to encourage a purity consistency for all clusters to maintain the fairness aspect of the learned clustering model. Moreover, we present a novel attention mechanism, Cross-attention, to measure correlations between multiple clusters, strengthening faraway positive samples and improving the purity of clusters during the learning process. Experimental results on a large-scale dataset with numerous attribute settings have demonstrated the effectiveness of the proposed approach on both clustering accuracy and fairness enhancement on several sensitive attributes.
[[2304.07382] Zero-Shot Multi-Label Topic Inference with Sentence Encoders](http://arxiv.org/abs/2304.07382) #fair
Sentence encoders have indeed been shown to achieve superior performances for many downstream text-mining tasks and, thus, claimed to be fairly general. Inspired by this, we performed a detailed study on how to leverage these sentence encoders for the "zero-shot topic inference" task, where the topics are defined/provided by the users in real-time. Extensive experiments on seven different datasets demonstrate that Sentence-BERT demonstrates superior generality compared to other encoders, while Universal Sentence Encoder can be preferred when efficiency is a top priority.
[[2304.07437] Medical Question Summarization with Entity-driven Contrastive Learning](http://arxiv.org/abs/2304.07437) #fair
By summarizing longer consumer health questions into shorter and essential ones, medical question answering (MQA) systems can more accurately understand consumer intentions and retrieve suitable answers. However, medical question summarization is very challenging due to obvious distinctions in health trouble descriptions from patients and doctors. Although existing works have attempted to utilize Seq2Seq, reinforcement learning, or contrastive learning to solve the problem, two challenges remain: how to correctly capture question focus to model its semantic intention, and how to obtain reliable datasets to fairly evaluate performance. To address these challenges, this paper proposes a novel medical question summarization framework using entity-driven contrastive learning (ECL). ECL employs medical entities in frequently asked questions (FAQs) as focuses and devises an effective mechanism to generate hard negative samples. This approach forces models to pay attention to the crucial focus information and generate more ideal question summarization. Additionally, we find that some MQA datasets suffer from serious data leakage problems, such as the iCliniq dataset's 33% duplicate rate. To evaluate the related methods fairly, this paper carefully checks leaked samples to reorganize more reasonable datasets. Extensive experiments demonstrate that our ECL method outperforms state-of-the-art methods by accurately capturing question focus and generating medical question summaries. The code and datasets are available at https://github.com/yrbobo/MQS-ECL.
[[2304.07788] Assisting clinical practice with fuzzy probabilistic decision trees](http://arxiv.org/abs/2304.07788) #interpretability
The need for fully human-understandable models is increasingly being recognised as a central theme in AI research. The acceptance of AI models to assist in decision making in sensitive domains will grow when these models are interpretable, and this trend towards interpretable models will be amplified by upcoming regulations. One of the killer applications of interpretable AI is medical practice, which can benefit from accurate decision support methodologies that inherently generate trust. In this work, we propose FPT, (MedFP), a novel method that combines probabilistic trees and fuzzy logic to assist clinical practice. This approach is fully interpretable as it allows clinicians to generate, control and verify the entire diagnosis procedure; one of the methodology's strength is the capability to decrease the frequency of misdiagnoses by providing an estimate of uncertainties and counterfactuals. Our approach is applied as a proof-of-concept to two real medical scenarios: classifying malignant thyroid nodules and predicting the risk of progression in chronic kidney disease patients. Our results show that probabilistic fuzzy decision trees can provide interpretable support to clinicians, furthermore, introducing fuzzy variables into the probabilistic model brings significant nuances that are lost when using the crisp thresholds set by traditional probabilistic decision trees. We show that FPT and its predictions can assist clinical practice in an intuitive manner, with the use of a user-friendly interface specifically designed for this purpose. Moreover, we discuss the interpretability of the FPT model.
[[2304.07609] ODSmoothGrad: Generating Saliency Maps for Object Detectors](http://arxiv.org/abs/2304.07609) #explainability
Techniques for generating saliency maps continue to be used for explainability of deep learning models, with efforts primarily applied to the image classification task. Such techniques, however, can also be applied to object detectors, not only with the classification scores, but also for the bounding box parameters, which are regressed values for which the relevant pixels contributing to these parameters can be identified. In this paper, we present ODSmoothGrad, a tool for generating saliency maps for the classification and the bounding box parameters in object detectors. Given the noisiness of saliency maps, we also apply the SmoothGrad algorithm to visually enhance the pixels of interest. We demonstrate these capabilities on one-stage and two-stage object detectors, with comparisons using classifier-based techniques.
[[2304.07670] Explanations of Black-Box Models based on Directional Feature Interactions](http://arxiv.org/abs/2304.07670) #explainability
As machine learning algorithms are deployed ubiquitously to a variety of domains, it is imperative to make these often black-box models transparent. Several recent works explain black-box models by capturing the most influential features for prediction per instance; such explanation methods are univariate, as they characterize importance per feature. We extend univariate explanation to a higher-order; this enhances explainability, as bivariate methods can capture feature interactions in black-box models, represented as a directed graph. Analyzing this graph enables us to discover groups of features that are equally important (i.e., interchangeable), while the notion of directionality allows us to identify the most influential features. We apply our bivariate method on Shapley value explanations, and experimentally demonstrate the ability of directional explanations to discover feature interactions. We show the superiority of our method against state-of-the-art on CIFAR10, IMDB, Census, Divorce, Drug, and gene data.
[[2304.07361] PTW: Pivotal Tuning Watermarking for Pre-Trained Image Generators](http://arxiv.org/abs/2304.07361) #watermark
Deepfakes refer to content synthesized using deep generators, which, when \emph{misused}, have the potential to erode trust in digital media. Synthesizing high-quality deepfakes requires access to large and complex generators only few entities can train and provide. The threat are malicious users that exploit access to the provided model and generate harmful deepfakes without risking detection. Watermarking makes deepfakes detectable by embedding an identifiable code into the generator that is later extractable from its generated images. We propose Pivotal Tuning Watermarking (PTW), a method for watermarking pre-trained generators (i) three orders of magnitude faster than watermarking from scratch and (ii) without the need for any training data. We improve existing watermarking methods and scale to generators $4 \times$ larger than related work. PTW can embed longer codes than existing methods while better preserving the generator's image quality. We propose rigorous, game-based definitions for robustness and undetectability and our study reveals that watermarking is not robust against an adaptive white-box attacker who has control over the generator's parameters. We propose an adaptive attack that can successfully remove any watermarking with access to only $200$ non-watermarked images. Our work challenges the trustworthiness of watermarking for deepfake detection when the parameters of a generator are available.
[[2304.07410] Text-Conditional Contextualized Avatars For Zero-Shot Personalization](http://arxiv.org/abs/2304.07410) #diffusion
Recent large-scale text-to-image generation models have made significant improvements in the quality, realism, and diversity of the synthesized images and enable users to control the created content through language. However, the personalization aspect of these generative models is still challenging and under-explored. In this work, we propose a pipeline that enables personalization of image generation with avatars capturing a user's identity in a delightful way. Our pipeline is zero-shot, avatar texture and style agnostic, and does not require training on the avatar at all - it is scalable to millions of users who can generate a scene with their avatar. To render the avatar in a pose faithful to the given text prompt, we propose a novel text-to-3D pose diffusion model trained on a curated large-scale dataset of in-the-wild human poses improving the performance of the SOTA text-to-motion models significantly. We show, for the first time, how to leverage large-scale image datasets to learn human 3D pose parameters and overcome the limitations of motion capture datasets.
[[2304.07429] Identity Encoder for Personalized Diffusion](http://arxiv.org/abs/2304.07429) #diffusion
Many applications can benefit from personalized image generation models, including image enhancement, video conferences, just to name a few. Existing works achieved personalization by fine-tuning one model for each person. While being successful, this approach incurs additional computation and storage overhead for each new identity. Furthermore, it usually expects tens or hundreds of examples per identity to achieve the best performance. To overcome these challenges, we propose an encoder-based approach for personalization. We learn an identity encoder which can extract an identity representation from a set of reference images of a subject, together with a diffusion generator that can generate new images of the subject conditioned on the identity representation. Once being trained, the model can be used to generate images of arbitrary identities given a few examples even if the model hasn't been trained on the identity. Our approach greatly reduces the overhead for personalized image generation and is more applicable in many potential applications. Empirical results show that our approach consistently outperforms existing fine-tuning based approach in both image generation and reconstruction, and the outputs is preferred by users more than 95% of the time compared with the best performing baseline.
[[2304.07302] HGWaveNet: A Hyperbolic Graph Neural Network for Temporal Link Prediction](http://arxiv.org/abs/2304.07302) #diffusion
Temporal link prediction, aiming to predict future edges between paired nodes in a dynamic graph, is of vital importance in diverse applications. However, existing methods are mainly built upon uniform Euclidean space, which has been found to be conflict with the power-law distributions of real-world graphs and unable to represent the hierarchical connections between nodes effectively. With respect to the special data characteristic, hyperbolic geometry offers an ideal alternative due to its exponential expansion property. In this paper, we propose HGWaveNet, a novel hyperbolic graph neural network that fully exploits the fitness between hyperbolic spaces and data distributions for temporal link prediction. Specifically, we design two key modules to learn the spatial topological structures and temporal evolutionary information separately. On the one hand, a hyperbolic diffusion graph convolution (HDGC) module effectively aggregates information from a wider range of neighbors. On the other hand, the internal order of causal correlation between historical states is captured by hyperbolic dilated causal convolution (HDCC) modules. The whole model is built upon the hyperbolic spaces to preserve the hierarchical structural information in the entire data flow. To prove the superiority of HGWaveNet, extensive experiments are conducted on six real-world graph datasets and the results show a relative improvement by up to 6.67% on AUC for temporal link prediction over SOTA methods.
[[2304.07358] Exact Subspace Diffusion for Decentralized Multitask Learning](http://arxiv.org/abs/2304.07358) #diffusion
Classical paradigms for distributed learning, such as federated or decentralized gradient descent, employ consensus mechanisms to enforce homogeneity among agents. While these strategies have proven effective in i.i.d. scenarios, they can result in significant performance degradation when agents follow heterogeneous objectives or data. Distributed strategies for multitask learning, on the other hand, induce relationships between agents in a more nuanced manner, and encourage collaboration without enforcing consensus. We develop a generalization of the exact diffusion algorithm for subspace constrained multitask learning over networks, and derive an accurate expression for its mean-squared deviation when utilizing noisy gradient approximations. We verify numerically the accuracy of the predicted performance expressions, as well as the improved performance of the proposed approach over alternatives based on approximate projections.