[[2211.12584] Big Earth Data and Machine Learning for Sustainable and Resilient Agriculture](http://arxiv.org/abs/2211.12584)
Big streams of Earth images from satellites or other platforms (e.g., drones and mobile phones) are becoming increasingly available at low or no cost and with enhanced spatial and temporal resolution. This thesis recognizes the unprecedented opportunities offered by the high quality and open access Earth observation data of our times and introduces novel machine learning and big data methods to properly exploit them towards developing applications for sustainable and resilient agriculture. The thesis addresses three distinct thematic areas, i.e., the monitoring of the Common Agricultural Policy (CAP), the monitoring of food security and applications for smart and resilient agriculture. The methodological innovations of the developments related to the three thematic areas address the following issues: i) the processing of big Earth Observation (EO) data, ii) the scarcity of annotated data for machine learning model training and iii) the gap between machine learning outputs and actionable advice.
This thesis demonstrated how big data technologies such as data cubes, distributed learning, linked open data and semantic enrichment can be used to exploit the data deluge and extract knowledge to address real user needs. Furthermore, this thesis argues for the importance of semi-supervised and unsupervised machine learning models that circumvent the ever-present challenge of scarce annotations and thus allow for model generalization in space and time. Specifically, it is shown how merely few ground truth data are needed to generate high quality crop type maps and crop phenology estimations. Finally, this thesis argues there is considerable distance in value between model inferences and decision making in real-world scenarios and thereby showcases the power of causal and interpretable machine learning in bridging this gap.
[[2211.12864] Privacy-Enhancing Optical Embeddings for Lensless Classification](http://arxiv.org/abs/2211.12864)
Lensless imaging can provide visual privacy due to the highly multiplexed characteristic of its measurements. However, this alone is a weak form of security, as various adversarial attacks can be designed to invert the one-to-many scene mapping of such cameras. In this work, we enhance the privacy provided by lensless imaging by (1) downsampling at the sensor and (2) using a programmable mask with variable patterns as our optical encoder. We build a prototype from a low-cost LCD and Raspberry Pi components, for a total cost of around 100 USD. This very low price point allows our system to be deployed and leveraged in a broad range of applications. In our experiments, we first demonstrate the viability and reconfigurability of our system by applying it to various classification tasks: MNIST, CelebA (face attributes), and CIFAR10. By jointly optimizing the mask pattern and a digital classifier in an end-to-end fashion, low-dimensional, privacy-enhancing embeddings are learned directly at the sensor. Secondly, we show how the proposed system, through variable mask patterns, can thwart adversaries that attempt to invert the system (1) via plaintext attacks or (2) in the event of camera parameters leaks. We demonstrate the defense of our system to both risks, with 55% and 26% drops in image quality metrics for attacks based on model-based convex optimization and generative neural networks respectively. We open-source a wave propagation and camera simulator needed for end-to-end optimization, the training software, and a library for interfacing with the camera.
[[2211.13218] CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning](http://arxiv.org/abs/2211.13218)
Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has enabled prompting approaches as an alternative to data-rehearsal. These approaches rely on a key-query mechanism to generate prompts and have been found to be highly resistant to catastrophic forgetting in the well-established rehearsal-free continual learning setting. However, the key mechanism of these methods is not trained end-to-end with the task sequence. Our experiments show that this leads to a reduction in their plasticity, hence sacrificing new task accuracy, and inability to benefit from expanded parameter capacity. We instead propose to learn a set of prompt components which are assembled with input-conditioned weights to produce input-conditioned prompts, resulting in a novel attention-based end-to-end key-query scheme. Our experiments show that we outperform the current SOTA method DualPrompt on established benchmarks by as much as 5.4% in average accuracy. We also outperform the state of art by as much as 6.6% accuracy on a continual learning benchmark which contains both class-incremental and domain-incremental task shifts, corresponding to many practical settings.
[[2211.12752] Agent-Specific Deontic Modality Detection in Legal Language](http://arxiv.org/abs/2211.12752)
Legal documents are typically long and written in legalese, which makes it particularly difficult for laypeople to understand their rights and duties. While natural language understanding technologies can be valuable in supporting such understanding in the legal domain, the limited availability of datasets annotated for deontic modalities in the legal domain, due to the cost of hiring experts and privacy issues, is a bottleneck. To this end, we introduce, LEXDEMOD, a corpus of English contracts annotated with deontic modality expressed with respect to a contracting party or agent along with the modal triggers. We benchmark this dataset on two tasks: (i) agent-specific multi-label deontic modality classification, and (ii) agent-specific deontic modality and trigger span detection using Transformer-based (Vaswani et al., 2017) language models. Transfer learning experiments show that the linguistic diversity of modal expressions in LEXDEMOD generalizes reasonably from lease to employment and rental agreements. A small case study indicates that a model trained on LEXDEMOD can detect red flags with high recall. We believe our work offers a new research direction for deontic modality detection in the legal domain.
[[2211.12899] Emerging Biometric Modalities and their Use: Loopholes in the Terminology of the GDPR and Resulting Privacy Risks](http://arxiv.org/abs/2211.12899)
Technological advancements allow biometric applications to be more omnipresent than in any other time before. This paper argues that in the current EU data protection regulation, classification applications using biometric data receive less protection compared to biometric recognition. We analyse preconditions in the regulatory language and explore how this has the potential to be the source of unique privacy risks for processing operations classifying individuals based on soft traits like emotions. This can have high impact on personal freedoms and human rights and therefore, should be subject to data protection impact assessment.
[[2211.13041] A new Privacy Preserving and Scalable Revocation Method for Self Sovereign Identity -- The Perfect Revocation Method does not exist yet](http://arxiv.org/abs/2211.13041)
Digital Identities are playing an essential role in our digital lives. Today, used Digital Identities are based on central architectures. Central Digital Identity providers control and know our data and, thereby, our Identity. Self Sovereign Identities (SSI) are based on a decentralized data storage and data exchange architecture, where the user is in sole control of his data and identity. Most of the issued credentials need the possibility of revocation. For a Central Digital Identity, revocation is easy. In decentral architectures, revocation is more challenging. Revocation can be done with different methods e.g. lists, compressed lists and cryptographic accumulators. A revocation method must be privacy preserving and must scale. This paper gives an overview about the available revocation methods, include a survey to define requirements, assess different revocation groups against the requirements, highlights shortcomings of the methods and introduce a new revocation method called Linked Validity Verifiable Credentials.
[[2211.13097] DeepVulSeeker: A Novel Vulnerability Identification Framework via Code Graph Structure and Pre-training Mechanism](http://arxiv.org/abs/2211.13097)
Software vulnerabilities can pose severe harms to a computing system. They can lead to system crash, privacy leakage, or even physical damage. Correctly identifying vulnerabilities among enormous software codes in a timely manner is so far the essential prerequisite to patch them. Unfortantely, the current vulnerability identification methods, either the classic ones or the deep-learning-based ones, have several critical drawbacks, making them unable to meet the present-day demands put forward by the software industry. To overcome the drawbacks, in this paper, we propose DeepVulSeeker, a novel fully automated vulnerability identification framework, which leverages both code graph structures and the semantic features with the help of the recently advanced Graph Representation Self-Attention and pre-training mechanisms. Our experiments show that DeepVulSeeker not only reaches an accuracy as high as 0.99 on traditional CWE datasets, but also outperforms all other exisiting methods on two highly-complicated datasets. We also testified DeepVulSeeker based on three case studies, and found that DeepVulSeeker is able to understand the implications of the vulnerbilities. We have fully implemented DeepVulSeeker and open-sourced it for future follow-up research.
[[2211.13195] Privacy-Preserving Application-to-Application Authentication Using Dynamic Runtime Behaviors](http://arxiv.org/abs/2211.13195)
Application authentication is typically performed using some form of secret credentials such as cryptographic keys, passwords, or API keys. Since clients are responsible for securely storing and managing the keys, this approach is vulnerable to attacks on clients. Similarly a centrally managed key store is also susceptible to various attacks and if compromised, can leak credentials. To resolve such issues, we propose an application authentication, where we rely on unique and distinguishable application's behavior to lock the key during a setup phase and unlock it for authentication. Our system add a fuzzy-extractor layer on top of current credential authentication systems. During a key enrollment process, the application's behavioral data collected from various sensors in the network are used to hide the credential key. The fuzzy extractor releases the key to the server if the application's behavior during the authentication matches the one collected during the enrollment, with some noise tolerance. We designed the system, analyzed its security, and implemented and evaluated it using 10 real-life applications deployed in our network. Our security analysis shows that the system is secure against client compromise, vault compromise, and feature observation. The evaluation shows the scheme can achieve 0 percent False Accept Rate with an average False Rejection Rate 14 percent and takes about 51 ms to successfully authenticate a client. In light of these promising results, we expect our system to be of practical use, since its deployment requires zero to minimal changes on the server.
[[2211.12857] Explaining Image Classifiers with Multiscale Directional Image Representation](http://arxiv.org/abs/2211.12857)
Image classifiers are known to be difficult to interpret and therefore require explanation methods to understand their decisions. We present ShearletX, a novel mask explanation method for image classifiers based on the shearlet transform -- a multiscale directional image representation. Current mask explanation methods are regularized by smoothness constraints that protect against undesirable fine-grained explanation artifacts. However, the smoothness of a mask limits its ability to separate fine-detail patterns, that are relevant for the classifier, from nearby nuisance patterns, that do not affect the classifier. ShearletX solves this problem by avoiding smoothness regularization all together, replacing it by shearlet sparsity constraints. The resulting explanations consist of a few edges, textures, and smooth parts of the original image, that are the most relevant for the decision of the classifier. To support our method, we propose a mathematical definition for explanation artifacts and an information theoretic score to evaluate the quality of mask explanations. We demonstrate the superiority of ShearletX over previous mask based explanation methods using these new metrics, and present exemplary situations where separating fine-detail patterns allows explaining phenomena that were not explainable before.
[[2211.12686] Batching of Tasks by Users of Pseudonymous Forums: Anonymity Compromise and Protection](http://arxiv.org/abs/2211.12686)
There are a number of forums where people participate under pseudonyms. One example is peer review, where the identity of reviewers for any paper is confidential. When participating in these forums, people frequently engage in "batching": executing multiple related tasks (e.g., commenting on multiple papers) at nearly the same time. Our empirical analysis shows that batching is common in two applications we consider $\unicode{x2013}$ peer review and Wikipedia edits. In this paper, we identify and address the risk of deanonymization arising from linking batched tasks. To protect against linkage attacks, we take the approach of adding delay to the posting time of batched tasks. We first show that under some natural assumptions, no delay mechanism can provide a meaningful differential privacy guarantee. We therefore propose a "one-sided" formulation of differential privacy for protecting against linkage attacks. We design a mechanism that adds zero-inflated uniform delay to events and show it can preserve privacy. We prove that this noise distribution is in fact optimal in minimizing expected delay among mechanisms adding independent noise to each event, thereby establishing the Pareto frontier of the trade-off between the expected delay for batched and unbatched events. Finally, we conduct a series of experiments on Wikipedia and Bitcoin data that corroborate the practical utility of our algorithm in obfuscating batching without introducing onerous delay to a system.
[[2211.13171] Query Efficient Cross-Dataset Transferable Black-Box Attack on Action Recognition](http://arxiv.org/abs/2211.13171)
Black-box adversarial attacks present a realistic threat to action recognition systems. Existing black-box attacks follow either a query-based approach where an attack is optimized by querying the target model, or a transfer-based approach where attacks are generated using a substitute model. While these methods can achieve decent fooling rates, the former tends to be highly query-inefficient while the latter assumes extensive knowledge of the black-box model's training data. In this paper, we propose a new attack on action recognition that addresses these shortcomings by generating perturbations to disrupt the features learned by a pre-trained substitute model to reduce the number of queries. By using a nearly disjoint dataset to train the substitute model, our method removes the requirement that the substitute model be trained using the same dataset as the target model, and leverages queries to the target model to retain the fooling rate benefits provided by query-based methods. This ultimately results in attacks which are more transferable than conventional black-box attacks. Through extensive experiments, we demonstrate highly query-efficient black-box attacks with the proposed framework. Our method achieves 8% and 12% higher deception rates compared to state-of-the-art query-based and transfer-based attacks, respectively.
[[2211.12713] Reliable Robustness Evaluation via Automatically Constructed Attack Ensembles](http://arxiv.org/abs/2211.12713)
Attack Ensemble (AE), which combines multiple attacks together, provides a reliable way to evaluate adversarial robustness. In practice, AEs are often constructed and tuned by human experts, which however tends to be sub-optimal and time-consuming. In this work, we present AutoAE, a conceptually simple approach for automatically constructing AEs. In brief, AutoAE repeatedly adds the attack and its iteration steps to the ensemble that maximizes ensemble improvement per additional iteration consumed. We show theoretically that AutoAE yields AEs provably within a constant factor of the optimal for a given defense. We then use AutoAE to construct two AEs for $l_{\infty}$ and $l_2$ attacks, and apply them without any tuning or adaptation to 45 top adversarial defenses on the RobustBench leaderboard. In all except one cases we achieve equal or better (often the latter) robustness evaluation than existing AEs, and notably, in 29 cases we achieve better robustness evaluation than the best known one. Such performance of AutoAE shows itself as a reliable evaluation protocol for adversarial robustness, which further indicates the huge potential of automatic AE construction. Code is available at \url{https://github.com/LeegerPENG/AutoAE}.
[[2211.12990] Adversarial Attacks are a Surprisingly Strong Baseline for Poisoning Few-Shot Meta-Learners](http://arxiv.org/abs/2211.12990)
This paper examines the robustness of deployed few-shot meta-learning systems when they are fed an imperceptibly perturbed few-shot dataset. We attack amortized meta-learners, which allows us to craft colluding sets of inputs that are tailored to fool the system's learning algorithm when used as training data. Jointly crafted adversarial inputs might be expected to synergistically manipulate a classifier, allowing for very strong data-poisoning attacks that would be hard to detect. We show that in a white box setting, these attacks are very successful and can cause the target model's predictions to become worse than chance. However, in opposition to the well-known transferability of adversarial examples in general, the colluding sets do not transfer well to different classifiers. We explore two hypotheses to explain this: 'overfitting' by the attack, and mismatch between the model on which the attack is generated and that to which the attack is transferred. Regardless of the mitigation strategies suggested by these hypotheses, the colluding inputs transfer no better than adversarial inputs that are generated independently in the usual way.
[[2211.12506] Dynamic Loss For Robust Learning](http://arxiv.org/abs/2211.12506)
Label noise and class imbalance commonly coexist in real-world data. Previous works for robust learning, however, usually address either one type of the data biases and underperform when facing them both. To mitigate this gap, this work presents a novel meta-learning based dynamic loss that automatically adjusts the objective functions with the training process to robustly learn a classifier from long-tailed noisy data. Concretely, our dynamic loss comprises a label corrector and a margin generator, which respectively correct noisy labels and generate additive per-class classification margins by perceiving the underlying data distribution as well as the learning state of the classifier. Equipped with a new hierarchical sampling strategy that enriches a small amount of unbiased metadata with diverse and hard samples, the two components in the dynamic loss are optimized jointly through meta-learning and cultivate the classifier to well adapt to clean and balanced test data. Extensive experiments show our method achieves state-of-the-art accuracy on multiple real-world and synthetic datasets with various types of data biases, including CIFAR-10/100, Animal-10N, ImageNet-LT, and Webvision. Code will soon be publicly available.
[[2211.12542] PVT3D: Point Voxel Transformers for Place Recognition from Sparse Lidar Scans](http://arxiv.org/abs/2211.12542)
Place recognition based on point cloud (LiDAR) scans is an important module for achieving robust autonomy in robots or self-driving vehicles. Training deep networks to match such scans presents a difficult trade-off: a higher spatial resolution of the network's intermediate representations is needed to perform fine-grained matching of subtle geometric features, but growing it too large makes the memory requirements infeasible. In this work, we propose a Point-Voxel Transformer network (PVT3D) that achieves robust fine-grained matching with low memory requirements. It leverages a sparse voxel branch to extract and aggregate information at a lower resolution and a point-wise branch to obtain fine-grained local information. A novel hierarchical cross-attention transformer (HCAT) uses queries from one branch to try to match structures in the other branch, ensuring that both extract self-contained descriptors of the point cloud (rather than one branch dominating), but using both to inform the output global descriptor of the point cloud. Extensive experiments show that the proposed PVT3D method surpasses the state-of-the-art by a large amount on several datasets (Oxford RobotCar, TUM, USyd). For instance, we achieve AR@1 of 85.6% on the TUM dataset, which surpasses the strongest prior model by ~15%.
[[2211.12716] Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision](http://arxiv.org/abs/2211.12716)
Multi-label image classification, which can be categorized into label-dependency and region-based methods, is a challenging problem due to the complex underlying object layouts. Although region-based methods are less likely to encounter issues with model generalizability than label-dependency methods, they often generate hundreds of meaningless or noisy proposals with non-discriminative information, and the contextual dependency among the localized regions is often ignored or over-simplified. This paper builds a unified framework to perform effective noisy-proposal suppression and to interact between global and local features for robust feature learning. Specifically, we propose category-aware weak supervision to concentrate on non-existent categories so as to provide deterministic information for local feature learning, restricting the local branch to focus on more high-quality regions of interest. Moreover, we develop a cross-granularity attention module to explore the complementary information between global and local features, which can build the high-order feature correlation containing not only global-to-local, but also local-to-local relations. Both advantages guarantee a boost in the performance of the whole network. Extensive experiments on two large-scale datasets (MS-COCO and VOC 2007) demonstrate that our framework achieves superior performance over state-of-the-art methods.
[[2211.12826] Data-driven Feature Tracking for Event Cameras](http://arxiv.org/abs/2211.12826)
Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in a grayscale frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. By directly transferring zero-shot from synthetic to real data, our data-driven tracker outperforms existing approaches in relative feature age by up to 120 % while also achieving the lowest latency. This performance gap is further increased to 130 % by adapting our tracker to real data with a novel self-supervision strategy.
[[2211.12853] BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields](http://arxiv.org/abs/2211.12853)
Neural Radiance Fields (NeRF) have received considerable attention recently, due to its impressive capability in photo-realistic 3D reconstruction and novel view synthesis, given a set of posed camera images. Earlier work usually assumes the input images are in good quality. However, image degradation (e.g. image motion blur in low-light conditions) can easily happen in real-world scenarios, which would further affect the rendering quality of NeRF. In this paper, we present a novel bundle adjusted deblur Neural Radiance Fields (BAD-NeRF), which can be robust to severe motion blurred images and inaccurate camera poses. Our approach models the physical image formation process of a motion blurred image, and jointly learns the parameters of NeRF and recovers the camera motion trajectories during exposure time. In experiments, we show that by directly modeling the real physical image formation process, BAD-NeRF achieves superior performance over prior works on both synthetic and real datasets.
[[2211.12886] OReX: Object Reconstruction from Planner Cross-sections Using Neural Fields](http://arxiv.org/abs/2211.12886)
Reconstructing 3D shapes from planar cross-sections is a challenge inspired by downstream applications like medical imaging and geographic informatics. The input is an in/out indicator function fully defined on a sparse collection of planes in space, and the output is an interpolation of the indicator function to the entire volume. Previous works addressing this sparse and ill-posed problem either produce low quality results, or rely on additional priors such as target topology, appearance information, or input normal directions. In this paper, we present OReX, a method for 3D shape reconstruction from slices alone, featuring a Neural Field as the interpolation prior. A simple neural network is trained on the input planes to receive a 3D coordinate and return an inside/outside estimate for the query point. This prior is powerful in inducing smoothness and self-similarities. The main challenge for this approach is high-frequency details, as the neural prior is overly smoothing. To alleviate this, we offer an iterative estimation architecture and a hierarchical input sampling scheme that encourage coarse-to-fine training, allowing focusing on high frequencies at later stages. In addition, we identify and analyze a common ripple-like effect stemming from the mesh extraction step. We mitigate it by regularizing the spatial gradients of the indicator function around input in/out boundaries, cutting the problem at the root.
Through extensive qualitative and quantitative experimentation, we demonstrate our method is robust, accurate, and scales well with the size of the input. We report state-of-the-art results compared to previous approaches and recent potential solutions, and demonstrate the benefit of our individual contributions through analysis and ablation studies.
[[2211.12926] Contrastive Multi-View Textual-Visual Encoding: Towards One Hundred Thousand-Scale One-Shot Logo Identification](http://arxiv.org/abs/2211.12926)
In this paper, we study the problem of identifying logos of business brands in natural scenes in an open-set one-shot setting. This problem setup is significantly more challenging than traditionally-studied 'closed-set' and 'large-scale training samples per category' logo recognition settings. We propose a novel multi-view textual-visual encoding framework that encodes text appearing in the logos as well as the graphical design of the logos to learn robust contrastive representations. These representations are jointly learned for multiple views of logos over a batch and thereby they generalize well to unseen logos. We evaluate our proposed framework for cropped logo verification, cropped logo identification, and end-to-end logo identification in natural scene tasks; and compare it against state-of-the-art methods. Further, the literature lacks a 'very-large-scale' collection of reference logo images that can facilitate the study of one-hundred thousand-scale logo identification. To fill this gap in the literature, we introduce Wikidata Reference Logo Dataset (WiRLD), containing logos for 100K business brands harvested from Wikidata. Our proposed framework that achieves an area under the ROC curve of 91.3% on the QMUL-OpenLogo dataset for the verification task, outperforms state-of-the-art methods by 9.1% and 2.6% on the one-shot logo identification task on the Toplogos-10 and the FlickrLogos32 datasets, respectively. Further, we show that our method is more stable compared to other baselines even when the number of candidate logos is on a 100K scale.
[[2211.13081] Robust Mean Teacher for Continual and Gradual Test-Time Adaptation](http://arxiv.org/abs/2211.13081)
Since experiencing domain shifts during test-time is inevitable in practice, test-time adaption (TTA) continues to adapt the model during deployment. Recently, the area of continual and gradual test-time adaptation (TTA) emerged. In contrast to standard TTA, continual TTA considers not only a single domain shift, but a sequence of shifts. Gradual TTA further exploits the property that some shifts evolve gradually over time. Since in both settings long test sequences are present, error accumulation needs to be addressed for methods relying on self-training. In this work, we propose and show that in the setting of TTA, the symmetric cross-entropy is better suited as a consistency loss for mean teachers compared to the commonly used cross-entropy. This is justified by our analysis with respect to the (symmetric) cross-entropy's gradient properties. To pull the test feature space closer to the source domain, where the pre-trained model is well posed, contrastive learning is leveraged. Since applications differ in their requirements, we address different settings, namely having source data available and the more challenging source-free setting. We demonstrate the effectiveness of our proposed method 'robust mean teacher' (RMT) on the continual and gradual corruption benchmarks CIFAR10C, CIFAR100C, and Imagenet-C. We further consider ImageNet-R and propose a new continual DomainNet-126 benchmark. State-of-the-art results are achieved on all benchmarks.
[[2211.12570] Predicting the Type and Target of Offensive Social Media Posts in Marathi](http://arxiv.org/abs/2211.12570)
The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource languages such as French, German, and Spanish. In this paper we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.
[[2211.12677] Word-Level Representation From Bytes For Language Modeling](http://arxiv.org/abs/2211.12677)
Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to noise and difficult to generalize to new languages. Also, the current trend of scaling up models reveals that larger models require larger embeddings but that makes parallelization hard. Previous work on image classification proves splitting raw input into a sequence of chucks is a strong, model-agnostic inductive bias. Based on this observation, we rethink the existing character-aware method that takes character-level inputs but makes word-level sequence modeling and prediction. We overhaul this method by introducing a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states to avoid the time and space requirement of word-level prediction. With these two improvements combined, we have a token free model with slim input embeddings for downstream tasks. We name our method Byte2Word and perform evaluations on language modeling and text classification. Experiments show that Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10\% of embedding size. We further test our method on synthetic noise and cross-lingual transfer and find it competitive to baseline methods on both settings.
[[2211.13112] This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish](http://arxiv.org/abs/2211.13112)
The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
[[2211.13124] Number Theory Meets Linguistics: Modelling Noun Pluralisation Across 1497 Languages Using 2-adic Metrics](http://arxiv.org/abs/2211.13124)
A simple machine learning model of pluralisation as a linear regression problem minimising a p-adic metric substantially outperforms even the most robust of Euclidean-space regressors on languages in the Indo-European, Austronesian, Trans New-Guinea, Sino-Tibetan, Nilo-Saharan, Oto-Meanguean and Atlantic-Congo language families. There is insufficient evidence to support modelling distinct noun declensions as a p-adic neighbourhood even in Indo-European languages.
[[2211.12624] Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization](http://arxiv.org/abs/2211.12624)
Recent research in robust optimization has shown an overfitting-like phenomenon in which models trained against adversarial attacks exhibit higher robustness on the training set compared to the test set. Although previous work provided theoretical explanations for this phenomenon using a robust PAC-Bayesian bound over the adversarial test error, related algorithmic derivations are at best only loosely connected to this bound, which implies that there is still a gap between their empirical success and our understanding of adversarial robustness theory. To close this gap, in this paper we consider a different form of the robust PAC-Bayesian bound and directly minimize it with respect to the model posterior. The derivation of the optimal solution connects PAC-Bayesian learning to the geometry of the robust loss surface through a Trace of Hessian (TrH) regularizer that measures the surface flatness. In practice, we restrict the TrH regularizer to the top layer only, which results in an analytical solution to the bound whose computational cost does not depend on the network depth. Finally, we evaluate our TrH regularization approach over CIFAR-10/100 and ImageNet using Vision Transformers (ViT) and compare against baseline adversarial robustness algorithms. Experimental results show that TrH regularization leads to improved ViT robustness that either matches or surpasses previous state-of-the-art approaches while at the same time requires less memory and computational cost.
[[2211.12703] Subgroup Robustness Grows On Trees: An Empirical Baseline Investigation](http://arxiv.org/abs/2211.12703)
Researchers have proposed many methods for fair and robust machine learning, but comprehensive empirical evaluation of their subgroup robustness is lacking. In this work, we address this gap in the context of tabular data, where sensitive subgroups are clearly-defined, real-world fairness problems abound, and prior works often do not compare to state-of-the-art tree-based models as baselines. We conduct an empirical comparison of several previously-proposed methods for fair and robust learning alongside state-of-the-art tree-based methods and other baselines. Via experiments with more than $340{,}000$ model configurations on eight datasets, we show that tree-based methods have strong subgroup robustness, even when compared to robustness- and fairness-enhancing methods. Moreover, the best tree-based models tend to show good performance over a range of metrics, while robust or group-fair models can show brittleness, with significant performance differences across different metrics for a fixed model. We also demonstrate that tree-based models show less sensitivity to hyperparameter configurations, and are less costly to train. Our work suggests that tree-based ensemble models make an effective baseline for tabular data, and are a sensible default when subgroup robustness is desired. For associated code and detailed results, see https://github.com/jpgard/subgroup-robustness-grows-on-trees .
[[2211.12776] Research on Data Fusion Algorithm Based on Deep Learning in Target Tracking](http://arxiv.org/abs/2211.12776)
Aiming at the limitation that deep long and short-term memory network(DLSTM) algorithm cannot perform parallel computing and cannot obtain global information, in this paper, feature extraction and feature processing are firstly carried out according to the characteristics of eye movement data and tracking data, then by introducing a convolutional neural network (CNN) into a deep long and short-term memory network, developed a new network structure and designed a fusion strategy, an eye tracking data fusion algorithm based on long and short-term memory network is proposed. The experimental results show that compared with the two fusion algorithms based on deep learning, the algorithm proposed in this paper performs well in terms of fusion quality.
[[2211.12515] Smart Agriculture : A Novel Multilevel Approach for Agricultural Risk Assessment over Unstructured Data](http://arxiv.org/abs/2211.12515)
Detecting opportunities and threats from massive text data is a challenging task for most. Traditionally, companies would rely mainly on structured data to detect and predict risks, losing a huge amount of information that could be extracted from unstructured text data. Fortunately, artificial intelligence came to remedy this issue by innovating in data extraction and processing techniques, allowing us to understand and make use of Natural Language data and turning it into structures that a machine can process and extract insight from. Uncertainty refers to a state of not knowing what will happen in the future. This paper aims to leverage natural language processing and machine learning techniques to model uncertainties and evaluate the risk level in each uncertainty cluster using massive text data.
[[2211.12814] Vertical Federated Learning](http://arxiv.org/abs/2211.12814)
Vertical Federated Learning (VFL) is a federated learning setting where multiple parties with different features about the same set of users jointly train machine learning models without exposing their raw data or model parameters. Motivated by the rapid growth in VFL research and real-world applications, we provide a comprehensive review of the concept and algorithms of VFL, as well as current advances and challenges in various aspects, including effectiveness, efficiency, and privacy. We provide an exhaustive categorization for VFL settings and privacy-preserving protocols and comprehensively analyze the privacy attacks and defense strategies for each protocol. In the end, we propose a unified framework, termed VFLow, which considers the VFL problem under communication, computation, privacy, and effectiveness constraints. Finally, we review the most recent advances in industrial applications, highlighting open challenges and future directions for VFL.
[[2211.12874] A Dynamic Weighted Federated Learning for Android Malware Classification](http://arxiv.org/abs/2211.12874)
Android malware attacks are increasing daily at a tremendous volume, making Android users more vulnerable to cyber-attacks. Researchers have developed many machine learning (ML)/ deep learning (DL) techniques to detect and mitigate android malware attacks. However, due to technological advancement, there is a rise in android mobile devices. Furthermore, the devices are geographically dispersed, resulting in distributed data. In such scenario, traditional ML/DL techniques are infeasible since all of these approaches require the data to be kept in a central system; this may provide a problem for user privacy because of the massive proliferation of Android mobile devices; putting the data in a central system creates an overhead. Also, the traditional ML/DL-based android malware classification techniques are not scalable. Researchers have proposed federated learning (FL) based android malware classification system to solve the privacy preservation and scalability with high classification performance. In traditional FL, Federated Averaging (FedAvg) is utilized to construct the global model at each round by merging all of the local models obtained from all of the customers that participated in the FL. However, the conventional FedAvg has a disadvantage: if one poor-performing local model is included in global model development for each round, it may result in an under-performing global model. Because FedAvg favors all local models equally when averaging. To address this issue, our main objective in this work is to design a dynamic weighted federated averaging (DW-FedAvg) strategy in which the weights for each local model are automatically updated based on their performance at the client. The DW-FedAvg is evaluated using four popular benchmark datasets, Melgenome, Drebin, Kronodroid and Tuandromd used in android malware classification research.
[[2211.13116] Fed-TDA: Federated Tabular Data Augmentation on Non-IID Data](http://arxiv.org/abs/2211.13116)
Non-independent and identically distributed (non-IID) data is a key challenge in federated learning (FL), which usually hampers the optimization convergence and the performance of FL. Existing data augmentation methods based on federated generative models or raw data sharing strategies for solving the non-IID problem still suffer from low performance, privacy protection concerns, and high communication overhead in decentralized tabular data. To tackle these challenges, we propose a federated tabular data augmentation method, named Fed-TDA. The core idea of Fed-TDA is to synthesize tabular data for data augmentation using some simple statistics (e.g., distributions of each column and global covariance). Specifically, we propose the multimodal distribution transformation and inverse cumulative distribution mapping respectively synthesize continuous and discrete columns in tabular data from a noise according to the pre-learned statistics. Furthermore, we theoretically analyze that our Fed-TDA not only preserves data privacy but also maintains the distribution of the original data and the correlation between columns. Through extensive experiments on five real-world tabular datasets, we demonstrate the superiority of Fed-TDA over the state-of-the-art in test performance and communication efficiency.
[[2211.12578] Online Federated Learning via Non-Stationary Detection and Adaptation amidst Concept Drift](http://arxiv.org/abs/2211.12578)
Federated Learning (FL) is an emerging domain in the broader context of artificial intelligence research. Methodologies pertaining to FL assume distributed model training, consisting of a collection of clients and a server, with the main goal of achieving optimal global model with restrictions on data sharing due to privacy concerns. It is worth highlighting that the diverse existing literature in FL mostly assume stationary data generation processes; such an assumption is unrealistic in real-world conditions where concept drift occurs due to, for instance, seasonal or period observations, faults in sensor measurements. In this paper, we introduce a multiscale algorithmic framework which combines theoretical guarantees of \textit{FedAvg} and \textit{FedOMD} algorithms in near stationary settings with a non-stationary detection and adaptation technique to ameliorate FL generalization performance in the presence of model/concept drifts. We present a multi-scale algorithmic framework leading to $\Tilde{\mathcal{O}} ( \min { \sqrt{LT} , \Delta^{\frac{1}{3}}T^{\frac{2}{3}} + \sqrt{T} })$ \textit{dynamic regret} for $T$ rounds with an underlying general convex loss function, where $L$ is the number of times non-stationary drifts occured and $\Delta$ is the cumulative magnitude of drift experienced within $T$ rounds.
[[2211.12640] Event-Triggered Decentralized Federated Learning over Resource-Constrained Edge Devices](http://arxiv.org/abs/2211.12640)
Federated learning (FL) is a technique for distributed machine learning (ML), in which edge devices carry out local model training on their individual datasets. In traditional FL algorithms, trained models at the edge are periodically sent to a central server for aggregation, utilizing a star topology as the underlying communication graph. However, assuming access to a central coordinator is not always practical, e.g., in ad hoc wireless network settings. In this paper, we develop a novel methodology for fully decentralized FL, where in addition to local training, devices conduct model aggregation via cooperative consensus formation with their one-hop neighbors over the decentralized underlying physical network. We further eliminate the need for a timing coordinator by introducing asynchronous, event-triggered communications among the devices. In doing so, to account for the inherent resource heterogeneity challenges in FL, we define personalized communication triggering conditions at each device that weigh the change in local model parameters against the available local resources. We theoretically demonstrate that our methodology converges to the globally optimal learning model at a $O{(\frac{\ln{k}}{\sqrt{k}})}$ rate under standard assumptions in distributed learning and consensus literature. Our subsequent numerical evaluations demonstrate that our methodology obtains substantial improvements in convergence speed and/or communication savings compared with existing decentralized FL baselines.
[[2211.13009] Federated Learning on Non-IID Graphs via Structural Knowledge Sharing](http://arxiv.org/abs/2211.13009)
Graph neural networks (GNNs) have shown their superiority in modeling graph data. Owing to the advantages of federated learning, federated graph learning (FGL) enables clients to train strong GNN models in a distributed manner without sharing their private data. A core challenge in federated systems is the non-IID problem, which also widely exists in real-world graph data. For example, local data of clients may come from diverse datasets or even domains, e.g., social networks and molecules, increasing the difficulty for FGL methods to capture commonly shared knowledge and learn a generalized encoder. From real-world graph datasets, we observe that some structural properties are shared by various domains, presenting great potential for sharing structural knowledge in FGL. Inspired by this, we propose FedStar, an FGL framework that extracts and shares the common underlying structure information for inter-graph federated learning tasks. To explicitly extract the structure information rather than encoding them along with the node features, we define structure embeddings and encode them with an independent structure encoder. Then, the structure encoder is shared across clients while the feature-based knowledge is learned in a personalized way, making FedStar capable of capturing more structure-based domain-invariant information and avoiding feature misalignment issues. We perform extensive experiments over both cross-dataset and cross-domain non-IID FGL settings, demonstrating the superiority of FedStar.
[[2211.12743] Efficient List-Decodable Regression using Batches](http://arxiv.org/abs/2211.12743)
We begin the study of list-decodable linear regression using batches. In this setting only an $\alpha \in (0,1]$ fraction of the batches are genuine. Each genuine batch contains $\ge n$ i.i.d. samples from a common unknown distribution and the remaining batches may contain arbitrary or even adversarial samples. We derive a polynomial time algorithm that for any $n\ge \tilde \Omega(1/\alpha)$ returns a list of size $\mathcal O(1/\alpha^2)$ such that one of the items in the list is close to the true regression parameter. The algorithm requires only $\tilde{\mathcal{O}}(d/\alpha^2)$ genuine batches and works under fairly general assumptions on the distribution.
The results demonstrate the utility of batch structure, which allows for the first polynomial time algorithm for list-decodable regression, which may be impossible for the non-batch setting, as suggested by a recent SQ lower bound \cite{diakonikolas2021statistical} for the non-batch setting.
[[2211.12757] FAIRification of MLC data](http://arxiv.org/abs/2211.12757)
The multi-label classification (MLC) task has increasingly been receiving interest from the machine learning (ML) community, as evidenced by the growing number of papers and methods that appear in the literature. Hence, ensuring proper, correct, robust, and trustworthy benchmarking is of utmost importance for the further development of the field. We believe that this can be achieved by adhering to the recently emerged data management standards, such as the FAIR (Findable, Accessible, Interoperable, and Reusable) and TRUST (Transparency, Responsibility, User focus, Sustainability, and Technology) principles. To FAIRify the MLC datasets, we introduce an ontology-based online catalogue of MLC datasets that follow these principles. The catalogue extensively describes many MLC datasets with comprehensible meta-features, MLC-specific semantic descriptions, and different data provenance information. The MLC data catalogue is extensively described in our recent publication in Nature Scientific Reports, Kostovska & Bogatinovski et al., and available at: this http URL In addition, we provide an ontology-based system for easy access and querying of performance/benchmark data obtained from a comprehensive MLC benchmark study. The system is available at: this http URL
[[2211.12572] Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation](http://arxiv.org/abs/2211.12572)
Large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, allowing us to synthesize diverse images that convey highly complex visual concepts. However, a pivotal challenge in leveraging such models for real-world content creation tasks is providing users with control over the generated content. In this paper, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation -- given a guidance image and a target text prompt, our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text, while preserving the semantic layout of the source image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model. This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the target image, requiring no training or fine-tuning and applicable for both real or generated guidance images. We demonstrate high-quality results on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing of the class and appearance of objects in a given image, and modifications of global qualities such as lighting and color.
[[2211.12737] RoentGen: Vision-Language Foundation Model for Chest X-ray Generation](http://arxiv.org/abs/2211.12737)
Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets. In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.
[[2211.13203] Inversion-Based Creativity Transfer with Diffusion Models](http://arxiv.org/abs/2211.13203)
In this paper, we introduce the task of "Creativity Transfer". The artistic creativity within a painting is the means of expression, which includes not only the painting material, colors, and brushstrokes, but also the high-level attributes including semantic elements, object shape, etc. Previous arbitrary example-guided artistic image generation methods (e.g., style transfer) often fail to control shape changes or convey semantic elements. The pre-trained text-to-image synthesis diffusion probabilistic models have achieved remarkable quality, but they often require extensive textual descriptions to accurately portray attributes of a particular painting. We believe that the uniqueness of an artwork lies precisely in the fact that it cannot be adequately explained with normal language. Our key idea is to learn artistic creativity directly from a single painting and then guide the synthesis without providing complex textual descriptions. Specifically, we assume creativity as a learnable textual description of a painting. We propose an attention-based inversion method, which can efficiently and accurately learn the holistic and detailed information of an image, thus capturing the complete artistic creativity of a painting. We demonstrate the quality and efficiency of our method on numerous paintings of various artists and styles. Code and models are available at https://github.com/zyxElsa/creativity-transfer.
[[2211.13220] Tetrahedral Diffusion Models for 3D Shape Generation](http://arxiv.org/abs/2211.13220)
Recently, probabilistic denoising diffusion models (DDMs) have greatly advanced the generative power of neural networks. DDMs, inspired by non-equilibrium thermodynamics, have not only been used for 2D image generation, but can also readily be applied to 3D point clouds. However, representing 3D shapes as point clouds has a number of drawbacks, most obvious perhaps that they have no notion of topology or connectivity. Here, we explore an alternative route and introduce tetrahedral diffusion models, an extension of DDMs to tetrahedral partitions of 3D space. The much more structured 3D representation with space-filling tetrahedra makes it possible to guide and regularize the diffusion process and to apply it to colorized assets. To manipulate the proposed representation, we develop tetrahedral convolutions, down- and up-sampling kernels. With those operators, 3D shape generation amounts to learning displacement vectors and signed distance values on the tetrahedral grid. Our experiments confirm that Tetrahedral Diffusion yields plausible, visually pleasing and diverse 3D shapes, is able to handle surface attributes like color, and can be guided at test time to manipulate the resulting shapes.
[[2211.13221] Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths](http://arxiv.org/abs/2211.13221)
AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks. However, training image diffusion models usually requires substantial computational resources to achieve a high performance, which makes expanding diffusion models to high-dimensional video synthesis tasks more computationally expensive. To ease this problem while leveraging its advantages, we introduce lightweight video diffusion models that synthesize high-fidelity and arbitrary-long videos from pure noise. Specifically, we propose to perform diffusion and denoising in a low-dimensional 3D latent space, which significantly outperforms previous methods on 3D pixel space when under a limited computational budget. In addition, though trained on tens of frames, our models can generate videos with arbitrary lengths, i.e., thousands of frames, in an autoregressive way. Finally, conditional latent perturbation is further introduced to reduce performance degradation during synthesizing long-duration videos. Extensive experiments on various datasets and generated lengths suggest that our framework is able to sample much more realistic and longer videos than previous approaches, including GAN-based, autoregressive-based, and diffusion-based methods.
[[2211.13224] Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors](http://arxiv.org/abs/2211.13224)
Recent diffusion-based generative models combined with vision-language models are capable of creating realistic images from natural language prompts. While these models are trained on large internet-scale datasets, such pre-trained models are not directly introduced to any semantic localization or grounding. Most current approaches for localization or grounding rely on human-annotated localization information in the form of bounding boxes or segmentation masks. The exceptions are a few unsupervised methods that utilize architectures or loss functions geared towards localization, but they need to be trained separately. In this work, we explore how off-the-shelf diffusion models, trained with no exposure to such localization information, are capable of grounding various semantic phrases with no segmentation-specific re-training. An inference time optimization process is introduced, that is capable of generating segmentation masks conditioned on natural language. We evaluate our proposal Peekaboo for unsupervised semantic segmentation on the Pascal VOC dataset. In addition, we evaluate for referring segmentation on the RefCOCO dataset. In summary, we present a first zero-shot, open-vocabulary, unsupervised (no localization information), semantic grounding technique leveraging diffusion-based generative models with no re-training. Our code will be released publicly.
[[2211.13227] Paint by Example: Exemplar-based Image Editing with Diffusion Models](http://arxiv.org/abs/2211.13227)
Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.
[[2211.13095] Schr\"{o}dinger's Bat: Diffusion Models Sometimes Generate Polysemous Words in Superposition](http://arxiv.org/abs/2211.13095)
Recent work has shown that despite their impressive capabilities, text-to-image diffusion models such as DALL-E 2 (Ramesh et al., 2022) can display strange behaviours when a prompt contains a word with multiple possible meanings, often generating images containing both senses of the word (Rassin et al., 2022). In this work we seek to put forward a possible explanation of this phenomenon. Using the similar Stable Diffusion model (Rombach et al., 2022), we first show that when given an input that is the sum of encodings of two distinct words, the model can produce an image containing both concepts represented in the sum. We then demonstrate that the CLIP encoder used to encode prompts (Radford et al., 2021) encodes polysemous words as a superposition of meanings, and that using linear algebraic techniques we can edit these representations to influence the senses represented in the generated images. Combining these two findings, we suggest that the homonym duplication phenomenon described by Rassin et al. (2022) is caused by diffusion models producing images representing both of the meanings that are present in superposition in the encoding of a polysemous word.