secure

Title: PMFault: Faulting and Bricking Server CPUs through Management Interfaces. (arXiv:2301.05538v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2301.05538
Code URL: null
Copy Paste: [[2301.05538] PMFault: Faulting and Bricking Server CPUs through Management Interfaces](http://arxiv.org/abs/2301.05538) #secure
Summary:
Apart from the actual CPU, modern server motherboards contain other auxiliary components, for example voltage regulators for power management. Those are connected to the CPU and the separate Baseboard Management Controller (BMC) via the I2C-based PMBus.

In this paper, using the case study of the widely used Supermicro X11SSL motherboard, we show how remotely exploitable software weaknesses in the BMC (or other processors with PMBus access) can be used to access the PMBus and then perform hardware-based fault injection attacks on the main CPU. The underlying weaknesses include insecure firmware encryption and signing mechanisms, a lack of authentication for the firmware upgrade process and the IPMI KCS control interface, as well as the motherboard design (with the PMBus connected to the BMC and SMBus by default).

First, we show that undervolting through the PMBus allows breaking the integrity guarantees of SGX enclaves, bypassing Intel's countermeasures against previous undervolting attacks like Plundervolt/V0ltPwn. Second, we experimentally show that overvolting outside the specified range has the potential of permanently damaging Intel Xeon CPUs, rendering the server inoperable. We assess the impact of our findings on other server motherboards made by Supermicro and ASRock.

Our attacks, dubbed PMFault, can be carried out by a privileged software adversary and do not require physical access to the server motherboard or knowledge of the BMC login credentials.

We responsibly disclosed the issues reported in this paper to Supermicro and discuss possible countermeasures at different levels. To the best of our knowledge, the 12th generation of Supermicro motherboards, which was designed before we reported PMFault to Supermicro, is not vulnerable.

Title: Threat Models over Space and Time: A Case Study of E2EE Messaging Applications. (arXiv:2301.05653v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2301.05653
Code URL: null
Copy Paste: [[2301.05653] Threat Models over Space and Time: A Case Study of E2EE Messaging Applications](http://arxiv.org/abs/2301.05653) #secure
Summary:
Threat modelling is foundational to secure systems engineering and should be done in consideration of the context within which systems operate. On the other hand, the continuous evolution of both the technical sophistication of threats and the system attack surface is an inescapable reality. In this work, we explore the extent to which real-world systems engineering reflects the changing threat context. To this end we examine the desktop clients of six widely used end-to-end-encrypted mobile messaging applications to understand the extent to which they adjusted their threat model over space (when enabling clients on new platforms, such as desktop clients) and time (as new threats emerged). We experimented with short-lived adversarial access against these desktop clients and analyzed the results with respect to two popular threat elicitation frameworks, STRIDE and LINDDUN. The results demonstrate that system designers need to both recognise the threats in the evolving context within which systems operate and, more importantly, to mitigate them by rescoping trust boundaries in a manner that those within the administrative boundary cannot violate security and privacy properties. Such a nuanced understanding of trust boundary scopes and their relationship with administrative boundaries allows for better administration of shared components, including securing them with safe defaults.

security

Title: Security-Aware Approximate Spiking Neural Networks. (arXiv:2301.05264v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2301.05264
Code URL: null
Copy Paste: [[2301.05264] Security-Aware Approximate Spiking Neural Networks](http://arxiv.org/abs/2301.05264) #security
Summary:
Deep Neural Networks (DNNs) and Spiking Neural Networks (SNNs) are both known for their susceptibility to adversarial attacks. Therefore, researchers in the recent past have extensively studied the robustness and defense of DNNs and SNNs under adversarial attacks. Compared to accurate SNNs (AccSNN), approximate SNNs (AxSNNs) are known to be up to 4X more energy-efficient for ultra-low power applications. Unfortunately, the robustness of AxSNNs under adversarial attacks is yet unexplored. In this paper, we first extensively analyze the robustness of AxSNNs with different structural parameters and approximation levels under two gradient-based and two neuromorphic attacks. Then, we propose two novel defense methods, i.e., precision scaling and approximate quantization-aware filtering (AQF), for securing AxSNNs. We evaluated the effectiveness of these two defense methods using both static and neuromorphic datasets. Our results demonstrate that AxSNNs are more prone to adversarial attacks than AccSNNs, but precision scaling and AQF significantly improve the robustness of AxSNNs. For instance, a PGD attack on AxSNN results in a 72\% accuracy loss compared to AccSNN without any attack, whereas the same attack on the precision-scaled AxSNN leads to only a 17\% accuracy loss in the static MNIST dataset (4X robustness improvement). Similarly, a Sparse Attack on AxSNN leads to a 77\% accuracy loss when compared to AccSNN without any attack, whereas the same attack on an AxSNN with AQF leads to only a 2\% accuracy loss in the neuromorphic DVS128 Gesture dataset (38X robustness improvement).

Title: An RTL Implementation of the Data Encryption Standard (DES). (arXiv:2301.05530v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2301.05530
Code URL: null
Copy Paste: [[2301.05530] An RTL Implementation of the Data Encryption Standard (DES)](http://arxiv.org/abs/2301.05530) #security
Summary:
Data Encryption Standard (DES) is based on the Feistel block cipher, developed in 1971 by IBM cryptography researcher Horst Feistel. DES uses 16 rounds of the Feistel structure. But with the changes in recent years, the internet is starting to be used more to connect devices to each other. These devices can range from powerful computing devices, such as desktop computers and tablets, to resource constrained devices, When it comes to these constrained devices, using a different key for each round cryptography algorithms fail to provide necessary security and performance.

privacy

protect

Title: The 2022 n2c2/UW Shared Task on Extracting Social Determinants of Health. (arXiv:2301.05571v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2301.05571
Code URL: null
Copy Paste: [[2301.05571] The 2022 n2c2/UW Shared Task on Extracting Social Determinants of Health](http://arxiv.org/abs/2301.05571) #protect
Summary:
Objective: The n2c2/UW SDOH Challenge explores the extraction of social determinant of health (SDOH) information from clinical notes. The objectives include the advancement of natural language processing (NLP) information extraction techniques for SDOH and clinical information more broadly. This paper presents the shared task, data, participating teams, performance results, and considerations for future work.

Materials and Methods: The task used the Social History Annotated Corpus (SHAC), which consists of clinical text with detailed event-based annotations for SDOH events such as alcohol, drug, tobacco, employment, and living situation. Each SDOH event is characterized through attributes related to status, extent, and temporality. The task includes three subtasks related to information extraction (Subtask A), generalizability (Subtask B), and learning transfer (Subtask C). In addressing this task, participants utilized a range of techniques, including rules, knowledge bases, n-grams, word embeddings, and pretrained language models (LM).

Results: A total of 15 teams participated, and the top teams utilized pretrained deep learning LM. The top team across all subtasks used a sequence-to-sequence approach achieving 0.901 F1 for Subtask A, 0.774 F1 Subtask B, and 0.889 F1 for Subtask C.

Conclusions: Similar to many NLP tasks and domains, pretrained LM yielded the best performance, including generalizability and learning transfer. An error analysis indicates extraction performance varies by SDOH, with lower performance achieved for conditions, like substance use and homelessness, that increase health risks (risk factors) and higher performance achieved for conditions, like substance abstinence and living with family, that reduce health risks (protective factors).

defense

attack

Title: On the feasibility of attacking Thai LPR systems with adversarial examples. (arXiv:2301.05506v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2301.05506
Code URL: null
Copy Paste: [[2301.05506] On the feasibility of attacking Thai LPR systems with adversarial examples](http://arxiv.org/abs/2301.05506) #attack
Summary:
Recent advances in deep neural networks (DNNs) have significantly enhanced the capabilities of optical character recognition (OCR) technology, enabling its adoption to a wide range of real-world applications. Despite this success, DNN-based OCR is shown to be vulnerable to adversarial attacks, in which the adversary can influence the DNN model's prediction by carefully manipulating input to the model. Prior work has demonstrated the security impacts of adversarial attacks on various OCR languages. However, to date, no studies have been conducted and evaluated on an OCR system tailored specifically for the Thai language. To bridge this gap, this work presents a feasibility study of performing adversarial attacks on a specific Thai OCR application -- Thai License Plate Recognition (LPR). Moreover, we propose a new type of adversarial attack based on the \emph{semi-targeted} scenario and show that this scenario is highly realistic in LPR applications. Our experimental results show the feasibility of our attacks as they can be performed on a commodity computer desktop with over 90% attack success rate.

robust

Title: Multi-Target Landmark Detection with Incomplete Images via Reinforcement Learning and Shape Prior. (arXiv:2301.05392v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2301.05392
Code URL: null
Copy Paste: [[2301.05392] Multi-Target Landmark Detection with Incomplete Images via Reinforcement Learning and Shape Prior](http://arxiv.org/abs/2301.05392) #robust
Summary:
Medical images are generally acquired with limited field-of-view (FOV), which could lead to incomplete regions of interest (ROI), and thus impose a great challenge on medical image analysis. This is particularly evident for the learning-based multi-target landmark detection, where algorithms could be misleading to learn primarily the variation of background due to the varying FOV, failing the detection of targets. Based on learning a navigation policy, instead of predicting targets directly, reinforcement learning (RL)-based methods have the potential totackle this challenge in an efficient manner. Inspired by this, in this work we propose a multi-agent RL framework for simultaneous multi-target landmark detection. This framework is aimed to learn from incomplete or (and) complete images to form an implicit knowledge of global structure, which is consolidated during the training stage for the detection of targets from either complete or incomplete test images. To further explicitly exploit the global structural information from incomplete images, we propose to embed a shape model into the RL process. With this prior knowledge, the proposed RL model can not only localize dozens of targetssimultaneously, but also work effectively and robustly in the presence of incomplete images. We validated the applicability and efficacy of the proposed method on various multi-target detection tasks with incomplete images from practical clinics, using body dual-energy X-ray absorptiometry (DXA), cardiac MRI and head CT datasets. Results showed that our method could predict whole set of landmarks with incomplete training images up to 80% missing proportion (average distance error 2.29 cm on body DXA), and could detect unseen landmarks in regions with missing image information outside FOV of target images (average distance error 6.84 mm on 3D half-head CT).

Title: Towards Single Camera Human 3D-Kinematics. (arXiv:2301.05435v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2301.05435
Code URL: https://github.com/bittnerma/direct3dkinematicestimation
Copy Paste: [[2301.05435] Towards Single Camera Human 3D-Kinematics](http://arxiv.org/abs/2301.05435) #robust
Summary:
Markerless estimation of 3D Kinematics has the great potential to clinically diagnose and monitor movement disorders without referrals to expensive motion capture labs; however, current approaches are limited by performing multiple de-coupled steps to estimate the kinematics of a person from videos. Most current techniques work in a multi-step approach by first detecting the pose of the body and then fitting a musculoskeletal model to the data for accurate kinematic estimation. Errors in training data of the pose detection algorithms, model scaling, as well the requirement of multiple cameras limit the use of these techniques in a clinical setting. Our goal is to pave the way toward fast, easily applicable and accurate 3D kinematic estimation \xdeleted{in a clinical setting}. To this end, we propose a novel approach for direct 3D human kinematic estimation D3KE from videos using deep neural networks. Our experiments demonstrate that the proposed end-to-end training is robust and outperforms 2D and 3D markerless motion capture based kinematic estimation pipelines in terms of joint angles error by a large margin (35\% from 5.44 to 3.54 degrees). We show that D3KE is superior to the multi-step approach and can run at video framerate speeds. This technology shows the potential for clinical analysis from mobile devices in the future.

Title: CLIP the Gap: A Single Domain Generalization Approach for Object Detection. (arXiv:2301.05499v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2301.05499
Code URL: null
Copy Paste: [[2301.05499] CLIP the Gap: A Single Domain Generalization Approach for Object Detection](http://arxiv.org/abs/2301.05499) #robust
Summary:
Single Domain Generalization (SDG) tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. While this has been well studied for image classification, the literature on SDG object detection remains almost non-existent. To address the challenges of simultaneously learning robust object localization and representation, we propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss. Our experiments evidence the benefits of our approach, outperforming by 10% the only existing SDG object detection method, Single-DGOD [49], on their own diverse weather-driving benchmark.

Title: RCPS: Rectified Contrastive Pseudo Supervision for Semi-Supervised Medical Image Segmentation. (arXiv:2301.05500v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2301.05500
Code URL: null
Copy Paste: [[2301.05500] RCPS: Rectified Contrastive Pseudo Supervision for Semi-Supervised Medical Image Segmentation](http://arxiv.org/abs/2301.05500) #robust
Summary:
Medical image segmentation methods are generally designed as fully-supervised to guarantee model performance, which require a significant amount of expert annotated samples that are high-cost and laborious. Semi-supervised image segmentation can alleviate the problem by utilizing a large number of unlabeled images along with limited labeled images. However, learning a robust representation from numerous unlabeled images remains challenging due to potential noise in pseudo labels and insufficient class separability in feature space, which undermines the performance of current semi-supervised segmentation approaches. To address the issues above, we propose a novel semi-supervised segmentation method named as Rectified Contrastive Pseudo Supervision (RCPS), which combines a rectified pseudo supervision and voxel-level contrastive learning to improve the effectiveness of semi-supervised segmentation. Particularly, we design a novel rectification strategy for the pseudo supervision method based on uncertainty estimation and consistency regularization to reduce the noise influence in pseudo labels. Furthermore, we introduce a bidirectional voxel contrastive loss to the network to ensure intra-class consistency and inter-class contrast in feature space, which increases class separability in the segmentation. The proposed RCPS segmentation method has been validated on two public datasets and an in-house clinical dataset. Experimental results reveal that the proposed method yields better segmentation performance compared with the state-of-the-art methods in semi-supervised medical image segmentation. The source code is available at https://github.com/hsiangyuzhao/RCPS.

Title: Deep learning-based approaches for human motion decoding in smart walkers for rehabilitation. (arXiv:2301.05575v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2301.05575
Code URL: null
Copy Paste: [[2301.05575] Deep learning-based approaches for human motion decoding in smart walkers for rehabilitation](http://arxiv.org/abs/2301.05575) #robust
Summary:
Gait disabilities are among the most frequent worldwide. Their treatment relies on rehabilitation therapies, in which smart walkers are being introduced to empower the user's recovery and autonomy, while reducing the clinicians effort. For that, these should be able to decode human motion and needs, as early as possible. Current walkers decode motion intention using information of wearable or embedded sensors, namely inertial units, force and hall sensors, and lasers, whose main limitations imply an expensive solution or hinder the perception of human movement. Smart walkers commonly lack a seamless human-robot interaction, which intuitively understands human motions. A contactless approach is proposed in this work, addressing human motion decoding as an early action recognition/detection problematic, using RGB-D cameras. We studied different deep learning-based algorithms, organised in three different approaches, to process lower body RGB-D video sequences, recorded from an embedded camera of a smart walker, and classify them into 4 classes (stop, walk, turn right/left). A custom dataset involving 15 healthy participants walking with the device was acquired and prepared, resulting in 28800 balanced RGB-D frames, to train and evaluate the deep networks. The best results were attained by a convolutional neural network with a channel attention mechanism, reaching accuracy values of 99.61% and above 93%, for offline early detection/recognition and trial simulations, respectively. Following the hypothesis that human lower body features encode prominent information, fostering a more robust prediction towards real-time applications, the algorithm focus was also evaluated using Dice metric, leading to values slightly higher than 30%. Promising results were attained for early action detection as a human motion decoding strategy, with enhancements in the focus of the proposed architectures.

Title: It's Just a Matter of Time: Detecting Depression with Time-Enriched Multimodal Transformers. (arXiv:2301.05453v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2301.05453
Code URL: https://github.com/cosmaadrian/time-enriched-multimodal-depression-detection
Copy Paste: [[2301.05453] It's Just a Matter of Time: Detecting Depression with Time-Enriched Multimodal Transformers](http://arxiv.org/abs/2301.05453) #robust
Summary:
Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social media mainly focus on text processing, and only a few also utilize images posted by users. In this work, we propose a flexible time-enriched multimodal transformer architecture for detecting depression from social media posts, using pretrained models for extracting image and text embeddings. Our model operates directly at the user-level, and we enrich it with the relative time between posts by using time2vec positional embeddings. Moreover, we propose another model variant, which can operate on randomly sampled and unordered sets of posts to be more robust to dataset noise. We show that our method, using EmoBERTa and CLIP embeddings, surpasses other methods on two multimodal datasets, obtaining state-of-the-art results of 0.931 F1 score on a popular multimodal Twitter dataset, and 0.902 F1 score on the only multimodal Reddit dataset.

Title: Learning to Control and Coordinate Hybrid Traffic Through Robot Vehicles at Complex and Unsignalized Intersections. (arXiv:2301.05294v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2301.05294
Code URL: null
Copy Paste: [[2301.05294] Learning to Control and Coordinate Hybrid Traffic Through Robot Vehicles at Complex and Unsignalized Intersections](http://arxiv.org/abs/2301.05294) #robust
Summary:
Intersections are essential road infrastructures for traffic in modern metropolises; however, they can also be the bottleneck of traffic flows due to traffic incidents or the absence of traffic coordination mechanisms such as traffic lights. Thus, various control and coordination mechanisms that are beyond traditional control methods have been proposed to improve the efficiency of intersection traffic. Amongst these methods, the control of foreseeable hybrid traffic that consists of human-driven vehicles (HVs) and robot vehicles (RVs) has recently emerged. We propose a decentralized reinforcement learning approach for the control and coordination of hybrid traffic at real-world, complex intersections--a topic that has not been previously explored. Comprehensive experiments are conducted to show the effectiveness of our approach. In particular, we show that using 5% RVs, we can prevent congestion formation inside the intersection under the actual traffic demand of 700 vehicles per hour. In contrast, without RVs, congestion starts to develop when the traffic demand reaches as low as 200 vehicles per hour. Further performance gains (reduced waiting time of vehicles at the intersection) are obtained as the RV penetration rate increases. When there exist more than 50% RVs in traffic, our method starts to outperform traffic signals on the average waiting time of all vehicles at the intersection.

Our method is also robust against both blackout events and sudden RV percentage drops, and enjoys excellent generalizablility, which is illustrated by its successful deployment in two unseen intersections.

biometric

steal

extraction

membership infer

federate

fair

interpretability

Title: Reworking geometric morphometrics into a methodology of transformation grids. (arXiv:2301.05623v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2301.05623
Code URL: null
Copy Paste: [[2301.05623] Reworking geometric morphometrics into a methodology of transformation grids](http://arxiv.org/abs/2301.05623) #interpretability
Summary:
Today's typical application of geometric morphometrics to a quantitative comparison of organismal anatomies begins by standardizing samples of homologously labelled point configurations for location, orientation, and scale, and then renders the ensuing comparisons graphically by thin-plate spline as applied to group averages, principal components, regression predictions, or canonical variates. The scale-standardization step has recently come under criticism as inappropriate, at least for growth studies. This essay argues for a similar rethinking of the centering and rotation, and then the replacement of the thin-plate spline interpolant of the resulting configurations by a different strategy that leaves unexplained residuals at every landmark individually in order to simplify the interpretation of the displayed grid as a whole, the "transformation grid" that has been highlighted as the true underlying topic ever since D'Arcy Thompson's exposition of 1917.

For analyses of comparisons involving gradients at large geometric scale, this paper argues for replacement of all the Procrustes conventions by a version of my two-point registration of 1986 (originally Francis Galton's of 1907). The choice of the two points interacts with another non-Procrustes concern, interpretability of the grid lines of a coordinate system deformed according to a fitted polynomial trend rather than an interpolating thin-plate spline.

The paper works two examples using previously published cranial data; there result new findings pertinent to the interpretation of both of these classic data sets.

A concluding discussion suggests that the current toolkit of geometric morphometrics, centered on Procrustes shape coordinates and thin-plate splines, is too restricted to suit many of the interpretive purposes of evolutionary and developmental biology.

Title: From stage to page: language independent bootstrap measures of distinctiveness in fictional speech. (arXiv:2301.05659v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2301.05659
Code URL: null
Copy Paste: [[2301.05659] From stage to page: language independent bootstrap measures of distinctiveness in fictional speech](http://arxiv.org/abs/2301.05659) #interpretability
Summary:
Stylometry is mostly applied to authorial style. Recently, researchers have begun investigating the style of characters, finding that the variation remains within authorial bounds. We address the stylistic distinctiveness of characters in drama. Our primary contribution is methodological; we introduce and evaluate two non-parametric methods to produce a summary statistic for character distinctiveness that can be usefully applied and compared across languages and times. Our first method is based on bootstrap distances between 3-gram probability distributions, the second (reminiscent of 'unmasking' techniques) on word keyness curves. Both methods are validated and explored by applying them to a reasonably large corpus (a subset of DraCor): we analyse 3301 characters drawn from 2324 works, covering five centuries and four languages (French, German, Russian, and the works of Shakespeare). Both methods appear useful; the 3-gram method is statistically more powerful but the word keyness method offers rich interpretability. Both methods are able to capture phonological differences such as accent or dialect, as well as broad differences in topic and lexical richness. Based on exploratory analysis, we find that smaller characters tend to be more distinctive, and that women are cross-linguistically more distinctive than men, with this latter finding carefully interrogated using multiple regression. This greater distinctiveness stems from a historical tendency for female characters to be restricted to an 'internal narrative domain' covering mainly direct discourse and family/romantic themes. It is hoped that direct, comparable statistical measures will form a basis for more sophisticated future studies, and advances in theory.

explainability

watermark

diffusion

Title: Neural Image Compression with a Diffusion-Based Decoder. (arXiv:2301.05489v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2301.05489
Code URL: null
Copy Paste: [[2301.05489] Neural Image Compression with a Diffusion-Based Decoder](http://arxiv.org/abs/2301.05489) #diffusion
Summary:
Diffusion probabilistic models have recently achieved remarkable success in generating high quality image and video data. In this work, we build on this class of generative models and introduce a method for lossy compression of high resolution images. The resulting codec, which we call DIffuson-based Residual Augmentation Codec (DIRAC),is the first neural codec to allow smooth traversal of the rate-distortion-perception tradeoff at test time, while obtaining competitive performance with GAN-based methods in perceptual quality. Furthermore, while sampling from diffusion probabilistic models is notoriously expensive, we show that in the compression setting the number of steps can be drastically reduced.

Title: In BLOOM: Creativity and Affinity in Artificial Lyrics and Art. (arXiv:2301.05402v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2301.05402
Code URL: https://github.com/ecrows/in-bloom
Copy Paste: [[2301.05402] In BLOOM: Creativity and Affinity in Artificial Lyrics and Art](http://arxiv.org/abs/2301.05402) #diffusion
Summary:
We apply a large multilingual language model (BLOOM-176B) in open-ended generation of Chinese song lyrics, and evaluate the resulting lyrics for coherence and creativity using human reviewers. We find that current computational metrics for evaluating large language model outputs (MAUVE) have limitations in evaluation of creative writing. We note that the human concept of creativity requires lyrics to be both comprehensible and distinctive -- and that humans assess certain types of machine-generated lyrics to score more highly than real lyrics by popular artists. Inspired by the inherently multimodal nature of album releases, we leverage a Chinese-language stable diffusion model to produce high-quality lyric-guided album art, demonstrating a creative approach for an artist seeking inspiration for an album or single. Finally, we introduce the MojimLyrics dataset, a Chinese-language dataset of popular song lyrics for future research.