Dance Dance Revolution (DDR) is a rhythm-based video game with millions of players worldwide. Players perform steps on a dance platform, following prompts from an on-screen step chart to step on the platform's buttons at specific, musically salient points in time. Scores depend upon both hitting the correct buttons and hitting them at the correct time. Step charts vary in difficulty with harder charts containing more steps and more complex sequences. The dance pad contains up, down, left, and right arrows, each of which can be in one of four states: on, off, hold, or release. Because the four arrows can be activated or released independently, there are 256 possible step combinations at any instant.
Step charts exhibit rich structure and complex semantics to ensure that step sequences are both challenging and enjoyable. Charts tend to mirror musical structure: particular sequences of steps correspond to different motifs, and entire passages may reappear as sections of the song are repeated. Moreover, chart authors strive to avoid patterns that would compel a player to face away from the screen.
The DDR community uses simulators, such as the open-source StepMania, that allow fans to create and play their own charts. A number of prolific authors produce and disseminate packs of charts, bundling metadata with relevant recordings. Typically, for each song, packs contain one chart for each of five difficulty levels.
Despite the game's popularity, players have some reasonable complaints: For one, packs are limited to songs with favorable licenses, meaning players may be unable to dance to their favorite songs. Even when charts are available, players may tire of repeatedly performing the same charts. Although players can produce their own charts, the process is painstaking and requires significant expertise.
Researchers at the University of California led by Chris Donahue sought to automate the process of step chart generation so that players can dance to a wider variety of charts on any song of their choosing. The researchers decided to use this huge database to train a deep-learning machine to create dances of its own and introduced the task of learning to choreograph, in which the machine learns to generate step charts from raw audio. Although this task has previously been approached via ad-hoc methods, the researchers are the first to cast it as a learning task seeking to mimic the semantics of human-generated charts. The problem is broken into two subtasks: First, step placement consists of identifying a set of timestamps in the song at which to place steps. This process can be conditioned on a user-specified difficulty level. Second, step selection consists of choosing which steps to place at each timestamp. Running these two steps in sequence yields a playable step chart.
Their system—called Dance Dance Convolution—takes as an input the raw audio files of pop songs and produces dance routines as an output resulting in a machine that can choreograph music.
Progress on learning to choreograph may also lead to advances in music information retrieval (MIR). The team's step placement task, for example, closely resembles onset detection, a well-studied MIR problem. The goal of onset detection is to identify the times of all musically salient events, such as melody notes or drum strikes. While not every onset in the data corresponds to a DDR step, every DDR step corresponds to an onset. In addition to marking steps, DDR packs specify a metronome click track for each song. For songs with changing tempos, the exact location of each change and the new tempo are annotated. This click data could help to spur algorithmic innovation for beat tracking and tempo detection.
Unfortunately, MIR research is stymied by the difficulty of accessing large, well-annotated datasets. Songs are often subject to copyright issues, and thus must be gathered by each researcher independently. Collating audio with separately-distributed metadata is nontrivial and error-prone owing to the multiple available versions of many songs. Researchers must often manually align their version of a song to the metadata. In contrast, the researchers' dataset is publicly available, standardized and contains meticulously-annotated labels as well as the relevant recordings.
The team believes that DDR charts represent an abundant and under-recognized source of annotated data for MIR research. Stepmania Online, a popular repository of DDR data, distributes over 350Gb of packs with annotations for more than 100k songs. In addition to introducing a novel task and methodology, the researchers contribute two large public datasets, which they con- sider to be of notably high quality and consistency. Each dataset is a collection of recordings and step charts. One contains charts by a single author and the other by multiple authors.
For both prediction stages of learning to choreograph, the team demonstrates the superior performance of neural networks over strong alternatives. Their best model for step placement jointly learns convolutional neural network (CNN) representations and a recurrent neural network (RNN), which integrates information across consecutive time slices. This method outperforms CNNs alone, multilayer perceptrons (MLPs), and linear models.
The best-performing system for step selection consists of a conditional LSTM generative model. As auxiliary information, the model takes beat phase, a number representing the fraction of a beat at which a step occurs. Additionally, the best models receive the time difference (measured in beats) since the last and until the next step. This model selects steps that are more consistent with expert authors than the best n-gram and fixed-window models, as measured by perplexity and per-token accuracy.
Donahue and the team use 80 percent of the music to train the machine-learning algorithm to recognize suitable times for step placement and test the resulting model with the remaining 20 percent of the data. Similar proportions are used to train another algorithm to determine the step selection. Similar techniques are widely used in machine learning for tasks, such as natural-language processing.
"By combining insights from musical onset detection and statistical language modeling, we have designed and evaluated a number of deep-learning methods for learning to choreograph," they say.
Top image: Dual Shockers