Final Performance – MelodyRNN/PerformanceRNN

For the final performance, I want to explore more about different RNN models and try to generate rap beats based on some of the RNNs we have looked at.

I wanted to explore more about ImprovRNN. performance RNN, MelodyRNN as well as DrumsRNN.

I collected different midi files (eminem/jay chou(a Chinese popstatr) as well as .xml files, because the training of ImprovRNN requires musicXML files.

I was actually most interested in the ImprovRNN because I thought the music pieces it generated were by far the most listenable ones.

However, I ran into a lot of problems with the xml files.

There were two ways that I got my musicXML files:

  1. convert midi files to xml files using musescore
  2. directly download xml files from websites like musescoreHowever, both ways didnt work for me. I did not have too much problem converting my musicXML files into note sequencers,  but when I tries to convert the note sequencers into tfrecords, I failed.

After trying for so many times and failed, I decided to stay away from ImprovRNN at this moment and spent some more time on Melody RNN and Performance RNN.

I train my model for both MelodyRNN and PerformanceRNN using eminems midi dataset.

I trainen both models using 10000 steps and the lost is small (I guess it is because my dataset is small too?(15-20 midi files))


I tried different primer melody for the melodyRNN:

melody_rnn_generate \

–config=lookback_rnn \

–run_dir=/tmp/melody_rnn/logdir/run1 \

–output_dir=./generated/melody_rnn/eminem2/ \

–num_outputs=10 \

–num_steps=1000 \

–hparams=”batch_size=64,rnn_layer_sizes=[64,64]” \

–primer_melody=”[60, -2, 52, -2, 60, -2, 52, -2]

melody_rnn_generate \

–config=lookback_rnn \

–run_dir=/tmp/melody_rnn/logdir/run1 \

–output_dir=./generated/melody_rnn/eminem3/ \

–num_outputs=3 \

–num_steps=1000 \

–hparams=”batch_size=64,rnn_layer_sizes=[64,64]” \



performance_rnn_generate \

–config=performance_with_dynamics \

–run_dir=/tmp/performance_rnn/logdir/run2 \

–output_dir=./generated /  \

–num_outputs=10 \

–num_steps=1000 \

–hparams=”batch_size=64,rnn_layer_sizes=[64,64]” \


Some results:


I used the Pre-trained drumsRNN in the end because I did not have enough time to train a third model and I curated some melodyRNN pieces that I think are good with the results I get from DrumsRNN and made a one-minute beat.


Overall I feel though the outcome does not sound like a rap beat at all, I feel the two models has captured the very repetitive nature of rap beats (same bars repeat for many times)

For the future, I definitely want to spend some more time on the MusicXML files and hopefully I can succeed in training my own ImprovRNN model.

I also think the dataset that a model is trained on is so important but I have not really got a chance to work really on. So I probably will spend more time researching on how to create a good dataset.

I also want to look into music VAE more.

Week 8 – Models Presentation

Improv RNN –  Magenta model (google AI, 2016)

Improv RNN is a training model that uses Recurrent Neural Network (more specifically, LSTM – long short term memory networks) to generative melodies over chord progression.

Data it is trained on

the model trains on lead sheets in MusicXML format

lead sheet: is a musical representation containing chords and melody (and lyrics, which are ignored by the model).

You can find lead sheets in various places on the web such as MuseScore. Magenta is currently only able to read lead sheets in MusicXML format; MuseScore provides MusicXML download links, e.g.

I think the model is looking at the both the chord progression and melody in the dataset as well as their relationship between each other.


For the chords: one-hot encoding

one hot encoding is often used for classifying categorical data. It transforms our categorical labels into two vectors of 0s and 1s.

The length of these vectors will be the amount of categories we have, in other words, the length will be equal to the number of our output categories

Each category will correspond to one element of the array and the particular element will be 1, with the rest of the elements stay 0 (that is the reason why it is called one-hot)

for instance, if we have three different categories of data, A, B and C,

the encoding of them will be something similar to [1,0,0] [0,1,0],[0,0,1]

In terms of this specific model, each chord will encoded as a one-hot vector of 48 triads (major/minor/augmented/diminished for all 12 root pitch classes).

for instance: D major would be encoded as [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Also, there are three different configurations:

basic improv rnn / attention improv rnn / chord pitches improv rnn

Q: what are the differences between the three different types of configurations improv (are the used at the same time? I think they are used separately)

Q: how are the melodies encoded?

To use the model (after installing MAGENTA)

  1. convert a collection of MusicXML lead sheets into NoteSequences
  2. Create SequenceExamples: SequenceExamples are fed into the model during training and evaluation. Each SequenceExample will contain a sequence of inputs and a sequence of labels that represent a lead sheet.
  3. Train and Evaluate the ModelRun the command below to start a training job using the attention configuration. --run_dir is the directory where checkpoints and TensorBoard data for this run will be stored. --sequence_example_file is the TFRecord file of SequenceExamples that will be fed to the model.


my biggest concern/problem is that I cannot really envision how the dataset looks like.

Generate Melody Over Chords

to generate your own melody you will also need to have a primer_melody

At least one note needs to be fed to the model before it can start generating consecutive notes. We can use –primer_melody to specify a priming melody using a string representation of a Python list. The values in the list should be ints that follow the melodies_lib.Melody format (-2 = no event, -1 = note-off event, values 0 through 127 = note-on event for that MIDI pitch). For example –primer_melody=”[60, -2, 60, -2, 67, -2, 67, -2]” would prime the model with the first four notes of Twinkle Twinkle Little Star. Instead of using –primer_melody, we can use –primer_midi to prime our model with a melody stored in a MIDI file.

backing chord:

In addition, the backing chord progression must be provided using –backing_chords, a string representation of the backing chords separated by spaces. For example, –backing_chords=”Am Dm G C F Bdim E E” uses the chords from I Will Survive. By default, each chord will last 16 steps (a single measure), but –steps_per_chord can also be set to a different value.

Github repo

examples I found

Week 7 Generative Music_Performance Two



I wanted to create a project to help envision what it would look like if certain rappers use their styles to sing other rappers’ representative rap works.

My two main goals:

  1. generate lyrics that make senses, and have rhymes and punctuations.
  2. somehow match the generated lyrics to a certain beat.

I tried two datasets:

  1. a compile of over 30 different rappers’ remix of Versace by Migos (found on Soundcloud and genius)
  2. Kanye West lyrics compile.


Week 5 – Bias in machine learning

The instance I found of bias in machine learning is Microsofts AI millennial chatbot Tay.

Tay it supposed to be a chatbot that mimics the talking style of a teenage girl. It will learn from peoples tweets on twitter and response to people. However, after just one day it started to publish really biased comments that are racist and terrible.

Garbage in, Garbage out. Microsoft probably just underestimated the power of peoples hate comments on twitter and the influences they have on the behavior of Tay

Week 1 – Generative Music – project

The project I found is called Performance RNN by Ian Simon and Sageev Oore. This project is posted on Magenta and I came over this project while reading Kyle McDonalds article Neural Nets for Generating Music.

As described by the creators Performance RNN is an LSTM-based recurrent neural network designed to model polyphonic music with expressive timing and dynamics.

Basically, as far as I understood the project, all the sound(notes) are pre-made, the system itself does not create the original sounds. However, via a stream of MIDI events, the system generates expressive timing and dynamics of those notes.

Because for a lot of times when system creates generative music pieces, there is a lack of performance in it(“with all notes at the same volume and quantized”), which could be achieved by manipulating the speed of a note,  the space between the notes or something like how hard to strike the note.

The Performance RNN therefore uses note-n and note-off events to define the pitch, the velocity, the feelings of the notes and in that sense, generates music pieces that are more emotional and performative.