To generate music we will need a good size of training data of music files. These will be used to extract sequences while building our training dataset. To simplify the process, in this chapter we are using the soundtrack of a single instrument. We collected some melodies and stored them in MIDI files. This sample of a midi file shows you what it looks like:
We can see the intervals between notes, the offset for each note and the pitch.
To get started, we load each file using converter.parse(file) function to create a Music21 stream object . We will get a list of all the notes and chords in the file by using this stream object later. Because the most salient features of a note's pitch can be recreated from string notation we'll append the pitch of every note. To handle chords, we will encode the id of every note in the chord as a single string where each note is separated by a dot and append this to the chord. This encoding process makes it possible for us to decode the model generated output with ease into the correct notes and chords.
We will load the data from midi files into an array as can be seen in the code snippet below:
from music21 import converter, instrument, note, chord
import glob
notes = []
for file in glob.glob("/data/*.mid"):
midi = converter.parse(file)
notes_to_parse = None
parts = instrument.partitionByInstrument(midi)
if parts: # file has instrument parts
notes_to_parse = parts.parts[0].recurse()
else: # file has notes in a flat structure
notes_to_parse = midi.flat.notes
for element in notes_to_parse:
if isinstance(element, note.Note):
notes.append(str(element.pitch))
elif isinstance(element, chord.Chord):
notes.append('.'.join(str(n) for n in element.normalOrder))
The next step is to create input sequences for the model and the corresponding outputs as shown in the figure below.
The model outputs a note or chord for each input sequence. We use the first note or chord following the input sequence in our list of notes. To complete the final step in data preparation for our network we need to one-hot encode the output. This normalizes the input for the next iteration.
This is completed with the following code:
sequence_length = 100
# get all pitch names
pitchnames = sorted(set(item for item in notes))
# create a dictionary to map pitches to integers
note_to_int = dict((note, number) for number, note in enumerate(pitchnames))
network_input = []
network_output = []
# create input sequences and the corresponding outputs
for i in range(0, len(notes) - sequence_length, 1):
sequence_in = notes[i:i + sequence_length]
sequence_out = notes[i + sequence_length]
network_input.append([note_to_int[char] for char in sequence_in])
network_output.append(note_to_int[sequence_out])
n_patterns = len(network_input)
# reshape the input into a format compatible with LSTM layers
network_input = numpy.reshape(network_input, (n_patterns, sequence_length, 1))
# normalize input
network_input = network_input / float(n_vocab)
network_output = np_utils.to_categorical(network_output)