Skip to the content.

Encoding & Decoding Guide

How The Jam Machine converts MIDI files to text tokens and back.


What’s in a MIDI File

A MIDI file contains:

The Jam Machine encodes all of this into a flat text sequence that a language model can learn.


The Encoding Pipeline

MIDI File
  → miditok extracts events (Note-On, Time-Shift, etc.)
  → Velocity is removed (not used by the model)
  → Time shifts are normalized and quantized
  → Bar markers are added every 4 beats
  → Note density is computed per bar
  → Instruments are mapped to 16 families
  → Events are serialized to text tokens

Step by step:

1. Extract events. The miditok library converts each MIDI instrument track into a sequence of events: Note-On, Note-Off, Time-Shift, etc.

2. Remove velocity. Velocity (how hard a note is struck) is stripped out to reduce vocabulary size. All generated notes use a default velocity.

3. Quantize time. Time shifts are quantized to 4 steps per beat. This means the finest resolution is a 16th note. Anything shorter (grace notes, humanized timing) is rounded to the nearest step.

4. Add bar markers. BAR_START and BAR_END tokens are inserted every 4 beats (one bar in 4/4 time).

5. Compute density. Each bar gets a DENSITY value (0-3) based on how many notes it contains. This gives the model a high-level knob for “how busy” a bar should be.

6. Map instruments to families. MIDI has 128 instrument programs. The model groups them into 16 families:

Family Name MIDI Programs
0 Piano 0-7
1 Chromatic Percussion 8-15
2 Organ 16-23
3 Guitar 24-31
4 Bass 32-39
5 Strings 40-47
6 Ensemble 48-55
7 Brass 56-63
8 Reed 64-71
9 Pipe 72-79
10 Synth Lead 80-87
11 Synth Pad 88-95
12 Synth Effects 96-103
13 Ethnic 104-111
14 Percussive 112-119
15 Sound Effects 120-127

Drums are a special case: they use INST=DRUMS instead of a family number.

7. Serialize to text. Each event becomes a text token. The full piece is a single string of space-separated tokens.


Token Vocabulary

Structure tokens

Token Meaning
PIECE_START Beginning of a piece
TRACK_START Beginning of an instrument track
TRACK_END End of an instrument track
BAR_START Beginning of a bar (4 beats)
BAR_END End of a bar

Metadata tokens

Token Meaning Values
INST=<n> Instrument family 0-15 or DRUMS
DENSITY=<n> Note density of the bar 0-3

Note tokens

Token Meaning Values
NOTE_ON=<pitch> Start playing a note 0-127 (MIDI pitch)
NOTE_OFF=<pitch> Stop playing a note 0-127
TIME_DELTA=<steps> Wait before next event 1-16 (4 steps = 1 beat)

The total vocabulary is ~300 tokens.


Worked Example: Reptilia Drums

Here’s the actual encoded output for the first bar of drums from The Strokes - Reptilia:

TRACK_START
INST=DRUMS
DENSITY=1
BAR_START
  TIME_DELTA=2        ← wait half a beat
  NOTE_ON=35          ← kick drum (GM note 35)
  NOTE_OFF=35         ← release kick
  NOTE_ON=40          ← electric snare (GM note 40)
  NOTE_OFF=40         ← release snare
  NOTE_ON=40          ← snare again
  NOTE_OFF=40
  TIME_DELTA=4        ← wait one full beat
  NOTE_ON=35          ← kick drum
  TIME_DELTA=2        ← wait half a beat
  NOTE_OFF=35
  TIME_DELTA=2        ← wait half a beat
  NOTE_ON=40          ← snare
  TIME_DELTA=2
  NOTE_OFF=40
BAR_END

Reading this like a timeline: the bar starts with a half-beat rest, then a kick+snare hit, another snare, a full beat rest, another kick, and a snare — a classic rock drum pattern.


Quantization Caveats

The encoding is lossy. Here’s what’s lost:

Timing resolution. Time is quantized to 4 steps per beat (16th-note grid). Sub-quantization timing — guitar strums where strings are hit in rapid succession, grace notes, humanized timing offsets — is rounded to the nearest step. Offsets smaller than one step are discarded entirely (the TIME_DELTA=0 tokens are dropped).

Velocity. All note velocities are stripped during encoding and replaced with a default value during decoding.

Instrument specificity. A specific MIDI program (e.g., program 33 = Electric Bass Finger) becomes family 4 (Bass). During decoding, a random program from that family is assigned back.

This is by design. The quantization reduces the vocabulary size and makes patterns easier for the model to learn. The trade-off is that the decoded MIDI won’t perfectly reproduce the original — but the musical content (which notes, when, which instruments) is preserved.

The resolution is fixed by the trained model’s vocabulary. Changing the quantization would require retraining the model on a new dataset encoded with the new resolution.


Decoding: Text Back to MIDI

The reverse pipeline:

Text Tokens
  → Parse tokens back to events
  → Reconstruct time shifts from TIME_DELTA values
  → Fill missing time shifts at bar boundaries
  → Add default velocity to all notes
  → Map instrument families back to MIDI programs
  → Assemble into a MIDI file via miditok

The decoder handles edge cases like bars with no notes (empty density=0 bars) and over-quantized events that exceed the bar length.


Piano Roll: Decoded Reptilia

Here’s the piano roll of the first 32 bars of The Strokes - Reptilia (from the original MIDI):

Piano Roll

You can see the arrangement structure: drums and bass enter around bar 8, the 2nd guitar plays sustained chords, and the 1st guitar has a rhythmic riff pattern. This is the kind of musical structure the model learns to reproduce.

Split instruments

Some MIDI files have a single instrument split across multiple tracks (common with guitar overdubs). Here’s how Reptilia’s 1st Guitar appears across two tracks:

Split 1st Guitar

Track 0 carries the main part (726 notes spanning the full song), while Track 1 has just 5 notes in a short section — likely a brief harmony or overdub.

The 2nd Guitar shows a similar pattern — a main track with 3498 notes and a secondary track with 166 notes appearing in two sections:

Split 2nd Guitar


Back to home