This page explores how The Jam Machine's GPT-2 model internally represents MIDI music tokens. All charts are interactive — hover for details, zoom, and pan.
Generated from JammyMachina/elec-gmusic-familized-model-13-12__17-35-53 using the
jammy.analysis module.
Each dot is a token from the model's vocabulary (301 tokens). Position is determined by TSNE dimensionality reduction from 512 dimensions to 2. Tokens the model considers similar are placed close together. Hover over any point to see the token name.
What to look for: The trained model clusters instruments (blue) tightly, time deltas (red) in their own group, and notes (black) spread across the space. The untrained model is random noise.
Each column is a token (grouped by category), each row is one of 512 embedding dimensions. The trained model shows visible patterns within categories. The untrained model is uniform noise. Hover to see individual token values.
For each token in a drum sequence, what does the model predict comes next? Bars show the top 10 most likely tokens.
What to look for: After PIECE_START, the model predicts TRACK_START with near certainty. After INST=DRUMS, it predicts DENSITY tokens. After BAR_START, it predicts drum notes. The model has learned the grammar of MIDI text.
Same sequence, two models. The trained model makes sharp, confident predictions. The untrained model spreads probability nearly uniformly.
Each cell shows how much one token attends to another (darker = more attention). Hover for exact weights.
What to look for: Layer 0 shows a strong diagonal (self-attention). The last layer shows vertical columns on structural tokens (DENSITY, INST) — the model looks back at these anchors regardless of distance.
Where does the last token (BAR_END) look for information?
What to look for: Early layer spreads attention evenly. Late layer concentrates on specific tokens — the model has learned which tokens carry the most useful information for predicting what comes next.
How attention from the last token builds up through all 6 layers. Each row is a layer, columns are source tokens, darker = stronger attention. Hover for exact weights.
What to look for: Attention shifts as you go deeper. The final layer focuses on the tokens most relevant for the next prediction.
Each of the model's 48 attention heads (6 layers x 8 heads) can specialize in tracking different aspects of music. Here we show the two heads with the most different attention profiles.
What to look for: Bar lengths show what proportion of attention each head pays to each token category. This specialization emerges from training — different heads learn complementary roles.
Generated by scripts/build-analysis-page.py