Exploring the Embedding and Attention

This page explores how The Jam Machine's GPT-2 model internally represents MIDI music tokens. All charts are interactive — hover for details, zoom, and pan.

Generated from JammyMachina/elec-gmusic-familized-model-13-12__17-35-53 using the jammy.analysis module.

Back to home

Token Embedding Space (TSNE)

Each dot is a token from the model's vocabulary (301 tokens). Position is determined by TSNE dimensionality reduction from 512 dimensions to 2. Tokens the model considers similar are placed close together. Hover over any point to see the token name.

What to look for: The trained model clusters instruments (blue) tightly, time deltas (red) in their own group, and notes (black) spread across the space. The untrained model is random noise.


Embedding Matrix: Trained vs Untrained

Each column is a token (grouped by category), each row is one of 512 embedding dimensions. The trained model shows visible patterns within categories. The untrained model is uniform noise. Hover to see individual token values.


Next Token Predictions

For each token in a drum sequence, what does the model predict comes next? Bars show the top 10 most likely tokens.

What to look for: After PIECE_START, the model predicts TRACK_START with near certainty. After INST=DRUMS, it predicts DENSITY tokens. After BAR_START, it predicts drum notes. The model has learned the grammar of MIDI text.


Predictions: Trained vs Untrained

Same sequence, two models. The trained model makes sharp, confident predictions. The untrained model spreads probability nearly uniformly.


Attention Patterns: Early vs Late Layer

Each cell shows how much one token attends to another (darker = more attention). Hover for exact weights.

What to look for: Layer 0 shows a strong diagonal (self-attention). The last layer shows vertical columns on structural tokens (DENSITY, INST) — the model looks back at these anchors regardless of distance.


Attention Flow: Early vs Late

Where does the last token (BAR_END) look for information?

What to look for: Early layer spreads attention evenly. Late layer concentrates on specific tokens — the model has learned which tokens carry the most useful information for predicting what comes next.


Information Flow Across Layers

How attention from the last token builds up through all 6 layers. Each row is a layer, columns are source tokens, darker = stronger attention. Hover for exact weights.

What to look for: Attention shifts as you go deeper. The final layer focuses on the tokens most relevant for the next prediction.


Head Specialization

Each of the model's 48 attention heads (6 layers x 8 heads) can specialize in tracking different aspects of music. Here we show the two heads with the most different attention profiles.

What to look for: Bar lengths show what proportion of attention each head pays to each token category. This specialization emerges from training — different heads learn complementary roles.


Generated by scripts/build-analysis-page.py