Cornell CS 4782 / 5782 · Extra Credit Demo · yc2998

Rotary Position Embedding Position as angle.

RoPE encodes token position by rotating query & key vectors in a plane — not by adding to them. This page unpacks why, one panel at a time.

00

Why do we need this at all?

Raw attention is order-blind. It treats a sentence as a set, not a sequence — so "the dog bit the man" and "the man bit the dog" look identical to it. Before we fix it with rotations, let's see the bug with our own eyes.

Tokens above · attention matrix below. Darker cell = token-row attends more to token-col.
Demonstration
Shuffle, and notice the attention values just permute around.
Turn on, then shuffle again — now the values actually change.
the setup

Position info has to come from somewhere. The original transformer added a vector. RoPE rotates instead — and that turns out to work much better.

01

The rotation idea

Think of a 2-D query vector $\mathbf{q}\in\mathbb{R}^2$. RoPE at position $m$ rotates it by angle $m\theta$: $$\mathbf{q}_m \;=\; R(m\theta)\,\mathbf{q} \;=\;\begin{bmatrix}\cos m\theta&-\sin m\theta\\ \sin m\theta&\cos m\theta\end{bmatrix}\mathbf{q}$$ The position is the amount of rotation. Nothing more.

Query $\mathbf{q}$ at position $m$
Position m 0
aha

Position $m$ becomes angle $m\theta$. Farther tokens = bigger rotation. Nothing is added to the vector — only its direction changes.

02

Many frequencies at once

Real embeddings aren't 2-D. RoPE chops a $d$-dim vector into $d/2$ planes, and each plane gets its own frequency: $$\theta_i \;=\; 10000^{-2i/d},\qquad i=0,1,\dots,d/2-1$$ Low-index planes rotate fast (fine-grained position), high-index planes rotate slowly (coarse).

$d=8$, so $4$ rotating planes — each pair of dimensions.
Position m 0
aha

RoPE is multi-scale. Fast planes (left) track short distances; slow planes (right) track long ones. One embedding, many clocks.

03

Relative position, for free

Here's the punchline. If you rotate $\mathbf{q}$ by $m\theta$ and $\mathbf{k}$ by $n\theta$, their dot product is: $$\langle R(m\theta)\mathbf{q},\; R(n\theta)\mathbf{k}\rangle \;=\; \langle\mathbf{q},\; R((n-m)\theta)\mathbf{k}\rangle$$ It depends only on $n-m$. Absolute positions cancel out. Slide both $m$ and $n$ by the same amount — nothing changes.

$\mathbf{q}_m$    $\mathbf{k}_n$    $\langle\mathbf{q}_m,\mathbf{k}_n\rangle$ over $(m-n)$
Position m (query) 0
Position n (key) 0
Turn on, drag either slider — the dot product holds perfectly still.
distance $m-n$ 0.00
$\langle\mathbf{q}_m,\mathbf{k}_n\rangle$ 0.000
aha

This is the whole reason RoPE works. The model never sees "token #742" — it sees "the other token is 3 positions away". That's what translation-equivariant attention should do.