RoPE encodes token position by rotating query & key vectors in a plane — not by adding to them. This page unpacks why, one panel at a time.
Raw attention is order-blind. It treats a sentence as a set, not a sequence — so "the dog bit the man" and "the man bit the dog" look identical to it. Before we fix it with rotations, let's see the bug with our own eyes.
Position info has to come from somewhere. The original transformer added a vector. RoPE rotates instead — and that turns out to work much better.
Think of a 2-D query vector $\mathbf{q}\in\mathbb{R}^2$. RoPE at position $m$ rotates it by angle $m\theta$: $$\mathbf{q}_m \;=\; R(m\theta)\,\mathbf{q} \;=\;\begin{bmatrix}\cos m\theta&-\sin m\theta\\ \sin m\theta&\cos m\theta\end{bmatrix}\mathbf{q}$$ The position is the amount of rotation. Nothing more.
Position $m$ becomes angle $m\theta$. Farther tokens = bigger rotation. Nothing is added to the vector — only its direction changes.
Real embeddings aren't 2-D. RoPE chops a $d$-dim vector into $d/2$ planes, and each plane gets its own frequency: $$\theta_i \;=\; 10000^{-2i/d},\qquad i=0,1,\dots,d/2-1$$ Low-index planes rotate fast (fine-grained position), high-index planes rotate slowly (coarse).
RoPE is multi-scale. Fast planes (left) track short distances; slow planes (right) track long ones. One embedding, many clocks.
Here's the punchline. If you rotate $\mathbf{q}$ by $m\theta$ and $\mathbf{k}$ by $n\theta$, their dot product is: $$\langle R(m\theta)\mathbf{q},\; R(n\theta)\mathbf{k}\rangle \;=\; \langle\mathbf{q},\; R((n-m)\theta)\mathbf{k}\rangle$$ It depends only on $n-m$. Absolute positions cancel out. Slide both $m$ and $n$ by the same amount — nothing changes.
This is the whole reason RoPE works. The model never sees "token #742" — it sees "the other token is 3 positions away". That's what translation-equivariant attention should do.