Transformer Positional Encoding - /s (strangemonad's notes)

## Questions - Why do we add position to the embeddings instead of, say, concatenating the embedding and position vector? - Why does the embedding support this positional perturbation? - Is it similar to linear combination / superposition? - How could we visualize this? ## Intuition - [[Recurrent Neural Nets]] don't scale computationally because you have to process the input sequentially. [[Transformers]] address this by considering all the input in parallel. Part of how this works is by cleverly encoding token position along** with token embedding. - For each input token in the sequence you want a vector encoding the position - To make things work well with tensor math, you make the token's position vector length match the token's embedding vector length. This allows you to create a input -> embeddings matrix and input -> positions matrix of the same size. ![[_Media/Transformer Positional Encoding Matrix - ML Mastery.png]] ## Encoding In the "Attention is All You Need" paper, position of the $k^{th}$ token in the sequence is: $ \begin{align} P(k, 2i) = sin(\frac{k}{n^{2i/d}}) \\ P(k, 2i+1) = cos(\frac{k}{n^{2i/d}}) \end{align} $ - $L$: input token sequence length - $k$: position of token 0 <= k < L/2 ^[XXX why is this L/2?] - $d$: dimension of the output embedding space - $P(k, j)$: position function mapping input position k to index $(k,j)$ of the positional matrix - $n$: a user defined scalar. The original paper set it to 10,000 ## Dead ends - You can't just use offset index because different length inputs would have larger magnitudes for larger inputs - You can't just normalize to 0-1 based on the length of the input, variable length sequences would normalize differently and be impacted by window length ## Resources - [A Gentle Introduction to Positional Encoding in Transformer Models, Part 1 - MachineLearningMastery.com](https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/) - [How Positional Embeddings work in Self-Attention (code in Pytorch) | AI Summer (theaisummer.com)](https://theaisummer.com/positional-embeddings/) - [What do Position Embeddings Learn - An Empirical Study of Pre-Trained Language Model Positional Encoding - 2010.04903.pdf (arxiv.org)](https://arxiv.org/pdf/2010.04903.pdf) --- - Links: - Created at: 2023-06-05