## Questions
- Why do we add position to the embeddings instead of, say, concatenating the embedding and position vector?
- Why does the embedding support this positional perturbation?
- Is it similar to linear combination / superposition?
- How could we visualize this?
## Intuition
- [[Recurrent Neural Nets]] don't scale computationally because you have to process the input sequentially. [[Transformers]] address this by considering all the input in parallel. Part of how this works is by cleverly encoding token position along** with token embedding.
- For each input token in the sequence you want a vector encoding the position
- To make things work well with tensor math, you make the token's position vector length match the token's embedding vector length. This allows you to create a input -> embeddings matrix and input -> positions matrix of the same size.
![[_Media/Transformer Positional Encoding Matrix - ML Mastery.png]]
## Encoding
In the "Attention is All You Need" paper, position of the $k^{th}$ token in the sequence is:
$
\begin{align}
P(k, 2i) = sin(\frac{k}{n^{2i/d}}) \\
P(k, 2i+1) = cos(\frac{k}{n^{2i/d}})
\end{align}
$
- $L$: input token sequence length
- $k$: position of token 0 <= k < L/2 ^[XXX why is this L/2?]
- $d$: dimension of the output embedding space
- $P(k, j)$: position function mapping input position k to index $(k,j)$ of the positional matrix
- $n$: a user defined scalar. The original paper set it to 10,000
## Dead ends
- You can't just use offset index because different length inputs would have larger magnitudes for larger inputs
- You can't just normalize to 0-1 based on the length of the input, variable length sequences would normalize differently and be impacted by window length
## Resources
- [A Gentle Introduction to Positional Encoding in Transformer Models, Part 1 - MachineLearningMastery.com](https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/)
- [How Positional Embeddings work in Self-Attention (code in Pytorch) | AI Summer (theaisummer.com)](https://theaisummer.com/positional-embeddings/)
- [What do Position Embeddings Learn - An Empirical Study of Pre-Trained Language Model Positional Encoding - 2010.04903.pdf (arxiv.org)](https://arxiv.org/pdf/2010.04903.pdf)
---
- Links:
- Created at: 2023-06-05