An embedding for a word $W_n$ is a vector of parameters $\theta_n$ that can be learned^[learnable vectors i.e. parameterized functions] such that $f_\theta(W_n) = \theta_n$. ^[the concept generalizes to inputs beyond words e.g. images, audio, video]
## Questions
- I'm assuming the word is already tokenized? does it always have to be?
- how do word and document embeddings relate. Where does the information you pack into word embeddings stop and LLM / transformer choke-point encoding start
- distributed representation of words
## History
Proposed by Bengio et. al. (2001, 2003) to tackle what's known as the [[curse of dimensionality]] in [[statistical language modeling]]. Word embeddings allowed training a neural network that established relationships between different words while preserving relationships of syntactic and semantic properties of the words.
[[Word2Vec]] (Mikolov et al 2013) tried to improve on the computational intensity of Bengio's original NNLM proposal. Mikolov proposed 2 models
- Continuous bag-of-words (CBOW)
- Continuous skip-gram ()
Both of these models do away with the hidden layer, reducing the computation requirements of this densely connected layer in NNLM. The under-parameterization means that unless we have more data, the representation that's learned is less precise. This can be overcome by fitting more data.
## Resources
- [The Ultimate Guide to Word Embeddings (neptune.ai)](https://neptune.ai/blog/word-embeddings-guide)
---
- Links: [[Natural Language Processing]] [[Embeddings]] [[Feature Vectors]] [[Statistical language modeling]]
- Created at: 2023-05-23