An embedding for a word $W_n$ is a vector of parameters $\theta_n$ that can be learned^[learnable vectors i.e. parameterized functions] such that $f_\theta(W_n) = \theta_n$. ^[the concept generalizes to inputs beyond words e.g. images, audio, video] ## Questions - I'm assuming the word is already tokenized? does it always have to be? - how do word and document embeddings relate. Where does the information you pack into word embeddings stop and LLM / transformer choke-point encoding start - distributed representation of words ## History Proposed by Bengio et. al. (2001, 2003) to tackle what's known as the [[curse of dimensionality]] in [[statistical language modeling]]. Word embeddings allowed training a neural network that established relationships between different words while preserving relationships of syntactic and semantic properties of the words. [[Word2Vec]] (Mikolov et al 2013) tried to improve on the computational intensity of Bengio's original NNLM proposal. Mikolov proposed 2 models - Continuous bag-of-words (CBOW) - Continuous skip-gram () Both of these models do away with the hidden layer, reducing the computation requirements of this densely connected layer in NNLM. The under-parameterization means that unless we have more data, the representation that's learned is less precise. This can be overcome by fitting more data. ## Resources - [The Ultimate Guide to Word Embeddings (neptune.ai)](https://neptune.ai/blog/word-embeddings-guide) --- - Links: [[Natural Language Processing]] [[Embeddings]] [[Feature Vectors]] [[Statistical language modeling]] - Created at: 2023-05-23