## Raw notes - RNNs are an architecture for sequence modeling. e.g. modeling [[Sequence to Sequence]] tasks (e.g. translation, answer generation) or sequence classification tasks (e.g. sentiment analysis) - motivation e.g. given an image of a ball, can you predict the next location. - Sequential data is everywhere - audio - text - time series - other structured sequences like DNA - Sequence modeling Applications - Many inputs to one output e.g. sentiment classification - One input to many outputs e.g. image captioning - many to many e.g. machine translation - How do you add a "temporal" or sequential ordering dimension - To model sequences we need to: - to handle variable-length sequences - track long-term dependencies (related parts of the input that are far apart) - maintain information about order - share parameters across the sequence - Training uses back-propagation through time - instead of back-propagating like in a single feed forward pass, we need to also back-prop the internal $W_hh$ time weights - computationally expensive. gradient w.r.t $h_0$ involves repeated gradient computation - exploding gradients -> could do gradient clipping to scale back large gradient values - vanishing gradients -> tackled by - choice of activation function - weight initialization - network architecture - Why are vanishing gradients a problem - when you multiple a long sequence of small numbers together, you're loosing the ability to learn from tokens further down the sequence. - specifically, later time steps have smaller and smaller gradients so an error will not adjust it much. - this biases the learned parameters to capture short-term dependencies (words close together) - Trick 1 - activation functions - the derivative of ReLU is 0 for x <= 0 and 1 for x > 0. This has a boosting characteristic compared the the derivatives of tanh or sigmoid, which shrink as x gets larger - Trick 2 - parameter initialization - Initialize weights to identity and biases to zero - Trick 3 - use a more complex type of recurrent unit that can have the effect of tracking long-term dependencies - gated cells selectively add / remove information within each recurrent unit (LSTM, GRU) LSTM - 1) forget, 2) store, 3) update, 4) output - an LSTM recurrent cell is able to track information through many time-steps. - Maintain a cell state - gates control information flow - forget gate gets rid of irrelevant information (from the past) - store gate keeps what's relevant from the current input - update gate selectively updates the cell state - output gate returns a filtered version of the cell state - In practice, this makes back-prop through time much more stable. RNN Applications & Limitations - e.g. music generation - predict the next note in the sequence - sentiment classification - single next token output ie the sentiment category Limitations - there's an encoding bottleneck. We need to encode some large input and encode it into a representation that can be operated on. Information can be lost in this encoding. - Slow, no parallelization - memory isn't really long enough, don't scale up well to very large input sequences. Desired capabilities - Continuous input stream - Parallelizable - Long memory ## Resources - https://www.youtube.com/watch?v=QvkQ1B3FBqA --- - Links: - Created at: 2023-06-05