## Raw notes
- RNNs are an architecture for sequence modeling. e.g. modeling [[Sequence to Sequence]] tasks (e.g. translation, answer generation) or sequence classification tasks (e.g. sentiment analysis)
- motivation e.g. given an image of a ball, can you predict the next location.
- Sequential data is everywhere
- audio
- text
- time series
- other structured sequences like DNA
- Sequence modeling Applications
- Many inputs to one output e.g. sentiment classification
- One input to many outputs e.g. image captioning
- many to many e.g. machine translation
- How do you add a "temporal" or sequential ordering dimension
- To model sequences we need to:
- to handle variable-length sequences
- track long-term dependencies (related parts of the input that are far apart)
- maintain information about order
- share parameters across the sequence
- Training uses back-propagation through time
- instead of back-propagating like in a single feed forward pass, we need to also back-prop the internal $W_hh$ time weights
- computationally expensive. gradient w.r.t $h_0$ involves repeated gradient computation
- exploding gradients -> could do gradient clipping to scale back large gradient values
- vanishing gradients -> tackled by
- choice of activation function
- weight initialization
- network architecture
- Why are vanishing gradients a problem
- when you multiple a long sequence of small numbers together, you're loosing the ability to learn from tokens further down the sequence.
- specifically, later time steps have smaller and smaller gradients so an error will not adjust it much.
- this biases the learned parameters to capture short-term dependencies (words close together)
- Trick 1 - activation functions
- the derivative of ReLU is 0 for x <= 0 and 1 for x > 0. This has a boosting characteristic compared the the derivatives of tanh or sigmoid, which shrink as x gets larger
- Trick 2 - parameter initialization
- Initialize weights to identity and biases to zero
- Trick 3 - use a more complex type of recurrent unit that can have the effect of tracking long-term dependencies
- gated cells selectively add / remove information within each recurrent unit (LSTM, GRU)
LSTM
- 1) forget, 2) store, 3) update, 4) output
- an LSTM recurrent cell is able to track information through many time-steps.
- Maintain a cell state
- gates control information flow
- forget gate gets rid of irrelevant information (from the past)
- store gate keeps what's relevant from the current input
- update gate selectively updates the cell state
- output gate returns a filtered version of the cell state
- In practice, this makes back-prop through time much more stable.
RNN Applications & Limitations
- e.g. music generation - predict the next note in the sequence
- sentiment classification - single next token output ie the sentiment category
Limitations
- there's an encoding bottleneck. We need to encode some large input and encode it into a representation that can be operated on. Information can be lost in this encoding.
- Slow, no parallelization
- memory isn't really long enough, don't scale up well to very large input sequences.
Desired capabilities
- Continuous input stream
- Parallelizable
- Long memory
## Resources
- https://www.youtube.com/watch?v=QvkQ1B3FBqA
---
- Links:
- Created at: 2023-06-05