Deep generative models of genetic variation capture mutation effects

- Title: [[Source Notes/Deep generative models of genetic variation capture mutation effects]] - Type: #source/paper - Author: - Reference: [[1712.06527] Deep generative models of genetic variation capture mutation effects (arxiv.org)](https://arxiv.org/abs/1712.06527) - Published at: - Reviewed at: [[2023-08-17]] - Links: [[Probabilistic Model]] [[Generative Models]] [[Generative Protein Model]] [[Source Notes/Auto-Encoding Variational Bayes|Auto-Encoding Variational Bayes]] --- # Questions - How does sequence weighting work? - How can I parameterize the model with an initial protein / protein family? - How does Probabilistic PCA work (44, 88)? # Rough Notes There's a need to assess if a given mutation with disrupt to a protein or RNA will disrupt its function. Assays can measure lots of attributes - ligand binding, splicing, catalysis (4, 8, 11, 13, 21, 23, 29) - cellular / organism fitness under some selection pressure (5-7, 9, 12, 14, 17, 19, 20) Pairwise interactions has yielded good results. 2nd and 3rd degree interactions is at the limits of being computationally tractable for aminos of length 200. Still not a generalizable path to 4th order interactions or higher because each higher order interaction requires incorporating a unique free parameter to be estimated. Instead look towards latent variable models. - PCA and admixture analysis (44-46) can be thought of as latent variable models with linear dependencies with limitations on the types of correlations that can be captured. [[Variational Inference]] let's us model nonlinear latent variables for lots of problem domains including chemical structures (49. Gómez-Bombarelli, R., et al., Automatic chemical design using a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415, 2016.) > We found that the combination of using biologically motivated priors and Bayesian approaches for inference on the weights was important to learning models that generalize. > ... > we found that our variational Bayesian approach performed better, (Table 1).Most importantly, only the Bayesian approaches for inference of the global parameters that estimate approximate posteriors were able to consistently outperform the previous pairwise-interaction models. Latent and global variables capture biological structure. **Claim** The formulation of their VAE / latent variable model over sequence space can be thought of as discrete, non-linear analog of PCA.