All Of Statistics - Larry Wasserman - Ch 7

## 7.1 Introduction - [[Statistical Inference]] ## 7.2 Parametric and Non-parametric models - [[Statistical Model]] ### Regression model Assume we have observed data $(X_1, \dots, X_n, Y)$. We call the $\boldsymbol X$ variables **predictors** (regressors, features, independent variables) and the $Y$ variable the **target** (outcome, response variable, dependent variable) and $r(x) = E(Y|X=x)$ is called the **regression function**. The *goal* of predicting $Y_i$ given the data $X_i$ is called, unsurprisingly, **prediction** (or classification in the case where $Y$ is discrete). Regression often colloquially refers to both the process of curve estimation and making predictions. Here Wasserman defines **regression** specifically as the process of curve estimation i.e. estimating the functions in the [[Statistical Model]] so that you can compute the regression function (i.e. the function that returns the expected value of $Y_i$ given the predictors $X_i$). Regression models are sometimes (often?) re-written as $ Y = f(X) + \epsilon $ With $E(\epsilon) = 0$ i.e. $\epsilon = Y - f(X)$ where if we've made a perfect estimate we have no residual error. Hint at Frequentist inference vs Baysian Inference. Will cover both but starting with Frequentist. Noted that most courses usually start with parametric models but they're starting with non-parametric (because they think they're easier to understand). ## 7.3 Fundamental Concepts of Inference Most inferential problems usually fall into one of three types: - Estimation (aka Point estimates) - Confidence sets - Hypothesis testing ### Point estimation Point estimates are a single best guess for some quantity of interest (e.g. a parameter in a parametric model, a [[Cumulative Distribution Function|CDF]], a [[Probability Density Function]] $f$, a regression function $r$, or a prediction of some futur, unseen value $Y$ of a random variable). Estimates, by convention a point estimate for some parameter $\theta$ is $\hat \theta$. When the estimate depends on $n$ IID predictor variables $X_1 \dots X_n$ the estimator is written as a function of those variables $\hat \theta_n = g(X_1 \dots X_n)$ Define the **bias** of the estimator $\hat \theta_n$ as $ bias(\hat \theta_n) = E_{\theta}(\hat \theta_n) - \theta $ The estimate is said to be unbiased if $E(\hat \theta_n) = \theta$ and similarly, define bias to be how far off the estimate is from the actual value. There used to be a lot of emphasis placed on finding unbiased estimators but this is not as much of a concern and most modern estimators are biased. #todo I don't understand consistency $\hat \theta_n \to \theta$ (#todo how to add a labeled arrow) **Sampling distribution** is the distribution of $\hat \theta_n$ and the std dev of $\hat \theta_n$ is the **standard error** $ se = se(\hat \theta_n) = \sqrt{V(\hat \theta_n)} $ It's often not possible to compute $se$ analytically and instead an estimate is used $\hat{se}$. --- - Links: [[All Of Statistics - Larry Wasserman|All of Stats]] - Created at: [[2021-09-26]]