Skip to content

Commit cb6fe08

Browse files
authored
Update regression.md
1 parent 637bd2c commit cb6fe08

File tree

1 file changed

+143
-25
lines changed

1 file changed

+143
-25
lines changed

notes/6_regression/regression.md

Lines changed: 143 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,167 @@
11
## Regression Analysis
22

3-
Regression analysis and curve fitting are critical methods in statistical analysis and machine learning. Both aim to find a function that best approximates a set of data points, yet their typical applications may vary slightly. They are particularly useful in understanding relationships among variables and making predictions.
3+
Regression analysis and curve fitting are important tools in statistics, econometrics, engineering, and modern machine-learning pipelines. At their core they seek a deterministic (or probabilistic) mapping
4+
$\widehat f: \mathcal X \longrightarrow \mathcal Y$
5+
that minimizes a suitably chosen loss function with respect to a sample of observations
6+
$\mathcal D = \{(\mathbf x_1,y_1),\dots,(\mathbf x_N,y_N)\}\subseteq \mathcal X\times\mathcal Y.$
7+
8+
A **regression problem** is typically posed under the additive error model
9+
$y_i\;=\; f_*(\mathbf x_i)\; +\; \varepsilon_i,\qquad \mathbb E[\varepsilon_i\mid\mathbf x_i]=0,\; \operatorname{Var}(\varepsilon_i)=\sigma^2,$
10+
where \$f\_\*\$ is an (unknown) deterministic function and \$(\varepsilon\_i)\$ are random errors. The analyst’s objective is to construct an estimator \$\widehat f\$ (or equivalently to estimate a parameter vector \$\widehat{\boldsymbol\theta}\$ specifying \$\widehat f\$) such that some notion of risk—mean-squared error, negative log-likelihood, predictive log-loss, etc.—is minimized.
11+
12+
Below we preserve the original high-level outline but expand each section with mathematical exactitude. Symbols used repeatedly are collected in the following table.
13+
14+
| Symbol | Meaning |
15+
| ----------------------------------------------------------- | ----------------------------------------------------------------- |
16+
| \$N\$ | sample size (number of observations) |
17+
| \$p\$ | number of predictors (features) |
18+
| \$\mathbf X \in \mathbb R^{N\times p}\$ | design / model matrix whose \$i\$-th row is \$\mathbf x\_i^\top\$ |
19+
| \$\mathbf y = (y\_1,\dots,y\_N)^\top\$ | vector of responses |
20+
| \$\boldsymbol\beta\in\mathbb R^{p}\$ | vector of unknown regression coefficients |
21+
| \$\widehat{\boldsymbol\beta}\$ | estimator of \$\boldsymbol\beta\$ |
22+
| \$\mathbf r=\mathbf y-\mathbf X\widehat{\boldsymbol\beta}\$ | vector of residuals |
23+
| $\|\cdot\|\_2\$ | Euclidean (ℓ2) norm |
24+
25+
---
426

527
### Curve Fitting
628

7-
Curve fitting involves finding a function, often a polynomial, that "best fits" a series of data points. This process does not require the function to pass through every data point; instead, it seeks to provide a general shape or trend that aligns closely with the data. This is especially applicable when dealing with noisy data or when multiple $y$ values exist for a single $x$ value.
29+
Curve fitting emphasizes the geometrical problem of approximating a cloud of points by a parametric curve or surface. The archetypal formulation is **polynomial least-squares**: given scalar inputs \$x\_i\in\mathbb R\$, fit an \$m\$-degree polynomial
30+
$P_m(x)=\sum_{k=0}^{m} a_k x^{k}\quad (\boldsymbol a\in\mathbb R^{m+1})$
31+
by minimizing the **sum-of-squares loss**
32+
\begin{align}
33+
S(\boldsymbol a)&=\sum\_{i=1}^{N}\bigl(P\_m(x\_i)-y\_i\bigr)^2.
34+
\end{align}
35+
In matrix form let \$\mathbf V\in\mathbb R^{N\times(m+1)}\$ be the *Vandermonde matrix* with \$V\_{ik}=x\_i^{k}\$ and \$\mathbf a=(a\_0,\dots,a\_m)^\top\$. The normal equations read
36+
$\mathbf V^{\top}\mathbf V\,\widehat{\mathbf a}=\mathbf V^{\top}\mathbf y.$
37+
Provided \$\mathbf V^{\top}\mathbf V\$ is nonsingular (which fails if \$m \ge N\$ or data are collinear), the minimizer is uniquely given by
38+
$\widehat{\mathbf a}=(\mathbf V^{\top}\mathbf V)^{-1}\mathbf V^{\top}\mathbf y.$
839

940
![curve_fitting](https://github.com/djeada/Numerical-Methods/assets/37275728/03a26675-9baa-4557-92fb-2ab86c9d7b7c)
1041

42+
> **Remark 1 (Overfitting and Regularisation).** High-degree polynomials can interpolate noisy data yet extrapolate disastrously. Ridge (\$\ell\_2\$) or Lasso (\$\ell\_1\$) penalties enforce smoothness or sparsity:
43+
> $S_{\lambda}(\boldsymbol a)=\underbrace{\|\mathbf V\boldsymbol a-\mathbf y\|_2^2}_{\text{data-fit}}\; +\; \lambda\underbrace{\|\boldsymbol a\|_q^q}_{\text{regulariser}},\quad q\in\{1,2\}.$
44+
> Closed-form solutions exist for \$q=2\$; for \$q=1\$ convex optimisation is required.
45+
46+
Other classical curve-fitting families include **splines**, **B-splines**, **Bezier curves**, **wavelet bases**, and **kernel smoothers** (e.g. Nadaraya–Watson). Each trades parametric flexibility against interpretability and computational cost.
47+
48+
---
49+
1150
### Regression Analysis
1251

13-
Regression analysis establishes a relationship between a dependent variable (also known as the 'outcome variable', 'target', or 'response') and one or more independent variables (also known as 'predictors', 'covariates', or 'features'). This statistical method is extensively used for predictive analysis.
52+
In modern statistical language, *regression* is synonymous with *conditional mean modelling*. We assume
53+
$\mathbb E[\,y\mid\mathbf x\,]=\mu(\mathbf x;\,\boldsymbol\beta),$
54+
where \$\mu(\cdot;,\boldsymbol\beta)\$ is a known link indexed by parameters \$\boldsymbol\beta\$. The task is to estimate \$\boldsymbol\beta\$ given i.i.d. samples.
55+
56+
#### 1. Linear Model
57+
58+
When
59+
$\mu(\mathbf x;\,\boldsymbol\beta)=\mathbf x^{\top}\boldsymbol\beta,$
60+
the model is **linear in parameters**. Writing \$\mathbf X\boldsymbol\beta\$ for the fitted values, the *ordinary least squares* (OLS) estimator is obtained by solving
61+
\begin{align}
62+
\widehat{\boldsymbol\beta}*{\text{OLS}}&=\arg\min*{\boldsymbol\beta};|\mathbf y-\mathbf X\boldsymbol\beta|\_2^2.
63+
\end{align}
64+
Assuming \$\operatorname{rank}(\mathbf X)=p\le N\$, the solution is
65+
$\widehat{\boldsymbol\beta}_{\text{OLS}}=(\mathbf X^{\top}\mathbf X)^{-1}\mathbf X^{\top}\mathbf y.$
66+
67+
**Gauss–Markov Theorem.** Under spherical errors \$\operatorname{Cov}(\boldsymbol\varepsilon)=\sigma^{2}\mathbf I\_N\$, OLS is the **best linear unbiased estimator** (BLUE): for any linear unbiased estimator \$\tilde{\boldsymbol\beta}=\mathbf C\mathbf y\$ with \$\mathbf C\mathbf X=\mathbf I\_p\$ we have
68+
$\operatorname{Var}(\tilde{\boldsymbol\beta})-\operatorname{Var}(\widehat{\boldsymbol\beta}_{\text{OLS}})\;\text{is positive semi-definite}.$
69+
70+
#### 2. Generalised Linear Model (GLM)
71+
72+
For exponential-family responses (\$y\sim\text{Bernoulli}\$, Poisson, Gamma, etc.) we posit
73+
$g(\,\mu(\mathbf x)\,) = \mathbf x^{\top}\boldsymbol\beta,$
74+
where \$g\$ is a monotonic *link*. E.g. logistic regression sets \$g(\mu)=\log(\mu/(1-\mu))\$. Parameters are estimated via **maximum likelihood**
75+
$\widehat{\boldsymbol\beta}=\arg\max_{\boldsymbol\beta}\; \prod_{i=1}^{N} f_Y\bigl(y_i;\,\mu_i(\boldsymbol\beta)\bigr),$
76+
which is solved by Fisher scoring or (quasi-)Newton iterations.
77+
78+
#### 3. Non-linear Least Squares (NLS)
79+
80+
Suppose \$\mu(\mathbf x;\boldsymbol\beta)\$ is nonlinear in \$\boldsymbol\beta\$, e.g. Michaelis–Menten kinetics
81+
$\mu(x;V_{\max},K_m)=\frac{V_{\max}x}{K_m+x}.$ The loss
82+
$S(\boldsymbol\beta)=\sum_{i=1}^{N}\bigl(y_i-\mu(\mathbf x_i;\boldsymbol\beta)\bigr)^2$
83+
becomes non-convex; Levenberg–Marquardt or trust-region methods are standard.
1484

15-
### Key Concepts in Regression
85+
---
1686

17-
- **Regression Models**: These models describe the relationship between a dependent variable and one or more independent variables.
18-
- **Parameter Estimation**: This involves calculating the coefficients of the variables in the regression model to best fit the observed data.
19-
- **Error Calculation**: Regression analysis typically utilizes the sum of squared differences between observed and predicted values for error calculation, ensuring discrepancies contribute positively to the error, with larger deviations weighted more heavily.
87+
### Concepts in Regression
2088

21-
$$E = \sum_{i=0}^{N} (P(x_i) - y_i)^2$$
89+
| Concept | Formal Definition |
90+
| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
91+
| **Parameter Estimation** | Solve \$\widehat{\boldsymbol\theta}=\arg\min\_{\boldsymbol\theta},\mathcal L(\boldsymbol\theta)\$ where \$\mathcal L\$ is least-squares or negative log-likelihood. |
92+
| **Fitted Values** | \$\widehat{y}\_i = \mu(\mathbf x\_i;\widehat{\boldsymbol\theta})\$ |
93+
| **Residuals** | \$r\_i=y\_i-\widehat{y}*i\$ (raw), \$\hat\varepsilon\_i=r\_i/(1-h*{ii})\$ (externally studentised) with \$h\_{ii}\$ the hat-matrix diagonal. |
94+
| **Loss / Error** | Classical: \$\mathrm{RSS}=\sum\_{i}r\_i^{2}\$; Classification: \$\mathrm{CrossEntropy}=-\sum\_{i} y\_i\log\widehat{y}\_i+(1-y\_i)\log(1-\widehat{y}\_i)\$ |
95+
| **Risk** | \$R(\widehat f)=\mathbb E\bigl\[\mathcal L(\widehat f(\mathbf x),y)\bigr]\$. Empirical risk minimisation (ERM) replaces \$\mathbb E\$ by sample mean. |
96+
| **Goodness-of-Fit** | \$R^2 = 1-\tfrac{\mathrm{RSS}}{\mathrm{TSS}}\$ with \$\mathrm{TSS}=\sum\_{i}(y\_i-\bar y)^2\$; Adjusted \$\bar R^2=1-(1-R^2)\tfrac{N-1}{N-p-1}\$; AIC \$=2k-2\log\hat L\$; BIC \$=k\log N-2\log\hat L\$. |
97+
| **Inference** | Wald test: \$\displaystyle z\_j = \frac{\widehat\beta\_j}{\widehat{\mathrm{se}}(\widehat\beta\_j)}\stackrel{\text{approx}}\sim N(0,1)\$; LR test: \$;2(\ell\_1-\ell\_0)\sim\chi^2\_{df}\$. |
98+
| **Prediction Interval** | For new \$\mathbf x\_0\$: \$\widehat y\_0\pm t\_{N-p,1-\alpha/2};\widehat\sigma\sqrt{1+ \mathbf x\_0^{\top}(\mathbf X^{\top}\mathbf X)^{-1}\mathbf x\_0}\$. |
2299

23-
- **Goodness of Fit**: This is a measure of how well the selected model fits the observed data. Commonly, R-squared is used as the goodness of fit measure.
100+
---
24101

25102
### Types of Regression Methods
26103

27-
- **Linear Regression**: Assumes a linear relationship between the dependent and independent variables. Useful for simple and multiple regression problems.
28-
- **Polynomial Regression**: An extended form of linear regression. This type models relationships best described by an nth degree polynomial, often used for curve fitting.
29-
- **Logistic Regression**: Utilized when the dependent variable is categorical. It models the likelihood of the dependent variable equating to a specific value.
30-
- **Non-Linear Regression**: Employed when the relationship between variables cannot be accurately represented by a linear model. Ideal for complex curve fitting where variable relationships are not simple.
104+
1. **Ordinary Least Squares (OLS)** – Closed-form, BLUE under Gauss–Markov conditions.
105+
2. **Ridge Regression** – Penalised least-squares with \$\lambda|\boldsymbol\beta|\_2^2\$; solution \$\widehat{\boldsymbol\beta}=(\mathbf X^{\top}\mathbf X+\lambda\mathbf I)^{-1}\mathbf X^{\top}\mathbf y\$.
106+
3. **Lasso & Elastic Net**\$\ell\_1\$ and mixed \$\ell\_1+\ell\_2\$ penalties promoting sparsity; solved by coordinate descent or LARS.
107+
4. **Generalised Linear Models (GLM)** – Logistic, probit, Poisson; estimated by iteratively re-weighted least squares.
108+
5. **Non-linear Regression (NLS)** – Use gradient-based optimisers; asymptotic theory requires identifiability and regularity.
109+
6. **Robust Regression** – M-estimators with Huber or Tukey bisquare \$\rho\$-functions; minimises \$\sum\_{i}\rho(r\_i/\hat\sigma)\$.
110+
7. **Quantile Regression** – Minimises asymmetric absolute loss \$\sum\_{i}\rho\_\tau(r\_i)\$ with \$\rho\_\tau(u)=u(\tau-\mathbb 1\_{u<0})\$.
111+
8. **Bayesian Regression** – Places prior \$p(\boldsymbol\beta)\$, outputs posterior \$p(\boldsymbol\beta\mid\mathbf y)\propto L(\boldsymbol\beta),p(\boldsymbol\beta)\$; predictive distribution integrates over posterior.
31112

32-
### Examples
113+
> **Computational Note.** High-dimensional (\$p\gg N\$) problems demand numerical linear-algebra tricks: Woodbury identity, iterative conjugate gradient, stochastic gradient descent (SGD), or variance-reduced methods (SVRG, SAGA).
33114
34-
- **Linear Regression**: Predict house prices (dependent variable) based on the house size in square feet (independent variable).
35-
- **Logistic Regression**: Predict whether an email is spam (dependent variable) based on word frequency in the email (independent variable).
36-
- **Polynomial Regression**: Fit a curve to a set of data points in a scatter plot to discern the underlying trend.
115+
---
116+
117+
### Worked Examples
118+
119+
#### Example 1 – OLS in Matrix Form
120+
121+
Let
122+
$\mathbf X = \begin{bmatrix}1 & 0.8\\1 & 1.2\\1 & 1.9\\1 & 2.4\\1 & 3.0\end{bmatrix},\qquad \mathbf y = \begin{bmatrix}1.2\\1.9\\3.1\\3.9\\5.1\end{bmatrix}.$
123+
Compute \$\mathbf X^{\top}\mathbf X=\begin{bmatrix}5 & 9.3\9.3 & 19.49\end{bmatrix}\$, \$\mathbf X^{\top}\mathbf y=\begin{bmatrix}15.2\30.47\end{bmatrix}\$, so
124+
$\widehat{\boldsymbol\beta}=\begin{bmatrix}0.067\\1.689\end{bmatrix},\quad R^2=0.998.$ Thus \$\widehat y=0.067+1.689x\$.
125+
126+
#### Example 2 – Logistic Regression, MLE Derivatives
127+
128+
For binary data \$y\_i\in{0,1}\$ the log-likelihood is
129+
$\ell(\boldsymbol\beta)=\sum_{i=1}^{N}\Bigl[y_i\,\mathbf x_i^{\top}\boldsymbol\beta-\log\bigl\{1+e^{\mathbf x_i^{\top}\boldsymbol\beta}\bigr\}\Bigr].$
130+
Gradient and Hessian:
131+
\begin{align}
132+
\nabla\ell &= \mathbf X^{\top}(\mathbf y-\boldsymbol\pi),\qquad \boldsymbol\pi=\operatorname{logit}^{-1}(\mathbf X\boldsymbol\beta),\\
133+
\nabla^2\ell &=-\mathbf X^{\top}\operatorname{diag}(\boldsymbol\pi\odot(1-\boldsymbol\pi)),\mathbf X;\text{ (negative definite)}.
134+
\end{align}
135+
Newton iteration: \$\boldsymbol\beta^{(t+1)}=\boldsymbol\beta^{(t)}-(\nabla^2\ell)^{-1}\nabla\ell\$.
136+
137+
---
37138

38139
### Applications
39140

40-
- Regression analysis and curve fitting find wide-ranging applications in businesses for market forecasting, financial analysis, and budgeting.
41-
- In healthcare, regression models predict patient outcomes based on various indicators.
42-
- They play a significant role in predictive modeling within machine learning and artificial intelligence in the tech industry.
141+
* **Finance & Econometrics** – Capital asset pricing (CAPM), term-structure models, volatility forecasting (GARCH regression), default-probability prediction.
142+
* **Healthcare & Epidemiology** – Survival analysis (Cox proportional hazards), dose-response curves, genome-wide association studies (GWAS) via penalised regression.
143+
* **Engineering** – System identification, Kalman-filter regressions, fatigue-life modelling.
144+
* **Marketing & A/B Testing** – Uplift modelling, mixed-effect regressions for hierarchical data.
145+
* **Machine Learning Pipelines** – Feature engineering baseline, stacking/blending meta-learners, interpretability audits.
146+
147+
---
148+
149+
### Limitations & Pitfalls
150+
151+
1. **Model Misspecification** – When \$f\_\*(\mathbf x)\$ lies outside the chosen hypothesis class, estimators are biased even as \$N\to\infty\$.
152+
2. **Violation of IID** – Autocorrelated or clustered errors require GLS or sandwich covariances.
153+
3. **Heteroscedasticity**\$\operatorname{Var}(\varepsilon\_i\mid\mathbf x\_i)=\sigma\_i^2\$ invalidates OLS variance formulas; use White/HC estimators.
154+
4. **Multicollinearity** – Near-linear dependence inflates \$\operatorname{Var}(\widehat\beta\_j)\$; ridge shrinks condition number.
155+
5. **High-Leverage & Outliers** – Influence measures: Cook’s \$D\_i=\frac{r\_i^2 h\_{ii}}{p\hat\sigma^2(1-h\_{ii})^2}\$; robust M-estimators mitigate.
156+
6. **Overfitting / High Variance** – Cross-validation, information criteria, or Bayesian model averaging select model complexity.
157+
7. **External Validity** – Regression learns conditional mean on \$\mathcal D\$; distribution shift breaks prediction (covariate shift, concept drift).
158+
8. **Causal Inference vs. Prediction** – Regression coefficients are not causal unless confounding is addressed (instrumental variables, RCTs, DAGs).
159+
160+
---
43161

44-
### Limitations
162+
### Further Reading
45163

46-
- Regression analysis presumes a specific form of relationship (linear, polynomial, etc.) between the dependent and independent variables, which may not always hold.
47-
- Both curve fitting and regression can be sensitive to outliers, potentially skewing results.
48-
- Regression models make certain assumptions about the data (like homoscedasticity and normality of errors), which may not always be met.
49-
- Multicollinearity, where independent variables are highly correlated, can affect the performance of regression models.
164+
1. Seber, G. A. F., & Lee, A. J. *Linear Regression Analysis*, 2e, Wiley (2003).
165+
2. Hastie, T., Tibshirani, R., & Friedman, J. *The Elements of Statistical Learning*, 2e, Springer (2009).
166+
3. McCullagh, P., & Nelder, J. *Generalized Linear Models*, 2e, Chapman & Hall (1989).
167+
4. Kennedy, P. *A Guide to Econometrics*, 7e, Wiley-Blackwell (2008).

0 commit comments

Comments
 (0)