Updated linear least squares docs

cs357 · Mar 7, 2024 · 5fa29bc · 5fa29bc
1 parent 1675f65
commit 5fa29bc
Showing 1 changed file with 109 additions and 49 deletions.
diff --git a/notes/linear-least-squares.md b/notes/linear-least-squares.md
@@ -1,7 +1,70 @@
 ---
 title: Least Squares Data Fitting
-description: Add description here...
+description: Solving Least Squares problems with different methods
 sort: 17
+author:
+  - CS 357 Course Staff
+changelog:
+  - 
+    name: Bhargav Chandaka
+    netid: bhargav9
+    date: 2024-03-6
+    message: major reorganziation to match up with content in slides/videos
+  - 
+    name: Yuxuan Chen
+    netid: yuxuan19
+    date: 2023-4-28
+    message: adding computational complexity using reduced SVD
+  - 
+    name: Arnav Shah
+    netid: arnavss2
+    date: 2022-04-9
+    message: add few comments from slides asked in homework
+  - 
+    name: Jerry Yang
+    netid: jiayiy7
+    date: 2020-08-8
+    message: adds formal proof link for solving least-squares using SVD
+  - 
+    name: Mariana Silva
+    netid: mfsilva
+    date: 2020-4-26
+    message: improved text overall; removed theory of the nonlinear least-squares
+  # - 
+  #   name: Erin Carrier
+  #   netid: ecarrie2
+  #   date: 2018-11-14
+  #   message: fix typo in lstsq res sum range
+  - 
+    name: Erin Carrier
+    netid: ecarrie2
+    date: 2018-1-14
+    message: removes demo links
+  # - 
+  #   name: Erin Carrier
+  #   netid: ecarrie2
+  #   date: 2017-11-29
+  #   message: fixes typos in lst-sq code, jacobian desc in nonlinear lst-sq
+  # - 
+  #   name: Erin Carrier
+  #   netid: ecarrie2
+  #   date: 2017-11-17
+  #   message: fixes incorrect link
+  # - 
+  #   name: Erin Carrier
+  #   netid: ecarrie2
+  #   date: 2017-11-16
+  #   message: adds review questions, minor formatting changes throughout for consistency, adds normal equations and interp vs lst-sq sections,  removes Gauss-Newton from nonlinear least squares
+#   - 
+#     name: Yu Meng
+#     netid: yumeng5
+#     date: 2017-11-12
+#     message: first complete draft
+  # - 
+  #   name: Luke Olson
+  #   netid: lukeo
+  #   date: 2017-10-17
+  #   message: outline
 ---
 # Least Squares Data Fitting
 
@@ -16,44 +79,34 @@ sort: 17
 
 ## Linear Regression with a Set of Data
 
-Consider a set of <span>\\(m\\)</span> data points (where <span>\\(m>2\\)</span>), \\(\{(t_1,y_1),(t_2,y_2),\dots,(t_m,y_m)\}\\). Suppose we want to find a straight line that best fits these data points.
+Given <span>\\(m\\)</span> data points (where <span>\\(m>2\\)</span>), \\(\\{(t_1,y_1),(t_2,y_2),\dots,(t_m,y_m)\\}\\), we want to find a straight line that best fits these data points.
 Mathematically, we are finding $$x_0$$ and $$x_1$$ such that
-<div>\[ y_i = x_1\,t_i + x_0, \quad \forall i \in [1,m]. \]</div>
+<div>\[ y_i = x_0 + x_1\,t_i , \quad \forall i \in [1,m]. \]</div>
 
-In matrix form, the resulting linear system is
+In matrix form, the resulting linear system is:
 <div>\[ \begin{bmatrix} 1 & t_1 \\ 1& t_2 \\ \vdots & \vdots\\ 1& t_m  \end{bmatrix} \begin{bmatrix} x_0\\ x_1 \end{bmatrix} = \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix} \]</div>
 
-However, it is obvious that we have more equations than unknowns, and there is usually no exact solution to the above problem.
+$${\bf A x} = {\bf b} $$
 
-Generally, if we have a linear system
+where $${\bf A}$$ is an \\(m\times n\\) matrix, $${\bf x}$$ is an \\(n\times 1\\) matrix, and $${\bf b}$$ is an \\(m\times 1\\) matrix. 
 
-$${\bf A x} = {\bf b} $$
+Ideally, we want to find the appropriate linear combination of the columns of $${\bf A}$$ that makes up the vector $${\bf b}$$. If a solution exists that satisfies $${\bf A x} = {\bf b} $$, then $${\bf b} \in range({\bf A})$$.
 
-where $${\bf A}$$ is an \\(m\times n\\) matrix. When <span>\\(m>n\\)</span>  we call this system **_overdetermined_** and the equality is usually not exactly satisfiable as $${\bf b}$$ may not lie in the column space of <span>\\({\bf A}\\)</span>.
+However, in this system of linear equations, we have more equations than unknowns, and there is usually no exact solution to the above problem.
+
+When \\( m>n\\), we call this linear system **_overdetermined_** and the $${\bf A x} = {\bf b} $$ equality is usually not exactly satisfiable as $${\bf b}$$ may not lie in the column space of $${\bf A}$$.
 
 Therefore, an overdetermined system is better written as
 
 $${\bf A x} \cong {\bf b} $$
 
 ## Linear Least-squares Problem
 
-For an overdetermined system \\({\bf A x}\cong {\bf b}\\), we are typically looking for a solution \\({\bf x}\\) that minimizes the squared Euclidean norm of the residual vector \\({\bf r} = {\bf b} - {\bf A} {\bf x}\\),
+For an overdetermined system \\({\bf A x}\cong {\bf b}\\), we are typically looking for a solution \\({\bf x}\\) that minimizes the Euclidean norm of the residual vector \\({\bf r} = {\bf b} - {\bf A} {\bf x}\\),
 
 $$\min_{ {\bf x} } \|{\bf r}\|_2^2 = \min_{ {\bf x} } \|{\bf b} - {\bf A}  {\bf x}\|_2^2.$$
 
-This problem \\(A {\bf x} \cong {\bf b}\\) is called a **_linear least-squares problem_**, and the solution \\({\bf x}\\) is called **_least-squares solution_**. Linear Least Squares problem \\(A {\bf x} \cong {\bf b}\\) always has solution. Here we will first focus on linear least-squares problems.
-
-## Data Fitting vs Interpolation
-
-It is important to understand that interpolation and least-squares data fitting, while somewhat similar, are fundamentally different in their goals. In both problems we have a set of data points <span>\\((t_i, y_i)\\)</span>, \\(i=1,\ldots,m\\), and we are attempting to determine the coefficients for a linear combination of basis functions.
-
-With interpolation, we are looking for the linear combination of basis functions such that the resulting function passes through each of the data points _exactly_. So, for <span>\\(m\\)</span> unique data points, we need <span>\\(m\\)</span> linearly independent basis functions (and the resulting linear system will be square and full rank, so it will have an exact solution).
-
-In contrast, however, with least squares data fitting we have some model that we are trying to find the parameters of the model that best fits the data points. For example, with linear least squares we may have 300 noisy data points that we want to model as a quadratic function. Therefore, we are trying represent our data as
-
-$$y = x_0 + x_1 t + x_2 t^2 $$
-
-where <span>\\(x_0, x_1,\\)</span> and <span>\\(x_2\\)</span> are the unknowns we want to determine (the coefficients to our basis functions). Because there are significantly more data points than parameters, we do not expect that the function will exactly pass through the data points. For this example, with noisy data points we would not want our function to pass through the data points exactly as we are looking to model the general trend and not capture the noise.
+This problem \\(A {\bf x} \cong {\bf b}\\) is called a **_linear least-squares problem_**, and the solution \\({\bf x}\\) is called **_least-squares solution_**. $${\bf A}$$ is an $${m \times n}$$ matrix where $${m \ge n}$$,  $${m}$$ is the number of data pair points and $${n}$$ is the number of parameters of the "best fit" function. The Linear Least Squares problem, \\(A {\bf x} \cong {\bf b}\\), **_always_** has a solution.And, this solution is unique if and only if $${rank({\bf A})= n}$$. 
 
 ## Normal Equations
 
@@ -74,7 +127,7 @@ is called the system of **normal equations**. If the matrix $${\bf A} $$ is full
 
 $${\bf x} = ({\bf A} ^T {\bf A})^{-1} {\bf A} ^T \mathbf{b}$$
 
-We can look at the second-order sufficient condition of the the minimization problem by evaluating the Hessian of $$\phi$$:
+We can look at the second-order sufficient condition of the minimization problem by evaluating the Hessian of $$\phi$$:
 
 $${\bf H} = 2 {\bf A} ^T {\bf A}$$
 
@@ -89,17 +142,30 @@ Because of this, finding the least squares solution using Normal Equations is of
 Another approach to solve Linear Least Squares is to find $${\bf y} = {\bf A} {\bf x}$$ which is closest to the vector $${\bf b}$$.
 When the residual $${\bf r} = {\bf b} - {\bf y} = {\bf b} - {\bf A} {\bf x}$$ is orthogonal to all columns of $${\bf A}$$, then $${\bf y}$$ is closest to $${\bf b}$$.
 
+$${\bf A}^T{\bf r} = {\bf A^T}\left({\bf b} - {\bf A} {\bf x}\right) = 0 \implies {\bf A} ^T {\bf A}  {\bf x} = {\bf A}^T  {\bf b}$$
+
+## Data Fitting vs Interpolation
+
+It is important to understand that interpolation and least-squares data fitting, while somewhat similar, are fundamentally different in their goals. In both problems we have a set of data points <span>\\((t_i, y_i)\\)</span>, \\(i=1,\ldots,m\\), and we are attempting to determine the coefficients for a linear combination of basis functions.
+
+With interpolation, we are looking for the linear combination of basis functions such that the resulting function passes through each of the data points _exactly_. So, for <span>\\(m\\)</span> unique data points, we need <span>\\(m\\)</span> linearly independent basis functions (and the resulting linear system will be square and full rank, so it will have an exact solution).
+
+In contrast, however, with least squares data fitting we have some model that we are trying to find the parameters of the model that best fits the data points. For example, with linear least squares we may have 300 noisy data points that we want to model as a quadratic function. Therefore, we are trying represent our data as
+
+$$y = x_0 + x_1 t + x_2 t^2 $$
+
+where <span>\\(x_0, x_1,\\)</span> and <span>\\(x_2\\)</span> are the unknowns we want to determine (the coefficients to our basis functions). Because there are significantly more data points than parameters, we do not expect that the function will exactly pass through the data points. For this example, with noisy data points we would not want our function to pass through the data points exactly as we are looking to model the general trend and not capture the noise.
+
 ## Computational Complexity
 
-Since the system of normal equations yield a square and symmetric matrix, the least-squares solution can be
-computed using efficient methods such as Cholesky factorization. Note that the overall computational complexity of the factorization is
-$$\mathcal{O}(n^3)$$. However, the construction of the matrix $${\bf A} ^T {\bf A}$$ has complexity $$\mathcal{O}(mn^2)$$.
-In typical data fitting problems, $$ m >> n$$ and hence the overall complexity of the Normal Equations method is $$\mathcal{O}(mn^2)$$.
+Since the system of normal equations yield a square and symmetric matrix, the least-squares solution can be computed using efficient methods such as Cholesky factorization. Note that the overall computational complexity of the factorization is $$\mathcal{O}(n^3)$$. However, the construction of the matrix $${\bf A} ^T {\bf A}$$ has complexity $$\mathcal{O}(mn^2)$$.
+
+In typical data fitting problems, $$ m >> n$$ and hence the overall complexity of the Normal Equations method is $${\bf \mathcal{O}(mn^2)}$$.
 
 ## Solving Least-Squares Problems Using SVD
 
 Another way to solve the least-squares problem \\({\bf A} {\bf x} \cong {\bf b}\\)
-(where we are looking for \\({\bf x}\\) that minimizes $$\|{\bf b} - {\bf A} {\bf x}\|_2^2$$ is to use the singular value decomposition
+(where we are looking for \\({\bf x}\\) that minimizes $$\|{\bf b} - {\bf A} {\bf x}\|_2^2$$) is to use the singular value decomposition
 (SVD) of <span>\\({\bf A}\\)</span>,
 
 $${\bf A} = {\bf U \Sigma V}^T $$
@@ -181,7 +247,7 @@ where \\({\bf u}_i\\) represents the <span>\\(i\\)</span>th column of <span>\\({
 
 #### Example of a Least-squares solution using SVD
 
-Assume we have <span>\\(3\\)</span> data points, \\(\{(t_i,y_i)\}=\{(1,1.2),(2,1.9),(3,1)\}\\), we want to find a line that best fits these data points. The code for using SVD to solve this least-squares problem is:
+Assume we have <span>\\(3\\)</span> data points, \\(\{(t_i,y_i)\}=\{(1,1.2),(2,1.9),(3,1)\}\\), we want to find the coefficients for a line, $${y = x_0 + x_1 t}$$, that best fits these data points. The code for using SVD to solve this least-squares problem is:
 
 ```python
 import numpy as np
@@ -195,6 +261,7 @@ y = np.zeros(len(A[0]))
 z = np.dot(U.T,b)
 k = 0
 threshold = 0.01
+# matrix multiplying A by pseudo-inverse of sigma
 while k < len(A[0]) and s[k] > threshold:
   y[k] = z[k]/s[k]
   k += 1
@@ -214,23 +281,16 @@ If the fitting function \\(f(t,{\bf x})\\) for data points $$(t_i,y_i), i = 1, .
 is a **_non-linear least-squares problem_**.
 
 ## Review Questions
-
-- See this [review link](/cs357/fa2020/reviews/rev-17-least-squares.html)
-
-## ChangeLog
-
-* 2023-04-28 Yuxuan Chen <[email protected]>: adding computational complexity using reduced SVD
-* 2022-04-09 Arnav Shah <[email protected]>: add few comments from slides asked in homework
-* 2020-08-08 Jerry Yang <[email protected]>: adds formal proof link for solving least-squares using SVD
-* 2020-04-26 Mariana Silva <[email protected]>: improved text overall; removed theory of the nonlinear least-squares
-* 2018-11-14 Erin Carrier <[email protected]>: fix typo in lstsq res sum range
-* 2018-01-14 Erin Carrier <[email protected]>: removes demo links
-* 2017-11-29 Erin Carrier <[email protected]>: fixes typos in lst-sq code,
-  jacobian desc in nonlinear lst-sq
-* 2017-11-17 Erin Carrier <[email protected]>: fixes incorrect link
-* 2017-11-16 Erin Carrier <[email protected]>: adds review questions
-  minor formatting changes throughout for consistency,
-  adds normal equations and interp vs lst-sq sections
-  removes Gauss-Newton from nonlinear least squares
-* 2017-11-12 Yu Meng <[email protected]>: first complete draft
-* 2017-10-17 Luke Olson <[email protected]>: outline
+1. What does the least-squares solution minimize?
+2. For a given model and given data points, can you form the system $${\bf A x} \cong {\bf b} $$ for a least squares problem?
+3. For a small problem, given some data points and a model, can you determine the least squares solution?
+4. In general, what can we say about the value of the residual for the least squares solution?
+5. What are the differences between least squares data fitting and interpolation?
+6. Given the SVD of a matrix $${\bf A}$$, how can we use the SVD to compute the residual of the least squares solution?
+7. Given the SVD of a matrix $${\bf A}$$, how can we use the SVD to compute the least squares solution? Be able to do this for a small problem.
+8. Given an already computed SVD of a matrix $${\bf A}$$, what is the cost of using the SVD to solve a least squares problem?
+9. Why would you use the SVD instead of normal equations to find the solution to $${\bf A x} \cong {\bf b} $$?
+10. Which costs less: solving a least squares problem via the normal equations or solving a least squares problem using the SVD?
+11. What is the difference between a linear and a nonlinear least squares problem? What sort of model makes it a nonlinear problem? For data points 
+$${\left(t_i, y_i\right)}$$, is fitting $${y = a*cos(t) + b}$$ where $${a}$$ and $${b}$$ are the coefficients we are trying to determine a linear or nonlinear least squares problem?
+- See this [review link](/cs357/fa2020/reviews/rev-17-least-squares.html)