linear_regression_intro - wording

scicloj · Dec 29, 2024 · 2d4e2d7 · 2d4e2d7
1 parent a550ab9
commit 2d4e2d7
Showing 1 changed file with 55 additions and 71 deletions.
diff --git a/notebooks/noj_book/linear_regression_intro.clj b/notebooks/noj_book/linear_regression_intro.clj
@@ -1,11 +1,11 @@
-;; # Intro to Linear Regression - DRAFT 🛠
+;; # Introduction to Linear Regression - DRAFT 🛠
 
 ;; **last update:** 2024-12-29
 
-;; Here we offer an intro to [linear regression](https://en.wikipedia.org/wiki/Linear_regression)
-;; following the
+;; In this tutorial, we introduce the fundamentals of [linear regression](https://en.wikipedia.org/wiki/Linear_regression),
+;; guided by the
 ;; [In Depth: Linear Regression](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html)
-;; section of the
+;; chapter of the
 ;; [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
 ;; by Jake VanderPlas.
 
@@ -25,20 +25,19 @@
 
 ;; ## Simple Linear Regression
 
-;; Let us start with the basic model of a straight-line fit, where points $(x,y)$ are
-;; assumed to have a relationship where $y$ can be predicted as $$y=ax+b.$$
-;; Here, $a$ is the slope of the line, and $b$ is the so-called intercept, the height of
-;; crossing the $y$ axis.
+;; We begin with the classic straight-line model: for data points $(x, y)$,
+;; we assume there is a linear relationship allowing us to predict $y$ as
+;; $$y = ax + b.$$
+;; In this formulation, $a$ is the slope and $b$ is the intercept,
+;; the point where our line would cross the $y$ axis.
 
-;; To explore this kind of relationship, we will use
-;; Fastmath and Tablecloth to
-;; generate random data where it holds for $a=2$ and $b=-5$.
+;; To illustrate, we'll use Fastmath and Tablecloth to create synthetic data
+;; in which the relationship is known to hold with $a=2$ and $b=-5$.
 
-;; For each row of the dataset below, we pick $x$ as a uniform
-;; random number between 0 to 10, and we compute $y$ as
-;; $ax+b$ plus additional standard Gaussian random error
-;; (Normally distributed with mean 0 and variance 1),
-;; independently of all other rows.
+;; For each row in the dataset below, we draw $x$ uniformly from 0 to 10
+;; and compute $y = ax + b$ plus an extra random noise term
+;; (drawn from a standard Gaussian distribution).
+;; This noise is added independently for every row.
 
 (def simple-linear-data
   (let [rng (rand/rng 1234)
@@ -56,20 +55,20 @@
 
 simple-linear-data
 
-;; Let us plot the data using Tableplot's Plotly API.
+;; Let's plot these points using Tableplot's Plotly API.
 
 (-> simple-linear-data
     plotly/layer-point)
 
 ;; ### Regression using Fastmath
 
-;; To estimate a linear relationship, we may use the Fastmath library.
+;; We can now fit a linear model to the data using the Fastmath library.
 
 (def simple-linear-data-model
   (reg/lm
    ;; ys - a "column" sequence of `y` values:
    (simple-linear-data :y)
-   ;; xss - a sequence of "rows", each containing `x` values
+   ;; xss - a sequence of "rows", each containing `x` values:
    ;; (one `x` per row, in our case):
    (-> simple-linear-data
        (tc/select-columns [:x])
@@ -81,39 +80,33 @@ simple-linear-data
 
 simple-linear-data-model
 
-;; When the fit model is printed, we get a nice tabular summary.
-;; Let us capture the printed value and display it here.
-;; Using [Kindly](https://scicloj.github.io/kindly/),
-;; we mark it as `kind/code` for convenient rendering.
+;; Printing the model gives a tabular summary:
+;; We'll capture the printed output and display it via Kindly for cleaner formatting.
 
 (kind/code
  (with-out-str
    (println
     simple-linear-data-model)))
 
-;; We can see the regression could estimate the intercept $b$ and
-;; the slope $a$ (the coefficient of $x$).
+;; As you can see, the estimated coefficients match our intercept $b$
+;; and slope $a$ (the coefficient of $x$).
 
 ;; ### Dataset ergonomics
 
-;; Let us write a couple of convenience functions
-;; that will help us use regression with datasets and view
-;; the summary with Kindly.
-;; We are working on similar functionality in the
-;; [Tablemath](https://scicloj.github.io/tablemath) library,
-;; which is still in experimental stage and is thus not included
-;; in Noj yet.
+;; Below are a couple of helper functions that simplify how we use regression with datasets
+;; and display model summaries. We have similar ideas under development in the
+;; [Tablemath](https://scicloj.github.io/tablemath) library, but it is still in
+;; an experimental stage and not part of Noj yet.
 
 (defn lm
   "Compute a linear regression model for `dataset`.
   The first column marked as target is the target.
   All the columns unmarked as target are the features.
   The resulting model is of type `fastmath.ml.regression.LMData`,
-  a generated by [Fastmath](https://github.com/generateme/fastmath).
+  created via [Fastmath](https://github.com/generateme/fastmath).
   
   See [fastmath.ml.regression.lm](https://generateme.github.io/fastmath/clay/ml.html#lm)
   for `options`."
-
   ([dataset]
    (lm dataset nil))
   ([dataset options]
@@ -148,24 +141,22 @@ simple-linear-data-model
 
 ;; ### Prediction
 
-;; Given a linear model, we can ask it for new predictions.
-
-;; Here is the prediction for $x=3$:
+;; Once we have a linear model, we can generate new predictions.
+;; For instance, let's predict $y$ when $x=3$:
 
 (simple-linear-data-model [3])
 
 ;; ### Displaying the regression line
 
-;; We can use Tableplot to visualize the regression model
-;; as and additional layer.
-;; Behind the scenes, it uses Fastmath `lm` by default, so
-;; the result should be compatible.
+;; We can visualize the fitted line by adding a smooth layer to our scatter plot.
+;; Tableplot makes this convenient:
 
 (-> simple-linear-data
     plotly/layer-point
     plotly/layer-smooth)
 
-;; We could also generate this explicitly ourselves:
+;; Alternatively, we can build the regression line explicitly.
+;; We'll obtain predictions and then plot them:
 
 (-> simple-linear-data
     (tc/map-columns :prediction
@@ -176,7 +167,7 @@ simple-linear-data-model
 
 ;; ## Multiple linear regression
 
-;; Linear dependency on a few variables can be estimated the same way.
+;; We can easily extend these ideas to multiple linear predictors.
 
 (def multiple-linear-data
   (let [rng (rand/rng 1234)
@@ -202,12 +193,10 @@ simple-linear-data-model
 
 (summary multiple-linear-data-model)
 
-;; Visualization is different, of course.
-;; Here, we use a combination of Tableplot's Plotly API
-;; and Plotly's own specification.
-;; We are planning to improve Tableplot so that
-;; such details will be handled more directly
-;; as layers in its API. 🛠
+;; Visualizing multiple dimensions is more involved. Below is a 3D
+;; scatter plot using Tableplot's Plotly API, along with an explicit
+;; Plotly surface spec for the regression plane. We expect to simplify this
+;; process in the future, so these details become part of higher-level layers. 🛠
 
 (-> multiple-linear-data
     (plotly/layer-point {:=coordinates :3d
@@ -229,12 +218,11 @@ simple-linear-data-model
 
 ;; ## Coming soon: Regularization 🛠
 
-
 ;; ## Example: Predicting Bicycle Traffic
 
-;; Following the Python Data Science Handbook, we will look into predicting
-;; the daily number of bicycle trips across Fremont Bridge in Seattle.
-;; We can use the information of weather, season, day of week, etc.
+;; As in the Python Data Science Handbook, we'll try predicting the daily number
+;; of bicycle trips across the Fremont Bridge in Seattle. The features will include
+;; weather, season, day of week, and related factors.
 
 ;; ### Reading and parsing data
 
@@ -262,16 +250,15 @@ weather
 
 ;; ### Preprocessing
 
-;; Our bike counts data are hourly, but the weather data is daily.
-;; To join them, we will need to convert the bike hourly counts to daily counts.
+;; The bike counts come in hourly data, but our weather information is daily.
+;; We'll need to aggregate the hourly counts into daily totals before combining the datasets.
 
-;; In the Python book, this is done as follows in Pandas:
+;; In the Python handbook, one does:
 ;; ```python
 ;; daily = counts.resample('d').sum()
 ;; ```
 
-;; Tablecloth's full support for time series is still under construction.
-;; For now, we will have to be a bit more verbose:
+;; Since Tablecloth's time series features are still evolving, we'll be a bit more explicit:
 
 (def daily-totals
   (-> counts
@@ -281,27 +268,25 @@ weather
       (tc/aggregate-columns [:total :west :east]
                             tcc/sum)))
 
-
 daily-totals
 
 ;; ### Prediction by day-of-week
 
-;; Let us prepare the data for regression on the day of week.
+;; Next, we'll explore a simple regression by day of week.
 
 (def days-of-week
   [:Mon :Tue :Wed :Thu :Fri :Sat :Sun])
 
-
-;; We will convert numbers to days-of-week keywords:
+;; We'll convert the numeric day-of-week to the corresponding keyword:
 
 (def idx->day-of-week
   (comp days-of-week dec))
 
-;; E.g., 
+;; For example,
 (idx->day-of-week 1)
 (idx->day-of-week 7)
 
-;; Now, let us prepare the data:
+;; Now, let's build our dataset:
 
 (def totals-with-day-of-week
   (-> daily-totals
@@ -321,7 +306,7 @@ totals-with-day-of-week
                     (tc/add-column day-of-week
                                    #(-> (:day-of-week %)
                                         (tcc/eq day-of-week)
-                                        ;; turn booleans into 0s and 1s
+                                        ;; convert booleans to 0/1
                                         (tcc/* 1)))))
               totals-with-day-of-week
               days-of-week)
@@ -331,23 +316,22 @@ totals-with-day-of-week
 (-> totals-with-one-hot-days-of-week
     (tc/select-columns ds-mod/inference-column?))
 
-;; The binary columns are collinear (sum up to 1),
-;; but we will avoid the intercept.
-;; This way, the interpretation of each coefficient is the expected
-;; bike count for the corresponding day of week.
+;; Since the binary columns sum to 1, they're collinear, and we won't use an intercept.
+;; This way, each coefficient directly reflects the expected bike count
+;; for that day of week.
 
 (def days-of-week-model
   (lm totals-with-one-hot-days-of-week
       {:intercept? false}))
 
-;; Here are the regression results:
+;; Let's take a look at the results:
 
 (-> days-of-week-model
     println
     with-out-str
     kind/code)
 
-;; We can see the difference between weekends and weekdays.
+;; We can clearly see weekend versus weekday differences.
 
 ;; ### Coming soon: more predictors for the bike counts 🛠