Skip to content

Commit

Permalink
linear_regression_intro - wording
Browse files Browse the repository at this point in the history
  • Loading branch information
daslu committed Dec 29, 2024
1 parent a550ab9 commit 2d4e2d7
Showing 1 changed file with 55 additions and 71 deletions.
126 changes: 55 additions & 71 deletions notebooks/noj_book/linear_regression_intro.clj
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
;; # Intro to Linear Regression - DRAFT 🛠
;; # Introduction to Linear Regression - DRAFT 🛠

;; **last update:** 2024-12-29

;; Here we offer an intro to [linear regression](https://en.wikipedia.org/wiki/Linear_regression)
;; following the
;; In this tutorial, we introduce the fundamentals of [linear regression](https://en.wikipedia.org/wiki/Linear_regression),
;; guided by the
;; [In Depth: Linear Regression](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html)
;; section of the
;; chapter of the
;; [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
;; by Jake VanderPlas.

Expand All @@ -25,20 +25,19 @@

;; ## Simple Linear Regression

;; Let us start with the basic model of a straight-line fit, where points $(x,y)$ are
;; assumed to have a relationship where $y$ can be predicted as $$y=ax+b.$$
;; Here, $a$ is the slope of the line, and $b$ is the so-called intercept, the height of
;; crossing the $y$ axis.
;; We begin with the classic straight-line model: for data points $(x, y)$,
;; we assume there is a linear relationship allowing us to predict $y$ as
;; $$y = ax + b.$$
;; In this formulation, $a$ is the slope and $b$ is the intercept,
;; the point where our line would cross the $y$ axis.

;; To explore this kind of relationship, we will use
;; Fastmath and Tablecloth to
;; generate random data where it holds for $a=2$ and $b=-5$.
;; To illustrate, we'll use Fastmath and Tablecloth to create synthetic data
;; in which the relationship is known to hold with $a=2$ and $b=-5$.

;; For each row of the dataset below, we pick $x$ as a uniform
;; random number between 0 to 10, and we compute $y$ as
;; $ax+b$ plus additional standard Gaussian random error
;; (Normally distributed with mean 0 and variance 1),
;; independently of all other rows.
;; For each row in the dataset below, we draw $x$ uniformly from 0 to 10
;; and compute $y = ax + b$ plus an extra random noise term
;; (drawn from a standard Gaussian distribution).
;; This noise is added independently for every row.

(def simple-linear-data
(let [rng (rand/rng 1234)
Expand All @@ -56,20 +55,20 @@

simple-linear-data

;; Let us plot the data using Tableplot's Plotly API.
;; Let's plot these points using Tableplot's Plotly API.

(-> simple-linear-data
plotly/layer-point)

;; ### Regression using Fastmath

;; To estimate a linear relationship, we may use the Fastmath library.
;; We can now fit a linear model to the data using the Fastmath library.

(def simple-linear-data-model
(reg/lm
;; ys - a "column" sequence of `y` values:
(simple-linear-data :y)
;; xss - a sequence of "rows", each containing `x` values
;; xss - a sequence of "rows", each containing `x` values:
;; (one `x` per row, in our case):
(-> simple-linear-data
(tc/select-columns [:x])
Expand All @@ -81,39 +80,33 @@ simple-linear-data

simple-linear-data-model

;; When the fit model is printed, we get a nice tabular summary.
;; Let us capture the printed value and display it here.
;; Using [Kindly](https://scicloj.github.io/kindly/),
;; we mark it as `kind/code` for convenient rendering.
;; Printing the model gives a tabular summary:
;; We'll capture the printed output and display it via Kindly for cleaner formatting.

(kind/code
(with-out-str
(println
simple-linear-data-model)))

;; We can see the regression could estimate the intercept $b$ and
;; the slope $a$ (the coefficient of $x$).
;; As you can see, the estimated coefficients match our intercept $b$
;; and slope $a$ (the coefficient of $x$).

;; ### Dataset ergonomics

;; Let us write a couple of convenience functions
;; that will help us use regression with datasets and view
;; the summary with Kindly.
;; We are working on similar functionality in the
;; [Tablemath](https://scicloj.github.io/tablemath) library,
;; which is still in experimental stage and is thus not included
;; in Noj yet.
;; Below are a couple of helper functions that simplify how we use regression with datasets
;; and display model summaries. We have similar ideas under development in the
;; [Tablemath](https://scicloj.github.io/tablemath) library, but it is still in
;; an experimental stage and not part of Noj yet.

(defn lm
"Compute a linear regression model for `dataset`.
The first column marked as target is the target.
All the columns unmarked as target are the features.
The resulting model is of type `fastmath.ml.regression.LMData`,
a generated by [Fastmath](https://github.com/generateme/fastmath).
created via [Fastmath](https://github.com/generateme/fastmath).
See [fastmath.ml.regression.lm](https://generateme.github.io/fastmath/clay/ml.html#lm)
for `options`."

([dataset]
(lm dataset nil))
([dataset options]
Expand Down Expand Up @@ -148,24 +141,22 @@ simple-linear-data-model

;; ### Prediction

;; Given a linear model, we can ask it for new predictions.

;; Here is the prediction for $x=3$:
;; Once we have a linear model, we can generate new predictions.
;; For instance, let's predict $y$ when $x=3$:

(simple-linear-data-model [3])

;; ### Displaying the regression line

;; We can use Tableplot to visualize the regression model
;; as and additional layer.
;; Behind the scenes, it uses Fastmath `lm` by default, so
;; the result should be compatible.
;; We can visualize the fitted line by adding a smooth layer to our scatter plot.
;; Tableplot makes this convenient:

(-> simple-linear-data
plotly/layer-point
plotly/layer-smooth)

;; We could also generate this explicitly ourselves:
;; Alternatively, we can build the regression line explicitly.
;; We'll obtain predictions and then plot them:

(-> simple-linear-data
(tc/map-columns :prediction
Expand All @@ -176,7 +167,7 @@ simple-linear-data-model

;; ## Multiple linear regression

;; Linear dependency on a few variables can be estimated the same way.
;; We can easily extend these ideas to multiple linear predictors.

(def multiple-linear-data
(let [rng (rand/rng 1234)
Expand All @@ -202,12 +193,10 @@ simple-linear-data-model

(summary multiple-linear-data-model)

;; Visualization is different, of course.
;; Here, we use a combination of Tableplot's Plotly API
;; and Plotly's own specification.
;; We are planning to improve Tableplot so that
;; such details will be handled more directly
;; as layers in its API. 🛠
;; Visualizing multiple dimensions is more involved. Below is a 3D
;; scatter plot using Tableplot's Plotly API, along with an explicit
;; Plotly surface spec for the regression plane. We expect to simplify this
;; process in the future, so these details become part of higher-level layers. 🛠

(-> multiple-linear-data
(plotly/layer-point {:=coordinates :3d
Expand All @@ -229,12 +218,11 @@ simple-linear-data-model

;; ## Coming soon: Regularization 🛠


;; ## Example: Predicting Bicycle Traffic

;; Following the Python Data Science Handbook, we will look into predicting
;; the daily number of bicycle trips across Fremont Bridge in Seattle.
;; We can use the information of weather, season, day of week, etc.
;; As in the Python Data Science Handbook, we'll try predicting the daily number
;; of bicycle trips across the Fremont Bridge in Seattle. The features will include
;; weather, season, day of week, and related factors.

;; ### Reading and parsing data

Expand Down Expand Up @@ -262,16 +250,15 @@ weather

;; ### Preprocessing

;; Our bike counts data are hourly, but the weather data is daily.
;; To join them, we will need to convert the bike hourly counts to daily counts.
;; The bike counts come in hourly data, but our weather information is daily.
;; We'll need to aggregate the hourly counts into daily totals before combining the datasets.

;; In the Python book, this is done as follows in Pandas:
;; In the Python handbook, one does:
;; ```python
;; daily = counts.resample('d').sum()
;; ```

;; Tablecloth's full support for time series is still under construction.
;; For now, we will have to be a bit more verbose:
;; Since Tablecloth's time series features are still evolving, we'll be a bit more explicit:

(def daily-totals
(-> counts
Expand All @@ -281,27 +268,25 @@ weather
(tc/aggregate-columns [:total :west :east]
tcc/sum)))


daily-totals

;; ### Prediction by day-of-week

;; Let us prepare the data for regression on the day of week.
;; Next, we'll explore a simple regression by day of week.

(def days-of-week
[:Mon :Tue :Wed :Thu :Fri :Sat :Sun])


;; We will convert numbers to days-of-week keywords:
;; We'll convert the numeric day-of-week to the corresponding keyword:

(def idx->day-of-week
(comp days-of-week dec))

;; E.g.,
;; For example,
(idx->day-of-week 1)
(idx->day-of-week 7)

;; Now, let us prepare the data:
;; Now, let's build our dataset:

(def totals-with-day-of-week
(-> daily-totals
Expand All @@ -321,7 +306,7 @@ totals-with-day-of-week
(tc/add-column day-of-week
#(-> (:day-of-week %)
(tcc/eq day-of-week)
;; turn booleans into 0s and 1s
;; convert booleans to 0/1
(tcc/* 1)))))
totals-with-day-of-week
days-of-week)
Expand All @@ -331,23 +316,22 @@ totals-with-day-of-week
(-> totals-with-one-hot-days-of-week
(tc/select-columns ds-mod/inference-column?))

;; The binary columns are collinear (sum up to 1),
;; but we will avoid the intercept.
;; This way, the interpretation of each coefficient is the expected
;; bike count for the corresponding day of week.
;; Since the binary columns sum to 1, they're collinear, and we won't use an intercept.
;; This way, each coefficient directly reflects the expected bike count
;; for that day of week.

(def days-of-week-model
(lm totals-with-one-hot-days-of-week
{:intercept? false}))

;; Here are the regression results:
;; Let's take a look at the results:

(-> days-of-week-model
println
with-out-str
kind/code)

;; We can see the difference between weekends and weekdays.
;; We can clearly see weekend versus weekday differences.

;; ### Coming soon: more predictors for the bike counts 🛠

0 comments on commit 2d4e2d7

Please sign in to comment.