-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathml_more.qmd
423 lines (245 loc) · 41 KB
/
ml_more.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
# Extending Machine Learning {#sec-ml-more}
![](img/chapter_gp_plots/gp_plot_8.svg){width=75%}
We've explored some fundamental aspects of machine learning (ML) for typical data settings and modeling objectives, but there are many other areas of machine learning that we haven't covered, and honestly, you just can't cover everything in a single book. The field is always evolving, progressing, and branching out, and covers every data domain, which is what makes it so fun! Here we'll briefly discuss some of the other aspects of ML that you'll want to be aware of as you continue your journey.
## Key Ideas {#sec-ml-more-key}
As we wrap up our focus on machine learning, here are some things to keep in mind:
- ML can be applied to virtually any modeling or data domain.
- Other widely used areas and applications of ML include unsupervised learning, reinforcement learning, computer vision, natural language processing, and more generally, artificial intelligence.
- While tabular data has traditionally been the primary format for modeling, the landscape has changed dramatically, and you may need to incorporate other data to reach your modeling goals.
### Why this matters {#sec-ml-more-why}
It's very important to know just how *unlimited* the modeling universe is, but also to recognize the common thread that connects all models. Even when we get into other data situations and complex models, we can always fall back on the core approaches we've already seen and know well at this point, and know that those ideas can potentially be applied in any modeling situation.
### Helpful context {#sec-ml-more-good}
For the content in this chapter, a basic idea of modeling and machine learning would probably be enough. We're not going to get too technical in this section.
## Unsupervised Learning {#sec-ml-more-unsuper}
All the models considered thus far would fall under the name of **supervised learning**. That is, we have a target variable that we are trying to predict with various features, and we use the data to train a model to predict it. However, there are settings in which we do not have a target variable, or we do not have a target variable for all of the data. In these cases, we can still use what's often referred to as **unsupervised learning** to learn about the data.
Unsupervised learning is a type of machine learning that involves training a model without an explicit target variable in the sense that we've seen. But to be clear, a model and target is still definitely there! Unsupervised learning attempts learn patterns in the data in a general sense, and can be used in a wide range of applications, including cluster analysis, anomaly detection, and dimensionality reduction. Although these may initially seem as fundamentally different modeling approaches, just like much of what we've seen, it's probably best to think of these as different flavors of a more general approach.
Traditionally, one of the more common applications of unsupervised learning falls under the heading of **dimension reduction**, or **data compression**. Here we reduce our feature set to a smaller **latent**, or hidden, or unobserved, subset that accounts for most of the (co-)variance of the larger set. Alternatively, we may reduce the rows to a small number of hidden, or unobserved, clusters. For example, we start with 100 features and reduce them to 10 features that still account for most of what's important in the original set, or we classify each observation as belonging to 2-3 clusters. Either way, the primary goal is to reduce the dimensionality of the data, not predict an explicit target.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-cluster-scatter
#| fig.cap: Two Variables with Three Overlapping Clusters
# Generate random data for three clusters
set.seed(123)
n = 900
x = rnorm(n, mean = c(0, 0, 3), sd = 1)
y = rnorm(n, mean = c(0, 5, 2), sd = 2)
cluster = rep(1:3, n/3)
# Create a data frame
df = tibble(x = x, y = y, cluster = factor(paste0('Cluster ', cluster)))
# Create the scatter plot
ggplot(df) +
geom_point(aes(x = x, y = y, color = cluster), size = 3) +
labs(
title = "",
x = "X1",
y = "X2",
color = "Cluster"
) +
see::scale_color_okabeito() +
theme(
legend.text = element_text(size = 20),
)
ggsave(
"img/cluster_scatter.png",
width = 8,
height = 6
)
```
:::
:::{.content-visible when-format='pdf'}
![Two Variables with Three Overlapping Clusters](img/cluster_scatter.png){#fig-cluster-scatter width=75%}
:::
Classical methods in this domain include **principal components analysis** (PCA), **singular value decomposition** (SVD), **factor analysis**, and **Latent Dirichlet Allocation**, which are geared toward reducing column dimensions. Also included are clustering methods such as **k-means** and **hierarchical clustering**, where we reduce observations into clusters or groups. Sometimes these methods are often used as preprocessing steps for supervised learning problems, or as a part of exploratory data analysis, but often they are an end in themselves.
Most of us are familiar with **recommender systems**, e.g., with Netflix or Amazon recommendations, which suggest products or movies, and we're all now becoming extremely familiar with text analysis methods through chatbots and similar. While the underlying models are notably more complex these days, they actually just started off as SVD (recommender systems) or a form of factor analysis (text analysis via latent semantic analysis/latent dirichlet allocation). Having a conceptual understanding of the simpler methods can aid in understanding the more complex ones.
::: {.callout-note title='Dimension Reduction in Preprocessing' collapse='true'}
You probably should not use a dimension reduction technique as a preprocessing step for a supervised learning problem. Instead, use a modeling approach that can handle high-dimensional data, has a built-in way to reduce features (e.g., lasso, boosting, dropout), or use a dimension reduction technique that is specifically designed for supervised learning (e.g., partial least squares). Creating a reduced set of features, but which are created without any connection to the target, will generally be suboptimal for a supervised learning problem.
:::
### Connections {#sec-ml-more-connections}
#### Clusters are latent categorical features {#sec-ml-more-clusters-latent}
In both clustering rows and reducing columns, we're essentially reducing the dimension of the features. For methods like PCA and factor analysis, we're explicitly reducing the number of data columns to a smaller set of numeric features. For example, we might take answers to responses to dozens of questions from a personality inventory, and reduce them to [five key features](https://en.wikipedia.org/wiki/Big_Five_personality_traits) that represent general aspects of personality. These new features are on their own scale, often standardized, but they still reflect at least some of the original items' variability [^componentvar].
[^componentvar]: Ideally we'd capture all the variability, but that's not the end result, and some techniques or results may only capture a relatively small percentage. In our personality example, this could be because the questions don't adequately capture the underlying personality constructs (i.e., an issue of the reliability of instrument), or because personality is just not that simple and we'd need more dimensions.
Now, imagine if we reduced the features to a single categorical variable, say, with two or three groups. Now you have cluster analysis! You can discretize any continuous feature to a coarser set of categories, and this goes for latent variables as well as those we actually observe in our data. For example, if we do a factor analysis with one latent feature, we could either convert it to a probability of some class with an appropriate transformation, or just say that scores higher than some cutoff are in cluster A and the others are in cluster B. Indeed, there is a whole class of clustering models called **mixture models** that do just that - they estimate the latent probability of class membership. Many of these approaches are conceptually similar or even identical to the continuous method counterparts, and the primary difference is how we think about and interpret the results.
```{r}
#| echo: false
#| eval: false
#| label: save-pca-as-net
# as usual, graphviz is a complete cluster to work with, and impossible to order. It literally changes the order in front of my eyes while typing this comment that has nothing to do with the graph. And whatever you see will not necessarily be what's rendered
set.seed(42)
g = DiagrammeR::grViz('img/pca_as_net.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/pca_as_net.svg", width = 800, height = 800)
g = DiagrammeR::grViz('img/autoencoder.dot')
# NOTE THAT DIAGRAMMER MAY NOT DISPLAY EXACTLY AS PREVIEWED
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/autoencoder.svg")
```
#### PCA as a neural network {#sec-ml-more-pca-as-net}
Consider the following neural network, called an **autoencoder**. Its job is to shrink the features down to a smaller, simpler representation, and then rebuild the feature set from the compressed state, resulting in an output that matches the original as closely as possible. It's trained by minimizing the error between the original data and the reconstructed data. The autoencoder is a special case of a neural network used as a component of many larger architectures such as those seen with large language models, but can be used for dimension reduction in and of itself if we are specifically interested in the compression layer, sometimes called a **bottleneck**.
![PCA or Autoencoder](img/pca_as_net.svg){#fig-pca-auto}
Consider the following setup for such a situation:
- Single hidden layer
- Number of hidden nodes = number of inputs
- Linear activation function
An autoencoder in this case would be equivalent to principal components analysis. In the approach described, PCA perfectly reconstructs the original data when considering all components, and so the error would be zero. But that doesn't give us any dimension reduction, as we have as many nodes in the compression layer as we did inputs. So with PCA, we often only focus on a small number of components that capture the data variance by some arbitrary amount. The discarded nodes are actually still estimated though.
Neural networks are not bound to linear activation functions, the size of the inputs, or even a single layer. As such, they provide a much more flexible approach that can compress the data at a certain layer, but still have very good reconstruction error. Typical autoencoders would have multiple layers with notably more nodes than inputs, at least for some layers. They may ultimately compress to a bottleneck layer consisting of a fewer set of nodes, before expanding out again. An autoencoder is not as easily interpretable as typical factor analytic techniques, and we still have to sort out the architecture. However, it's a good example of how the same underlying approach can be used for different purposes.
<!-- due to diagrammer issues in preview vs rendered, we just created a png also, note which one is ultimately used -->
![Conceptual Diagram of an Autoencoder](img/autoencoder.png){#fig-autoencoder width=75%}
<!-- The only real difference between the two approaches is the objective (function), and for any k latent features I come up with I can create (at least) k + 1 clusters before taking into account interactions of the latent variables.
https://stats.stackexchange.com/questions/122213/latent-class-analysis-vs-cluster-analysis-differences-in-inferences -->
:::{.callout-note title='Autoencoders and LLMs' collapse='true'}
Encoder-decoder models, which are the basis for LLMs, can be seen as a type of autoencoder, and are used in many applications, including machine translation, image captioning, and more. Autoencoders suggest the same inputs and outputs, but a similar type of architecture, or even part of it, as in 'decoder-only' approaches of many of the current popular LLMs, might be used to classify or generate text, as with large language models.
:::
#### Latent linear models {#sec-ml-more-latent-linear}
Some dimension reduction techniques can be thought of as *latent linear models*. The following depicts factor analysis as a latent linear model. The 'targets' are the observed features, and we predict each one by some linear combination of latent variables.
$$
\begin{aligned}
x_1 &= \beta_{11} h_1 + \beta_{12} h_2 + \beta_{13} h_3 + \beta_{14} h_4 + \epsilon_1 \\
x_2 &= \beta_{21} h_1 + \beta_{22} h_2 + \beta_{23} h_3 + \beta_{24} h_4 + \epsilon_2 \\
x_3 &= \beta_{31} h_1 + \beta_{32} h_2 + \beta_{33} h_3 + \beta_{34} h_4 + \epsilon_3 \\
\end{aligned}
$$
In this scenario, the $h$ are estimated latent variables, and $\beta$ are the coefficients, which in some contexts are called **loadings**. The $\epsilon$ are the residuals, which are assumed to be independent and normally distributed as with a standard linear model. The $\beta$ are usually estimated by maximum likelihood. The latent variables are not observed, but are to be estimated as part of the modeling process, and typically standardized with mean 0 and standard deviation of 1[^factorest]. The number of latent variables we use is a hyperparameter in the ML sense, and so can be determined by the usual means[^nocvforFA]. To tie some more common models together:
[^factorest]: They can also be derived in post-processing depending on the estimation approach.
[^nocvforFA]: Actually, in application as typically seen in social sciences, cross-validation is very rarely employed, and the number of latent variables is determined by some combination of theory, model comparison for training data only, or trial and error. Not that we're advocating for that, but it's a common practice.
- Factor analysis is the more general approach with varying residual variance. In a multivariate sense, we can write the model with $\mathbf{X}$ is the data matrix, $\mathbf{W}$ is the loading matrix (weights), $\mathbf{Z}$ is the latent variable matrix, and $\Psi$ is the covariance.
$$ \mathbf{X} = \mathrm{N}(\mathbf{Z} \mathbf{W}, \Psi) $$
- Probabilistic PCA is a factor analysis with $\Psi = \sigma^2 \mathbf{I}$, where $\sigma^2$ is the (constant across X) residual variance.
- PCA is a factor analysis with no (residual) variance, and the latent variables are orthogonal (independent).
- Independent component analysis is a factor analysis that does not assume an underlying gaussian data generating process.
- Non-negative matrix factorization and latent dirichlet allocation are factor analyses applied to counts (think poisson and multinomial regression).
In other words, many traditional dimension reduction techniques can be formulated in the context of a linear model.
### Other unsupervised learning techniques {#sec-ml-more-other-unsupervised}
There are several techniques that are used to visualize high-dimensional data in simpler ways, such as multidimensional scaling, t-SNE, and (H)DBSCAN. These are often used as a part of exploratory data analysis to identify groups.
**Cluster analysis** is a method with a long history and many different approaches, including hierarchical clustering algorithms (agglomerative, divisive), k-means, and more. Distance matrices are often the first step for these clustering approaches, and there are many ways to calculate distances between observations. With the distances we can group observations with small distances and separate those with large distances. Conversely, some methods use adjacency matrices, which focus on similarity of observations rather than differences (like correlations), and can be used for graph-based approaches to find hidden clusters (see network analysis).
**Anomaly/outlier detection** is an approach for finding 'unusual' data points, or otherwise small, atypical clusters. This is often done by looking for data points that are far from the rest of the data, or that are not well explained by the model. This approach is often used for situations like fraud detection or network intrusion detection. For example, standard clustering (small anomalous groups) or modeling techniques (observations with high residuals) might be used to identify outliers.
```{python}
#| echo: false
#| eval: false
#| label: python-network-graph
import matplotlib as mpl
import matplotlib.pyplot as plt
import networkx as nx
seed = 13648 # Seed random number generators for reproducibility
G = nx.random_k_out_graph(25, 3, 0.5, self_loops=False, seed=seed)
pos = nx.spring_layout(G, seed=seed)
node_sizes = [3 + 10 * i for i in range(len(G))]
M = G.number_of_edges()
edge_colors = range(2, M + 2)
edge_alphas = [(5 + i) / (M + 4) for i in range(M)]
cmap = plt.cm.Oranges
nodes = nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color="#A0CBE2")
edges = nx.draw_networkx_edges(
G,
pos,
node_size=node_sizes,
arrowstyle="-",
arrowsize=10,
edge_color=edge_colors,
edge_cmap=cmap,
width=2,
)
nx.draw(G, pos,node_size=node_sizes,
arrowstyle="-",
arrowsize=10,
edge_color=edge_colors,
edge_cmap=cmap,
width=2,)
plt.show()
# set alpha value for each edge
# for i in range(M):
# edges[i].set_alpha(edge_alphas[i])
# pc = mpl.collections.PatchCollection(edges, cmap=cmap)
# pc.set_array(edge_colors)
# ax = plt.gca()
# ax.set_axis_off()
# plt.colorbar(pc, ax=ax)
# plt.show()
# Save the graph as an image file
# plt.savefig('img/network_graph.png')
```
![Network Graph](img/network_us.png){#fig-network-graph width=75%}
**Network analysis** is a type of unsupervised learning that involves analyzing the relationships between entities. It is a graph-based approach that involves identifying nodes (e.g., people) and edges (e.g., do they know each other?) in a network. It is used in a wide range of applications, like identifying communities within a network, or to see how they evolve over time. It is also used to identify relationships between entities, such as people, products, or documents. One might be interested in such things as which nodes that have the most connections, or the general 'connectedness' of a network. Network analysis or similar graphical models typically have their own clustering techniques that are based on the edge (connection) weights between individuals, such as modularity, or the number of edges between individuals, such as k-clique.
In summary, there are many methods that fall under the umbrella of unsupervised learning, but even when you don't think you have an explicit target variable, you can still understand or frame these as models in familiar ways. It's important to not get hung up on trying to distinguish modeling approaches with somewhat arbitrary labels, and focus more on what their modeling goal is and how best to achieve it!
:::{.callout-note title='Generative vs. Discriminative Models' collapse='true'}
Many unsupervised learning and many deep learning techniques involved in computer vision and natural language processing are often thought of as **generative** models. These attempt to model the underlying data generating process, i.e., the features, but possibly a target variable also. In contrast, most supervised learning models are often thought of as **discriminative** models, that try to model the conditional distribution of the target given the features only.
These labels are a bit problematic though. Any probabilistic model can be used to generate data, even if it is only for the target, so calling a model generative isn't all that clarifying. And models that might be thought of as discriminative in a machine learning context might not be in others (e.g., [Bayesian](https://stats.stackexchange.com/questions/7455/the-connection-between-bayesian-statistics-and-generative-modeling)).
:::
## Reinforcement Learning {#sec-ml-more-reinforcement}
![Reinforcement Learning](img/ml-rl_basic.svg){width=75%}
**Reinforcement learning** (RL) is a type of modeling approach that involves training an **agent** to make decisions in an **environment**. The agent receives feedback in the form of rewards or punishments for its actions, and the goal is to maximize its rewards over time by learning which actions lead to positive or negative outcomes. Typical data involves a sequence of states, actions, and rewards, and the agent learns a policy that maps states to actions. The agent learns by interacting with the environment, and the environment changes based on the agent's actions.
The agent's goal is to learn a **policy**, which is a set of rules that dictate which actions to take in different situations. The agent learns by trial and error, adjusting its policy based on the feedback it receives from the environment. The classic example is a game like chess or a simple video game. In these scenarios, the agent learns which moves (actions) lead to winning the game (positive reward) and which moves lead to losing the game (negative reward). Over time, the agent improves its policy to make better moves that increase its chances of winning.
One of the key challenges in reinforcement learning is balancing **exploration and exploitation**. Exploration is about trying new actions that could lead to higher rewards, while exploitation is about sticking to the actions that have already been found to give good rewards.
Reinforcement learning has many applications, including robotics, games, and autonomous driving, but there is little restriction on where it might be applied. It is also often a key part of some deep learning models, where reinforcement is supplied via human feedback or other means to an otherwise automatic modeling process.
## Working with Specialized Data Types {#sec-ml-more-non-tabular}
While our focus in this book is on tabular data due to its ubiquity, there are many other types of data used for machine learning and modeling in general. This data often starts in a special format or must be considered uniquely. You'll often hear this labeled as 'unstructured', but that's probably not the best conceptual way to think about it, as the data is still structured in some way, sometimes in a strict format (e.g., images). Here we'll briefly discuss some of the other types of data you'll potentially come across.
### Spatial {#sec-ml-more-spatial}
:::{.content-visible when-format='html'}
![Spatial Data (code available from [Kyle Walker](https://walker-data.com/mapboxapi/articles/creating-tiles.html))](img/ml-spatial-tippecanoe-example.gif){width=75%}
:::
:::{.content-visible when-format='pdf'}
![Spatial Data (code available from [Kyle Walker](https://github.com/walkerke/mb-immigrants))](img/ml-spatial-kw.png){width=75%}
:::
Spatial data, which includes geographic and similar information, can be quite complex. It often comes in specific formats (e.g., shapefiles), and may require specialized tools to work with it. Spatial specific features may include continuous variables like latitude and longitude, or tracking data from a device like a smartwatch. Other spatial features are more discrete, such as states or political regions within a country.
We could use these spatial features as we would others in the tabular setting, but we often want to take into account the uniqueness of a particular region, or the correlation of spatial regions. Historically, most spatial data can be incorporated into approaches like mixed models or generalized additive models, but in certain applications, such as satellite imagery, deep learning models are more the norm, and the models often transition into image processing techniques.
### Audio {#sec-ml-more-audio}
![Sound wave](img/wiki-colour_soundwave.svg)
Audio data is a type of time series data that is also the focus for many modeling applications. Think of the sound of someone speaking or music playing, as it changes over time. Such data is often represented as a waveform, which is a plot of the amplitude of the sound wave over time.
The goal of modeling audio data may include speech recognition, language translation, music generation, and more. Like spatial data, audio data is typically stored in specific formats and can be quite large by default. Also like spatial data, the specific type of data and research question may allow for a tabular format. In that case, the modeling approaches used are similar to those for other time series data.
Deep learning methods have proven very effective for analyzing audio data, and can even create songs people actually like, [even recently helping the Beatles to release one more song](https://en.wikipedia.org/wiki/Now_and_Then_(Beatles_song)). Nowadays, you can generate an entire song in any genre you want, just by typing a text prompt!
### Computer vision {#sec-ml-more-cv}
<!-- https://alexlenail.me/NN-SVG/LeNet.html -->
<!-- ![Convolutional Neural Network](img/ml-resnet_architecture.png){#fig-cnn width=66%} -->
![Convolutional Neural Network [LeNet](https://alexlenail.me/NN-SVG/LeNet.html)](img/cnn.png){#fig-cnn width=66%}
Computer vision involves a range of models and techniques for analyzing and interpreting image-based data. It includes tasks like image classification (labeling an image), object detection (finding the location of objects in an image), image segmentation (identifying the boundaries of objects in an image), and object tracking (following objects as they move over time).
Typically, your raw data is an image, which is represented as a matrix of pixel values. For example, each row of the matrix could be a grayscale value for a pixel, or it could be a 3-dimensional array of Red, Green, and Blue (RGB) values for each pixel. The modeling goal is to extract features from the image data that can be used for the task at hand. For example, you might extract features that relate to color, texture, and shape. You can then use these features to train a model to classify images or whatever your task may be.
Image processing is a broad field with many applications. It is used in medical imaging, satellite imagery, self-driving cars, and more. And while it can be really fun to classify objects such as cats and dogs, or generate images from text and vice versa, it can be challenging due to the size of the data, issues specific to video/image quality, and the model complexity. Even if your base data is often the same or very similar across tasks, the model architecture and training process can vary widely depending on the task at hand.
These days we generally don't have to start from scratch though, as there are pretrained models that can be used for image processing tasks, which you can then fine-tune for your specific task. These models are often based on **convolutional neural networks** (CNNs), which are a type of deep learning model. CNNs are designed to take advantage of the spatial structure of images, and they use a series of convolutional layers to extract features from the image. These features are then passed through a series of fully connected layers to make a prediction. CNNs have been used to achieve state-of-the-art results on a wide range of image processing tasks, and are the standard for many image processing applications. More recently, diffusion models, which seek to reconstruct images after successively adding noise to the initial input, have been shown to be quite effective for a wide range of tasks involving image generation.
### Natural language processing {#sec-ml-more-nlp}
![Partial GPT4 output from a prompt: Write a very brief short story about using models in data science. It should reflect the style of Donald Barthelme.](img/gpt4_barth_data.png){width=100%}
One of the hottest areas of modeling development in recent times regards **natural language processing**, as evidenced by the runaway success of models like [ChatGPT](https://chat.openai.com/). Natural language processing (NLP) is a field of study that focuses on understanding human language, and along with computer vision, is a very visible subfield of artificial intelligence. NLP is used in a wide range of applications, including language translation, speech recognition, text classification, and more. NLP is behind some of the most exciting modeling applications today, with tools that continue to amaze with their capabilities to generate summaries of articles, answering questions, write code, and even [pass the bar exam with flying colors](https://www.abajournal.com/web/article/latest-version-of-chatgpt-aces-the-bar-exam-with-score-in-90th-percentile)!
Early efforts in this field were based on statistical models, and then variations on things like PCA, but it took a lot of [data pre-processing work](https://m-clark.github.io/text-analysis-with-R/intro.html) to get much from those approaches, and results could still be unsatisfactory. More recently, deep learning models became the standard application, and there is no looking back in that regard due to their success. Current state of the art models have been trained on massive amounts of data, [even much of the internet](https://commoncrawl.org/), and require a tremendous amount of computing power. Thankfully, you don't have to train such a model yourself to take advantage of the results. Now you can simply use a pretrained model like GPT-4 for your own tasks. In some cases, much of the trouble comes with just generating the best prompt to produce the desired results. However, the field and the models are evolving very rapidly, and, for those who don't have the resources of Google, Meta, or OpenAI, things are getting easier to implement all the time. In the meantime, feel free to just [play around with ChatGPT yourself](https://chat.openai.com/).
## Pretrained Models & Transfer Learning {#sec-ml-more-pretrained}
Pretrained models are those that have been trained on a large amount of data, and can be used for a wide range of tasks, even on data they were not trained on. They are widely employed in image and natural language processing. The basic idea is that if you can use a model that was trained on the entire internet of text, why start from scratch? Computer vision models already understand things like edges and colors, so there is little need to reinvent the wheel when you know those features would be useful for your own task. These are viable in tasks where the inputs you are using are similar to the data the model was trained on, as is the case with images and text.
You can use a pretrained model as a starting point for your own model, and then **fine-tune** it for your specific task, and this is more generally called **transfer learning**. The gist is that you only need to train part of the model on your specific data, for example the last layer or two of a deep learning model. You 'freeze' the already learned weights for most of the model, while allowing the model to relearn the weights for the last layer(s) on your data. This can save a lot of time and resources, and can be especially useful when you don't have a lot of data to train your model on.
### Self-supervised learning {#sec-ml-more-self-supervised}
Self-supervised learning is a type of machine learning technique that involves training a model on a task that can be generated from the data itself. In this setting, there is no labeled data. For example, you might train a model to predict the next word in a sentence. The idea is that the model learns to extract (represent) useful features from the data by trying to predict the missing information, which is imposed by a **mask** that hides parts of the data, and may change from sample to sample[^maskvsmissing]. We can then see how well our predictions match the targets that were masked.
[^maskvsmissing]: If your 'mask' was a truly missing value rather than self-imposed, the general idea of self-supervised learning is essentially the same as missing value imputation, but the latter doesn't sound as sexy and was already well established.
This can be a useful approach when you don't have labeled data, or just when you don't have a lot of labeled data. Once trained, the model can be used as other pre-trained models to predict other unlabeled data. Self-supervised learning is often used in natural language processing, but can be applied to other types of data as well.
## Combining Models {#sec-ml-more-combine}
It's important to note that the types of data used in ML and DL and their associated models are not mutually exclusive. For example, you might have a video that contains both audio and visual information pertinent to the task. Or you might want to produce images from text inputs. In these cases, you can use a combination of models to extract features from the data, which may just be more features in a tabular format, or be as complex as a **multimodal** deep learning architecture.
Many computer vision, audio, natural language and other modeling approaches incorporate **transformers**. They are based on the idea of **attention**, which is a mechanism that allows the model to focus on certain parts of the input sequence and less on others. Transformers are used in many state-of-the-art models with different data types, such as those that combine text and images. The transformer architecture, [although complex]((https://bbycroft.net/llm)), underpins many of today's most sophisticated models, so is worth being aware of even if it isn't your usual modeling domain.
As an example[^torch_frame], we added a transformer-based approach to process the text reviews in the movie review data set used in other chapters. We kept to the same basic data setup otherwise, and we ended up with notably better performance than the other models demonstrated, pushing toward 90% accuracy on test, even without fiddling too much with many hyper parameters. It's a good example of a case where we have standard tabular data, but need to deal with additional data structure in a different way. By combining the approaches to obtain a final output for prediction, we obtained better results than we would with a single model. This won't always be the case, but keep it in mind when you are dealing with different data sources or types.
[^torch_frame]: We use a recently developed Python module [torch_frame](https://pytorch-frame.readthedocs.io/) for this. Our approach is in a notebook available in the python chapter notebooks.
<!-- TODO: add link to notebook -->
## Artificial Intelligence {#sec-ml-more-ai}
![AI](img/ai_workshop_brain_3.jpeg){width=66%}
The prospect of combining models for computer vision, natural language processing, audio processing, and other domains can produce tools that mimic many aspects of what we call intelligence[^intel]. Current efforts in **artificial intelligence** produce models that can pass law and medical exams, create better explanations of images and text than average human effort, and produce conversation on par with humans. AI even helped to create this book!
[^intel]: It seems most discussions of AI in the public sphere haven't really defined intelligence very clearly in the first place, and the academic realm has struggled with the concept for centuries. This is why you can see people arguing about whether a model is 'intelligent', whether it 'reasons', etc. The only place it makes sense to ask these questions is with an operational definition of these terms that works for (data) science, but that doesn't mean the definition would be satisfying to most people.
In many discussions of ML and AI, [many put ML as a subset of AI](https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/artificial-intelligence-vs-machine-learning), but this is a bit off the mark from a modeling perspective in our opinion[^mlsubset]. In terms of models, practically most of what we'd call modern AI almost exclusively employs deep learning models, while the ML approach to training and evaluating models can be used for any underlying model, from simple linear models to the most complex deep learning models, and whether the application falls under the domain of AI or not. Furthermore, statistical model applications have never seriously attempted what we might call AI.
[^mlsubset]: In almost every instance of this we've seen, there isn't any actual detail or specific enough definitions to make the comparison meaningful to begin with, so don't take it too seriously.
If AI is some 'autonomous and general set of tools that attempt to engage the world in a human-like way or better', it's not clear why it'd be compared to ML in the first place. That's kind of like saying the brain is a subset of cognition. The brain does the work, much like ML does the modeling work with data, and gives rise to what we call cognition, but generally we would not compare the brain to cognition. We also wouldn't call ML a subset of climate science or medicine for similar reasons - they are domains in which it is used, much like the domain of artificial intelligence.
The main point is to not get too hung up on the labels, and focus on the modeling goal and how best to achieve it. Deep learning models, and machine learning in general, can be used for AI or non-AI settings, as we have seen for ourselves. And models used for AI still employ the *perspective* of the ML approach - the steps taken from data to model output are largely the same, as we are concerned with validation and generalization.
Many of the non-AI settings we use modeling for may well be things we can eventually rely on AI to do. At present though, the computational limits, and the amount of data that would be required for AI models to do well, or the ability of AI to be able to deal with situations in which there is *only* small bits of data to train on, are still hindrances in many places we might like to use it. However, we feel it's likely these issues will eventually be overcome. Even then, a statistical approach may still have a place when the data is small.
**Artificial general intelligence (AGI)** is the holy grail of AI, and like AI itself, is not consistently defined. Generally, the idea behind AGI is the creation of some autonomous agent that can perform any task that a human can perform, many that humans cannot, and generalize abilities to new problems that have not even been seen yet. It seems we are getting closer to AGI all the time, but it's not yet clear when it will be achieved, or even what it will look like when it is, especially since no one has an agreed upon definition of what intelligence is in the first place.
All that being said, to be perfectly honest, you may well be reading a history book. Given recent advancements just in the last couple years, it almost seems unlikely that the data science being performed five years from now will resemble much of how things are done today[^scotty]. We are already capable of making faster and further advancements in many domains due to AI, and it's likely that the next generation of data scientists will be able to do so even more easily. The future is here, and it is amazing. Buckle up!
[^scotty]: A good reference for this sentiment is a scene from Star Trek in which [Scotty has to use a contemporary computer](https://www.youtube.com/watch?v=hShY6xZWVGE).
## Wrapping Up {#sec-ml-more-wrap}
We hope you've enjoyed the journey and have a better understanding of the core concepts. By now you also have a couple modeling tools in hand, and also have a good idea of where things can go. We encourage you to continue learning and experimenting with what you've seen, and to apply what you've learned to your own problems. The best way to learn is by doing, so don't be afraid to get your hands dirty and start building models!
### The common thread {#sec-ml-more-common}
Even the most complex models can be thought of as a series of steps that go from input to output. In between, things can get very complicated, but often the underlying operations are the same ones you saw used with the simplest models. One of the key goals of any model is to generalize to new data, and this is the same no matter what type of data you're working with or what type of model you're using.
### Choose your own adventure {#sec-ml-more-adventure}
The sky's the limit with machine learning modeling techniques, so go where your heart leads you, and have some fun! If you started here, feel free to go back to the linear model chapters for a more traditional and statistical modeling overview. Otherwise, continue on for an overview of a few more modeling topics, such as causal modeling (@sec-causal), data issues (@sec-data), and things to avoid (@sec-danger).
### Additional resources {#sec-ml-more-resources}
- Courses on ML and DL: FastAI (@howard_practical_2024), [Coursera](https://www.coursera.org/collections/machine-learning), [edX](https://www.edx.org/), [DeepLearning.AI](https://www.deeplearning.ai/), and many others are great places to get more formal training.
- [Kaggle](https://www.kaggle.com/): Even if you don't compete, you can learn a lot from what others are doing.
- [Unsupervised learning overview at Google](https://cloud.google.com/discover/what-is-unsupervised-learning)
- Machine Learning and AI: Beyond the Basics (@raschka_machine_2023)
- A Visual Introduction to LLMs (@3blue1brown_how_2024)
- Build a LLM from Scratch (@raschka_build_2023)
- [Visualizing Transformer Models](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1) (@vig_deconstructing_2019)
- Self-supervised learning: The dark matter of intelligence (@lecun_self-supervised_2021)