From 26ba4bf0e0bbb56736331d89cc70c51fd437b7b2 Mon Sep 17 00:00:00 2001
From: Ayush Joshi <ayush854032@gmail.com>
Date: Thu, 16 Nov 2023 11:38:04 +0530
Subject: [PATCH] Added a brief explaination of `Training Neural Networks` into
 the main `ai` documentation

Signed-off-by: Ayush Joshi <ayush854032@gmail.com>
---
 .github/workflows/docs.yml          |  4 +--
 ai/__init__.py                      |  1 +
 docs/ml/Neural-Networks.md          |  3 ++-
 docs/ml/README.md                   |  2 +-
 docs/ml/Training-Neural-Networks.md | 39 +++++++++++++++++++++++++++++
 5 files changed, 45 insertions(+), 4 deletions(-)
 create mode 100644 docs/ml/Training-Neural-Networks.md

diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
index 2753d83..4ef43e3 100644
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@@ -2,8 +2,8 @@ name: github-pages
 
 on:
   push:
-    branches:
-      - docs
+    paths:
+      - 'CHANGELOG.md'
 
 permissions:
   actions: write
diff --git a/ai/__init__.py b/ai/__init__.py
index eb05963..60410dd 100644
--- a/ai/__init__.py
+++ b/ai/__init__.py
@@ -90,6 +90,7 @@ def _PreprocessReadme(fpath: Union[str, pathlib.Path]) -> str:
   'Classification.md',
   'Regularization-for-Sparsity.md',
   'Neural-Networks.md',
+  'Training-Neural-Networks.md',
 )
 
 
diff --git a/docs/ml/Neural-Networks.md b/docs/ml/Neural-Networks.md
index 802552d..ec3f59a 100644
--- a/docs/ml/Neural-Networks.md
+++ b/docs/ml/Neural-Networks.md
@@ -126,4 +126,5 @@ Now our model has all the standard components of what people usually mean when t
 * A set of nodes, analogous to neurons, organized in layers.
 * A set of weights representing the connections between each neural network layer and the layer beneath it. The layer beneath may be another neural network layer, or some other kind of layer.
 * A set of biases, one for each node.
-* An activation function that transforms the output of each node in a layer. Different layers may have different activation functions.
\ No newline at end of file
+* An activation function that transforms the output of each node in a layer. Different layers may have different activation functions.
+
diff --git a/docs/ml/README.md b/docs/ml/README.md
index 91b715f..6a6b69b 100644
--- a/docs/ml/README.md
+++ b/docs/ml/README.md
@@ -14,6 +14,6 @@
 12. [Classification](https://github.com/joshiayush/ai/blob/master/docs/ml/Classification.md)
 13. [Regularization for Sparsity](https://github.com/joshiayush/ai/blob/master/docs/ml/Regularization-for-Sparsity.md)
 14. [Neural Networks](https://github.com/joshiayush/ai/blob/master/docs/ml/Neural-Networks.md)
-15. Training Neural Nets
+15. [Training Neural Networks](https://github.com/joshiayush/ai/blob/master/docs/ml/Training-Neural-Networks.md)
 16. Multi-Class Neural Nets
 17. Embeddings
\ No newline at end of file
diff --git a/docs/ml/Training-Neural-Networks.md b/docs/ml/Training-Neural-Networks.md
new file mode 100644
index 0000000..b2dacdc
--- /dev/null
+++ b/docs/ml/Training-Neural-Networks.md
@@ -0,0 +1,39 @@
+# Training Neural Networks
+
+**Backpropagation** is the most common training algorithm for neural networks. It makes gradient descent feasible for multi-layer neural networks.
+
+## Best Practices
+
+This section explains backpropagation's failure cases and the most common way to regularize a neural network.
+
+### Failure Cases
+
+There are a number of common ways for backpropagation to go wrong.
+
+#### Vanishing Gradients
+
+The gradients for the lower layers (closer to the input) can become very small. In deep networks, computing these gradients can involve taking the product of many small terms.
+
+When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all.
+
+The ReLU activation function can help prevent vanishing gradients.
+
+#### Exploding Gradients
+
+If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge.
+
+Batch normalization can help prevent exploding gradients, as can lowering the learning rate.
+
+#### Dead ReLU Units
+
+Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network's output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0.
+
+Lowering the learning rate can help keep ReLU units from dying.
+
+### Dropout Regularization
+
+Yet another form of regularization, called **Dropout**, is useful for neural networks. It works by randomly "dropping out" unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization:
+
+* 0.0 = No dropout regularization.
+* 1.0 = Drop out everything. The model learns nothing.
+* Values between 0.0 and 1.0 = More useful.
\ No newline at end of file