From d9fbff84e890b3005cc16c735de3b33733a8575f Mon Sep 17 00:00:00 2001 From: Ayush Joshi Date: Tue, 14 Nov 2023 11:49:25 +0530 Subject: [PATCH] Removed "Introduction to Tensorflow" to keep the document focused only on the theoretical part Signed-off-by: Ayush Joshi --- docs/ml/Introduction-to-TensorFlow.md | 629 --------------- notebooks/ml/Machine_Learning.ipynb | 1025 +------------------------ 2 files changed, 2 insertions(+), 1652 deletions(-) delete mode 100644 docs/ml/Introduction-to-TensorFlow.md diff --git a/docs/ml/Introduction-to-TensorFlow.md b/docs/ml/Introduction-to-TensorFlow.md deleted file mode 100644 index ff9512b..0000000 --- a/docs/ml/Introduction-to-TensorFlow.md +++ /dev/null @@ -1,629 +0,0 @@ -# Introduction to TensorFlow - -TensorFlow is an end-to-end open source platform for machine learning. TensorFlow is a rich system for managing all aspects of a machine learning system; however, this class focuses on using a particular TensorFlow API to develop and train machine learning models. See the [TensorFlow documentation](https://tensorflow.org/) for complete details on the broader TensorFlow system. - -TensorFlow APIs are arranged hierarchically, with the high-level APIs built on the low-level APIs. Machine learning researchers use the low-level APIs to create and explore new machine learning algorithms. In this class, you will use a high-level API named tf.keras to define and train machine learning models and to make predictions. tf.keras is the TensorFlow variant of the open-source [Keras](https://keras.io/) API. - -The following figure shows the hierarchy of TensorFlow toolkits: - -
- - - Figure 1. TensorFlow toolkit hierarchy. -
- -## Linear regression with tf.keras - -### Simple Linear regression with Synthetic Data - -```python -import pandas as pd -import tensorflow as tf - -from matplotlib import pyplot as plt -``` - -### Define functions that build and train a model - -The following code defines two functions: - - * `build_model(my_learning_rate)`, which builds an empty model. - * `train_model(model, feature, label, epochs)`, which trains the model from the examples (feature and label) you pass. - -Since you don't need to understand model building code right now, you may optionally explore this code. - -```python -def build_model(learning_rate): - """Create and compile a simple linear regression model.""" - # Most simple tf.keras models are sequential. - # A sequential model contains one or more layers. - model = tf.keras.models.Sequential() - - # Describe the topography of the model. - # The topography of a simple linear regression model - # is a single node in a single layer. - model.add(tf.keras.layers.Dense(units=1, - input_shape=(1,))) - - # Compile the model topography into code that - # TensorFlow can efficiently execute. Configure - # training to minimize the model's mean squared error. - model.compile(optimizer=tf.keras.optimizers.RMSprop( - learning_rate=learning_rate), - loss='mean_squared_error', - metrics=[tf.keras.metrics.RootMeanSquaredError()]) - - return model - - -def train_model(model, feature, label, epochs, batch_size): - """Train the model by feeding it data.""" - - # Feed the feature values and the label values to the - # model. The model will train for the specified number - # of epochs, gradually learning how the feature values - # relate to the label values. - history = model.fit(x=feature, - y=label, - batch_size=batch_size, - epochs=epochs) - - # Gather the trained model's weight and bias. - trained_weight = model.get_weights()[0] - trained_bias = model.get_weights()[1] - - # The list of epochs is stored separately from the - # rest of history. - epochs = history.epoch - - # Gather the history (a snapshot) of each epoch. - hist = pd.DataFrame(history.history) - - # Specifically gather the model's root mean - # squared error at each epoch. - rmse = hist['root_mean_squared_error'] - - return trained_weight, trained_bias, epochs, rmse -``` - -### Define plotting functions - -We're using a popular Python library called [Matplotlib](https://developers.google.com/machine-learning/glossary/#matplotlib) to create the following two plots: - -* a plot of the feature values vs. the label values, and a line showing the output of the trained model. -* a [loss curve](https://developers.google.com/machine-learning/glossary/#loss_curve). - -```python -def plot_the_model(trained_weight, trained_bias, feature, label): - """Plot the trained model against the training feature and label.""" - - # Label the axes. - plt.xlabel('feature') - plt.ylabel('label') - - # Plot the feature values vs. label values. - plt.scatter(feature, label) - - # Create a red line representing the model. The red line starts - # at coordinates (x0, y0) and ends at coordinates (x1, y1). - x0 = 0 - y0 = trained_bias - x1 = feature[-1] - y1 = trained_bias + (trained_weight * x1) - plt.plot([x0, x1], [y0, y1], c='r') - - # Render the scatter plot and the red line. - plt.show() - -def plot_the_loss_curve(epochs, rmse): - """Plot the loss curve, which shows loss vs. epoch.""" - - plt.figure() - plt.xlabel('Epoch') - plt.ylabel('Root Mean Squared Error') - - plt.plot(epochs, rmse, label='Loss') - plt.legend() - plt.ylim([rmse.min() * 0.97, rmse.max()]) - plt.show() -``` - -### Define the dataset - -The dataset consists of 12 [examples](https://developers.google.com/machine-learning/glossary/#example). Each example consists of one [feature](https://developers.google.com/machine-learning/glossary/#feature) and one [label](https://developers.google.com/machine-learning/glossary/#label). - -```python -feature = ([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0]) -label = ([5.0, 8.8, 9.6, 14.2, 18.8, 19.5, 21.4, 26.8, 28.9, 32.0, 33.8, 38.2]) -``` - -### Specify the hyperparameters - -The hyperparameters in this Colab are as follows: - - * [learning rate](https://developers.google.com/machine-learning/glossary/#learning_rate) - * [epochs](https://developers.google.com/machine-learning/glossary/#epoch) - * [batch_size](https://developers.google.com/machine-learning/glossary/#batch_size) - -The following code cell initializes these hyperparameters and then invokes the functions that build and train the model. - -```python -learning_rate=0.01 -epochs=10 -batch_size=12 - -model = build_model(learning_rate) -trained_weight, trained_bias, epochs, rmse = train_model(model, feature, - label, epochs, - batch_size) -plot_the_model(trained_weight, trained_bias, feature, label) -plot_the_loss_curve(epochs, rmse) -``` - -### Task 1: Examine the graphs - -Examine the top graph. The blue dots identify the actual data; the red line identifies the output of the trained model. Ideally, the red line should align nicely with the blue dots. Does it? Probably not. - -A certain amount of randomness plays into training a model, so you'll get somewhat different results every time you train. That said, unless you are an extremely lucky person, the red line probably *doesn't* align nicely with the blue dots. - -Examine the bottom graph, which shows the loss curve. Notice that the loss curve decreases but doesn't flatten out, which is a sign that the model hasn't trained sufficiently. - -### Task 2: Increase the number of epochs - -Training loss should steadily decrease, steeply at first, and then more slowly. Eventually, training loss should eventually stay steady (zero slope or nearly zero slope), which indicates that training has [converged](http://developers.google.com/machine-learning/glossary/#convergence). - -In Task 1, the training loss did not converge. One possible solution is to train for more epochs. Your task is to increase the number of epochs sufficiently to get the model to converge. However, it is inefficient to train past convergence, so don't just set the number of epochs to an arbitrarily high value. - -Examine the loss curve. Does the model converge? - -```python -learning_rate=0.01 -epochs=450 -batch_size=12 - -model = build_model(learning_rate) -trained_weight, trained_bias, epochs, rmse = train_model(model, feature, - label, epochs, - batch_size) -plot_the_model(trained_weight, trained_bias, feature, label) -plot_the_loss_curve(epochs, rmse) -``` - -### Task 3: Increase the learning rate - -In Task 2, you increased the number of epochs to get the model to converge. Sometimes, you can get the model to converge more quickly by increasing the learning rate. However, setting the learning rate too high often makes it impossible for a model to converge. In Task 3, we've intentionally set the learning rate too high. Run the following code cell and see what happens. - -```python -learning_rate=100 -epochs=500 -batch_size = batch_size - -model = build_model(learning_rate) -trained_weight, trained_bias, epochs, rmse = train_model(model, feature, - label, epochs, - batch_size) -plot_the_model(trained_weight, trained_bias, feature, label) -plot_the_loss_curve(epochs, rmse) -``` - -The resulting model is terrible; the red line doesn't align with the blue dots. Furthermore, the loss curve oscillates like a [roller coaster](https://www.wikipedia.org/wiki/Roller_coaster). An oscillating loss curve strongly suggests that the learning rate is too high. - -### Task 4: Find the ideal combination of epochs and learning rate - -Assign values to the following two hyperparameters to make training converge as efficiently as possible: - -* `learning_rate` -* `epochs` - -```python -learning_rate=0.14 -epochs=70 -batch_size = batch_size - -model = build_model(learning_rate) -trained_weight, trained_bias, epochs, rmse = train_model(model, feature, - label, epochs, - batch_size) -plot_the_model(trained_weight, trained_bias, feature, label) -plot_the_loss_curve(epochs, rmse) -``` - -### Task 5: Adjust the batch size - -The system recalculates the model's loss value and adjusts the model's weights and bias after each **iteration**. Each iteration is the span in which the system processes one batch. For example, if the **batch size** is 6, then the system recalculates the model's loss value and adjusts the model's weights and bias after processing every 6 examples. - -One **epoch** spans sufficient iterations to process every example in the dataset. For example, if the batch size is 12, then each epoch lasts one iteration. However, if the batch size is 6, then each epoch consumes two iterations. - -It is tempting to simply set the batch size to the number of examples in the dataset (12, in this case). However, the model might actually train faster on smaller batches. Conversely, very small batches might not contain enough information to help the model converge. - -Experiment with `batch_size` in the following code cell. What's the smallest integer you can set for `batch_size` and still have the model converge in a hundred epochs? - -```python -learning_rate=0.05 -epochs=125 -batch_size=1 # Wow, a batch size of 1 works! - -model = build_model(learning_rate) -trained_weight, trained_bias, epochs, rmse = train_model(model, feature, - label, epochs, - batch_size) -plot_the_model(trained_weight, trained_bias, feature, label) -plot_the_loss_curve(epochs, rmse) -``` - -### Summary of Hyperparameter Tuning - -Most machine learning problems require a lot of hyperparameter tuning. Unfortunately, we can't provide concrete tuning rules for every model. Lowering the learning rate can help one model converge efficiently but make another model converge much too slowly. You must experiment to find the best set of hyperparameters for your dataset. That said, here are a few rules of thumb: - - * Training loss should steadily decrease, steeply at first, and then more slowly until the slope of the curve reaches or approaches zero. - * If the training loss does not converge, train for more epochs. - * If the training loss decreases too slowly, increase the learning rate. Note that setting the learning rate too high may also prevent training loss from converging. - * If the training loss varies wildly (that is, the training loss jumps around), decrease the learning rate. - * Lowering the learning rate while increasing the number of epochs or the batch size is often a good combination. - * Setting the batch size to a *very* small batch number can also cause instability. First, try large batch size values. Then, decrease the batch size until you see degradation. - * For real-world datasets consisting of a very large number of examples, the entire dataset might not fit into memory. In such cases, you'll need to reduce the batch size to enable a batch to fit into memory. - -Remember: the ideal combination of hyperparameters is data dependent, so you must always experiment and verify. - -## Linear Regression with a Real Dataset - -Now we are going to use a real dataset to predict the prices of houses in California. - -### The Dataset - -The [dataset for this exercise](https://developers.google.com/machine-learning/crash-course/california-housing-data-description) is based on 1990 census data from California. The dataset is old but still provides a great opportunity to learn about machine learning programming. - -```python -import pandas as pd -import tensorflow as tf -from matplotlib import pyplot as plt - -# The following lines adjust the granularity of reporting. -pd.options.display.max_rows = 10 -pd.options.display.float_format = "{:.1f}".format -``` - -### The dataset - -Datasets are often stored on disk or at a URL in [.csv format](https://wikipedia.org/wiki/Comma-separated_values). - -A well-formed .csv file contains column names in the first row, followed by many rows of data. A comma divides each value in each row. For example, here are the first five rows of the .csv file holding the California Housing Dataset: - -``` -"longitude","latitude","housing_median_age","total_rooms","total_bedrooms","population","households","median_income","median_house_value" --114.310000,34.190000,15.000000,5612.000000,1283.000000,1015.000000,472.000000,1.493600,66900.000000 --114.470000,34.400000,19.000000,7650.000000,1901.000000,1129.000000,463.000000,1.820000,80100.000000 --114.560000,33.690000,17.000000,720.000000,174.000000,333.000000,117.000000,1.650900,85700.000000 --114.570000,33.640000,14.000000,1501.000000,337.000000,515.000000,226.000000,3.191700,73400.000000 -``` - -### Load the .csv file into a pandas DataFrame - -Like many machine learning programs, we gather the `.csv` file and stores the data in memory as a pandas Dataframe. Pandas is an open source Python library. The primary datatype in pandas is a DataFrame. You can imagine a pandas DataFrame as a spreadsheet in which each row is identified by a number and each column by a name. Pandas is itself built on another open source Python library called NumPy. - -The following code cell imports the .csv file into a pandas DataFrame and scales the values in the label (`median_house_value`): - -```python -# Import the dataset. -training_df = pd.read_csv(filepath_or_buffer="https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv") - -# Scale the label. -training_df["median_house_value"] /= 1000.0 - -# Print the first rows of the pandas DataFrame. -training_df.head() -``` - -Scaling `median_house_value` puts the value of each house in units of thousands. Scaling will keep loss values and learning rates in a friendlier range. - -Although scaling a label is usually *not* essential, scaling features in a multi-feature model usually *is* essential. - -### Examine the dataset - -A large part of most machine learning projects is getting to know your data. The pandas API provides a `describe` function that outputs the following statistics about every column in the DataFrame: - -* `count`, which is the number of rows in that column. Ideally, `count` contains the same value for every column. - -* `mean` and `std`, which contain the mean and standard deviation of the values in each column. - -* `min` and `max`, which contain the lowest and highest values in each column. - -* `25%`, `50%`, `75%`, which contain various [quantiles](https://developers.google.com/machine-learning/glossary/#quantile). - -```python -# Get statistics on the dataset. -training_df.describe() -``` - -### Task 1: Identify anomalies in the dataset - -Do you see any anomalies (strange values) in the data? - -> The maximum value (max) of several columns seems very high compared to the other quantiles. For example, example the total_rooms column. Given the quantile values (25%, 50%, and 75%), you might expect the max value of total_rooms to be approximately 5,000 or possibly 10,000. However, the max value is actually 37,937. -> -> When you see anomalies in a column, become more careful about using that column as a feature. That said, anomalies in potential features sometimes mirror anomalies in the label, which could make the column be (or seem to be) a powerful feature. - -### Define functions that build and train a model - -The following code defines two functions: - - * `build_model(my_learning_rate)`, which builds a randomly-initialized model. - * `train_model(model, feature, label, epochs)`, which trains the model from the examples (feature and label) you pass. - -Since you don't need to understand model building code right now, you may optionally explore this code. - -```python -def build_model(my_learning_rate): - """Create and compile a simple linear regression model.""" - # Most simple tf.keras models are sequential. - model = tf.keras.models.Sequential() - - # Describe the topography of the model. - # The topography of a simple linear regression model - # is a single node in a single layer. - model.add(tf.keras.layers.Dense(units=1, - input_shape=(1,))) - - # Compile the model topography into code that TensorFlow can efficiently - # execute. Configure training to minimize the model's mean squared error. - model.compile(optimizer=tf.keras.optimizers.RMSprop( - learning_rate=my_learning_rate), - loss="mean_squared_error", - metrics=[tf.keras.metrics.RootMeanSquaredError()]) - - return model - - -def train_model(model, df, feature, label, epochs, batch_size): - """Train the model by feeding it data.""" - - # Feed the model the feature and the label. - # The model will train for the specified number of epochs. - history = model.fit(x=df[feature], - y=df[label], - batch_size=batch_size, - epochs=epochs) - - # Gather the trained model's weight and bias. - trained_weight = model.get_weights()[0] - trained_bias = model.get_weights()[1] - - # The list of epochs is stored separately from the rest of history. - epochs = history.epoch - - # Isolate the error for each epoch. - hist = pd.DataFrame(history.history) - - # To track the progression of training, we're going to take a snapshot - # of the model's root mean squared error at each epoch. - rmse = hist["root_mean_squared_error"] - - return trained_weight, trained_bias, epochs, rmse -``` - -### Define plotting functions - -We're using a popular Python library called [Matplotlib](https://developers.google.com/machine-learning/glossary/#matplotlib) to create the following two plots: - -* a plot of the feature values vs. the label values, and a line showing the output of the trained model. -* a [loss curve](https://developers.google.com/machine-learning/glossary/#loss_curve). - -```python -def plot_the_model(trained_weight, trained_bias, feature, label): - """Plot the trained model against 200 random training examples.""" - - # Label the axes. - plt.xlabel(feature) - plt.ylabel(label) - - # Create a scatter plot from 200 random points of the dataset. - random_examples = training_df.sample(n=200) - plt.scatter(random_examples[feature], random_examples[label]) - - # Create a red line representing the model. The red line starts - # at coordinates (x0, y0) and ends at coordinates (x1, y1). - x0 = 0 - y0 = trained_bias - x1 = random_examples[feature].max() - y1 = trained_bias + (trained_weight * x1) - plt.plot([x0, x1], [y0, y1], c='r') - - # Render the scatter plot and the red line. - plt.show() - - -def plot_the_loss_curve(epochs, rmse): - """Plot a curve of loss vs. epoch.""" - - plt.figure() - plt.xlabel("Epoch") - plt.ylabel("Root Mean Squared Error") - - plt.plot(epochs, rmse, label="Loss") - plt.legend() - plt.ylim([rmse.min() * 0.97, rmse.max()]) - plt.show() -``` - -### Call the model functions - -An important part of machine learning is determining which [features](https://developers.google.com/machine-learning/glossary/#feature) correlate with the [label](https://developers.google.com/machine-learning/glossary/#label). For example, real-life home-value prediction models typically rely on hundreds of features and synthetic features. However, this model relies on only one feature. For now, you'll arbitrarily use `total_rooms` as that feature. - - -```python -# The following variables are the hyperparameters. -learning_rate = 0.01 -epochs = 30 -batch_size = 30 - -# Specify the feature and the label. -feature = "total_rooms" # the total number of rooms on a specific city block. -label="median_house_value" # the median value of a house on a specific city block. -# That is, you're going to create a model that predicts house value based -# solely on total_rooms. - -# Discard any pre-existing version of the model. -model = None - -# Invoke the functions. -model = build_model(learning_rate) -weight, bias, epochs, rmse = train_model(model, training_df, - feature, label, - epochs, batch_size) - -print("\nThe learned weight for your model is %.4f" % weight) -print("The learned bias for your model is %.4f\n" % bias ) - -plot_the_model(weight, bias, feature, label) -plot_the_loss_curve(epochs, rmse) -``` - -A certain amount of randomness plays into training a model. Consequently, you'll get different results each time you train the model. That said, given the dataset and the hyperparameters, the trained model will generally do a poor job describing the feature's relation to the label. - -### Use the model to make predictions - -You can use the trained model to make predictions. In practice, [you should make predictions on examples that are not used in training](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data). However, for this exercise, you'll just work with a subset of the same training dataset. - -First, run the following code to define the house prediction function: - -```python -def predict_house_values(n, feature, label): - """Predict house values based on a feature.""" - - batch = training_df[feature][10000:10000 + n] - predicted_values = model.predict_on_batch(x=batch) - - print("feature label predicted") - print(" value value value") - print(" in thousand$ in thousand$") - print("--------------------------------------") - for i in range(n): - print ("%5.0f %6.0f %15.0f" % (training_df[feature][10000 + i], - training_df[label][10000 + i], - predicted_values[i][0] )) -``` - -Now, invoke the house prediction function on 10 examples: - -```python -predict_house_values(10, feature, label) -``` - -### Task 2: Judge the predictive power of the model - -Look at the preceding table. How close is the predicted value to the label value? In other words, does your model accurately predict house values? - -> Most of the predicted values differ significantly from the label value, so the trained model probably doesn't have much predictive power. However, the first 10 examples might not be representative of the rest of the examples. - -### Task 3: Try a different feature - -The `total_rooms` feature had only a little predictive power. Would a different feature have greater predictive power? Try using `population` as the feature instead of `total_rooms`. - -Note: When you change features, you might also need to change the hyperparameters. - -```python -# Pick a feature other than "total_rooms" -feature = "population" - -# Possibly, experiment with the hyperparameters. -learning_rate = 0.05 -epochs = 18 -batch_size = 3 - -# Don't change anything below. -model = build_model(learning_rate) -weight, bias, epochs, rmse = train_model(model, training_df, - feature, label, - epochs, batch_size) - -plot_the_model(weight, bias, feature, label) -plot_the_loss_curve(epochs, rmse) - -predict_house_values(10, feature, label) -``` - -Did `population` produce better predictions than `total_rooms`? - -> Training is not entirely deterministic, but population typically converges at a slightly higher RMSE than total_rooms. So, population appears to be about the same or slightly worse at making predictions than total_rooms. - -### Task 4: Define a synthetic feature - -You have determined that `total_rooms` and `population` were not useful features. That is, neither the total number of rooms in a neighborhood nor the neighborhood's population successfully predicted the median house price of that neighborhood. Perhaps though, the *ratio* of `total_rooms` to `population` might have some predictive power. That is, perhaps block density relates to median house value. - -To explore this hypothesis, do the following: - -1. Create a [synthetic feature](https://developers.google.com/machine-learning/glossary/#synthetic_feature) that's a ratio of `total_rooms` to `population`. -2. Tune the three hyperparameters. -3. Determine whether this synthetic feature produces - a lower loss value than any of the single features you - tried earlier. - -```python -# Define a synthetic feature -training_df["rooms_per_person"] = training_df["total_rooms"] / training_df["population"] -feature = "rooms_per_person" - -# Tune the hyperparameters. -learning_rate = 0.06 -epochs = 24 -batch_size = 30 - -# Don't change anything below this line. -model = build_model(learning_rate) -weight, bias, epochs, mae = train_model(model, training_df, - feature, label, - epochs, batch_size) - -plot_the_model(weight, bias, feature, label) -plot_the_loss_curve(epochs, mae) -predict_house_values(15, feature, label) -``` - -Based on the loss values, this synthetic feature produces a better model than the individual features you tried in Task 2 and Task 3. However, the model still isn't creating great predictions. - -### Task 5. Find feature(s) whose raw values correlate with the label - -So far, we've relied on trial-and-error to identify possible features for the model. Let's rely on statistics instead. - -A **correlation matrix** indicates how each attribute's raw values relate to the other attributes' raw values. Correlation values have the following meanings: - - * `1.0`: perfect positive correlation; that is, when one attribute rises, the other attribute rises. - * `-1.0`: perfect negative correlation; that is, when one attribute rises, the other attribute falls. - * `0.0`: no correlation; the two columns [are not linearly related](https://en.wikipedia.org/wiki/Correlation_and_dependence#/media/File:Correlation_examples2.svg). - -In general, the higher the absolute value of a correlation value, the greater its predictive power. For example, a correlation value of -0.8 implies far more predictive power than a correlation of -0.2. - -The following code cell generates the correlation matrix for attributes of the California Housing Dataset: - -```python -# Generate a correlation matrix. -training_df.corr() -``` - -The correlation matrix shows nine potential features (including a synthetic -feature) and one label (`median_house_value`). A strong negative correlation or strong positive correlation with the label suggests a potentially good feature. - -**Your Task:** Determine which of the nine potential features appears to be the best candidate for a feature? - -> The median_income correlates 0.7 with the label (median_house_value), so median_income might be a good feature. The other seven potential features all have a correlation relatively close to 0. - -```python -feature = "median_income" - -# Possibly, experiment with the hyperparameters. -learning_rate = 0.01 -epochs = 10 -batch_size = 3 - -# Don't change anything below. -model = build_model(learning_rate) -weight, bias, epochs, rmse = train_model(model, training_df, - feature, label, - epochs, batch_size) - -plot_the_model(weight, bias, feature, label) -plot_the_loss_curve(epochs, rmse) - -predict_house_values(10, feature, label) -``` - diff --git a/notebooks/ml/Machine_Learning.ipynb b/notebooks/ml/Machine_Learning.ipynb index 9f61742..c7c5072 100644 --- a/notebooks/ml/Machine_Learning.ipynb +++ b/notebooks/ml/Machine_Learning.ipynb @@ -3,8 +3,7 @@ { "cell_type": "markdown", "metadata": { - "id": "view-in-github", - "colab_type": "text" + "id": "view-in-github" }, "source": [ "\"Open" @@ -645,1026 +644,6 @@ "In summary, **Stochastic Gradient Descent** is an optimization algorithm that is efficient and can help to find the global minimum of a function. It has been widely used in machine learning and deep learning tasks." ] }, - { - "cell_type": "markdown", - "metadata": { - "id": "0FlUVkD4-kE9" - }, - "source": [ - "# Introduction to TensorFlow\n", - "\n", - "TensorFlow is an end-to-end open source platform for machine learning. TensorFlow is a rich system for managing all aspects of a machine learning system; however, this class focuses on using a particular TensorFlow API to develop and train machine learning models. See the [TensorFlow documentation](https://tensorflow.org/) for complete details on the broader TensorFlow system.\n", - "\n", - "TensorFlow APIs are arranged hierarchically, with the high-level APIs built on the low-level APIs. Machine learning researchers use the low-level APIs to create and explore new machine learning algorithms. In this class, you will use a high-level API named tf.keras to define and train machine learning models and to make predictions. tf.keras is the TensorFlow variant of the open-source [Keras](https://keras.io/) API.\n", - "\n", - "The following figure shows the hierarchy of TensorFlow toolkits:" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FeDKXYBR-kE9" - }, - "source": [ - "
\n", - " \n", - "\n", - " Figure 1. TensorFlow toolkit hierarchy.\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TIM8MGrI-kE-" - }, - "source": [ - "## Linear regression with tf.keras" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bYxlpYOx-kE_" - }, - "source": [ - "### Simple Linear regression with Synthetic Data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "DzcMgP4G-kFC" - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import tensorflow as tf\n", - "\n", - "from matplotlib import pyplot as plt" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "91aFmOPi-kFQ" - }, - "source": [ - "### Define functions that build and train a model\n", - "\n", - "The following code defines two functions:\n", - "\n", - " * `build_model(my_learning_rate)`, which builds an empty model.\n", - " * `train_model(model, feature, label, epochs)`, which trains the model from the examples (feature and label) you pass.\n", - "\n", - "Since you don't need to understand model building code right now, you may optionally explore this code." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "YdcgMG1C-kFT" - }, - "outputs": [], - "source": [ - "def build_model(learning_rate):\n", - " \"\"\"Create and compile a simple linear regression model.\"\"\"\n", - " # Most simple tf.keras models are sequential.\n", - " # A sequential model contains one or more layers.\n", - " model = tf.keras.models.Sequential()\n", - "\n", - " # Describe the topography of the model.\n", - " # The topography of a simple linear regression model\n", - " # is a single node in a single layer.\n", - " model.add(tf.keras.layers.Dense(units=1,\n", - " input_shape=(1,)))\n", - "\n", - " # Compile the model topography into code that\n", - " # TensorFlow can efficiently execute. Configure\n", - " # training to minimize the model's mean squared error.\n", - " model.compile(optimizer=tf.keras.optimizers.RMSprop(\n", - " learning_rate=learning_rate),\n", - " loss='mean_squared_error',\n", - " metrics=[tf.keras.metrics.RootMeanSquaredError()])\n", - "\n", - " return model\n", - "\n", - "\n", - "def train_model(model, feature, label, epochs, batch_size):\n", - " \"\"\"Train the model by feeding it data.\"\"\"\n", - "\n", - " # Feed the feature values and the label values to the\n", - " # model. The model will train for the specified number\n", - " # of epochs, gradually learning how the feature values\n", - " # relate to the label values.\n", - " history = model.fit(x=feature,\n", - " y=label,\n", - " batch_size=batch_size,\n", - " epochs=epochs)\n", - "\n", - " # Gather the trained model's weight and bias.\n", - " trained_weight = model.get_weights()[0]\n", - " trained_bias = model.get_weights()[1]\n", - "\n", - " # The list of epochs is stored separately from the\n", - " # rest of history.\n", - " epochs = history.epoch\n", - "\n", - " # Gather the history (a snapshot) of each epoch.\n", - " hist = pd.DataFrame(history.history)\n", - "\n", - " # Specifically gather the model's root mean\n", - " # squared error at each epoch.\n", - " rmse = hist['root_mean_squared_error']\n", - "\n", - " return trained_weight, trained_bias, epochs, rmse" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sBpYaQkp-kFW" - }, - "source": [ - "### Define plotting functions\n", - "\n", - "We're using a popular Python library called [Matplotlib](https://developers.google.com/machine-learning/glossary/#matplotlib) to create the following two plots:\n", - "\n", - "* a plot of the feature values vs. the label values, and a line showing the output of the trained model.\n", - "* a [loss curve](https://developers.google.com/machine-learning/glossary/#loss_curve)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gKJFeeeQ-kFY" - }, - "outputs": [], - "source": [ - "def plot_the_model(trained_weight, trained_bias, feature, label):\n", - " \"\"\"Plot the trained model against the training feature and label.\"\"\"\n", - "\n", - " # Label the axes.\n", - " plt.xlabel('feature')\n", - " plt.ylabel('label')\n", - "\n", - " # Plot the feature values vs. label values.\n", - " plt.scatter(feature, label)\n", - "\n", - " # Create a red line representing the model. The red line starts\n", - " # at coordinates (x0, y0) and ends at coordinates (x1, y1).\n", - " x0 = 0\n", - " y0 = trained_bias\n", - " x1 = feature[-1]\n", - " y1 = trained_bias + (trained_weight * x1)\n", - " plt.plot([x0, x1], [y0, y1], c='r')\n", - "\n", - " # Render the scatter plot and the red line.\n", - " plt.show()\n", - "\n", - "def plot_the_loss_curve(epochs, rmse):\n", - " \"\"\"Plot the loss curve, which shows loss vs. epoch.\"\"\"\n", - "\n", - " plt.figure()\n", - " plt.xlabel('Epoch')\n", - " plt.ylabel('Root Mean Squared Error')\n", - "\n", - " plt.plot(epochs, rmse, label='Loss')\n", - " plt.legend()\n", - " plt.ylim([rmse.min() * 0.97, rmse.max()])\n", - " plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EkH59w_y-kFZ" - }, - "source": [ - "### Define the dataset\n", - "\n", - "The dataset consists of 12 [examples](https://developers.google.com/machine-learning/glossary/#example). Each example consists of one [feature](https://developers.google.com/machine-learning/glossary/#feature) and one [label](https://developers.google.com/machine-learning/glossary/#label)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XMI2qzhR-kFZ" - }, - "outputs": [], - "source": [ - "feature = ([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0])\n", - "label = ([5.0, 8.8, 9.6, 14.2, 18.8, 19.5, 21.4, 26.8, 28.9, 32.0, 33.8, 38.2])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "W1oIGROF-kFd" - }, - "source": [ - "### Specify the hyperparameters\n", - "\n", - "The hyperparameters in this Colab are as follows:\n", - "\n", - " * [learning rate](https://developers.google.com/machine-learning/glossary/#learning_rate)\n", - " * [epochs](https://developers.google.com/machine-learning/glossary/#epoch)\n", - " * [batch_size](https://developers.google.com/machine-learning/glossary/#batch_size)\n", - "\n", - "The following code cell initializes these hyperparameters and then invokes the functions that build and train the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QAw4p9X5-kFf" - }, - "outputs": [], - "source": [ - "learning_rate=0.01\n", - "epochs=10\n", - "batch_size=12\n", - "\n", - "model = build_model(learning_rate)\n", - "trained_weight, trained_bias, epochs, rmse = train_model(model, feature,\n", - " label, epochs,\n", - " batch_size)\n", - "plot_the_model(trained_weight, trained_bias, feature, label)\n", - "plot_the_loss_curve(epochs, rmse)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aU-jG1VV-kFi" - }, - "source": [ - "### Task 1: Examine the graphs\n", - "\n", - "Examine the top graph. The blue dots identify the actual data; the red line identifies the output of the trained model. Ideally, the red line should align nicely with the blue dots. Does it? Probably not.\n", - "\n", - "A certain amount of randomness plays into training a model, so you'll get somewhat different results every time you train. That said, unless you are an extremely lucky person, the red line probably *doesn't* align nicely with the blue dots. \n", - "\n", - "Examine the bottom graph, which shows the loss curve. Notice that the loss curve decreases but doesn't flatten out, which is a sign that the model hasn't trained sufficiently." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZdZWKVDQ-kFk" - }, - "source": [ - "### Task 2: Increase the number of epochs\n", - "\n", - "Training loss should steadily decrease, steeply at first, and then more slowly. Eventually, training loss should eventually stay steady (zero slope or nearly zero slope), which indicates that training has [converged](http://developers.google.com/machine-learning/glossary/#convergence).\n", - "\n", - "In Task 1, the training loss did not converge. One possible solution is to train for more epochs. Your task is to increase the number of epochs sufficiently to get the model to converge. However, it is inefficient to train past convergence, so don't just set the number of epochs to an arbitrarily high value.\n", - "\n", - "Examine the loss curve. Does the model converge?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "rNWGM8U9-kFn" - }, - "outputs": [], - "source": [ - "learning_rate=0.01\n", - "epochs=450\n", - "batch_size=12\n", - "\n", - "model = build_model(learning_rate)\n", - "trained_weight, trained_bias, epochs, rmse = train_model(model, feature,\n", - " label, epochs,\n", - " batch_size)\n", - "plot_the_model(trained_weight, trained_bias, feature, label)\n", - "plot_the_loss_curve(epochs, rmse)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Qwh9SVE4-kFp" - }, - "source": [ - "### Task 3: Increase the learning rate\n", - "\n", - "In Task 2, you increased the number of epochs to get the model to converge. Sometimes, you can get the model to converge more quickly by increasing the learning rate. However, setting the learning rate too high often makes it impossible for a model to converge. In Task 3, we've intentionally set the learning rate too high. Run the following code cell and see what happens." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2ZXSePOu-kFq" - }, - "outputs": [], - "source": [ - "learning_rate=100\n", - "epochs=500\n", - "batch_size = batch_size\n", - "\n", - "model = build_model(learning_rate)\n", - "trained_weight, trained_bias, epochs, rmse = train_model(model, feature,\n", - " label, epochs,\n", - " batch_size)\n", - "plot_the_model(trained_weight, trained_bias, feature, label)\n", - "plot_the_loss_curve(epochs, rmse)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7YzmOTeJ-kFq" - }, - "source": [ - "The resulting model is terrible; the red line doesn't align with the blue dots. Furthermore, the loss curve oscillates like a [roller coaster](https://www.wikipedia.org/wiki/Roller_coaster). An oscillating loss curve strongly suggests that the learning rate is too high." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KPplLwZN-kFr" - }, - "source": [ - "### Task 4: Find the ideal combination of epochs and learning rate\n", - "\n", - "Assign values to the following two hyperparameters to make training converge as efficiently as possible:\n", - "\n", - "* `learning_rate`\n", - "* `epochs`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9WCwiaJ9-kFt" - }, - "outputs": [], - "source": [ - "learning_rate=0.14\n", - "epochs=70\n", - "batch_size = batch_size\n", - "\n", - "model = build_model(learning_rate)\n", - "trained_weight, trained_bias, epochs, rmse = train_model(model, feature,\n", - " label, epochs,\n", - " batch_size)\n", - "plot_the_model(trained_weight, trained_bias, feature, label)\n", - "plot_the_loss_curve(epochs, rmse)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dYCC8X_E-kFv" - }, - "source": [ - "### Task 5: Adjust the batch size\n", - "\n", - "The system recalculates the model's loss value and adjusts the model's weights and bias after each **iteration**. Each iteration is the span in which the system processes one batch. For example, if the **batch size** is 6, then the system recalculates the model's loss value and adjusts the model's weights and bias after processing every 6 examples. \n", - "\n", - "One **epoch** spans sufficient iterations to process every example in the dataset. For example, if the batch size is 12, then each epoch lasts one iteration. However, if the batch size is 6, then each epoch consumes two iterations. \n", - "\n", - "It is tempting to simply set the batch size to the number of examples in the dataset (12, in this case). However, the model might actually train faster on smaller batches. Conversely, very small batches might not contain enough information to help the model converge.\n", - "\n", - "Experiment with `batch_size` in the following code cell. What's the smallest integer you can set for `batch_size` and still have the model converge in a hundred epochs?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2yvkcSWW-kFz" - }, - "outputs": [], - "source": [ - "learning_rate=0.05\n", - "epochs=125\n", - "batch_size=1 # Wow, a batch size of 1 works!\n", - "\n", - "model = build_model(learning_rate)\n", - "trained_weight, trained_bias, epochs, rmse = train_model(model, feature,\n", - " label, epochs,\n", - " batch_size)\n", - "plot_the_model(trained_weight, trained_bias, feature, label)\n", - "plot_the_loss_curve(epochs, rmse)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gnt4N18w-kF3" - }, - "source": [ - "### Summary of Hyperparameter Tuning\n", - "\n", - "Most machine learning problems require a lot of hyperparameter tuning. Unfortunately, we can't provide concrete tuning rules for every model. Lowering the learning rate can help one model converge efficiently but make another model converge much too slowly. You must experiment to find the best set of hyperparameters for your dataset. That said, here are a few rules of thumb:\n", - "\n", - " * Training loss should steadily decrease, steeply at first, and then more slowly until the slope of the curve reaches or approaches zero.\n", - " * If the training loss does not converge, train for more epochs.\n", - " * If the training loss decreases too slowly, increase the learning rate. Note that setting the learning rate too high may also prevent training loss from converging.\n", - " * If the training loss varies wildly (that is, the training loss jumps around), decrease the learning rate.\n", - " * Lowering the learning rate while increasing the number of epochs or the batch size is often a good combination.\n", - " * Setting the batch size to a *very* small batch number can also cause instability. First, try large batch size values. Then, decrease the batch size until you see degradation.\n", - " * For real-world datasets consisting of a very large number of examples, the entire dataset might not fit into memory. In such cases, you'll need to reduce the batch size to enable a batch to fit into memory.\n", - "\n", - "Remember: the ideal combination of hyperparameters is data dependent, so you must always experiment and verify." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jeoXI_nu-kF4" - }, - "source": [ - "## Linear Regression with a Real Dataset\n", - "\n", - "Now we are going to use a real dataset to predict the prices of houses in California." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jqq1gQw5-kF5" - }, - "source": [ - "### The Dataset\n", - " \n", - "The [dataset for this exercise](https://developers.google.com/machine-learning/crash-course/california-housing-data-description) is based on 1990 census data from California. The dataset is old but still provides a great opportunity to learn about machine learning programming." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ulhMpMiF-kF7" - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import tensorflow as tf\n", - "from matplotlib import pyplot as plt\n", - "\n", - "# The following lines adjust the granularity of reporting.\n", - "pd.options.display.max_rows = 10\n", - "pd.options.display.float_format = \"{:.1f}\".format" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fcbY8-bu-kF-" - }, - "source": [ - "### The dataset\n", - "\n", - "Datasets are often stored on disk or at a URL in [.csv format](https://wikipedia.org/wiki/Comma-separated_values).\n", - "\n", - "A well-formed .csv file contains column names in the first row, followed by many rows of data. A comma divides each value in each row. For example, here are the first five rows of the .csv file holding the California Housing Dataset:\n", - "\n", - "```\n", - "\"longitude\",\"latitude\",\"housing_median_age\",\"total_rooms\",\"total_bedrooms\",\"population\",\"households\",\"median_income\",\"median_house_value\"\n", - "-114.310000,34.190000,15.000000,5612.000000,1283.000000,1015.000000,472.000000,1.493600,66900.000000\n", - "-114.470000,34.400000,19.000000,7650.000000,1901.000000,1129.000000,463.000000,1.820000,80100.000000\n", - "-114.560000,33.690000,17.000000,720.000000,174.000000,333.000000,117.000000,1.650900,85700.000000\n", - "-114.570000,33.640000,14.000000,1501.000000,337.000000,515.000000,226.000000,3.191700,73400.000000\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4EHxBmKW-kGB" - }, - "source": [ - "### Load the .csv file into a pandas DataFrame\n", - "\n", - "Like many machine learning programs, we gather the `.csv` file and stores the data in memory as a pandas Dataframe. Pandas is an open source Python library. The primary datatype in pandas is a DataFrame. You can imagine a pandas DataFrame as a spreadsheet in which each row is identified by a number and each column by a name. Pandas is itself built on another open source Python library called NumPy.\n", - "\n", - "The following code cell imports the .csv file into a pandas DataFrame and scales the values in the label (`median_house_value`):" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "MEA3z6Gd-kGE" - }, - "outputs": [], - "source": [ - "# Import the dataset.\n", - "training_df = pd.read_csv(filepath_or_buffer=\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\")\n", - "\n", - "# Scale the label.\n", - "training_df[\"median_house_value\"] /= 1000.0\n", - "\n", - "# Print the first rows of the pandas DataFrame.\n", - "training_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jvm_Gp5G-kGH" - }, - "source": [ - "Scaling `median_house_value` puts the value of each house in units of thousands. Scaling will keep loss values and learning rates in a friendlier range. \n", - "\n", - "Although scaling a label is usually *not* essential, scaling features in a multi-feature model usually *is* essential." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aID_glKx-kGJ" - }, - "source": [ - "### Examine the dataset\n", - "\n", - "A large part of most machine learning projects is getting to know your data. The pandas API provides a `describe` function that outputs the following statistics about every column in the DataFrame:\n", - "\n", - "* `count`, which is the number of rows in that column. Ideally, `count` contains the same value for every column.\n", - "\n", - "* `mean` and `std`, which contain the mean and standard deviation of the values in each column.\n", - "\n", - "* `min` and `max`, which contain the lowest and highest values in each column.\n", - "\n", - "* `25%`, `50%`, `75%`, which contain various [quantiles](https://developers.google.com/machine-learning/glossary/#quantile)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "d1DC8o07-kGL" - }, - "outputs": [], - "source": [ - "# Get statistics on the dataset.\n", - "training_df.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KNjfPF72-kGN" - }, - "source": [ - "### Task 1: Identify anomalies in the dataset\n", - "\n", - "Do you see any anomalies (strange values) in the data?\n", - "\n", - "> The maximum value (max) of several columns seems very high compared to the other quantiles. For example, example the total_rooms column. Given the quantile values (25%, 50%, and 75%), you might expect the max value of total_rooms to be approximately 5,000 or possibly 10,000. However, the max value is actually 37,937.\n", - ">\n", - "> When you see anomalies in a column, become more careful about using that column as a feature. That said, anomalies in potential features sometimes mirror anomalies in the label, which could make the column be (or seem to be) a powerful feature." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "PdD4pK-d-kGP" - }, - "source": [ - "### Define functions that build and train a model\n", - "\n", - "The following code defines two functions:\n", - "\n", - " * `build_model(my_learning_rate)`, which builds a randomly-initialized model.\n", - " * `train_model(model, feature, label, epochs)`, which trains the model from the examples (feature and label) you pass.\n", - "\n", - "Since you don't need to understand model building code right now, you may optionally explore this code." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "nFbnWhN_-kGQ" - }, - "outputs": [], - "source": [ - "def build_model(my_learning_rate):\n", - " \"\"\"Create and compile a simple linear regression model.\"\"\"\n", - " # Most simple tf.keras models are sequential.\n", - " model = tf.keras.models.Sequential()\n", - "\n", - " # Describe the topography of the model.\n", - " # The topography of a simple linear regression model\n", - " # is a single node in a single layer.\n", - " model.add(tf.keras.layers.Dense(units=1,\n", - " input_shape=(1,)))\n", - "\n", - " # Compile the model topography into code that TensorFlow can efficiently\n", - " # execute. Configure training to minimize the model's mean squared error.\n", - " model.compile(optimizer=tf.keras.optimizers.RMSprop(\n", - " learning_rate=my_learning_rate),\n", - " loss=\"mean_squared_error\",\n", - " metrics=[tf.keras.metrics.RootMeanSquaredError()])\n", - "\n", - " return model\n", - "\n", - "\n", - "def train_model(model, df, feature, label, epochs, batch_size):\n", - " \"\"\"Train the model by feeding it data.\"\"\"\n", - "\n", - " # Feed the model the feature and the label.\n", - " # The model will train for the specified number of epochs.\n", - " history = model.fit(x=df[feature],\n", - " y=df[label],\n", - " batch_size=batch_size,\n", - " epochs=epochs)\n", - "\n", - " # Gather the trained model's weight and bias.\n", - " trained_weight = model.get_weights()[0]\n", - " trained_bias = model.get_weights()[1]\n", - "\n", - " # The list of epochs is stored separately from the rest of history.\n", - " epochs = history.epoch\n", - "\n", - " # Isolate the error for each epoch.\n", - " hist = pd.DataFrame(history.history)\n", - "\n", - " # To track the progression of training, we're going to take a snapshot\n", - " # of the model's root mean squared error at each epoch.\n", - " rmse = hist[\"root_mean_squared_error\"]\n", - "\n", - " return trained_weight, trained_bias, epochs, rmse" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5jCKbdiI-kGS" - }, - "source": [ - "### Define plotting functions\n", - "\n", - "We're using a popular Python library called [Matplotlib](https://developers.google.com/machine-learning/glossary/#matplotlib) to create the following two plots:\n", - "\n", - "* a plot of the feature values vs. the label values, and a line showing the output of the trained model.\n", - "* a [loss curve](https://developers.google.com/machine-learning/glossary/#loss_curve)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "BIIkR4pT-kGU" - }, - "outputs": [], - "source": [ - "def plot_the_model(trained_weight, trained_bias, feature, label):\n", - " \"\"\"Plot the trained model against 200 random training examples.\"\"\"\n", - "\n", - " # Label the axes.\n", - " plt.xlabel(feature)\n", - " plt.ylabel(label)\n", - "\n", - " # Create a scatter plot from 200 random points of the dataset.\n", - " random_examples = training_df.sample(n=200)\n", - " plt.scatter(random_examples[feature], random_examples[label])\n", - "\n", - " # Create a red line representing the model. The red line starts\n", - " # at coordinates (x0, y0) and ends at coordinates (x1, y1).\n", - " x0 = 0\n", - " y0 = trained_bias\n", - " x1 = random_examples[feature].max()\n", - " y1 = trained_bias + (trained_weight * x1)\n", - " plt.plot([x0, x1], [y0, y1], c='r')\n", - "\n", - " # Render the scatter plot and the red line.\n", - " plt.show()\n", - "\n", - "\n", - "def plot_the_loss_curve(epochs, rmse):\n", - " \"\"\"Plot a curve of loss vs. epoch.\"\"\"\n", - "\n", - " plt.figure()\n", - " plt.xlabel(\"Epoch\")\n", - " plt.ylabel(\"Root Mean Squared Error\")\n", - "\n", - " plt.plot(epochs, rmse, label=\"Loss\")\n", - " plt.legend()\n", - " plt.ylim([rmse.min() * 0.97, rmse.max()])\n", - " plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Wwutc1zc-kGV" - }, - "source": [ - "### Call the model functions\n", - "\n", - "An important part of machine learning is determining which [features](https://developers.google.com/machine-learning/glossary/#feature) correlate with the [label](https://developers.google.com/machine-learning/glossary/#label). For example, real-life home-value prediction models typically rely on hundreds of features and synthetic features. However, this model relies on only one feature. For now, you'll arbitrarily use `total_rooms` as that feature.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "BQH7x7AZ-kGX" - }, - "outputs": [], - "source": [ - "# The following variables are the hyperparameters.\n", - "learning_rate = 0.01\n", - "epochs = 30\n", - "batch_size = 30\n", - "\n", - "# Specify the feature and the label.\n", - "feature = \"total_rooms\" # the total number of rooms on a specific city block.\n", - "label=\"median_house_value\" # the median value of a house on a specific city block.\n", - "# That is, you're going to create a model that predicts house value based\n", - "# solely on total_rooms.\n", - "\n", - "# Discard any pre-existing version of the model.\n", - "model = None\n", - "\n", - "# Invoke the functions.\n", - "model = build_model(learning_rate)\n", - "weight, bias, epochs, rmse = train_model(model, training_df,\n", - " feature, label,\n", - " epochs, batch_size)\n", - "\n", - "print(\"\\nThe learned weight for your model is %.4f\" % weight)\n", - "print(\"The learned bias for your model is %.4f\\n\" % bias )\n", - "\n", - "plot_the_model(weight, bias, feature, label)\n", - "plot_the_loss_curve(epochs, rmse)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6ENg3FRx-kGZ" - }, - "source": [ - "A certain amount of randomness plays into training a model. Consequently, you'll get different results each time you train the model. That said, given the dataset and the hyperparameters, the trained model will generally do a poor job describing the feature's relation to the label." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "62fWZ_2C-kGb" - }, - "source": [ - "### Use the model to make predictions\n", - "\n", - "You can use the trained model to make predictions. In practice, [you should make predictions on examples that are not used in training](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data). However, for this exercise, you'll just work with a subset of the same training dataset.\n", - "\n", - "First, run the following code to define the house prediction function:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "MkFW-qrL-kGd" - }, - "outputs": [], - "source": [ - "def predict_house_values(n, feature, label):\n", - " \"\"\"Predict house values based on a feature.\"\"\"\n", - "\n", - " batch = training_df[feature][10000:10000 + n]\n", - " predicted_values = model.predict_on_batch(x=batch)\n", - "\n", - " print(\"feature label predicted\")\n", - " print(\" value value value\")\n", - " print(\" in thousand$ in thousand$\")\n", - " print(\"--------------------------------------\")\n", - " for i in range(n):\n", - " print (\"%5.0f %6.0f %15.0f\" % (training_df[feature][10000 + i],\n", - " training_df[label][10000 + i],\n", - " predicted_values[i][0] ))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "T-2jmxlF-kGg" - }, - "source": [ - "Now, invoke the house prediction function on 10 examples:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VV1p4Grf-kGl" - }, - "outputs": [], - "source": [ - "predict_house_values(10, feature, label)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LNEb5E2z-kGo" - }, - "source": [ - "### Task 2: Judge the predictive power of the model\n", - "\n", - "Look at the preceding table. How close is the predicted value to the label value? In other words, does your model accurately predict house values?\n", - "\n", - "> Most of the predicted values differ significantly from the label value, so the trained model probably doesn't have much predictive power. However, the first 10 examples might not be representative of the rest of the examples. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bUiF3g9c-kGp" - }, - "source": [ - "### Task 3: Try a different feature\n", - "\n", - "The `total_rooms` feature had only a little predictive power. Would a different feature have greater predictive power? Try using `population` as the feature instead of `total_rooms`.\n", - "\n", - "Note: When you change features, you might also need to change the hyperparameters." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "P_SKVoyo-kGq" - }, - "outputs": [], - "source": [ - "# Pick a feature other than \"total_rooms\"\n", - "feature = \"population\"\n", - "\n", - "# Possibly, experiment with the hyperparameters.\n", - "learning_rate = 0.05\n", - "epochs = 18\n", - "batch_size = 3\n", - "\n", - "# Don't change anything below.\n", - "model = build_model(learning_rate)\n", - "weight, bias, epochs, rmse = train_model(model, training_df,\n", - " feature, label,\n", - " epochs, batch_size)\n", - "\n", - "plot_the_model(weight, bias, feature, label)\n", - "plot_the_loss_curve(epochs, rmse)\n", - "\n", - "predict_house_values(10, feature, label)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yJa8_56u-kGt" - }, - "source": [ - "Did `population` produce better predictions than `total_rooms`?\n", - "\n", - "> Training is not entirely deterministic, but population typically converges at a slightly higher RMSE than total_rooms. So, population appears to be about the same or slightly worse at making predictions than total_rooms." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vKxPOt0G-kGv" - }, - "source": [ - "### Task 4: Define a synthetic feature\n", - "\n", - "You have determined that `total_rooms` and `population` were not useful features. That is, neither the total number of rooms in a neighborhood nor the neighborhood's population successfully predicted the median house price of that neighborhood. Perhaps though, the *ratio* of `total_rooms` to `population` might have some predictive power. That is, perhaps block density relates to median house value.\n", - "\n", - "To explore this hypothesis, do the following:\n", - "\n", - "1. Create a [synthetic feature](https://developers.google.com/machine-learning/glossary/#synthetic_feature) that's a ratio of `total_rooms` to `population`.\n", - "2. Tune the three hyperparameters.\n", - "3. Determine whether this synthetic feature produces\n", - " a lower loss value than any of the single features you\n", - " tried earlier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "hEsFU0kE-kGx" - }, - "outputs": [], - "source": [ - "# Define a synthetic feature\n", - "training_df[\"rooms_per_person\"] = training_df[\"total_rooms\"] / training_df[\"population\"]\n", - "feature = \"rooms_per_person\"\n", - "\n", - "# Tune the hyperparameters.\n", - "learning_rate = 0.06\n", - "epochs = 24\n", - "batch_size = 30\n", - "\n", - "# Don't change anything below this line.\n", - "model = build_model(learning_rate)\n", - "weight, bias, epochs, mae = train_model(model, training_df,\n", - " feature, label,\n", - " epochs, batch_size)\n", - "\n", - "plot_the_model(weight, bias, feature, label)\n", - "plot_the_loss_curve(epochs, mae)\n", - "predict_house_values(15, feature, label)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ou4yVCYv-kG0" - }, - "source": [ - "Based on the loss values, this synthetic feature produces a better model than the individual features you tried in Task 2 and Task 3. However, the model still isn't creating great predictions." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "E_W7Xt2--kG1" - }, - "source": [ - "### Task 5. Find feature(s) whose raw values correlate with the label\n", - "\n", - "So far, we've relied on trial-and-error to identify possible features for the model. Let's rely on statistics instead.\n", - "\n", - "A **correlation matrix** indicates how each attribute's raw values relate to the other attributes' raw values. Correlation values have the following meanings:\n", - "\n", - " * `1.0`: perfect positive correlation; that is, when one attribute rises, the other attribute rises.\n", - " * `-1.0`: perfect negative correlation; that is, when one attribute rises, the other attribute falls.\n", - " * `0.0`: no correlation; the two columns [are not linearly related](https://en.wikipedia.org/wiki/Correlation_and_dependence#/media/File:Correlation_examples2.svg).\n", - "\n", - "In general, the higher the absolute value of a correlation value, the greater its predictive power. For example, a correlation value of -0.8 implies far more predictive power than a correlation of -0.2.\n", - "\n", - "The following code cell generates the correlation matrix for attributes of the California Housing Dataset:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "A409yR4g-kG3" - }, - "outputs": [], - "source": [ - "# Generate a correlation matrix.\n", - "training_df.corr()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2Q8srhLJ-kG4" - }, - "source": [ - "The correlation matrix shows nine potential features (including a synthetic\n", - "feature) and one label (`median_house_value`). A strong negative correlation or strong positive correlation with the label suggests a potentially good feature. \n", - "\n", - "**Your Task:** Determine which of the nine potential features appears to be the best candidate for a feature?\n", - "\n", - "> The median_income correlates 0.7 with the label (median_house_value), so median_income might be a good feature. The other seven potential features all have a correlation relatively close to 0." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "izL-jUY5-kG5" - }, - "outputs": [], - "source": [ - "feature = \"median_income\"\n", - "\n", - "# Possibly, experiment with the hyperparameters.\n", - "learning_rate = 0.01\n", - "epochs = 10\n", - "batch_size = 3\n", - "\n", - "# Don't change anything below.\n", - "model = build_model(learning_rate)\n", - "weight, bias, epochs, rmse = train_model(model, training_df,\n", - " feature, label,\n", - " epochs, batch_size)\n", - "\n", - "plot_the_model(weight, bias, feature, label)\n", - "plot_the_loss_curve(epochs, rmse)\n", - "\n", - "predict_house_values(10, feature, label)" - ] - }, { "cell_type": "markdown", "metadata": { @@ -4801,7 +3780,7 @@ "metadata": { "colab": { "provenance": [], - "include_colab_link": true + "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)",