From d9c35deec76c7f70b609cb2b93ec63daa2776aad Mon Sep 17 00:00:00 2001 From: Vansh Sharma <74853090+vanshhhhh@users.noreply.github.com> Date: Sat, 3 Sep 2022 00:20:17 +0530 Subject: [PATCH 1/4] Added Kaggle example - Beginner classification --- ...ggle_beginner_example_classification.ipynb | 2149 +++++++++++++++++ 1 file changed, 2149 insertions(+) create mode 100644 documentation/tutorials/kaggle_beginner_example_classification.ipynb diff --git a/documentation/tutorials/kaggle_beginner_example_classification.ipynb b/documentation/tutorials/kaggle_beginner_example_classification.ipynb new file mode 100644 index 00000000..a8d9cdcc --- /dev/null +++ b/documentation/tutorials/kaggle_beginner_example_classification.ipynb @@ -0,0 +1,2149 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "MDBzBKC_pnXl" + }, + "source": [ + "# Titanic - TFDF\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " View on TensorFlow.org\n", + " \n", + " Run in Google Colab\n", + " \n", + " View on GitHub\n", + " \n", + " Download notebook\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3u9YXGqAZWwj" + }, + "source": [ + "Kaggle Dataset - [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/overview)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MA9_xqWRpqZU" + }, + "source": [ + "## Introduction\n", + "\n", + "[TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests)\n", + "is a collection of state-of-the-art algorithms of Decision Forest models\n", + "that are compatible with [Keras APIs](https://www.tensorflow.org/api_docs/python/tf/keras)\n", + ".\n", + "The models include [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel),\n", + "[Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel),\n", + "and [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel),\n", + "and can be used for regression, classification, and ranking tasks.\n", + "For a beginner's guide to TensorFlow Decision Forests,\n", + "please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-eDuuv2RroES" + }, + "source": [ + "### Random Forest\n", + "Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.\n", + "\n", + "In this example we will use TensorFlow to train each of these on a dataset you load from a CSV file. This is a common pattern in practice. Roughly, your code will look as follows:\n", + "\n", + "```\n", + "import tensorflow_decision_forests as tfdf\n", + "import pandas as pd\n", + " \n", + "dataset = pd.read_csv(\"project/dataset.csv\")\n", + "tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label=\"my_label\", task=tfdf.keras.Task.CLASSIFICATION)\n", + "\n", + "model = tfdf.keras.RandomForestModel()\n", + "model.fit(tf_dataset)\n", + " \n", + "print(model.summary())\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mrTx_bPrtd17" + }, + "source": [ + "### Setup" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dl6_Mdy7sUC7" + }, + "source": [ + "#### Install TensorFlow Decision Forests\n", + "\n", + "There are many excellent libraries for working with tree-based models, including [scikit-learn](https://scikit-learn.org/) (highly recommended for all your ML needs), XGBoost, LightGBM, and others.\n", + "\n", + "In this example we'll use [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests), a relatively new library used to train large models. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lSDtXxIDseKq" + }, + "outputs": [], + "source": [ + "!pip install tensorflow_decision_forests --quiet" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zr0eiHcyvG1m" + }, + "source": [ + "#### Import the library" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IA1LNshiumEA" + }, + "outputs": [], + "source": [ + "# Scientific computing # \n", + "import numpy as np # Numpy Documentation - https://numpy.org/doc/stable/ \n", + "\n", + "# - Data processing - #\n", + "import pandas as pd # Pandas Documentation - https://pandas.pydata.org/docs/\n", + "\n", + "# -- Hide Warnings -- #\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "# ---- Tensorflow ---- #\n", + "import tensorflow as tf\n", + "import tensorflow_decision_forests as tfdf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CjdtV-KWvcWA", + "outputId": "7087788a-b3f9-416f-8f35-33a363c6a81e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TensorFlow v2.9.1\n", + "TensorFlow Decision Forests v0.2.7\n" + ] + } + ], + "source": [ + "print(\"TensorFlow v\" + tf.__version__)\n", + "print(\"TensorFlow Decision Forests v\" + tfdf.__version__)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vzX1j4YewLSr" + }, + "source": [ + "### Download the Titanic dataset\n", + "[Titanic dataset](https://www.kaggle.com/competitions/titanic/overview/description) is an example of a binary classification problem in supervised learning. We are classifying the outcome of the passengers as either one of two classes, survived or did not survive the Titanic." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_vmqf8o60D37" + }, + "source": [ + "To run this notebook, you need to have a Kaggle account.\n", + "\n", + "If you do not have an account, you can create one here: [Kaggle Register](https://www.kaggle.com/account/login?phase=startRegisterTab&returnUrl=%2F) \n", + "\n", + "In order to get a token to use in the following cell, check out the [Authentication Section](https://www.kaggle.com/docs/api#authentication) of Kaggle API documentation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "ekAcuqTFvt3p" + }, + "outputs": [], + "source": [ + "#@title Enter your Kaggle token in order to fetch the dataset\n", + "\n", + "username = '' #@param {type:\"string\"}\n", + "key = '' #@param {type: \"string\"}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "JKyVN-lC0HOC" + }, + "outputs": [], + "source": [ + "#@title Configure Kaggle\n", + "try:\n", + " from google.colab import files, drive\n", + "\n", + " # Install and Configure Kaggle\n", + " import json\n", + "\n", + " token = {\n", + " \"username\":username,\n", + " \"key\":key\n", + " }\n", + "\n", + " # Installing kaggle\n", + " !pip install kaggle &> /dev/null\n", + "\n", + " # Creating .kaggle if necessary\n", + " !if [ -d .kaggle ]; then echo \".kaggle exists\"; else echo \".kaggle does not exist ... Creating it\"; mkdir .kaggle; if [ -d .kaggle ]; then echo \"Successfully created\"; else echo \"Error creating .kaggle\"; fi; fi\n", + "\n", + " with open('/content/.kaggle/kaggle.json', 'w') as file:\n", + " json.dump(token, file)\n", + "\n", + " # Creating .kaggle if necessary\n", + " !if [ -d ~/.kaggle ]; then echo \" ~/.kaggle exists\"; else echo \" ~/.kaggle does not exist ... Creating it\"; mkdir ~/.kaggle; if [ -d ~/.kaggle ]; then echo \"Successfully created\"; else echo \"Error creating ~/.kaggle\"; fi; fi\n", + " !cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json\n", + "\n", + " # kaggle configuration\n", + " !kaggle config set -n path -v{/content}\n", + "\n", + " # Changing mode\n", + " !chmod 600 /root/.kaggle/kaggle.json\n", + "except Exception:\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "J501lMS40lUR" + }, + "outputs": [], + "source": [ + "#@title Download Dataset\n", + "import os\n", + "\n", + "DOWNLOAD_LOCATION = \"/root/Downloads/\"\n", + "\n", + "if os.path.exists(DOWNLOAD_LOCATION):\n", + " if os.path.isdir(DOWNLOAD_LOCATION):\n", + " print(\"{} exists and is a directory\".format(DOWNLOAD_LOCATION))\n", + " else:\n", + " print(\"{} exists but is not a directory!!!\".format(DOWNLOAD_LOCATION))\n", + "else:\n", + " print(\"{} does not exist ... Creating it\".format(DOWNLOAD_LOCATION))\n", + " os.makedirs(DOWNLOAD_LOCATION)\n", + "\n", + "# Downloading\n", + "!kaggle competitions download -c titanic -p {DOWNLOAD_LOCATION}\n", + "\n", + "# Extracting archives\n", + "!cd {DOWNLOAD_LOCATION}; unzip -qq \\*.zip; rm -f *.zip" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PCFSJUjl2fuT" + }, + "source": [ + "## Data Loading\n", + "Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the TensorFlow Dataset to read the files may be better suited." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "18QhsN2L16wH", + "outputId": "664d0036-8db4-4345-f961-a39554a0d50e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Full train dataset shape is (891, 12)\n" + ] + } + ], + "source": [ + "train_file_path = os.path.join(DOWNLOAD_LOCATION, \"train.csv\")\n", + "train_full_data = pd.read_csv(train_file_path)\n", + "print(\"Full train dataset shape is {}\".format(train_full_data.shape))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J8KqgoL95mhw" + }, + "source": [ + "The data is composed of 12 columns and 891 entries. We can see all 12 dimensions of our dataset by printing out the first 3 entries using the following code: \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + }, + "id": "v4rywCtW2pfK", + "outputId": "b39ba084-53c8-49c6-ea35-1fadc9ca4b39" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "0 1 0 3 \n", + "1 2 1 1 \n", + "2 3 1 3 \n", + "\n", + " Name Sex Age SibSp \\\n", + "0 Braund, Mr. Owen Harris male 22.0 1 \n", + "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", + "2 Heikkinen, Miss. Laina female 26.0 0 \n", + "\n", + " Parch Ticket Fare Cabin Embarked \n", + "0 0 A/5 21171 7.2500 NaN S \n", + "1 0 PC 17599 71.2833 C85 C \n", + "2 0 STON/O2. 3101282 7.9250 NaN S " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_full_data.head(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IYqyU_6MOrH8" + }, + "source": [ + "* 8 feature columns named `Pclass, Sex, Age, SibSp, Parch, Fare, Cabin, Embarked`.\n", + "* Label column named `Survived`.\n", + "* We will drop the following unnecessary columns : `PassengerId`, `Name` and `Ticket`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "shj-eSteOqPE" + }, + "outputs": [], + "source": [ + "train_full_data = train_full_data.drop(['PassengerId', 'Name', 'Ticket'], axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GxrmygY2QZ-I" + }, + "source": [ + "Let's print the updated table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + }, + "id": "SYvo-ty6QiHN", + "outputId": "0931e98b-1634-4747-ea48-ae38d4ad9f01" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SurvivedPclassSexAgeSibSpParchFareCabinEmbarked
003male22.0107.2500NaNS
111female38.01071.2833C85C
213female26.0007.9250NaNS
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked\n", + "0 0 3 male 22.0 1 0 7.2500 NaN S\n", + "1 1 1 female 38.0 1 0 71.2833 C85 C\n", + "2 1 3 female 26.0 0 0 7.9250 NaN S" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_full_data.head(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qs070SbkMJix" + }, + "source": [ + "To know more about the data description you can refer [Kaggle](https://www.kaggle.com/competitions/titanic/data)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2EpAa_q55Ke8" + }, + "source": [ + "## Prepare the dataset\n", + "This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to tensorflow and ML." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uQMX8Md3ISq0" + }, + "source": [ + "Convert the values stored in the `Survived` column to a list of values, where the list does not allow for duplicates. `Survived` has one of two values, 0 or 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "YmrDp4SL7hTw", + "outputId": "d0d3a509-3144-42d6-beb8-18eced608b78" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Label classes: [0, 1]\n" + ] + } + ], + "source": [ + "label=\"Survived\"\n", + "classes = train_full_data[label].unique().tolist()\n", + "print(f\"Label classes: {classes}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0NGJhK0R58Oa" + }, + "source": [ + "Let's split the dataset into training and testing:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CW3ofmmI5xIr", + "outputId": "8919362e-6d2c-4323-92a5-780bd3e9af8a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "611 examples in training, 280 examples in validation.\n" + ] + } + ], + "source": [ + "def split_dataset(dataset, test_ratio=0.30):\n", + " test_indices = np.random.rand(len(dataset)) < test_ratio\n", + " return dataset[~test_indices], dataset[test_indices]\n", + "\n", + "train_ds_pd, val_ds_pd = split_dataset(train_full_data)\n", + "print(\"{} examples in training, {} examples in validation.\".format(\n", + " len(train_ds_pd), len(val_ds_pd)))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I0ZrYmer6tMp" + }, + "source": [ + "There's one more step required before you can train your model. You need to convert from Pandas format (`pd.DataFrame`) into TensorFlow format (`tf.data.Dataset`). A single line helper function that will do this for you: \n", + "\n", + "```\n", + "tfdf.keras.pd_dataframe_to_tf_dataset(your_df, label='your_label', task=tfdf.keras.Task.CLASSIFICATION)\n", + "```\n", + "\n", + "This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). A GPU (Graphics Processing Unit) is a specialized processor with dedicated memory that conventionally perform floating point operations required for rendering graphics. GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously. It is not necessary for tree-based models until you begin to do distributed training.\n", + "\n", + "Creating a fast input pipeline is important when working with neural networks, and forgetting to do so is the most common bug new researchers encounter. The author of this notebook has seen many folks with expensive GPUs that are idle ~50% of the time while waiting for data.\n", + "\n", + "Note that tf.data is a bit tricky to use, and has a learning curve. There are guides on [tensorflow.org/guide](https://www.tensorflow.org/guide) to help." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DyAHpZ0R6B5R" + }, + "outputs": [], + "source": [ + "train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", + " train_ds_pd, \n", + " label = label, \n", + " task = tfdf.keras.Task.CLASSIFICATION)\n", + "\n", + "val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", + " val_ds_pd, \n", + " label = label, \n", + " task = tfdf.keras.Task.CLASSIFICATION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cwdbYZeTJP89" + }, + "source": [ + "## Exploratory Data Analysis (EDA)\n", + "Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. \n", + "\n", + "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banrrjee." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3m46QYDz8IB4" + }, + "source": [ + "## Create a Random Forest " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "11yxinBK78qU" + }, + "outputs": [], + "source": [ + "model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.CLASSIFICATION)\n", + "model.compile(metrics=[\"accuracy\"]) # Optional, you can use this to include a list of eval metrics" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fxBlIUPD8SKU" + }, + "source": [ + "## Train your model\n", + "\n", + "This is a one-liner.\n", + "\n", + "Note: You can safely ignore the warning about Autograph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_tQfGooA8OI2" + }, + "outputs": [], + "source": [ + "model.fit(x=train_ds)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YoqPROtT9A33" + }, + "source": [ + "## Visualize your model\n", + "One benefit of tree-based models is that you can easily visualize them. The default number of trees used in the Random Forest is 300. You can select a tree to display below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 404 + }, + "id": "Cwv7-NXc8WUq", + "outputId": "eebf77c4-37ea-4651-d693-7eeaa081cdf9" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RtGEzEGU9FsI" + }, + "source": [ + "## Evaluate the model on OOB data and the test dataset\n", + "\n", + "Let's plot accuracy on OOB evaluation dataset as a function of the number of trees in the forest. One of the nice features about this particular hyperparameter is that larger values are usually better, and come with little risk aside from slowing down training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 279 + }, + "id": "4nOZy6lX9CwJ", + "outputId": "220a98fd-8281-4924-e9e0-5ece77289184" + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "logs = model.make_inspector().training_logs()\n", + "plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])\n", + "plt.xlabel(\"Number of trees\")\n", + "plt.ylabel(\"Accuracy (out-of-bag)\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KWlw6i0U9UcE" + }, + "source": [ + "You can also see some general stats on the OOB dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_nEjaF9Y9NjF", + "outputId": "e04c43f8-a735-44cb-d078-b808664a0145" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Evaluation(num_examples=611, accuracy=0.806873977086743, loss=0.7393123309627944, rmse=None, ndcg=None, aucs=None, auuc=None, qini=None)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "inspector = model.make_inspector()\n", + "inspector.evaluation()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W8OwVx569bbU" + }, + "source": [ + "Now, let's run an evaluation using the test data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KyH_XC1d9X9x", + "outputId": "64568368-962e-4cba-ea8e-ed169fd328a3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1/1 [==============================] - 1s 958ms/step - loss: 0.0000e+00 - accuracy: 0.8679\n", + "loss: 0.0000\n", + "accuracy: 0.8679\n" + ] + } + ], + "source": [ + "evaluation = model.evaluate(x=val_ds,return_dict=True)\n", + "\n", + "for name, value in evaluation.items():\n", + " print(f\"{name}: {value:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TK0l4Qgxbwcq" + }, + "source": [ + "# Test Set Prediction\n", + "Now we will do prediction on `test.csv`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BbC05cBKcTQ5" + }, + "outputs": [], + "source": [ + "test_file_path = os.path.join(DOWNLOAD_LOCATION, \"test.csv\")\n", + "test_data = pd.read_csv(test_file_path)\n", + "ids = test_data.pop('PassengerId')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OUDeYu2zcYrk" + }, + "outputs": [], + "source": [ + "test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", + " test_data, \n", + " task = tfdf.keras.Task.CLASSIFICATION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4l13fn1seaHj" + }, + "source": [ + "Since the prediction can be either 0 (Not survived) or 1 (Survived), let's convert the predited float value to binary value" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8tY1evHAcoTH" + }, + "outputs": [], + "source": [ + "preds = model.predict(test_ds)\n", + "preds = preds >= 0.5\n", + "preds = preds.astype('int')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "Jxtj1lp6csVQ", + "outputId": "4b6929fc-b490-4806-fff7-e71576f0261c" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvived
08920
18930
28940
38950
48960
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " PassengerId Survived\n", + "0 892 0\n", + "1 893 0\n", + "2 894 0\n", + "3 895 0\n", + "4 896 0" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output = pd.DataFrame({'PassengerId': ids,\n", + " 'Survived': preds.squeeze()})\n", + "\n", + "output.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yZOhWNRcpM-G" + }, + "source": [ + "You can download the predicted output as a CSV file and do submission on the [Competition page](https://www.kaggle.com/competitions/titanic/submit) on Kaggle." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j8FquAuufmtS" + }, + "outputs": [], + "source": [ + "output_filename = \"test_prediction_output.csv\"\n", + "output.to_csv(output_filename, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 17 + }, + "id": "2ary3LNoffRA", + "outputId": "e20e4278-86a5-4282-ba7b-81d35fa8b5a5" + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "\n", + " async function download(id, filename, size) {\n", + " if (!google.colab.kernel.accessAllowed) {\n", + " return;\n", + " }\n", + " const div = document.createElement('div');\n", + " const label = document.createElement('label');\n", + " label.textContent = `Downloading \"${filename}\": `;\n", + " div.appendChild(label);\n", + " const progress = document.createElement('progress');\n", + " progress.max = size;\n", + " div.appendChild(progress);\n", + " document.body.appendChild(div);\n", + "\n", + " const buffers = [];\n", + " let downloaded = 0;\n", + "\n", + " const channel = await google.colab.kernel.comms.open(id);\n", + " // Send a message to notify the kernel that we're ready.\n", + " channel.send({})\n", + "\n", + " for await (const message of channel.messages) {\n", + " // Send a message to notify the kernel that we're ready.\n", + " channel.send({})\n", + " if (message.buffers) {\n", + " for (const buffer of message.buffers) {\n", + " buffers.push(buffer);\n", + " downloaded += buffer.byteLength;\n", + " progress.value = downloaded;\n", + " }\n", + " }\n", + " }\n", + " const blob = new Blob(buffers, {type: 'application/binary'});\n", + " const a = document.createElement('a');\n", + " a.href = window.URL.createObjectURL(blob);\n", + " a.download = filename;\n", + " div.appendChild(a);\n", + " a.click();\n", + " div.remove();\n", + " }\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/javascript": [ + "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from google.colab import files\n", + "files.download('test_prediction_output.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wFMtKF5irxhs" + }, + "source": [ + "# Try it out yourself\n", + "We've provided a bunch of code which you can use to explore the dataset, in case this is helpful to you in your future work. The code you need to write for this exercise is only a couple lines. \n", + "\n", + "Note: For this section the `test_ratio` is decreased from 0.3 to 0.1. Therefore, you can get different result.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CUirWAzfGkkC" + }, + "source": [ + "## Explore the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SrBp6T9xsqb5" + }, + "outputs": [], + "source": [ + "train_file_path = os.path.join(DOWNLOAD_LOCATION, \"train.csv\")\n", + "train_full_data = pd.read_csv(train_file_path)\n", + "print(\"Full train dataset shape is {}\".format(train_full_data.shape))\n", + "\n", + "label=\"Survived\"\n", + "classes = train_full_data[label].unique().tolist()\n", + "print(f\"Label classes: {classes}\")\n", + "\n", + "train_full_data[label] = train_full_data[label].map(classes.index)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Pg7v6HFZNcNI" + }, + "source": [ + "### Split the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UNGS3XDb3-4K" + }, + "outputs": [], + "source": [ + " def split_dataset(dataset, test_ratio=0.10):\n", + " # YOUR CODE HERE\n", + "\n", + " \n", + " # Add code to split the dataset\n", + " return # your split data set\n", + "\n", + "train_ds_pd, val_ds_pd = split_dataset(train_full_data)\n", + "print(\"{} examples in training, {} examples in validation.\".format(\n", + " len(train_ds_pd), len(val_ds_pd)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "zQKZitDR4t0o" + }, + "outputs": [], + "source": [ + "#@title Solution\n", + "'''def split_dataset(dataset, test_ratio=0.10):\n", + " test_indices = np.random.rand(len(dataset)) < test_ratio\n", + " return dataset[~test_indices], dataset[test_indices]'''" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NeZAkVroGsN3" + }, + "source": [ + "## Create tf.data.Datasets from the Pandas DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZQVp9z1f4dou" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE\n", + "\n", + "\n", + "# Add code to create a tf.data.Dataset for train and test from the DataFrames\n", + "# Example...\n", + "# train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(...\n", + "# test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "TvEAL3Sh5M6i" + }, + "outputs": [], + "source": [ + "#@title Solution\n", + "#train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", + "# train_ds_pd, \n", + "# label = label, \n", + "# task = tfdf.keras.Task.CLASSIFICATION)\n", + "\n", + "#val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", + "# val_ds_pd, \n", + "# label = label, \n", + "# task = tfdf.keras.Task.CLASSIFICATION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_N7MeCoBG25D" + }, + "source": [ + "## Create your model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_ptiMVC15yhl" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE\n", + "\n", + "\n", + "# Add code to create a random forest\n", + "# Example ...\n", + "# mymodel = tfdf.keras. ...\n", + "# mymodel.compile(metrics=[\"accuracy\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "CdVGq1Dq5zq3" + }, + "outputs": [], + "source": [ + "#@title Solution\n", + "#mymodel = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.CLASSIFICATION)\n", + "#mymodel.compile(metrics=[\"accuracy\"]) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pricbSWnHE4w" + }, + "source": [ + "## Train your Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9bbEyEUfGLda" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE\n", + "\n", + "\n", + "# Add code to train your model\n", + "# Example ...\n", + "# mymodel.fit(..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "3Y3crGNkGIU-" + }, + "outputs": [], + "source": [ + "#@title Solution\n", + "#mymodel.fit(x=train_ds)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D3xi8eNnGeDA" + }, + "source": [ + "## Evaluate your model\n", + "Uncomment these cells after completing the code above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bQlSIbiMHVV4" + }, + "outputs": [], + "source": [ + "#mymodel.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CKa4fo5zHYRE" + }, + "outputs": [], + "source": [ + "#mymodel.evaluate(test_ds)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "e66pPL8OHk_4" + }, + "outputs": [], + "source": [ + "#inspector = mymodel.make_inspector()\n", + "#inspector.evaluation()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2344zfqIHcfG" + }, + "outputs": [], + "source": [ + "#evaluation = mymodel.evaluate(x=test_ds,return_dict=True)\n", + "\n", + "#for name, value in evaluation.items():\n", + "# print(f\"{name}: {value:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Lh3gxL9OHKbD" + }, + "source": [ + "# References\n", + "* Dive deep into \n", + " * [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel)\n", + " * [Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel)\n", + " * [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel)\n", + " * [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras)\n", + " * [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests).\n", + "* [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banrrjee.\n", + "* TensorFlow Decision Forests tutorials which are a set of 3 very interesting tutorials.\n", + " * [Beginner Tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)\n", + " * [Intermediate Tutorial](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab)\n", + " * [Advanced Tutorial](https://www.tensorflow.org/decision_forests/tutorials/advanced_colab)\n", + "* The [TensorFlow Forum](https://discuss.tensorflow.org/) where one can get in touch with the TensorFlow community. Check it out if you haven't yet." + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From b10b8f0c7904073db00f5aa2f1b5289ca213bd3d Mon Sep 17 00:00:00 2001 From: Vansh Sharma <74853090+vanshhhhh@users.noreply.github.com> Date: Sat, 3 Sep 2022 05:21:57 +0530 Subject: [PATCH 2/4] Minor change --- ...ggle_beginner_example_classification.ipynb | 54 ++----------------- 1 file changed, 5 insertions(+), 49 deletions(-) diff --git a/documentation/tutorials/kaggle_beginner_example_classification.ipynb b/documentation/tutorials/kaggle_beginner_example_classification.ipynb index a8d9cdcc..30a01174 100644 --- a/documentation/tutorials/kaggle_beginner_example_classification.ipynb +++ b/documentation/tutorials/kaggle_beginner_example_classification.ipynb @@ -874,7 +874,7 @@ "## Exploratory Data Analysis (EDA)\n", "Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. \n", "\n", - "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banrrjee." + "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee." ] }, { @@ -1411,7 +1411,7 @@ "outputs": [ { "data": { - "image/png": "\n", + "image/png": "", "text/plain": [ "
" ] @@ -1764,49 +1764,7 @@ "outputs": [ { "data": { - "application/javascript": [ - "\n", - " async function download(id, filename, size) {\n", - " if (!google.colab.kernel.accessAllowed) {\n", - " return;\n", - " }\n", - " const div = document.createElement('div');\n", - " const label = document.createElement('label');\n", - " label.textContent = `Downloading \"${filename}\": `;\n", - " div.appendChild(label);\n", - " const progress = document.createElement('progress');\n", - " progress.max = size;\n", - " div.appendChild(progress);\n", - " document.body.appendChild(div);\n", - "\n", - " const buffers = [];\n", - " let downloaded = 0;\n", - "\n", - " const channel = await google.colab.kernel.comms.open(id);\n", - " // Send a message to notify the kernel that we're ready.\n", - " channel.send({})\n", - "\n", - " for await (const message of channel.messages) {\n", - " // Send a message to notify the kernel that we're ready.\n", - " channel.send({})\n", - " if (message.buffers) {\n", - " for (const buffer of message.buffers) {\n", - " buffers.push(buffer);\n", - " downloaded += buffer.byteLength;\n", - " progress.value = downloaded;\n", - " }\n", - " }\n", - " }\n", - " const blob = new Blob(buffers, {type: 'application/binary'});\n", - " const a = document.createElement('a');\n", - " a.href = window.URL.createObjectURL(blob);\n", - " a.download = filename;\n", - " div.appendChild(a);\n", - " a.click();\n", - " div.remove();\n", - " }\n", - " " - ], + "application/javascript": "\n async function download(id, filename, size) {\n if (!google.colab.kernel.accessAllowed) {\n return;\n }\n const div = document.createElement('div');\n const label = document.createElement('label');\n label.textContent = `Downloading \"${filename}\": `;\n div.appendChild(label);\n const progress = document.createElement('progress');\n progress.max = size;\n div.appendChild(progress);\n document.body.appendChild(div);\n\n const buffers = [];\n let downloaded = 0;\n\n const channel = await google.colab.kernel.comms.open(id);\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n\n for await (const message of channel.messages) {\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n if (message.buffers) {\n for (const buffer of message.buffers) {\n buffers.push(buffer);\n downloaded += buffer.byteLength;\n progress.value = downloaded;\n }\n }\n }\n const blob = new Blob(buffers, {type: 'application/binary'});\n const a = document.createElement('a');\n a.href = window.URL.createObjectURL(blob);\n a.download = filename;\n div.appendChild(a);\n a.click();\n div.remove();\n }\n ", "text/plain": [ "" ] @@ -1816,9 +1774,7 @@ }, { "data": { - "application/javascript": [ - "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)" - ], + "application/javascript": "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)", "text/plain": [ "" ] @@ -2112,7 +2068,7 @@ " * [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel)\n", " * [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras)\n", " * [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests).\n", - "* [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banrrjee.\n", + "* [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee.\n", "* TensorFlow Decision Forests tutorials which are a set of 3 very interesting tutorials.\n", " * [Beginner Tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)\n", " * [Intermediate Tutorial](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab)\n", From ec26f124a782f1968b1ff9a6ad9e30ae67ea6d35 Mon Sep 17 00:00:00 2001 From: Vansh Sharma <74853090+vanshhhhh@users.noreply.github.com> Date: Fri, 16 Sep 2022 01:19:03 +0530 Subject: [PATCH 3/4] Update kaggle_beginner_example_classification.ipynb --- ...ggle_beginner_example_classification.ipynb | 519 ++++-------------- 1 file changed, 109 insertions(+), 410 deletions(-) diff --git a/documentation/tutorials/kaggle_beginner_example_classification.ipynb b/documentation/tutorials/kaggle_beginner_example_classification.ipynb index 30a01174..bf560c8e 100644 --- a/documentation/tutorials/kaggle_beginner_example_classification.ipynb +++ b/documentation/tutorials/kaggle_beginner_example_classification.ipynb @@ -1,12 +1,38 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Copyright 2020 The TensorFlow Authors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, { "cell_type": "markdown", "metadata": { "id": "MDBzBKC_pnXl" }, "source": [ - "# Titanic - TFDF\n", + "# Structured Data Classification using TFDF\n", "\n", "\n", "
\n", @@ -24,15 +50,6 @@ "
" ] }, - { - "cell_type": "markdown", - "metadata": { - "id": "3u9YXGqAZWwj" - }, - "source": [ - "Kaggle Dataset - [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/overview)" - ] - }, { "cell_type": "markdown", "metadata": { @@ -49,18 +66,8 @@ "[Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel),\n", "and [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel),\n", "and can be used for regression, classification, and ranking tasks.\n", - "For a beginner's guide to TensorFlow Decision Forests,\n", - "please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-eDuuv2RroES" - }, - "source": [ - "### Random Forest\n", - "Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.\n", + "For an introduction to [TFDF](https://www.tensorflow.org/decision_forests) without Kaggle, please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab).\n", + "Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform neural networks.\n", "\n", "In this example we will use TensorFlow to train each of these on a dataset you load from a CSV file. This is a common pattern in practice. Roughly, your code will look as follows:\n", "\n", @@ -93,11 +100,7 @@ "id": "dl6_Mdy7sUC7" }, "source": [ - "#### Install TensorFlow Decision Forests\n", - "\n", - "There are many excellent libraries for working with tree-based models, including [scikit-learn](https://scikit-learn.org/) (highly recommended for all your ML needs), XGBoost, LightGBM, and others.\n", - "\n", - "In this example we'll use [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests), a relatively new library used to train large models. " + "#### Install TensorFlow Decision Forests" ] }, { @@ -134,10 +137,6 @@ "# - Data processing - #\n", "import pandas as pd # Pandas Documentation - https://pandas.pydata.org/docs/\n", "\n", - "# -- Hide Warnings -- #\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "\n", "# ---- Tensorflow ---- #\n", "import tensorflow as tf\n", "import tensorflow_decision_forests as tfdf" @@ -147,11 +146,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "CjdtV-KWvcWA", - "outputId": "7087788a-b3f9-416f-8f35-33a363c6a81e" + "id": "CjdtV-KWvcWA" }, "outputs": [ { @@ -175,7 +170,7 @@ }, "source": [ "### Download the Titanic dataset\n", - "[Titanic dataset](https://www.kaggle.com/competitions/titanic/overview/description) is an example of a binary classification problem in supervised learning. We are classifying the outcome of the passengers as either one of two classes, survived or did not survive the Titanic." + "The [Titanic dataset](https://www.kaggle.com/competitions/titanic/overview/description) is an example of a binary classification problem in supervised learning. We are classifying the outcome of the passengers as either one of two classes, survived or did not survive the Titanic." ] }, { @@ -285,19 +280,15 @@ "id": "PCFSJUjl2fuT" }, "source": [ - "## Data Loading\n", - "Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the TensorFlow Dataset to read the files may be better suited." + "## Load the dataset\n", + "Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the [TensorFlow Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to read the files may be better suited." ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "18QhsN2L16wH", - "outputId": "664d0036-8db4-4345-f961-a39554a0d50e" + "id": "18QhsN2L16wH" }, "outputs": [ { @@ -327,12 +318,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 187 - }, - "id": "v4rywCtW2pfK", - "outputId": "b39ba084-53c8-49c6-ea35-1fadc9ca4b39" + "id": "v4rywCtW2pfK" }, "outputs": [ { @@ -547,25 +533,11 @@ "train_full_data = train_full_data.drop(['PassengerId', 'Name', 'Ticket'], axis=1)" ] }, - { - "cell_type": "markdown", - "metadata": { - "id": "GxrmygY2QZ-I" - }, - "source": [ - "Let's print the updated table." - ] - }, { "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 143 - }, - "id": "SYvo-ty6QiHN", - "outputId": "0931e98b-1634-4747-ea48-ae38d4ad9f01" + "id": "SYvo-ty6QiHN" }, "outputs": [ { @@ -742,7 +714,7 @@ "id": "Qs070SbkMJix" }, "source": [ - "To know more about the data description you can refer [Kaggle](https://www.kaggle.com/competitions/titanic/data)." + "Refer to [Kaggle](https://www.kaggle.com/competitions/titanic/data) for a comprehensive guide to the data." ] }, { @@ -752,7 +724,7 @@ }, "source": [ "## Prepare the dataset\n", - "This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to tensorflow and ML." + "This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to TensorFlow and ML." ] }, { @@ -768,11 +740,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "YmrDp4SL7hTw", - "outputId": "d0d3a509-3144-42d6-beb8-18eced608b78" + "id": "YmrDp4SL7hTw" }, "outputs": [ { @@ -795,18 +763,14 @@ "id": "0NGJhK0R58Oa" }, "source": [ - "Let's split the dataset into training and testing:" + "Split the dataset into training and testing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "CW3ofmmI5xIr", - "outputId": "8919362e-6d2c-4323-92a5-780bd3e9af8a" + "id": "CW3ofmmI5xIr" }, "outputs": [ { @@ -839,9 +803,7 @@ "tfdf.keras.pd_dataframe_to_tf_dataset(your_df, label='your_label', task=tfdf.keras.Task.CLASSIFICATION)\n", "```\n", "\n", - "This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). A GPU (Graphics Processing Unit) is a specialized processor with dedicated memory that conventionally perform floating point operations required for rendering graphics. GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously. It is not necessary for tree-based models until you begin to do distributed training.\n", - "\n", - "Creating a fast input pipeline is important when working with neural networks, and forgetting to do so is the most common bug new researchers encounter. The author of this notebook has seen many folks with expensive GPUs that are idle ~50% of the time while waiting for data.\n", + "This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). It is not necessary for tree-based models until you begin to do distributed training.\n", "\n", "Note that tf.data is a bit tricky to use, and has a learning curve. There are guides on [tensorflow.org/guide](https://www.tensorflow.org/guide) to help." ] @@ -855,14 +817,14 @@ "outputs": [], "source": [ "train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", - " train_ds_pd, \n", - " label = label, \n", - " task = tfdf.keras.Task.CLASSIFICATION)\n", + " train_ds_pd, \n", + " label = label, \n", + " task = tfdf.keras.Task.CLASSIFICATION)\n", "\n", "val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", - " val_ds_pd, \n", - " label = label, \n", - " task = tfdf.keras.Task.CLASSIFICATION)" + " val_ds_pd, \n", + " label = label, \n", + " task = tfdf.keras.Task.CLASSIFICATION)" ] }, { @@ -872,7 +834,7 @@ }, "source": [ "## Exploratory Data Analysis (EDA)\n", - "Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. \n", + "Data scientists use exploratory analysis techniques to analyze and visualize large datasets. This process helps them identify the main characteristics of their data sets and develop effective strategies to get the answers they need. It can also help them spot anomalies and test hypotheses.\n", "\n", "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee." ] @@ -883,7 +845,7 @@ "id": "3m46QYDz8IB4" }, "source": [ - "## Create a Random Forest " + "## Create and train a Random Forest model " ] }, { @@ -898,19 +860,6 @@ "model.compile(metrics=[\"accuracy\"]) # Optional, you can use this to include a list of eval metrics" ] }, - { - "cell_type": "markdown", - "metadata": { - "id": "fxBlIUPD8SKU" - }, - "source": [ - "## Train your model\n", - "\n", - "This is a one-liner.\n", - "\n", - "Note: You can safely ignore the warning about Autograph." - ] - }, { "cell_type": "code", "execution_count": null, @@ -936,12 +885,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 404 - }, - "id": "Cwv7-NXc8WUq", - "outputId": "eebf77c4-37ea-4651-d693-7eeaa081cdf9" + "id": "Cwv7-NXc8WUq" }, "outputs": [ { @@ -1392,7 +1336,7 @@ "id": "RtGEzEGU9FsI" }, "source": [ - "## Evaluate the model on OOB data and the test dataset\n", + "## Evaluate the model on OOB data and the validation dataset\n", "\n", "Let's plot accuracy on OOB evaluation dataset as a function of the number of trees in the forest. One of the nice features about this particular hyperparameter is that larger values are usually better, and come with little risk aside from slowing down training." ] @@ -1401,12 +1345,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 279 - }, - "id": "4nOZy6lX9CwJ", - "outputId": "220a98fd-8281-4924-e9e0-5ece77289184" + "id": "4nOZy6lX9CwJ" }, "outputs": [ { @@ -1444,11 +1383,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "_nEjaF9Y9NjF", - "outputId": "e04c43f8-a735-44cb-d078-b808664a0145" + "id": "_nEjaF9Y9NjF" }, "outputs": [ { @@ -1480,11 +1415,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "KyH_XC1d9X9x", - "outputId": "64568368-962e-4cba-ea8e-ed169fd328a3" + "id": "KyH_XC1d9X9x" }, "outputs": [ { @@ -1510,7 +1441,7 @@ "id": "TK0l4Qgxbwcq" }, "source": [ - "# Test Set Prediction\n", + "## Test Set Prediction\n", "Now we will do prediction on `test.csv`.\n" ] }, @@ -1536,8 +1467,8 @@ "outputs": [], "source": [ "test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", - " test_data, \n", - " task = tfdf.keras.Task.CLASSIFICATION)" + " test_data, \n", + " task = tfdf.keras.Task.CLASSIFICATION)" ] }, { @@ -1566,12 +1497,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 206 - }, - "id": "Jxtj1lp6csVQ", - "outputId": "4b6929fc-b490-4806-fff7-e71576f0261c" + "id": "Jxtj1lp6csVQ" }, "outputs": [ { @@ -1754,17 +1680,54 @@ "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 17 - }, - "id": "2ary3LNoffRA", - "outputId": "e20e4278-86a5-4282-ba7b-81d35fa8b5a5" + "id": "2ary3LNoffRA" }, "outputs": [ { "data": { - "application/javascript": "\n async function download(id, filename, size) {\n if (!google.colab.kernel.accessAllowed) {\n return;\n }\n const div = document.createElement('div');\n const label = document.createElement('label');\n label.textContent = `Downloading \"${filename}\": `;\n div.appendChild(label);\n const progress = document.createElement('progress');\n progress.max = size;\n div.appendChild(progress);\n document.body.appendChild(div);\n\n const buffers = [];\n let downloaded = 0;\n\n const channel = await google.colab.kernel.comms.open(id);\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n\n for await (const message of channel.messages) {\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n if (message.buffers) {\n for (const buffer of message.buffers) {\n buffers.push(buffer);\n downloaded += buffer.byteLength;\n progress.value = downloaded;\n }\n }\n }\n const blob = new Blob(buffers, {type: 'application/binary'});\n const a = document.createElement('a');\n a.href = window.URL.createObjectURL(blob);\n a.download = filename;\n div.appendChild(a);\n a.click();\n div.remove();\n }\n ", + "application/javascript": [ + "\n", + " async function download(id, filename, size) {\n", + " if (!google.colab.kernel.accessAllowed) {\n", + " return;\n", + " }\n", + " const div = document.createElement('div');\n", + " const label = document.createElement('label');\n", + " label.textContent = `Downloading \"${filename}\": `;\n", + " div.appendChild(label);\n", + " const progress = document.createElement('progress');\n", + " progress.max = size;\n", + " div.appendChild(progress);\n", + " document.body.appendChild(div);\n", + "\n", + " const buffers = [];\n", + " let downloaded = 0;\n", + "\n", + " const channel = await google.colab.kernel.comms.open(id);\n", + " // Send a message to notify the kernel that we're ready.\n", + " channel.send({})\n", + "\n", + " for await (const message of channel.messages) {\n", + " // Send a message to notify the kernel that we're ready.\n", + " channel.send({})\n", + " if (message.buffers) {\n", + " for (const buffer of message.buffers) {\n", + " buffers.push(buffer);\n", + " downloaded += buffer.byteLength;\n", + " progress.value = downloaded;\n", + " }\n", + " }\n", + " }\n", + " const blob = new Blob(buffers, {type: 'application/binary'});\n", + " const a = document.createElement('a');\n", + " a.href = window.URL.createObjectURL(blob);\n", + " a.download = filename;\n", + " div.appendChild(a);\n", + " a.click();\n", + " div.remove();\n", + " }\n", + " " + ], "text/plain": [ "" ] @@ -1774,7 +1737,9 @@ }, { "data": { - "application/javascript": "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)", + "application/javascript": [ + "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)" + ], "text/plain": [ "" ] @@ -1788,273 +1753,6 @@ "files.download('test_prediction_output.csv')" ] }, - { - "cell_type": "markdown", - "metadata": { - "id": "wFMtKF5irxhs" - }, - "source": [ - "# Try it out yourself\n", - "We've provided a bunch of code which you can use to explore the dataset, in case this is helpful to you in your future work. The code you need to write for this exercise is only a couple lines. \n", - "\n", - "Note: For this section the `test_ratio` is decreased from 0.3 to 0.1. Therefore, you can get different result.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "CUirWAzfGkkC" - }, - "source": [ - "## Explore the dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "SrBp6T9xsqb5" - }, - "outputs": [], - "source": [ - "train_file_path = os.path.join(DOWNLOAD_LOCATION, \"train.csv\")\n", - "train_full_data = pd.read_csv(train_file_path)\n", - "print(\"Full train dataset shape is {}\".format(train_full_data.shape))\n", - "\n", - "label=\"Survived\"\n", - "classes = train_full_data[label].unique().tolist()\n", - "print(f\"Label classes: {classes}\")\n", - "\n", - "train_full_data[label] = train_full_data[label].map(classes.index)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Pg7v6HFZNcNI" - }, - "source": [ - "### Split the dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "UNGS3XDb3-4K" - }, - "outputs": [], - "source": [ - " def split_dataset(dataset, test_ratio=0.10):\n", - " # YOUR CODE HERE\n", - "\n", - " \n", - " # Add code to split the dataset\n", - " return # your split data set\n", - "\n", - "train_ds_pd, val_ds_pd = split_dataset(train_full_data)\n", - "print(\"{} examples in training, {} examples in validation.\".format(\n", - " len(train_ds_pd), len(val_ds_pd)))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "zQKZitDR4t0o" - }, - "outputs": [], - "source": [ - "#@title Solution\n", - "'''def split_dataset(dataset, test_ratio=0.10):\n", - " test_indices = np.random.rand(len(dataset)) < test_ratio\n", - " return dataset[~test_indices], dataset[test_indices]'''" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NeZAkVroGsN3" - }, - "source": [ - "## Create tf.data.Datasets from the Pandas DataFrame" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZQVp9z1f4dou" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n", - "\n", - "\n", - "# Add code to create a tf.data.Dataset for train and test from the DataFrames\n", - "# Example...\n", - "# train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(...\n", - "# test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(..." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "TvEAL3Sh5M6i" - }, - "outputs": [], - "source": [ - "#@title Solution\n", - "#train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", - "# train_ds_pd, \n", - "# label = label, \n", - "# task = tfdf.keras.Task.CLASSIFICATION)\n", - "\n", - "#val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", - "# val_ds_pd, \n", - "# label = label, \n", - "# task = tfdf.keras.Task.CLASSIFICATION)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_N7MeCoBG25D" - }, - "source": [ - "## Create your model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "_ptiMVC15yhl" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n", - "\n", - "\n", - "# Add code to create a random forest\n", - "# Example ...\n", - "# mymodel = tfdf.keras. ...\n", - "# mymodel.compile(metrics=[\"accuracy\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "CdVGq1Dq5zq3" - }, - "outputs": [], - "source": [ - "#@title Solution\n", - "#mymodel = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.CLASSIFICATION)\n", - "#mymodel.compile(metrics=[\"accuracy\"]) " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pricbSWnHE4w" - }, - "source": [ - "## Train your Model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9bbEyEUfGLda" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n", - "\n", - "\n", - "# Add code to train your model\n", - "# Example ...\n", - "# mymodel.fit(..." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "3Y3crGNkGIU-" - }, - "outputs": [], - "source": [ - "#@title Solution\n", - "#mymodel.fit(x=train_ds)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D3xi8eNnGeDA" - }, - "source": [ - "## Evaluate your model\n", - "Uncomment these cells after completing the code above." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "bQlSIbiMHVV4" - }, - "outputs": [], - "source": [ - "#mymodel.summary()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "CKa4fo5zHYRE" - }, - "outputs": [], - "source": [ - "#mymodel.evaluate(test_ds)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "e66pPL8OHk_4" - }, - "outputs": [], - "source": [ - "#inspector = mymodel.make_inspector()\n", - "#inspector.evaluation()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2344zfqIHcfG" - }, - "outputs": [], - "source": [ - "#evaluation = mymodel.evaluate(x=test_ds,return_dict=True)\n", - "\n", - "#for name, value in evaluation.items():\n", - "# print(f\"{name}: {value:.4f}\")" - ] - }, { "cell_type": "markdown", "metadata": { @@ -2080,7 +1778,8 @@ "metadata": { "colab": { "collapsed_sections": [], - "provenance": [] + "name": "kaggle_beginner_example_classification.ipynb", + "toc_visible": true }, "kernelspec": { "display_name": "Python 3", From d2b243366e425b9a423b01180c71f1b0394a7477 Mon Sep 17 00:00:00 2001 From: Vansh Sharma <74853090+vanshhhhh@users.noreply.github.com> Date: Fri, 16 Sep 2022 01:33:46 +0530 Subject: [PATCH 4/4] Minor Changes --- ...ggle_beginner_example_classification.ipynb | 26 +++++++++---------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/documentation/tutorials/kaggle_beginner_example_classification.ipynb b/documentation/tutorials/kaggle_beginner_example_classification.ipynb index bf560c8e..bd796f04 100644 --- a/documentation/tutorials/kaggle_beginner_example_classification.ipynb +++ b/documentation/tutorials/kaggle_beginner_example_classification.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Copyright 2020 The TensorFlow Authors." + "##### Copyright 2022 The TensorFlow Authors." ] }, { @@ -717,6 +717,18 @@ "Refer to [Kaggle](https://www.kaggle.com/competitions/titanic/data) for a comprehensive guide to the data." ] }, + { + "cell_type": "markdown", + "metadata": { + "id": "cwdbYZeTJP89" + }, + "source": [ + "## Exploratory Data Analysis (EDA)\n", + "Data scientists use exploratory analysis techniques to analyze and visualize large datasets. This process helps them identify the main characteristics of their data sets and develop effective strategies to get the answers they need. It can also help them spot anomalies and test hypotheses.\n", + "\n", + "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee." + ] + }, { "cell_type": "markdown", "metadata": { @@ -827,18 +839,6 @@ " task = tfdf.keras.Task.CLASSIFICATION)" ] }, - { - "cell_type": "markdown", - "metadata": { - "id": "cwdbYZeTJP89" - }, - "source": [ - "## Exploratory Data Analysis (EDA)\n", - "Data scientists use exploratory analysis techniques to analyze and visualize large datasets. This process helps them identify the main characteristics of their data sets and develop effective strategies to get the answers they need. It can also help them spot anomalies and test hypotheses.\n", - "\n", - "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee." - ] - }, { "cell_type": "markdown", "metadata": {