\n",
+ " "
+ ],
+ "text/plain": [
+ " Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked\n",
+ "0 0 3 male 22.0 1 0 7.2500 NaN S\n",
+ "1 1 1 female 38.0 1 0 71.2833 C85 C\n",
+ "2 1 3 female 26.0 0 0 7.9250 NaN S"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "train_full_data.head(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Qs070SbkMJix"
+ },
+ "source": [
+ "To know more about the data description you can refer [Kaggle](https://www.kaggle.com/competitions/titanic/data)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2EpAa_q55Ke8"
+ },
+ "source": [
+ "## Prepare the dataset\n",
+ "This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to tensorflow and ML."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uQMX8Md3ISq0"
+ },
+ "source": [
+ "Convert the values stored in the `Survived` column to a list of values, where the list does not allow for duplicates. `Survived` has one of two values, 0 or 1."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "YmrDp4SL7hTw",
+ "outputId": "d0d3a509-3144-42d6-beb8-18eced608b78"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Label classes: [0, 1]\n"
+ ]
+ }
+ ],
+ "source": [
+ "label=\"Survived\"\n",
+ "classes = train_full_data[label].unique().tolist()\n",
+ "print(f\"Label classes: {classes}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0NGJhK0R58Oa"
+ },
+ "source": [
+ "Let's split the dataset into training and testing:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "CW3ofmmI5xIr",
+ "outputId": "8919362e-6d2c-4323-92a5-780bd3e9af8a"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "611 examples in training, 280 examples in validation.\n"
+ ]
+ }
+ ],
+ "source": [
+ "def split_dataset(dataset, test_ratio=0.30):\n",
+ " test_indices = np.random.rand(len(dataset)) < test_ratio\n",
+ " return dataset[~test_indices], dataset[test_indices]\n",
+ "\n",
+ "train_ds_pd, val_ds_pd = split_dataset(train_full_data)\n",
+ "print(\"{} examples in training, {} examples in validation.\".format(\n",
+ " len(train_ds_pd), len(val_ds_pd)))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "I0ZrYmer6tMp"
+ },
+ "source": [
+ "There's one more step required before you can train your model. You need to convert from Pandas format (`pd.DataFrame`) into TensorFlow format (`tf.data.Dataset`). A single line helper function that will do this for you: \n",
+ "\n",
+ "```\n",
+ "tfdf.keras.pd_dataframe_to_tf_dataset(your_df, label='your_label', task=tfdf.keras.Task.CLASSIFICATION)\n",
+ "```\n",
+ "\n",
+ "This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). A GPU (Graphics Processing Unit) is a specialized processor with dedicated memory that conventionally perform floating point operations required for rendering graphics. GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously. It is not necessary for tree-based models until you begin to do distributed training.\n",
+ "\n",
+ "Creating a fast input pipeline is important when working with neural networks, and forgetting to do so is the most common bug new researchers encounter. The author of this notebook has seen many folks with expensive GPUs that are idle ~50% of the time while waiting for data.\n",
+ "\n",
+ "Note that tf.data is a bit tricky to use, and has a learning curve. There are guides on [tensorflow.org/guide](https://www.tensorflow.org/guide) to help."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "DyAHpZ0R6B5R"
+ },
+ "outputs": [],
+ "source": [
+ "train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
+ " train_ds_pd, \n",
+ " label = label, \n",
+ " task = tfdf.keras.Task.CLASSIFICATION)\n",
+ "\n",
+ "val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
+ " val_ds_pd, \n",
+ " label = label, \n",
+ " task = tfdf.keras.Task.CLASSIFICATION)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cwdbYZeTJP89"
+ },
+ "source": [
+ "## Exploratory Data Analysis (EDA)\n",
+ "Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. \n",
+ "\n",
+ "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banrrjee."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3m46QYDz8IB4"
+ },
+ "source": [
+ "## Create a Random Forest "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "11yxinBK78qU"
+ },
+ "outputs": [],
+ "source": [
+ "model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.CLASSIFICATION)\n",
+ "model.compile(metrics=[\"accuracy\"]) # Optional, you can use this to include a list of eval metrics"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fxBlIUPD8SKU"
+ },
+ "source": [
+ "## Train your model\n",
+ "\n",
+ "This is a one-liner.\n",
+ "\n",
+ "Note: You can safely ignore the warning about Autograph."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "_tQfGooA8OI2"
+ },
+ "outputs": [],
+ "source": [
+ "model.fit(x=train_ds)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YoqPROtT9A33"
+ },
+ "source": [
+ "## Visualize your model\n",
+ "One benefit of tree-based models is that you can easily visualize them. The default number of trees used in the Random Forest is 300. You can select a tree to display below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 404
+ },
+ "id": "Cwv7-NXc8WUq",
+ "outputId": "eebf77c4-37ea-4651-d693-7eeaa081cdf9"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RtGEzEGU9FsI"
+ },
+ "source": [
+ "## Evaluate the model on OOB data and the test dataset\n",
+ "\n",
+ "Let's plot accuracy on OOB evaluation dataset as a function of the number of trees in the forest. One of the nice features about this particular hyperparameter is that larger values are usually better, and come with little risk aside from slowing down training."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 279
+ },
+ "id": "4nOZy6lX9CwJ",
+ "outputId": "220a98fd-8281-4924-e9e0-5ece77289184"
+ },
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
\n",
+ " "
+ ],
+ "text/plain": [
+ " PassengerId Survived\n",
+ "0 892 0\n",
+ "1 893 0\n",
+ "2 894 0\n",
+ "3 895 0\n",
+ "4 896 0"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "output = pd.DataFrame({'PassengerId': ids,\n",
+ " 'Survived': preds.squeeze()})\n",
+ "\n",
+ "output.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yZOhWNRcpM-G"
+ },
+ "source": [
+ "You can download the predicted output as a CSV file and do submission on the [Competition page](https://www.kaggle.com/competitions/titanic/submit) on Kaggle."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "j8FquAuufmtS"
+ },
+ "outputs": [],
+ "source": [
+ "output_filename = \"test_prediction_output.csv\"\n",
+ "output.to_csv(output_filename, index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 17
+ },
+ "id": "2ary3LNoffRA",
+ "outputId": "e20e4278-86a5-4282-ba7b-81d35fa8b5a5"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/javascript": [
+ "\n",
+ " async function download(id, filename, size) {\n",
+ " if (!google.colab.kernel.accessAllowed) {\n",
+ " return;\n",
+ " }\n",
+ " const div = document.createElement('div');\n",
+ " const label = document.createElement('label');\n",
+ " label.textContent = `Downloading \"${filename}\": `;\n",
+ " div.appendChild(label);\n",
+ " const progress = document.createElement('progress');\n",
+ " progress.max = size;\n",
+ " div.appendChild(progress);\n",
+ " document.body.appendChild(div);\n",
+ "\n",
+ " const buffers = [];\n",
+ " let downloaded = 0;\n",
+ "\n",
+ " const channel = await google.colab.kernel.comms.open(id);\n",
+ " // Send a message to notify the kernel that we're ready.\n",
+ " channel.send({})\n",
+ "\n",
+ " for await (const message of channel.messages) {\n",
+ " // Send a message to notify the kernel that we're ready.\n",
+ " channel.send({})\n",
+ " if (message.buffers) {\n",
+ " for (const buffer of message.buffers) {\n",
+ " buffers.push(buffer);\n",
+ " downloaded += buffer.byteLength;\n",
+ " progress.value = downloaded;\n",
+ " }\n",
+ " }\n",
+ " }\n",
+ " const blob = new Blob(buffers, {type: 'application/binary'});\n",
+ " const a = document.createElement('a');\n",
+ " a.href = window.URL.createObjectURL(blob);\n",
+ " a.download = filename;\n",
+ " div.appendChild(a);\n",
+ " a.click();\n",
+ " div.remove();\n",
+ " }\n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/javascript": [
+ "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from google.colab import files\n",
+ "files.download('test_prediction_output.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wFMtKF5irxhs"
+ },
+ "source": [
+ "# Try it out yourself\n",
+ "We've provided a bunch of code which you can use to explore the dataset, in case this is helpful to you in your future work. The code you need to write for this exercise is only a couple lines. \n",
+ "\n",
+ "Note: For this section the `test_ratio` is decreased from 0.3 to 0.1. Therefore, you can get different result.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CUirWAzfGkkC"
+ },
+ "source": [
+ "## Explore the dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SrBp6T9xsqb5"
+ },
+ "outputs": [],
+ "source": [
+ "train_file_path = os.path.join(DOWNLOAD_LOCATION, \"train.csv\")\n",
+ "train_full_data = pd.read_csv(train_file_path)\n",
+ "print(\"Full train dataset shape is {}\".format(train_full_data.shape))\n",
+ "\n",
+ "label=\"Survived\"\n",
+ "classes = train_full_data[label].unique().tolist()\n",
+ "print(f\"Label classes: {classes}\")\n",
+ "\n",
+ "train_full_data[label] = train_full_data[label].map(classes.index)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Pg7v6HFZNcNI"
+ },
+ "source": [
+ "### Split the dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "UNGS3XDb3-4K"
+ },
+ "outputs": [],
+ "source": [
+ " def split_dataset(dataset, test_ratio=0.10):\n",
+ " # YOUR CODE HERE\n",
+ "\n",
+ " \n",
+ " # Add code to split the dataset\n",
+ " return # your split data set\n",
+ "\n",
+ "train_ds_pd, val_ds_pd = split_dataset(train_full_data)\n",
+ "print(\"{} examples in training, {} examples in validation.\".format(\n",
+ " len(train_ds_pd), len(val_ds_pd)))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "cellView": "form",
+ "id": "zQKZitDR4t0o"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Solution\n",
+ "'''def split_dataset(dataset, test_ratio=0.10):\n",
+ " test_indices = np.random.rand(len(dataset)) < test_ratio\n",
+ " return dataset[~test_indices], dataset[test_indices]'''"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NeZAkVroGsN3"
+ },
+ "source": [
+ "## Create tf.data.Datasets from the Pandas DataFrame"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ZQVp9z1f4dou"
+ },
+ "outputs": [],
+ "source": [
+ "# YOUR CODE HERE\n",
+ "\n",
+ "\n",
+ "# Add code to create a tf.data.Dataset for train and test from the DataFrames\n",
+ "# Example...\n",
+ "# train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(...\n",
+ "# test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "cellView": "form",
+ "id": "TvEAL3Sh5M6i"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Solution\n",
+ "#train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
+ "# train_ds_pd, \n",
+ "# label = label, \n",
+ "# task = tfdf.keras.Task.CLASSIFICATION)\n",
+ "\n",
+ "#val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
+ "# val_ds_pd, \n",
+ "# label = label, \n",
+ "# task = tfdf.keras.Task.CLASSIFICATION)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_N7MeCoBG25D"
+ },
+ "source": [
+ "## Create your model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "_ptiMVC15yhl"
+ },
+ "outputs": [],
+ "source": [
+ "# YOUR CODE HERE\n",
+ "\n",
+ "\n",
+ "# Add code to create a random forest\n",
+ "# Example ...\n",
+ "# mymodel = tfdf.keras. ...\n",
+ "# mymodel.compile(metrics=[\"accuracy\"])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "cellView": "form",
+ "id": "CdVGq1Dq5zq3"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Solution\n",
+ "#mymodel = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.CLASSIFICATION)\n",
+ "#mymodel.compile(metrics=[\"accuracy\"]) "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pricbSWnHE4w"
+ },
+ "source": [
+ "## Train your Model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "9bbEyEUfGLda"
+ },
+ "outputs": [],
+ "source": [
+ "# YOUR CODE HERE\n",
+ "\n",
+ "\n",
+ "# Add code to train your model\n",
+ "# Example ...\n",
+ "# mymodel.fit(..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "cellView": "form",
+ "id": "3Y3crGNkGIU-"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Solution\n",
+ "#mymodel.fit(x=train_ds)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "D3xi8eNnGeDA"
+ },
+ "source": [
+ "## Evaluate your model\n",
+ "Uncomment these cells after completing the code above."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "bQlSIbiMHVV4"
+ },
+ "outputs": [],
+ "source": [
+ "#mymodel.summary()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "CKa4fo5zHYRE"
+ },
+ "outputs": [],
+ "source": [
+ "#mymodel.evaluate(test_ds)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "e66pPL8OHk_4"
+ },
+ "outputs": [],
+ "source": [
+ "#inspector = mymodel.make_inspector()\n",
+ "#inspector.evaluation()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "2344zfqIHcfG"
+ },
+ "outputs": [],
+ "source": [
+ "#evaluation = mymodel.evaluate(x=test_ds,return_dict=True)\n",
+ "\n",
+ "#for name, value in evaluation.items():\n",
+ "# print(f\"{name}: {value:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Lh3gxL9OHKbD"
+ },
+ "source": [
+ "# References\n",
+ "* Dive deep into \n",
+ " * [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel)\n",
+ " * [Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel)\n",
+ " * [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel)\n",
+ " * [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras)\n",
+ " * [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests).\n",
+ "* [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banrrjee.\n",
+ "* TensorFlow Decision Forests tutorials which are a set of 3 very interesting tutorials.\n",
+ " * [Beginner Tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)\n",
+ " * [Intermediate Tutorial](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab)\n",
+ " * [Advanced Tutorial](https://www.tensorflow.org/decision_forests/tutorials/advanced_colab)\n",
+ "* The [TensorFlow Forum](https://discuss.tensorflow.org/) where one can get in touch with the TensorFlow community. Check it out if you haven't yet."
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "collapsed_sections": [],
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
From b10b8f0c7904073db00f5aa2f1b5289ca213bd3d Mon Sep 17 00:00:00 2001
From: Vansh Sharma <74853090+vanshhhhh@users.noreply.github.com>
Date: Sat, 3 Sep 2022 05:21:57 +0530
Subject: [PATCH 2/4] Minor change
---
...ggle_beginner_example_classification.ipynb | 54 ++-----------------
1 file changed, 5 insertions(+), 49 deletions(-)
diff --git a/documentation/tutorials/kaggle_beginner_example_classification.ipynb b/documentation/tutorials/kaggle_beginner_example_classification.ipynb
index a8d9cdcc..30a01174 100644
--- a/documentation/tutorials/kaggle_beginner_example_classification.ipynb
+++ b/documentation/tutorials/kaggle_beginner_example_classification.ipynb
@@ -874,7 +874,7 @@
"## Exploratory Data Analysis (EDA)\n",
"Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. \n",
"\n",
- "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banrrjee."
+ "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee."
]
},
{
@@ -1411,7 +1411,7 @@
"outputs": [
{
"data": {
- "image/png": "\n",
+ "image/png": "",
"text/plain": [
"
"
]
@@ -1764,49 +1764,7 @@
"outputs": [
{
"data": {
- "application/javascript": [
- "\n",
- " async function download(id, filename, size) {\n",
- " if (!google.colab.kernel.accessAllowed) {\n",
- " return;\n",
- " }\n",
- " const div = document.createElement('div');\n",
- " const label = document.createElement('label');\n",
- " label.textContent = `Downloading \"${filename}\": `;\n",
- " div.appendChild(label);\n",
- " const progress = document.createElement('progress');\n",
- " progress.max = size;\n",
- " div.appendChild(progress);\n",
- " document.body.appendChild(div);\n",
- "\n",
- " const buffers = [];\n",
- " let downloaded = 0;\n",
- "\n",
- " const channel = await google.colab.kernel.comms.open(id);\n",
- " // Send a message to notify the kernel that we're ready.\n",
- " channel.send({})\n",
- "\n",
- " for await (const message of channel.messages) {\n",
- " // Send a message to notify the kernel that we're ready.\n",
- " channel.send({})\n",
- " if (message.buffers) {\n",
- " for (const buffer of message.buffers) {\n",
- " buffers.push(buffer);\n",
- " downloaded += buffer.byteLength;\n",
- " progress.value = downloaded;\n",
- " }\n",
- " }\n",
- " }\n",
- " const blob = new Blob(buffers, {type: 'application/binary'});\n",
- " const a = document.createElement('a');\n",
- " a.href = window.URL.createObjectURL(blob);\n",
- " a.download = filename;\n",
- " div.appendChild(a);\n",
- " a.click();\n",
- " div.remove();\n",
- " }\n",
- " "
- ],
+ "application/javascript": "\n async function download(id, filename, size) {\n if (!google.colab.kernel.accessAllowed) {\n return;\n }\n const div = document.createElement('div');\n const label = document.createElement('label');\n label.textContent = `Downloading \"${filename}\": `;\n div.appendChild(label);\n const progress = document.createElement('progress');\n progress.max = size;\n div.appendChild(progress);\n document.body.appendChild(div);\n\n const buffers = [];\n let downloaded = 0;\n\n const channel = await google.colab.kernel.comms.open(id);\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n\n for await (const message of channel.messages) {\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n if (message.buffers) {\n for (const buffer of message.buffers) {\n buffers.push(buffer);\n downloaded += buffer.byteLength;\n progress.value = downloaded;\n }\n }\n }\n const blob = new Blob(buffers, {type: 'application/binary'});\n const a = document.createElement('a');\n a.href = window.URL.createObjectURL(blob);\n a.download = filename;\n div.appendChild(a);\n a.click();\n div.remove();\n }\n ",
"text/plain": [
""
]
@@ -1816,9 +1774,7 @@
},
{
"data": {
- "application/javascript": [
- "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)"
- ],
+ "application/javascript": "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)",
"text/plain": [
""
]
@@ -2112,7 +2068,7 @@
" * [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel)\n",
" * [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras)\n",
" * [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests).\n",
- "* [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banrrjee.\n",
+ "* [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee.\n",
"* TensorFlow Decision Forests tutorials which are a set of 3 very interesting tutorials.\n",
" * [Beginner Tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)\n",
" * [Intermediate Tutorial](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab)\n",
From ec26f124a782f1968b1ff9a6ad9e30ae67ea6d35 Mon Sep 17 00:00:00 2001
From: Vansh Sharma <74853090+vanshhhhh@users.noreply.github.com>
Date: Fri, 16 Sep 2022 01:19:03 +0530
Subject: [PATCH 3/4] Update kaggle_beginner_example_classification.ipynb
---
...ggle_beginner_example_classification.ipynb | 519 ++++--------------
1 file changed, 109 insertions(+), 410 deletions(-)
diff --git a/documentation/tutorials/kaggle_beginner_example_classification.ipynb b/documentation/tutorials/kaggle_beginner_example_classification.ipynb
index 30a01174..bf560c8e 100644
--- a/documentation/tutorials/kaggle_beginner_example_classification.ipynb
+++ b/documentation/tutorials/kaggle_beginner_example_classification.ipynb
@@ -1,12 +1,38 @@
{
"cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### Copyright 2020 The TensorFlow Authors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {
"id": "MDBzBKC_pnXl"
},
"source": [
- "# Titanic - TFDF\n",
+ "# Structured Data Classification using TFDF\n",
"\n",
"
\n",
"
\n",
@@ -24,15 +50,6 @@
"
"
]
},
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "3u9YXGqAZWwj"
- },
- "source": [
- "Kaggle Dataset - [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/overview)"
- ]
- },
{
"cell_type": "markdown",
"metadata": {
@@ -49,18 +66,8 @@
"[Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel),\n",
"and [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel),\n",
"and can be used for regression, classification, and ranking tasks.\n",
- "For a beginner's guide to TensorFlow Decision Forests,\n",
- "please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "-eDuuv2RroES"
- },
- "source": [
- "### Random Forest\n",
- "Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.\n",
+ "For an introduction to [TFDF](https://www.tensorflow.org/decision_forests) without Kaggle, please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab).\n",
+ "Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform neural networks.\n",
"\n",
"In this example we will use TensorFlow to train each of these on a dataset you load from a CSV file. This is a common pattern in practice. Roughly, your code will look as follows:\n",
"\n",
@@ -93,11 +100,7 @@
"id": "dl6_Mdy7sUC7"
},
"source": [
- "#### Install TensorFlow Decision Forests\n",
- "\n",
- "There are many excellent libraries for working with tree-based models, including [scikit-learn](https://scikit-learn.org/) (highly recommended for all your ML needs), XGBoost, LightGBM, and others.\n",
- "\n",
- "In this example we'll use [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests), a relatively new library used to train large models. "
+ "#### Install TensorFlow Decision Forests"
]
},
{
@@ -134,10 +137,6 @@
"# - Data processing - #\n",
"import pandas as pd # Pandas Documentation - https://pandas.pydata.org/docs/\n",
"\n",
- "# -- Hide Warnings -- #\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')\n",
- "\n",
"# ---- Tensorflow ---- #\n",
"import tensorflow as tf\n",
"import tensorflow_decision_forests as tfdf"
@@ -147,11 +146,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "CjdtV-KWvcWA",
- "outputId": "7087788a-b3f9-416f-8f35-33a363c6a81e"
+ "id": "CjdtV-KWvcWA"
},
"outputs": [
{
@@ -175,7 +170,7 @@
},
"source": [
"### Download the Titanic dataset\n",
- "[Titanic dataset](https://www.kaggle.com/competitions/titanic/overview/description) is an example of a binary classification problem in supervised learning. We are classifying the outcome of the passengers as either one of two classes, survived or did not survive the Titanic."
+ "The [Titanic dataset](https://www.kaggle.com/competitions/titanic/overview/description) is an example of a binary classification problem in supervised learning. We are classifying the outcome of the passengers as either one of two classes, survived or did not survive the Titanic."
]
},
{
@@ -285,19 +280,15 @@
"id": "PCFSJUjl2fuT"
},
"source": [
- "## Data Loading\n",
- "Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the TensorFlow Dataset to read the files may be better suited."
+ "## Load the dataset\n",
+ "Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the [TensorFlow Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to read the files may be better suited."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "18QhsN2L16wH",
- "outputId": "664d0036-8db4-4345-f961-a39554a0d50e"
+ "id": "18QhsN2L16wH"
},
"outputs": [
{
@@ -327,12 +318,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 187
- },
- "id": "v4rywCtW2pfK",
- "outputId": "b39ba084-53c8-49c6-ea35-1fadc9ca4b39"
+ "id": "v4rywCtW2pfK"
},
"outputs": [
{
@@ -547,25 +533,11 @@
"train_full_data = train_full_data.drop(['PassengerId', 'Name', 'Ticket'], axis=1)"
]
},
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GxrmygY2QZ-I"
- },
- "source": [
- "Let's print the updated table."
- ]
- },
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 143
- },
- "id": "SYvo-ty6QiHN",
- "outputId": "0931e98b-1634-4747-ea48-ae38d4ad9f01"
+ "id": "SYvo-ty6QiHN"
},
"outputs": [
{
@@ -742,7 +714,7 @@
"id": "Qs070SbkMJix"
},
"source": [
- "To know more about the data description you can refer [Kaggle](https://www.kaggle.com/competitions/titanic/data)."
+ "Refer to [Kaggle](https://www.kaggle.com/competitions/titanic/data) for a comprehensive guide to the data."
]
},
{
@@ -752,7 +724,7 @@
},
"source": [
"## Prepare the dataset\n",
- "This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to tensorflow and ML."
+ "This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to TensorFlow and ML."
]
},
{
@@ -768,11 +740,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "YmrDp4SL7hTw",
- "outputId": "d0d3a509-3144-42d6-beb8-18eced608b78"
+ "id": "YmrDp4SL7hTw"
},
"outputs": [
{
@@ -795,18 +763,14 @@
"id": "0NGJhK0R58Oa"
},
"source": [
- "Let's split the dataset into training and testing:"
+ "Split the dataset into training and testing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "CW3ofmmI5xIr",
- "outputId": "8919362e-6d2c-4323-92a5-780bd3e9af8a"
+ "id": "CW3ofmmI5xIr"
},
"outputs": [
{
@@ -839,9 +803,7 @@
"tfdf.keras.pd_dataframe_to_tf_dataset(your_df, label='your_label', task=tfdf.keras.Task.CLASSIFICATION)\n",
"```\n",
"\n",
- "This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). A GPU (Graphics Processing Unit) is a specialized processor with dedicated memory that conventionally perform floating point operations required for rendering graphics. GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously. It is not necessary for tree-based models until you begin to do distributed training.\n",
- "\n",
- "Creating a fast input pipeline is important when working with neural networks, and forgetting to do so is the most common bug new researchers encounter. The author of this notebook has seen many folks with expensive GPUs that are idle ~50% of the time while waiting for data.\n",
+ "This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). It is not necessary for tree-based models until you begin to do distributed training.\n",
"\n",
"Note that tf.data is a bit tricky to use, and has a learning curve. There are guides on [tensorflow.org/guide](https://www.tensorflow.org/guide) to help."
]
@@ -855,14 +817,14 @@
"outputs": [],
"source": [
"train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
- " train_ds_pd, \n",
- " label = label, \n",
- " task = tfdf.keras.Task.CLASSIFICATION)\n",
+ " train_ds_pd, \n",
+ " label = label, \n",
+ " task = tfdf.keras.Task.CLASSIFICATION)\n",
"\n",
"val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
- " val_ds_pd, \n",
- " label = label, \n",
- " task = tfdf.keras.Task.CLASSIFICATION)"
+ " val_ds_pd, \n",
+ " label = label, \n",
+ " task = tfdf.keras.Task.CLASSIFICATION)"
]
},
{
@@ -872,7 +834,7 @@
},
"source": [
"## Exploratory Data Analysis (EDA)\n",
- "Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. \n",
+ "Data scientists use exploratory analysis techniques to analyze and visualize large datasets. This process helps them identify the main characteristics of their data sets and develop effective strategies to get the answers they need. It can also help them spot anomalies and test hypotheses.\n",
"\n",
"For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee."
]
@@ -883,7 +845,7 @@
"id": "3m46QYDz8IB4"
},
"source": [
- "## Create a Random Forest "
+ "## Create and train a Random Forest model "
]
},
{
@@ -898,19 +860,6 @@
"model.compile(metrics=[\"accuracy\"]) # Optional, you can use this to include a list of eval metrics"
]
},
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "fxBlIUPD8SKU"
- },
- "source": [
- "## Train your model\n",
- "\n",
- "This is a one-liner.\n",
- "\n",
- "Note: You can safely ignore the warning about Autograph."
- ]
- },
{
"cell_type": "code",
"execution_count": null,
@@ -936,12 +885,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 404
- },
- "id": "Cwv7-NXc8WUq",
- "outputId": "eebf77c4-37ea-4651-d693-7eeaa081cdf9"
+ "id": "Cwv7-NXc8WUq"
},
"outputs": [
{
@@ -1392,7 +1336,7 @@
"id": "RtGEzEGU9FsI"
},
"source": [
- "## Evaluate the model on OOB data and the test dataset\n",
+ "## Evaluate the model on OOB data and the validation dataset\n",
"\n",
"Let's plot accuracy on OOB evaluation dataset as a function of the number of trees in the forest. One of the nice features about this particular hyperparameter is that larger values are usually better, and come with little risk aside from slowing down training."
]
@@ -1401,12 +1345,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 279
- },
- "id": "4nOZy6lX9CwJ",
- "outputId": "220a98fd-8281-4924-e9e0-5ece77289184"
+ "id": "4nOZy6lX9CwJ"
},
"outputs": [
{
@@ -1444,11 +1383,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "_nEjaF9Y9NjF",
- "outputId": "e04c43f8-a735-44cb-d078-b808664a0145"
+ "id": "_nEjaF9Y9NjF"
},
"outputs": [
{
@@ -1480,11 +1415,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "KyH_XC1d9X9x",
- "outputId": "64568368-962e-4cba-ea8e-ed169fd328a3"
+ "id": "KyH_XC1d9X9x"
},
"outputs": [
{
@@ -1510,7 +1441,7 @@
"id": "TK0l4Qgxbwcq"
},
"source": [
- "# Test Set Prediction\n",
+ "## Test Set Prediction\n",
"Now we will do prediction on `test.csv`.\n"
]
},
@@ -1536,8 +1467,8 @@
"outputs": [],
"source": [
"test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
- " test_data, \n",
- " task = tfdf.keras.Task.CLASSIFICATION)"
+ " test_data, \n",
+ " task = tfdf.keras.Task.CLASSIFICATION)"
]
},
{
@@ -1566,12 +1497,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 206
- },
- "id": "Jxtj1lp6csVQ",
- "outputId": "4b6929fc-b490-4806-fff7-e71576f0261c"
+ "id": "Jxtj1lp6csVQ"
},
"outputs": [
{
@@ -1754,17 +1680,54 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 17
- },
- "id": "2ary3LNoffRA",
- "outputId": "e20e4278-86a5-4282-ba7b-81d35fa8b5a5"
+ "id": "2ary3LNoffRA"
},
"outputs": [
{
"data": {
- "application/javascript": "\n async function download(id, filename, size) {\n if (!google.colab.kernel.accessAllowed) {\n return;\n }\n const div = document.createElement('div');\n const label = document.createElement('label');\n label.textContent = `Downloading \"${filename}\": `;\n div.appendChild(label);\n const progress = document.createElement('progress');\n progress.max = size;\n div.appendChild(progress);\n document.body.appendChild(div);\n\n const buffers = [];\n let downloaded = 0;\n\n const channel = await google.colab.kernel.comms.open(id);\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n\n for await (const message of channel.messages) {\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n if (message.buffers) {\n for (const buffer of message.buffers) {\n buffers.push(buffer);\n downloaded += buffer.byteLength;\n progress.value = downloaded;\n }\n }\n }\n const blob = new Blob(buffers, {type: 'application/binary'});\n const a = document.createElement('a');\n a.href = window.URL.createObjectURL(blob);\n a.download = filename;\n div.appendChild(a);\n a.click();\n div.remove();\n }\n ",
+ "application/javascript": [
+ "\n",
+ " async function download(id, filename, size) {\n",
+ " if (!google.colab.kernel.accessAllowed) {\n",
+ " return;\n",
+ " }\n",
+ " const div = document.createElement('div');\n",
+ " const label = document.createElement('label');\n",
+ " label.textContent = `Downloading \"${filename}\": `;\n",
+ " div.appendChild(label);\n",
+ " const progress = document.createElement('progress');\n",
+ " progress.max = size;\n",
+ " div.appendChild(progress);\n",
+ " document.body.appendChild(div);\n",
+ "\n",
+ " const buffers = [];\n",
+ " let downloaded = 0;\n",
+ "\n",
+ " const channel = await google.colab.kernel.comms.open(id);\n",
+ " // Send a message to notify the kernel that we're ready.\n",
+ " channel.send({})\n",
+ "\n",
+ " for await (const message of channel.messages) {\n",
+ " // Send a message to notify the kernel that we're ready.\n",
+ " channel.send({})\n",
+ " if (message.buffers) {\n",
+ " for (const buffer of message.buffers) {\n",
+ " buffers.push(buffer);\n",
+ " downloaded += buffer.byteLength;\n",
+ " progress.value = downloaded;\n",
+ " }\n",
+ " }\n",
+ " }\n",
+ " const blob = new Blob(buffers, {type: 'application/binary'});\n",
+ " const a = document.createElement('a');\n",
+ " a.href = window.URL.createObjectURL(blob);\n",
+ " a.download = filename;\n",
+ " div.appendChild(a);\n",
+ " a.click();\n",
+ " div.remove();\n",
+ " }\n",
+ " "
+ ],
"text/plain": [
""
]
@@ -1774,7 +1737,9 @@
},
{
"data": {
- "application/javascript": "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)",
+ "application/javascript": [
+ "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)"
+ ],
"text/plain": [
""
]
@@ -1788,273 +1753,6 @@
"files.download('test_prediction_output.csv')"
]
},
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wFMtKF5irxhs"
- },
- "source": [
- "# Try it out yourself\n",
- "We've provided a bunch of code which you can use to explore the dataset, in case this is helpful to you in your future work. The code you need to write for this exercise is only a couple lines. \n",
- "\n",
- "Note: For this section the `test_ratio` is decreased from 0.3 to 0.1. Therefore, you can get different result.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "CUirWAzfGkkC"
- },
- "source": [
- "## Explore the dataset"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "SrBp6T9xsqb5"
- },
- "outputs": [],
- "source": [
- "train_file_path = os.path.join(DOWNLOAD_LOCATION, \"train.csv\")\n",
- "train_full_data = pd.read_csv(train_file_path)\n",
- "print(\"Full train dataset shape is {}\".format(train_full_data.shape))\n",
- "\n",
- "label=\"Survived\"\n",
- "classes = train_full_data[label].unique().tolist()\n",
- "print(f\"Label classes: {classes}\")\n",
- "\n",
- "train_full_data[label] = train_full_data[label].map(classes.index)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Pg7v6HFZNcNI"
- },
- "source": [
- "### Split the dataset"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "UNGS3XDb3-4K"
- },
- "outputs": [],
- "source": [
- " def split_dataset(dataset, test_ratio=0.10):\n",
- " # YOUR CODE HERE\n",
- "\n",
- " \n",
- " # Add code to split the dataset\n",
- " return # your split data set\n",
- "\n",
- "train_ds_pd, val_ds_pd = split_dataset(train_full_data)\n",
- "print(\"{} examples in training, {} examples in validation.\".format(\n",
- " len(train_ds_pd), len(val_ds_pd)))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "id": "zQKZitDR4t0o"
- },
- "outputs": [],
- "source": [
- "#@title Solution\n",
- "'''def split_dataset(dataset, test_ratio=0.10):\n",
- " test_indices = np.random.rand(len(dataset)) < test_ratio\n",
- " return dataset[~test_indices], dataset[test_indices]'''"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "NeZAkVroGsN3"
- },
- "source": [
- "## Create tf.data.Datasets from the Pandas DataFrame"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ZQVp9z1f4dou"
- },
- "outputs": [],
- "source": [
- "# YOUR CODE HERE\n",
- "\n",
- "\n",
- "# Add code to create a tf.data.Dataset for train and test from the DataFrames\n",
- "# Example...\n",
- "# train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(...\n",
- "# test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(..."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "id": "TvEAL3Sh5M6i"
- },
- "outputs": [],
- "source": [
- "#@title Solution\n",
- "#train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
- "# train_ds_pd, \n",
- "# label = label, \n",
- "# task = tfdf.keras.Task.CLASSIFICATION)\n",
- "\n",
- "#val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n",
- "# val_ds_pd, \n",
- "# label = label, \n",
- "# task = tfdf.keras.Task.CLASSIFICATION)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "_N7MeCoBG25D"
- },
- "source": [
- "## Create your model"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "_ptiMVC15yhl"
- },
- "outputs": [],
- "source": [
- "# YOUR CODE HERE\n",
- "\n",
- "\n",
- "# Add code to create a random forest\n",
- "# Example ...\n",
- "# mymodel = tfdf.keras. ...\n",
- "# mymodel.compile(metrics=[\"accuracy\"])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "id": "CdVGq1Dq5zq3"
- },
- "outputs": [],
- "source": [
- "#@title Solution\n",
- "#mymodel = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.CLASSIFICATION)\n",
- "#mymodel.compile(metrics=[\"accuracy\"]) "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "pricbSWnHE4w"
- },
- "source": [
- "## Train your Model"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "9bbEyEUfGLda"
- },
- "outputs": [],
- "source": [
- "# YOUR CODE HERE\n",
- "\n",
- "\n",
- "# Add code to train your model\n",
- "# Example ...\n",
- "# mymodel.fit(..."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "cellView": "form",
- "id": "3Y3crGNkGIU-"
- },
- "outputs": [],
- "source": [
- "#@title Solution\n",
- "#mymodel.fit(x=train_ds)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "D3xi8eNnGeDA"
- },
- "source": [
- "## Evaluate your model\n",
- "Uncomment these cells after completing the code above."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "bQlSIbiMHVV4"
- },
- "outputs": [],
- "source": [
- "#mymodel.summary()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "CKa4fo5zHYRE"
- },
- "outputs": [],
- "source": [
- "#mymodel.evaluate(test_ds)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "e66pPL8OHk_4"
- },
- "outputs": [],
- "source": [
- "#inspector = mymodel.make_inspector()\n",
- "#inspector.evaluation()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "2344zfqIHcfG"
- },
- "outputs": [],
- "source": [
- "#evaluation = mymodel.evaluate(x=test_ds,return_dict=True)\n",
- "\n",
- "#for name, value in evaluation.items():\n",
- "# print(f\"{name}: {value:.4f}\")"
- ]
- },
{
"cell_type": "markdown",
"metadata": {
@@ -2080,7 +1778,8 @@
"metadata": {
"colab": {
"collapsed_sections": [],
- "provenance": []
+ "name": "kaggle_beginner_example_classification.ipynb",
+ "toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
From d2b243366e425b9a423b01180c71f1b0394a7477 Mon Sep 17 00:00:00 2001
From: Vansh Sharma <74853090+vanshhhhh@users.noreply.github.com>
Date: Fri, 16 Sep 2022 01:33:46 +0530
Subject: [PATCH 4/4] Minor Changes
---
...ggle_beginner_example_classification.ipynb | 26 +++++++++----------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/documentation/tutorials/kaggle_beginner_example_classification.ipynb b/documentation/tutorials/kaggle_beginner_example_classification.ipynb
index bf560c8e..bd796f04 100644
--- a/documentation/tutorials/kaggle_beginner_example_classification.ipynb
+++ b/documentation/tutorials/kaggle_beginner_example_classification.ipynb
@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "##### Copyright 2020 The TensorFlow Authors."
+ "##### Copyright 2022 The TensorFlow Authors."
]
},
{
@@ -717,6 +717,18 @@
"Refer to [Kaggle](https://www.kaggle.com/competitions/titanic/data) for a comprehensive guide to the data."
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cwdbYZeTJP89"
+ },
+ "source": [
+ "## Exploratory Data Analysis (EDA)\n",
+ "Data scientists use exploratory analysis techniques to analyze and visualize large datasets. This process helps them identify the main characteristics of their data sets and develop effective strategies to get the answers they need. It can also help them spot anomalies and test hypotheses.\n",
+ "\n",
+ "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee."
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {
@@ -827,18 +839,6 @@
" task = tfdf.keras.Task.CLASSIFICATION)"
]
},
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "cwdbYZeTJP89"
- },
- "source": [
- "## Exploratory Data Analysis (EDA)\n",
- "Data scientists use exploratory analysis techniques to analyze and visualize large datasets. This process helps them identify the main characteristics of their data sets and develop effective strategies to get the answers they need. It can also help them spot anomalies and test hypotheses.\n",
- "\n",
- "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee."
- ]
- },
{
"cell_type": "markdown",
"metadata": {